情感分析

时间 2019-12-12

标签情感分析繁體版

原文原文链接

情感分析

情感分析(Sentiment analysis,SA),它是天然语言处理(NLP)领域的一个分支,又称倾向性分析,意见抽取(Opinion extraction),意见挖掘(Opinion mining),情感挖掘(Sentiment mining),主观分析(Subjectivity analysis)等,它是对带有情感色彩的主观性文本进行分析,处理,概括和推理的过程.python

接下来主要讲解：基于LSTM网络,用Tensorflow实现一个情感分析实例,具体内容以下：git

深度学习与天然语言处理
词向量简介
循环神经网络
具体实例

一.深度学习与天然语言处理

天然语言处理就是告诉机器如何处理或读懂人类语言,目前热门方向包括：github

对话系统：比较著名的案例有Siri,Alexa和Cortana
情感分析：对一段文本进行情感识别
图文映射：用一句话来描述一张图片
机器翻译：将一种语言翻译成另外一种语言
语音识别：让电脑识别口语

情感分析的本质就是根据已知的文字和情感符号,推测文字是正面的仍是负面的web

进行情感分析有以下难点：算法

文本是非结构化的,有长有短,很难用经典的机器学习模型来处理
特征不容易提取,文本多是谈论这个主题的,也多是谈论人物或事件的.人工提取特征耗费的精力太大,效果也很差
词与词之间有联系,把这部分信息归入模型中也不容易

二.词向量简介

不管是机器学习仍是深度学习,输入数据都须要转换为计算机能识别的数字.卷积神经网络使用像素做为输入,逻辑回归使用一些能够量化的特征值做为输入,强化学习模型使用奖励信号来进行更新.当处理NLP任务时,一样须要把文字转换为数字数组

from IPython.display import Image
Image(filename="./data/17_01.png",width=500)

因此,咱们不能将字符串做为模型的输入,须要将句子中的每一个词转换成数字或向量.如何进行转换呢?方法不少,可能最早想到的是每一个单词用一个整数表示,这种方法简单,但它没法反应词之间的依赖关系,也没法表示近义词或同义词等.为了克服这种方法的不足,有人把这些数字转换成独热编码(One-hot),这种方式能避免数据大权重也大的问题,但构成的矩阵通常很庞大,没法反映一个句子上下文的依赖关系.后来人们想到一个比较好的方法,即经过Word2Vec算法或模型,把句子中的词转换为维度不是很大(如在50至300之间)的向量.经过这种方法便可控制向量维度,又可体现句子或文章中的词的上下关系网络

Image(filename="./data/17_02.png",width=500)

Image(filename="./data/17_03.png",width=500)

Word2Vec模型先根据数据集中的每一个句子进行训练,并以一个固定窗口在句子上进行滑动,根据句子上的上下文来预测固定窗口中间那个词的向量session

三.循环神经网络

有了神经网络的输入数据–词向量以后,接下来看须要构建的神经网络.时序性是NLP数据的一大特色,每一个单词的出现都依赖于它的前一个单词和后一个单词.因为这种依赖的存在,因此可以使用循环神经网络来处理数据结构

在循环神经网络中,句子中的每一个单词都考虑了时间步长.实际上时间步长的数量将等于最大序列长度架构

Image(filename="./data/17_04.png",width=500)

四.迁移学习简介

迁移学习是一种极机器学习方法,简单来讲,就是把任务A开发的模型做为初始点,从新用在任务B中.合理使用迁移学习,能够避免针对每一个目标任务单独训练模型,从而极大节约计算资源

在计算机视觉任务和天然语言处理任务中,将预训练好的模型做为新模型的起点是一种经常使用方法,一般预训练这些模型,每每要消耗大量的时间和巨大的计算资源.迁移学习就是把预训练好的模型迁移到新的任务上

使用两个已训练好的词向量模型：一个是包含40000个单词的python列表(及Numpy对象),另外一个是包含40000*50词向量的嵌入矩阵

五.Tensorflow实现情感分析

情感分析的任务就是分析一个输入单词或者句子的情绪是积极的,仍是消极的.咱们把这个特定的任务分红5个步骤来实现：

制做词向量.能够训练一个单词向量生成模型(如Word2Vec),也能够直接用现成的
建立ID矩阵
构建RNN网络架构
训练模型
测试模型

1.导入数据

咱们用另一个现成但小一些的矩阵,该矩阵用GloVe进行训练而得.该矩阵包含40000个词向量,每一个向量的维数为50

咱们将采用迁移方法,直接导入这两个已训练好的数据结构,一个是包含400000个单词的python列表,一个是包含全部单词向量值的400000*50维的嵌入矩阵

import numpy as np

wordsList=np.load("./imdb/wordsList.npy")

print("Load the word list!")

# Originally loaded as numpy array
wordsList=wordsList.tolist()

# 将单词编码为UTF-8格式
wordsList=[word.decode("UTF-8") for word in wordsList]
wordVectors=np.load("./imdb/wordVectors.npy")
print("Loaded the word vectors!")

Load the word list!
Loaded the word vectors!

查看词汇列表和嵌入矩阵的维度

print(len(wordsList))
print(wordVectors.shape)

400000
(400000, 50)

咱们也能够在词库中搜索单词,好比"baseball",而后能够经过访问嵌入矩阵来获得相应的向量

baseballIndex=wordsList.index("baseball")
wordVectors[baseballIndex]

array([-1.9327  ,  1.0421  , -0.78515 ,  0.91033 ,  0.22711 , -0.62158 ,
       -1.6493  ,  0.07686 , -0.5868  ,  0.058831,  0.35628 ,  0.68916 ,
       -0.50598 ,  0.70473 ,  1.2664  , -0.40031 , -0.020687,  0.80863 ,
       -0.90566 , -0.074054, -0.87675 , -0.6291  , -0.12685 ,  0.11524 ,
       -0.55685 , -1.6826  , -0.26291 ,  0.22632 ,  0.713   , -1.0828  ,
        2.1231  ,  0.49869 ,  0.066711, -0.48226 , -0.17897 ,  0.47699 ,
        0.16384 ,  0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 ,
       -0.59457 ,  1.3566  , -0.27506 ,  0.19918 , -0.36008 ,  0.55667 ,
       -0.70315 ,  0.17157 ], dtype=float32)

有了向量以后,第一步就是输入一个句子,而后构造它的向量表示.假设如今输入的句子是"I thought the movie was incredible and inspiring".为了获得词向量,可使用Tensorflow的嵌入函数.这个函数有两个函数：一个是嵌入矩阵(这里采用词向量矩阵),另外一个是与每一个词对应的索引.接下来,经过一个具体的例子来讲明：

import tensorflow as tf

# 句子的最大长度
maxSeqLength=10

# 单词向量维度
numDimensions=300

firstSentence=np.zeros((maxSeqLength),dtype="int32")

firstSentence[0]=wordsList.index("i")
firstSentence[1]=wordsList.index("thought")
firstSentence[2]=wordsList.index("the")
firstSentence[3]=wordsList.index("movie")
firstSentence[4]=wordsList.index("was")
firstSentence[5]=wordsList.index("incredible")
firstSentence[6]=wordsList.index("and")
firstSentence[7]=wordsList.index("inspiring")
# firstSentence[8] and firstSentence[9] 为 0
print(firstSentence.shape)

print(firstSentence)

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]

Image(filename="./data/17_07.png",width=500)

with tf.Session() as session:
    print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)

WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\embedding_ops.py:132: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
(10, 50)

结果输出数据是一个10*50的词矩阵,其中包括10个词,每一个词的向量维度是50

例如新训练集是IMDB数据集.这个数据集包含25000条电影数据,其中12500条正向数据,12500条反向数据.这些数据存储在一个文本文件中,首先须要作的就是去解析这个文件.正向数据包含在一个文件中,负向数据包含在另外一个文件中

from os import listdir
from os.path import isfile,join

positiveFiles=["./imdb/positiveReviews/"+f for f in listdir("./imdb/"
                                                            "positiveReviews/") if isfile(join("./imdb/positiveReviews/",f))]
negativeFiles=["./imdb/negativeReviews/"+f for f in listdir("./imdb/"
                                                            "negativeReviews/") if isfile(join("./imdb/negativeReviews/",f))]

numWords=[]

for pf in positiveFiles:
    with open(pf,"r",encoding="utf-8") as f:
        line=f.readline()
        counter=len(line.split())
        numWords.append(counter)
print("Positive files finished")

for nf in negativeFiles:
    with open(nf,"r",encoding="utf-8") as f:
        line=f.readline()
        counter=len(line.split())
        numWords.append(counter)
print("Negative files finished")

numFiles=len(numWords)
print("The total number of files is",numFiles)
print("The total number of words in the files is",sum(numWords))
print("The average number of words in the files is",sum(numWords)/len(numWords))

Positive files finished
Negative files finished
The total number of files is 25000
The total number of words in the files is 5844680
The average number of words in the files is 233.7872

使用Matplot对数据进行可视化

import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

myfont=fm.FontProperties(fname="E:/Anaconda/envs/mytensorflow/Lib/site-packages/matplotlib/mpl-data/fonts/ttf/Simhei.ttf")

%matplotlib inline
plt.hist(numWords,50)
plt.xlabel("序列长度",fontproperties=myfont)
plt.ylabel("频率",fontproperties=myfont)
plt.axis([0,1200,0,8000])
plt.show()

从直方图和句子的平均单词数能够看出,将句子最大长度设置为250是合适的.接下来咱们将单个文件中的文本转换成索引矩阵.下面的代码就是文本中的一个评论

maxSeqLength=250

fname=positiveFiles[3]

with open(fname) as f:
    for lines in f:
        print(lines)
        exit

This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have cast someone else in the lead (I love Mel as a Director and Writer, not so much as a lead).

将它转换成一个索引矩阵：

import re

# 删除标点符号,括号,问号等,只留下字母数字字符
strip_special_chars=re.compile("[^A-Za-z0-9 ]+")

def cleanSentences(string):
    string=string.lower().replace("<br />"," ")
    return re.sub(strip_special_chars,"",string.lower())

firstFile=np.zeros((maxSeqLength),dtype="int32")

with open(fname) as f:
    indexCounter=0
    line=f.readline()
    cleanedLine=cleanSentences(line)
    split=cleanedLine.split()
    
    for word in split:
        try:
            firstFile[indexCounter]=wordsList.index(word)
        except ValueError:
            firstFile[indexCounter]=399999
        indexCounter=indexCounter+1
        
firstFile

array([    37,     14,   2407, 201534,     96,  37314,    319,   7158,
       201534,   6469,   8828,   1085,     47,   9703,     20,    260,
           36,    455,      7,   7284,   1139,      3,  26494,   2633,
          203,    197,   3941,  12739,    646,      7,   7284,   1139,
            3,  11990,   7792,     46,  12608,    646,      7,   7284,
         1139,      3,   8593,     81,  36381,    109,      3, 201534,
         8735,    807,   2983,     34,    149,     37,    319,     14,
          191,  31906,      6,      7,    179,    109,  15402,     32,
           36,      5,      4,   2933,     12,    138,      6,      7,
          523,     59,     77,      3, 201534,     96,   4246,  30006,
          235,      3,    908,     14,   4702,   4571,     47,     36,
       201534,   6429,    691,     34,     47,     36,  35404,    900,
          192,     91,   4499,     14,     12,   6469,    189,     33,
         1784,   1318,   1726,      6, 201534,    410,     41,    835,
        10464,     19,      7,    369,      5,   1541,     36,    100,
          181,     19,      7,    410,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0,      0,      0,      0,      0,      0,      0,
            0,      0])

如今,咱们用相同的方法来处理所有的25000条评论.咱们导入电影训练集,并获得一个25000*250的矩阵.这是一个计算成本很是高的过程,能够采用迁移学习方法,直接使用训练好的索引矩阵文件

ids=np.load("./imdb/idsMatrix.npy")

2.定义辅助函数

定义辅助函数的方法以下：

from random import randint

def getTrainBatch():
    labels=[]
    arr=np.zeros([batchSize,maxSeqLength])
    for i in range(batchSize):
        if (i%2==0):
            num=randint(1,11499)
            labels.append([1,0])
        else:
            num=randint(13499,24999)
            labels.append([0,1])
        arr[i]=ids[num-1:num]
    return arr,labels

def getTestBatch():
    labels=[]
    arr=np.zeros([batchSize,maxSeqLength])
    for i in range(batchSize):
        num=randint(11499,13499)
        if (num<=12499):
            labels.append([1,0])
        else:
            labels.append([0,1])
        arr[i]=ids[num-1:num]
    return arr,labels

3.构建RNN模型

先定义一些超参数,好比批处理大小,LSTM的单元个数,分类类别和训练次数等

batchSize=24
lstmUnits=64
numClasses=2
iterations=20000

咱们须要定义两个占位符,一个用于数据输入,另外一个用于标签数据.对于占位符,最重要的一点就是肯定好维度

import tensorflow as tf

tf.reset_default_graph()

labels=tf.placeholder(tf.float32,[batchSize,numClasses])
input_data=tf.placeholder(tf.int32,[batchSize,maxSeqLength])

标签占位符表明一组值,每一个值都为[1,0]或者[0,1],具体是哪一个取决于数据是正向的仍是反向的.输入的占位符是一个整数比的索引数组

from IPython.display import Image
Image(filename="./data/17_06.png",width=500)

设置了输入数据占位符以后,咱们能够调用tf.nn.embedding_lookup()函数来获得词向量.该函数最后将返回一个三维向量,第一个维度是批处理大小,第二个维度是句子长度,第三个维度是词向量长度

data=tf.nn.embedding_lookup(wordVectors,input_data)

Image(filename="./data/17_05.png",width=500)

如何才能将这种数据形式输入到咱们的LSTM网络中?首先,咱们使用tf.nn.rnn_cell.BasicLSTMCell函数,这个函数输入的参数是一个整数,表示须要几个LSTM单元.这是咱们设置的一个超参数,咱们须要对这个数值进行调试从而找到最优的解.而后设置一个dropout参数,以此来避免一些过拟合.最后咱们将LSTM cell和三维的数据输入到tf.nn.dynamic_rnn,这个函数的功能是展开整个网络,而且构建一个RNN模型

lstmCell=tf.contrib.rnn.BasicLSTMCell(lstmUnits)
lstmCell=tf.contrib.rnn.DropoutWrapper(cell=lstmCell,output_keep_prob=0.25)
value,_=tf.nn.dynamic_rnn(lstmCell,data,dtype=tf.float32)

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:1: BasicLSTMCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.
WARNING:tensorflow:From <ipython-input-16-56feceb9201e>:3: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py:1259: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

LSTM能够帮助模型记住更多的上下文信息,可是带来的弊端是训练参数会增长不少,模型的训练时间会很长,过拟合的几率也会增长

dynamic RNN函数的第一个输出能够被认为是最后的隐藏状态向量.这个向量将被从新肯定维度,而后乘以最后的权重矩阵和一个偏置项来得到最终的输出值

weight=tf.Variable(tf.truncated_normal([lstmUnits,numClasses]))
bias=tf.Variable(tf.constant(0.1,shape=[numClasses]))
value=tf.transpose(value,[1,0,2])
last=tf.gather(value,int(value.get_shape()[0])-1)
prediction=(tf.matmul(last,weight)+bias)

接下来,咱们须要定义正确的预测函数和正确率评估参数.正确的预测形式是查看最后输出的0-1向量是否和标记的0-1向量相同

correctPred=tf.equal(tf.argmax(prediction,1),tf.argmax(labels,1))
accuracy=tf.reduce_mean(tf.cast(correctPred,tf.float32))

以后咱们使用一个标准的交叉熵做为代价函数.对于优化器,咱们选择Adam,而且采用默认的学习率：

loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=prediction,labels=labels))
optimizer=tf.train.AdamOptimizer().minimize(loss)

若是你想使用Tensorboard来可视化损失值和正确率,能够修改而且运行下列代码

import datetime

tf.summary.scalar("Loss",loss)
tf.summary.scalar("Accuracy",accuracy)

merged=tf.summary.merge_all()

logdir="./tensorboard/"+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+"/"
sess=tf.Session()
writer=tf.summary.FileWriter(logdir,sess.graph)

4.调优超参数

选择合适的超参数来训练你的神经网络是相当重要的

学习率：若是咱们将学习率设置很大,那么学习曲线就会波动很大;若是将学习率设置很小,那么训练过程就会很是缓慢.根据经验,将学习率默认设置为0.001是一个好的选择
优化器：没有统一的选择,不过Adam优化器被普遍使用
LSTM单元的数量：取决于输入文本的平均长度
词向量维度：词向量的维度通常设置为50到300

5.训练模型

训练过程的基本思路是：首先定义一个Tensorflow会话;而后加载一批评论和对应的标签;接下来调用会话的run函数.run函数有两个参数：第一个参数被称为fetches参数,这个参数定义了咱们感兴趣的值.但愿经过咱们的优化器来最小化代价函数;第二个参数被称为feed_dict参数,这个数据结构就是占位符,咱们须要将一个批处理的评论和标签输入模型,而后不断对这一组训练数据进行循环训练

为了获取比较好的模型性能,迭代次数要比较大.这里只迭代20000次(即iterations=20000),使用GPU

sess=tf.InteractiveSession()
saver=tf.train.Saver()
sess.run(tf.global_variables_initializer())

with tf.device("/gpu:0"):
    for i in range(iterations):
        # 获取下一批次数据
        nextBatch,nextBatchLabels=getTrainBatch()
        sess.run(optimizer,{input_data:nextBatch,labels:nextBatchLabels})
        
        # 把汇总信息写入Tensorboard
        if (i%50==0):
            summary=sess.run(merged,{input_data:nextBatch,labels:nextBatchLabels})
            writer.add_summary(summary,i)
            
        # 每训练1000次保存一次
        if (i%1000==0 and i!=0):
            save_path=saver.save(sess,"./models/pretrained_lstm.ckpt",global_step=i)
            print("saved to %s"%save_path)
    writer.close()

saved to ./models/pretrained_lstm.ckpt-1000
saved to ./models/pretrained_lstm.ckpt-2000
saved to ./models/pretrained_lstm.ckpt-3000
saved to ./models/pretrained_lstm.ckpt-4000
saved to ./models/pretrained_lstm.ckpt-5000
WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
saved to ./models/pretrained_lstm.ckpt-6000
saved to ./models/pretrained_lstm.ckpt-7000
saved to ./models/pretrained_lstm.ckpt-8000
saved to ./models/pretrained_lstm.ckpt-9000
saved to ./models/pretrained_lstm.ckpt-10000
saved to ./models/pretrained_lstm.ckpt-11000
saved to ./models/pretrained_lstm.ckpt-12000
saved to ./models/pretrained_lstm.ckpt-13000
saved to ./models/pretrained_lstm.ckpt-14000
saved to ./models/pretrained_lstm.ckpt-15000
saved to ./models/pretrained_lstm.ckpt-16000
saved to ./models/pretrained_lstm.ckpt-17000
saved to ./models/pretrained_lstm.ckpt-18000
saved to ./models/pretrained_lstm.ckpt-19000

上面的代码已保存了模型,恢复一个预训练的模型须要使用Tensorflow的另外一个会话函数–Server,而后使用利用这个会话函数来调用restore函数.这个函数包括两个参数：一个表示当前的会话,另外一个表示保存的模型

sess=tf.InteractiveSession()
saver=tf.train.Saver()
saver.restore(sess,tf.train.latest_checkpoint("./models"))

WARNING:tensorflow:From E:\Anaconda\envs\mytensorflow\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./models\pretrained_lstm.ckpt-19000

而后,从咱们的测试集中导入一些电影评论.你能够经过如下代码来查看每个批处理的准确率

iterations=10

for i in range(iterations):
    nextBatch,nextBarchLabels=getTestBatch();
    print("Accuracy for thie batch:",(sess.run(accuracy,{input_data:nextBatch,labels:nextBatchLabels}))*100)

Accuracy for thie batch: 41.66666567325592
Accuracy for thie batch: 41.66666567325592
Accuracy for thie batch: 33.33333432674408
Accuracy for thie batch: 37.5
Accuracy for thie batch: 54.16666865348816
Accuracy for thie batch: 41.66666567325592
Accuracy for thie batch: 41.66666567325592
Accuracy for thie batch: 54.16666865348816
Accuracy for thie batch: 45.83333432674408
Accuracy for thie batch: 50.0