点互信息PMI(Pointwise Mutual Information)这个指标来衡量两个事物之间的相关性(好比两个词)。ide
在几率论中,咱们知道,若是x跟y相互独立,则p(x,y)=p(x)p(y)。this
两者相关性越大,则p(x,y)就相比于p(x)p(y)越大。用后面的式子可能更好理解,在y出现的状况下x出现的条件几率p(x|y)除以x自己出现的几率p(x),天然就表示x跟y的相关程度。excel
例子:
举个天然语言处理中的例子来讲,咱们想衡量like这个词的极性(正向情感仍是负向情感)。咱们能够预先挑选一些正向情感的词,好比good。而后咱们算like跟good的PMI,即: code
其中,orm
在stackoverflow中找到pmi实现的代码blog
from nltk.collocations import BigramAssocMeasures,BigramCollocationFinder from nltk.tokenize import word_tokenize text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" words = word_tokenize(text) bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) for row in finder.score_ngrams(bigram_measures.pmi): print(row)
(('is', 'a'), 4.523561956057013) (('this', 'is'), 4.523561956057013) (('a', 'foo'), 2.938599455335857) (('sheep', 'shep'), 2.938599455335857) (('black', 'sheep'), 2.5235619560570135) (('black', 'sentence'), 2.523561956057013) (('sheep', 'foo'), 2.3536369546147005) (('bar', 'black'), 1.523561956057013) (('foo', 'bar'), 1.523561956057013) (('shep', 'bar'), 1.523561956057013) (('bar', 'bar'), 0.5235619560570131)
好了,下面写一个完整的代码token
实现如下功能:pandas
读取txt、xls、xlsx文件的数据(其中excel形式的数据,其数据是存储在某一列)it
对文本数据进行分词、英文小写化、英文词干化、去停用词io
按照两元语法模式,计算全部文本两两词语的pmi值
完整代码
import re import csv import jieba import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder def chinese(text): """ 对中文数据进行处理,并将计算出的pmi保存到"中文pmi计算.csv" """ content = ''.join(re.findall(r'[\u4e00-\u9fa5]+', text)) words = jieba.cut(content) words = [w for w in words if len(w)>1] bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) with open('中文pmi计算.csv','a+',encoding='gbk',newline='') as csvf: writer = csv.writer(csvf) writer.writerow(('word1','word2','pmi_score')) for row in finder.score_ngrams(bigram_measures.pmi): data = (*row[0],row[1]) try: writer.writerow(data) except: pass def english(text): """ 对英文数据进行处理,并将计算出的pmi保存到"english_pmi_computer.csv" """ stopwordss = set(stopwords.words('english')) stemmer = nltk.stem.snowball.SnowballStemmer('english') tokenizer = nltk.tokenize.RegexpTokenizer('\w+') words = tokenizer.tokenize(text) words = [w for w in words if not w.isnumeric()] words = [w.lower() for w in words] words = [stemmer.stem(w) for w in words] words = [w for w in words if w not in stopwordss] bigram_measures = BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) with open('english_pmi_computer.csv','a+',encoding='gbk',newline='') as csvf: writer = csv.writer(csvf) writer.writerow(('word1','word2','pmi_score')) for row in finder.score_ngrams(bigram_measures.pmi): data = (*row[0],row[1]) try: writer.writerow(data) except: pass def pmi_score(file,lang,column='数据列'): """ 计算pmi :param file: 原始文本数据文件 :param lang: 数据的语言,参数为chinese或english :param column: 若是文件为excel形式的文件,column为excel中的数据列 """ #读取数据 text = '' if 'csv' in file: df = pd.read_csv(file) rows = df.iterrows() for row in rows: text += row[1][column] elif ('xlsx' in file) or ('xls' in file): df = pd.read_excel(file) rows = df.iterrows() for row in rows: text += row[1][column] else: text = open(file).read() #对该语言的文本数据计算pmi globals()[lang](text) #计算pmi pmi_score(file='test.txt',lang='chinese')
test.txt数据来自4000+场知乎live的简介,pmi部分计算结果截图。
pmi计算结果是从大到小输出的。从中能够看到,pmi越大,两个词语更有感情,更搭。
而当翻看最后面的组合,pmi已经沦为负值,两个词语间关系已经不大了。