计算多个文档之间的文本类似程度

时间 2019-12-13

标签计算多个文档之间文本类似程度繁體版

原文原文链接

首先咱们上代码：python

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]
vectorizer = CountVectorizer(binary=True,stop_words='english')#设置停用词为英语，这样就会过滤掉
#过滤掉a an the 等没必要要的冠词，同时设定英语里的同种词的形式，单复数，过去式等为一样的词语
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

输出：ide

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}

　前面三行的矩阵只有0和1两个值，每个矩阵都有8个0或者1，这里说明了咱们的词库当中一共有8个不一样的英语词汇，因为以前咱们使用了代码：函数

vectorizer = CountVectorizer(binary=True,stop_words='english')#设置停用词为英语，这样就会过滤掉
#过滤掉a an the 等没必要要的冠词，同时设定英语里的同种词的形式，单复数，过去式等为一样的词语

所以咱们已通过滤掉了a an tne 这种英语里的冠词，每个名次的单复数，动词的过去，过去完成时等词，好比说咱们的play和played计算机就会默认为是同一个词了，真的神奇。idea

后面的输出0和1表示了全部词库当中的某一个词是否出现，咱们全部的词汇的所对应的数值已经计算出：spa

{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}

　在每一句话当中，出现就记为1，不出现则记为0，这就是上述矩阵的含义了。最后咱们经过sklearn库当中的函数来计算这三个句子特征向量的欧式距离，其实就是把咱们的矩阵拿来计算，计算的公式以下：
code

代码以下：orm

from sklearn.metrics.pairwise import euclidean_distances
counts = vectorizer.fit_transform(corpus).todense()
for x,y in [[0,1],[0,2],[1,2]]:
    dist = euclidean_distances(counts[x],counts[y])
    print('文档{}与文档{}的距离{}'.format(x,y,dist))

所以咱们有输出：blog

文档0与文档1的距离[[2.]]
文档0与文档2的距离[[2.44948974]]
文档1与文档2的距离[[2.44948974]]

　说明文档2和文档一、0的类似程度是同样的。文档