simhash算法:海量千万级的数据去重

simhash算法:海量千万级的数据去重

simhash算法及原理参考:

简单易懂讲解simhash算法 hash 哈希:https://blog.csdn.net/le_le_name/article/details/51615931html

simhash算法及原理简介:https://blog.csdn.net/lengye7/article/details/79789206python

使用SimHash进行海量文本去重:https://www.cnblogs.com/maybe2030/p/5203186.html#_label3算法

 

 

python实现:

python使用simhash实现文本类似性对比(全代码展现):http://www.javashuo.com/article/p-ritxpaev-kg.html数据结构

simhash的py实现:https://blog.csdn.net/gzt940726/article/details/80460419app

 

python库simhash使用

详情请查看:https://leons.im/posts/a-python-implementation-of-simhash-algorithm/post

(1) 查看simhash值spa

>>> from simhash import Simhash >>> print '%x' % Simhash(u'I am very happy'.split()).value 9f8fd7efdb1ded7f

Simhash()接收一个token序列,或者叫特征序列。.net

 

(2)计算两个simhash值距离code

>>> hash1 = Simhash(u'I am very happy'.split()) >>> hash2 = Simhash(u'I am very sad'.split()) >>> print hash1.distance(hash2)



(3)创建索引htm

simhash被用来去重。若是两两分别计算simhash值,数据量较大的状况下确定hold不住。有专门的数据结构,参考:http://www.cnblogs.com/maybe2030/p/5203186.html#_label4

from simhash import Simhash, SimhashIndex # 创建索引
data = { u'1': u'How are you I Am fine . blar blar blar blar blar Thanks .'.lower().split(), u'2': u'How are you i am fine .'.lower().split(), u'3': u'This is simhash test .'.lower().split(), } objs = [(id, Simhash(sent)) for id, sent in data.items()] index = SimhashIndex(objs, k=10) # k是容忍度;k越大,检索出的类似文本就越多 # 检索
s1 = Simhash(u'How are you . blar blar blar blar blar Thanks'.lower().split()) print index.get_near_dups(s1) # 增长新索引
index.add(u'4', s1)
相关文章
相关标签/搜索