01布尔模型&倒排索引

时间 2019-11-21

标签布尔模型索引繁體版

原文原文链接

原文连接: http://www.cnblogs.com/jacklu/p/8379726.htmlhtml

博士一年级选了这门课 SEEM 5680 Text Mining Models and Applications，记下来以便之后查阅。算法

1. 信息检索的布尔模型

用0和1表示某个词是否出如今文档中。以下图例子，要回答“Brutus AND Caesar but NOT Calpurnia”,咱们须要对词的向量作布尔运算，即110100 AND 110111 AND 101111=100100 对应的文档是Antony and Cleopatra和Hamletspa

然而这种方法随着数据的增大是很是耗费空间的。好比咱们有100万个文档，每一个文档平均有1000字，总共有50万个不一样的词语，那么矩阵将是500 000 x 1 000 000。这个矩阵是稀疏的，1的个数通常不会超过1亿个。3d

2. 倒排索引

倒排索引是为了解决上述布尔模型的问题。具体来讲，每一个词用链表顺序存储文档编号。以下图所示：指针

创建索引的核心是将词按字母顺序排列，合并重复词，可是要记录词频。code

3. 倒排索引模型中对查询语句（AND）的处理

一、求Brutus AND Calpurnia，即求两个链表的交集。htm

算法思路是若是文档号不一样就移动较小的指针，伪代码 INTERSECTION(p1, p2)：blog

answer<-()
while p1 != NIL and p2 != NIL
do if docID(p1) = docID(p2)
     then ADD(answer, docID(p1))
         p1 <-next(p1)
         p2 <-next(p2)
     else if docID(p1) < docID(p2)
         p1 <-next(p1)
     else p2<-next(p2)
return answer

思考题，有两个词项A，B，其文档编号链表长度分别为3和5，那么对A，B求交集，最少的访问次数和最多的访问次数分别是多少？各举一个例子索引

最少访问次数是4，好比A:1-2-3，B:3-4-5-6-7；最多访问次数是8，好比A:1-7-8, B:3-4-5-7-9文档

二、思考题：求Brutus OR Calpurnia，即求两个链表的并集。伪代码 UNION(p1,p2):

answer<-()
while p1 != NIL and p2 != NIL
do if docID(p1) = docID(p2)
    then ADD(answer, docID(p1))
        p1 <-next(p1)
        p2 <-next(p2)
    else if docID(p1) < docID(p2)
    then ADD(answer, docID(p1))
        p1<-next(p1)
    else ADD(answer, docID(p2))
        p2<-next(p2)
return answer

三、思考题：求Brutus AND NOT Calpurnia。伪代码 INTERSECTION(p1,p2, AND NOT):

answer<-()
while p1 != NIL and p2 != NIL
do if docID(p1) = docID(p2)
        p1 <-next(p1)
        p2 <-next(p2)
    else if docID(p1) < docID(p2)
    then ADD(answer, docID(p1))
        p1<-next(p1)
    else p2<-next(p2)
    
    if p1 != NIL and P2 = NIL
    then ADD(answer, docID(p1))
        p1<-next(p1)
return answer

参考资料：http://www1.se.cuhk.edu.hk/~seem5680/