对搜索引擎开源项目的代码分析——索引（1）

时间 2019-11-15

原文原文链接

首先，须要对基本概念进行简单的介绍：Keywords:搜索键; tokens:关键词; 关键词（tokens）和标签（labels）组成了索引器中的搜索键（keywords）git

1. 上文中，已经从微博上，抓取了相应的微博信息，下面将对其进行搜索引擎的下一步骤：“索引”
github

  // 读入微博数据file, err := os.Open("../../testdata/weibo_data.txt")

2. 使用悟空引擎你须要import两个包;第一个包定义了引擎功能，第二个包定告终构体，同时须要对引擎使用以前进行初始化；
安全

import(
"github.com/huichen/wukong/engine"
"github.com/huichen/wukong/types"
)

3. 再索引以前，须要了解下须要注意的基本概念：并发

IndexerInitOptions.IndexType的类型选择，共有三种不一样类型的索引表进行选择；
    1） DocIdsIndex，提供了最基本的索引，仅仅记录搜索键出现的文档docid；
    2） FrequenciesIndex，除了记录docid外，还保存了搜索键在每一个文档中出现的频率；
    3.）LocationsIndex，这个不只包括上两种索引的内容，还额外存储了关键词在文档中的具体位置
这三种索引由上到下在提供更多计算能力的同时也消耗了更多的内存，特别是LocationsIndex，当文档很长时会占用大量内存。请根据须要平衡选择。若是没有选择，那么系统会默认选择FrequenciesIndexapp

4.       悟空引擎容许你加入三种索引数据：
    1）文档的正文（content），会被分词为关键词（tokens）加入索引。
    2）文档的关键词（tokens）。当正文为空的时候，容许用户绕过悟空内置的分词器直接
          输入文档关键词，这使得在引擎外部进行文档分词成为可能。
    3）文档的属性标签（labels），好比微博的做者，类别等。标签并不出如今正文中。
须要注意的是：文档的正文是进行关键词的优先择；关键词（tokens）和标签（labels）组成了索引器中的搜索键（keywords），固然标签labels是不出如今正文中的；函数

5. 引擎采用了非同步的索引方式，也就是说当IndexDocument返回时索引可能尚未加入索引表中，从而方便的循环并发加入索引；若是你须要等待索引添加完毕后再进行后续操做，请用下面的函数：searcher.FlushIndex()ui

6.下面分析索引的代码功能，一些功能重叠部分将在后续的索引中进行分析；搜索引擎

下面定义了索引器的一些基本单位，其中添加了sync.RWMutex读写锁实现安全的map；可是了解到，自锁和解锁的相互过程，试想若是自锁一次，而在不知道自锁次数的状况下解锁超过自锁，那么将要报错，所以在此能够进行次数检查，防止自锁和解锁次数的不一致致使的错误；spa

// 索引器
type Indexer struct {
    // 从搜索键到文档列表的反向索引
    // 加了读写锁以保证读写安全
    tableLock struct {
        sync.RWMutex
        table map[string]*KeywordIndices
    }
    initOptions types.IndexerInitOptions
    initialized bool
    // 这其实是总文档数的一个近似
    numDocuments uint64
    // 全部被索引文本的总关键词数
    totalTokenLength float32
    // 每一个文档的关键词长度
    docTokenLengths map[uint64]float32
}

本段代码定义了的功能已在上面的概念解析中进行了阐述；注意IndexType的选择符合业务的需求，内存的消耗承担状况；
code

// 反向索引表的一行，收集了一个搜索键出现的全部文档，按照DocId从小到大排序。 
type KeywordIndices struct { 
    // 下面的切片是否为空，取决于初始化时IndexType的值
    docIds      []uint64  // 所有类型都有 
    frequencies []float32 // IndexType == FrequenciesIndex 
    locations   [][]int   // IndexType == LocationsIndex}

对索引器进行相应的初始化

// 初始化索引器
func (indexer *Indexer) Init(options types.IndexerInitOptions) {
    if indexer.initialized == true {
        log.Fatal("索引器不能初始化两次")
    }
    indexer.initialized = true
    indexer.tableLock.table = make(map[string]*KeywordIndices)
    indexer.initOptions = options
    indexer.docTokenLengths = make(map[uint64]float32)
}

下面将文档加入索引：提取文档的关键词，出现频率甚至是位置信息等等；

// 向反向索引表中加入一个文档
func (indexer *Indexer) AddDocument(document *types.DocumentIndex) {
    if indexer.initialized == false {
        log.Fatal("索引器还没有初始化")
    }
    indexer.tableLock.Lock()
    defer indexer.tableLock.Unlock()
    // 更新文档关键词总长度
    if document.TokenLength != 0 {
        originalLength, found := indexer.docTokenLengths[document.DocId]
        indexer.docTokenLengths[document.DocId] = float32(document.TokenLength)
        if found {
            indexer.totalTokenLength += document.TokenLength - originalLength
        } else {
            indexer.totalTokenLength += document.TokenLength
        }
    } 
    ...
    ...
    ...

查找新文档以后，进行搜索键的查找；

docIdIsNew := true    for _, keyword := range document.Keywords {
        indices, foundKeyword := indexer.tableLock.table[keyword.Text]
        if !foundKeyword {
            // 若是没找到该搜索键则加入
            ti := KeywordIndices{}
            switch indexer.initOptions.IndexType {
            case types.LocationsIndex:
                ti.locations = [][]int{keyword.Starts}
            case types.FrequenciesIndex:
                ti.frequencies = []float32{keyword.Frequency}
            }
            ti.docIds = []uint64{document.DocId}
            indexer.tableLock.table[keyword.Text] = &ti
            continue
        }
        // 查找应该插入的位置
        position, found := indexer.searchIndex(
            indices, 0, indexer.getIndexLength(indices)-1, document.DocId)
        if found {
            docIdIsNew = false
            // 覆盖已有的索引项
            switch indexer.initOptions.IndexType {
            case types.LocationsIndex:
                indices.locations[position] = keyword.Starts
            case types.FrequenciesIndex:
                indices.frequencies[position] = keyword.Frequency
            }
            continue
        }

此处根据IndexType的选择进行代码的索引的插入项；

// 当索引不存在时，插入新索引项        
       switch indexer.initOptions.IndexType {
        case types.LocationsIndex:
            indices.locations = append(indices.locations, []int{})
            copy(indices.locations[position+1:], indices.locations[position:])
            indices.locations[position] = keyword.Starts
        case types.FrequenciesIndex:
            indices.frequencies = append(indices.frequencies, float32(0))
            copy(indices.frequencies[position+1:], indices.frequencies[position:])
            indices.frequencies[position] = keyword.Frequency
        }
        indices.docIds = append(indices.docIds, 0)
        copy(indices.docIds[position+1:], indices.docIds[position:])
        indices.docIds[position] = document.DocId
    }
    // 更新文章总数
    if docIdIsNew {
        indexer.numDocuments++
    }

其中，当搜索键是关键词和标签结合时，能够更加缩小搜寻范围；同时注意：标签并不在正文之中；其中如下代码中的copy(keywords[len(tokens):], labels)，我认为是不是copy(keywords[len(tokens)+1:], labels)？

// 查找包含所有搜索键(AND操做)的文档// 当docIds不为nil时仅从docIds指定的文档中查找
func (indexer *Indexer) Lookup(
    tokens []string, labels []string, docIds *map[uint64]bool) (docs []types.IndexedDocument) {
    if indexer.initialized == false {
        log.Fatal("索引器还没有初始化")
    }
    if indexer.numDocuments == 0 {
        return
    }
    // 合并关键词和标签为搜索键
    keywords := make([]string, len(tokens)+len(labels))
    copy(keywords, tokens)
    copy(keywords[len(tokens):], labels)
    indexer.tableLock.RLock()
    defer indexer.tableLock.RUnlock()
    table := make([]*KeywordIndices, len(keywords))
    for i, keyword := range keywords {
        indices, found := indexer.tableLock.table[keyword]
        if !found {
            // 当反向索引表中无此搜索键时直接返回
            return
        } else {
            // 不然加入反向表中
            table[i] = indices
        }
    }
    // 当没有找到时直接返回
    if len(table) == 0 {
        return
    }

总结：

以上代码的索引是为倒排索引的使用提供条件，倒排索引是根据单词à文档的模式即根据单词进行查找包含单词的全部文档，同时映射了单词在相应的文档里的出现次数和位置信息；以上的代码功能简单的说明索引的前奏细节，接下来将要重点解析索引运用的方法；
以上代码部分，我的理解是，在Golang语言中，为了不代码编译时出现异常，应尽可能采用err.Error()机制来进行避免，不知道是否稳当，之后在实践中须要注意；