Elasticsearch Field Options Norms

Elasticsearch 定义字段时Norms选项的做用

本文介绍ElasticSearch中2种字段(text 和 keyword)的Norms参数做用。html

建立ES索引时,通常指定2种配置信息:settings、mappings。settings 与数据存储有关(几个分片、几个副本);而mappings 是数据模型,相似于MySQL中的表结构定义。在Mapping信息中指定每一个字段的类型,ElasticSearch支持多种类型的字段(field datatypes),好比String、Numeric、Date…其中String又细分红为种:keyword 和 text。在建立索引时,须要定义字段并为每一个字段指定类型,示例以下:java

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "title": {
          "type": "text",
          "norms": false
        },
        "overview": {
          "type": "text",
          "norms": true
        },
        "body": {
          "type": "text"
        },
        "author": {
          "type": "keyword",
          "norms": true
        },
        "chapters": {
          "type": "keyword",
          "norms": false
        },
        "email": {
          "type": "keyword"
        }
      }
    }
  }
}

my_index 索引的 title 字段类型是 text,而 author 字段类型是 keyword。算法

对于 text 类型的字段而言,默认开启了norms,而 keyword 类型的字段则默认关闭了normsapp

Whether field-length should be taken into account when scoring queries. Accepts true(text filed datatype) or false(keyword filed datatype)elasticsearch

为何 keyword 类型的字段默认关闭 norms 呢?keyword 类型的string 可理解为:Do index the field, but don't analyze the string value,也即:keyword 类型的字段是不会被Analyzer "分析成" 一个个的term的,它是一个single-token fields,所以也就不须要字段长度(fieldNorm)、tfNorm(term frequency Norm)这些归一化因子了。而 text 类型的字段会被分析器(Analyzer)分析,生成若干个terms,两个 text 类型的字段,一个可能有不少term(好比文章的正文),另外一个只有不多的term(好比文章的标题),在多字段查询时,就须要长度归一化,这就是为何 text 类型字段默认开启 norms 选项的缘由吧。另外,对于Lucene经常使用的2种评分算法:tf-idf 和 bm25,tf-idf 就倾向于给长度较小的字段打高分,为何呢?Lucene 的类似度评分公式,主要由三部分组成:IDF score,TF score 还有 fieldNorms。就TF-IDF评分公式而言,IDF score 是log(numDocs/(docFreq+1)),TF score 是 sqrt(tf),fieldNorms 是 1/sqrt(length),所以:文档长度越短,fieldNorms越大,评分越高,这也是为何TF-IDF严重偏向于给短文本打高分的缘由。ide

norms 做用是什么?

norms 是一个用来计算文档/字段得分(Score)的"调节因子"。TF-IDF、BM25算法计算文档得分时都用到了norms参数,具体可参考这篇文章中的Lucene文档得分计算公式。ui

ElasticSearch中的一篇文档(Document),里面有多个字段。查询解析器(QueryParser)将用户输入的查询字符串解析成Terms ,在多字段搜索中,每一个 Term 会去匹配各个字段,为每一个字段计算一个得分,各个字段的得分通过某种方式(以词为中心的搜索 vs 以字段为中心的搜索)组合起来,最终获得一篇文档的得分。this

ES官方文档关于Norms解释:code

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.orm

这里的 normalization factors 用于查询计算文档得分时进行 boosting。好比根据BM25算法给出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))计算文档得分时,其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代价

开启norms以后,每篇文档的每一个字段须要一个字节存储norms。对于 text 类型的字段而言是默认开启norms的,所以对于不须要评分的 text 类型的字段,能够禁用norms,这算是一个调优势吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子属于 Index-time boosting一部分,也即:在索引文档(写入文档)的时候,就已经将全部boosting因子存储起来,在查询时从内存中读取,参与得分计算。参考《Lucene in action》中一段话:

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另外一种类型的 boosting 是search time boosting,在查询语句中指定boosting因子,而后动态计算出文档得分,具体可参考:《relevant search with applications for solr and elasticsearch》,本文再也不详述。可是值得注意的是:目前的ES版本已经再也不推荐使用index time boosting了,而是推荐使用 search time boosting。ES官方文档给出的理由以下:

  • 在索引文档时存储的boosting因子(开启 norms 选项),一经存储,就没法改变。要想改变,只能reindex索引
  • search time boosting 的效果和 index time boosting是同样的,而且search time boosting可以动态指定boosting因子(但计算文档得分时更消耗CPU吧),灵活性更大。而index time boosting须要额外的存储空间
  • index time boosting因子存储在norms字段,它影响了 field length normalization,从而致使文档类似度计算结果不太准确(lower quality relevance calculations)

附:my_index索引的mapping 信息:

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "_doc": {
        "properties": {
          "author": {
            "type": "keyword",
            "norms": true
          },
          "body": {
            "type": "text"
          },
          "chapters": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "overview": {
            "type": "text"
          },
          "title": {
            "type": "text",
            "norms": false
          }
        }
      }
    }
  }
}

原文:http://www.javashuo.com/article/p-dnaymfdf-bo.html

相关文章
相关标签/搜索