Elasticsearch Field Options Norms

时间 2019-11-05

标签 elasticsearch field options norms 栏目日志分析繁體版

原文原文链接

Elasticsearch 定义字段时Norms选项的做用

本文介绍ElasticSearch中2种字段(text 和 keyword)的Norms参数做用。html

建立ES索引时，通常指定2种配置信息：settings、mappings。settings 与数据存储有关（几个分片、几个副本）；而mappings 是数据模型，相似于MySQL中的表结构定义。在Mapping信息中指定每一个字段的类型，ElasticSearch支持多种类型的字段(field datatypes)，好比String、Numeric、Date…其中String又细分红为种：keyword 和 text。在建立索引时，须要定义字段并为每一个字段指定类型，示例以下：java

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "title": {
          "type": "text",
          "norms": false
        },
        "overview": {
          "type": "text",
          "norms": true
        },
        "body": {
          "type": "text"
        },
        "author": {
          "type": "keyword",
          "norms": true
        },
        "chapters": {
          "type": "keyword",
          "norms": false
        },
        "email": {
          "type": "keyword"
        }
      }
    }
  }
}

my_index 索引的 title 字段类型是 text，而 author 字段类型是 keyword。算法

对于 text 类型的字段而言，默认开启了norms，而 keyword 类型的字段则默认关闭了normsapp

Whether field-length should be taken into account when scoring queries. Accepts true（text filed datatype） or false(keyword filed datatype)elasticsearch

为何 keyword 类型的字段默认关闭 norms 呢？keyword 类型的string 可理解为：Do index the field, but don't analyze the string value，也即：keyword 类型的字段是不会被Analyzer "分析成" 一个个的term的，它是一个single-token fields，所以也就不须要字段长度(fieldNorm)、tfNorm（term frequency Norm）这些归一化因子了。而 text 类型的字段会被分析器(Analyzer)分析，生成若干个terms，两个 text 类型的字段，一个可能有不少term(好比文章的正文)，另外一个只有不多的term(好比文章的标题)，在多字段查询时，就须要长度归一化，这就是为何 text 类型字段默认开启 norms 选项的缘由吧。另外，对于Lucene经常使用的2种评分算法：tf-idf 和 bm25，tf-idf 就倾向于给长度较小的字段打高分，为何呢？Lucene 的类似度评分公式，主要由三部分组成：IDF score，TF score 还有 fieldNorms。就TF-IDF评分公式而言，IDF score 是log(numDocs/(docFreq+1))，TF score 是 sqrt(tf)，fieldNorms 是 1/sqrt(length)，所以：文档长度越短，fieldNorms越大，评分越高，这也是为何TF-IDF严重偏向于给短文本打高分的缘由。ide

norms 做用是什么？

norms 是一个用来计算文档/字段得分(Score)的"调节因子"。TF-IDF、BM25算法计算文档得分时都用到了norms参数，具体可参考这篇文章中的Lucene文档得分计算公式。ui

ElasticSearch中的一篇文档(Document)，里面有多个字段。查询解析器(QueryParser)将用户输入的查询字符串解析成Terms ，在多字段搜索中，每一个 Term 会去匹配各个字段，为每一个字段计算一个得分，各个字段的得分通过某种方式(以词为中心的搜索 vs 以字段为中心的搜索)组合起来，最终获得一篇文档的得分。this

ES官方文档关于Norms解释：code

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.orm

这里的 normalization factors 用于查询计算文档得分时进行 boosting。好比根据BM25算法给出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))计算文档得分时，其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代价

开启norms以后，每篇文档的每一个字段须要一个字节存储norms。对于 text 类型的字段而言是默认开启norms的，所以对于不须要评分的 text 类型的字段，能够禁用norms，这算是一个调优势吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子属于 Index-time boosting一部分，也即：在索引文档(写入文档)的时候，就已经将全部boosting因子存储起来，在查询时从内存中读取，参与得分计算。参考《Lucene in action》中一段话：

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另外一种类型的 boosting 是search time boosting，在查询语句中指定boosting因子，而后动态计算出文档得分，具体可参考：《relevant search with applications for solr and elasticsearch》，本文再也不详述。可是值得注意的是：目前的ES版本已经再也不推荐使用index time boosting了，而是推荐使用 search time boosting。ES官方文档给出的理由以下：

在索引文档时存储的boosting因子(开启 norms 选项)，一经存储，就没法改变。要想改变，只能reindex索引
search time boosting 的效果和 index time boosting是同样的，而且search time boosting可以动态指定boosting因子(但计算文档得分时更消耗CPU吧)，灵活性更大。而index time boosting须要额外的存储空间
index time boosting因子存储在norms字段，它影响了 field length normalization，从而致使文档类似度计算结果不太准确(lower quality relevance calculations)

附：my_index索引的mapping 信息：

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "_doc": {
        "properties": {
          "author": {
            "type": "keyword",
            "norms": true
          },
          "body": {
            "type": "text"
          },
          "chapters": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "overview": {
            "type": "text"
          },
          "title": {
            "type": "text",
            "norms": false
          }
        }
      }
    }
  }
}

原文：http://www.javashuo.com/article/p-dnaymfdf-bo.html