一次 ElasticSearch 搜索优化

时间 2020-07-11

标签一次 elasticsearch 搜索优化栏目日志分析繁體版

原文原文链接

一次 ElasticSearch 搜索优化

1. 环境

ES6.3.2，索引名称 user_v1，5个主分片，每一个分片一个副本。分片基本都在11GB左右，GET _cat/shards/userhtml

一共有3.4亿文档，主分片总共57GB。java

Segment信息：curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segmentpython

user_v1索引一共有404个段：git

cat user_v1_segment | wc -lgithub

404算法

处理一下数据，用Python画个直方图看看效果：api

sed -i '1d' file # 删除文件第一行微信

awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 选取感兴趣的一列(docs.count 列)app

with open('doc_count.txt') as f:
    data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')

大概看一下每一个Segment中包含的文档的个数。横坐标是：文档数量，纵坐标是：segment个数。可见：大部分的Segment中只包含了少许的文档($0.5*10^7$)curl

修改refresh_interval为30s，原来默认为1s，这样能在必定程度上减小Segment的数量。而后先force merge将404个Segment减小到200个：

POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true

可是一看，仍是有312个Segment。这个可能与merge的配置有关了。有兴趣的能够了解一下 force merge 过程当中这2个参数的意义：

merge.policy.max_merge_at_once_explicit
merge.scheduler.max_merge_count

执行profile分析：

1，Collector 时间过长，有些分片耗时长达7.9s。关于Profile 分析，可参考：profile-api

2，采用HanLP 分词插件，Analyzer后获得Term，竟然有"空格Term"，而这个Term的匹配长达800ms！

来看看缘由：

POST /_analyze { "analyzer": "hanlp_standard", "text":"人生如梦" }

分词结果是包含了空格的：

{
  "tokens": [
    {
      "token": "人生",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 0
    },
    {
      "token": " ",
      "start_offset": 0,
      "end_offset": 1,
      "type": "w",
      "position": 1
    },
    {
      "token": "如",
      "start_offset": 0,
      "end_offset": 1,
      "type": "v",
      "position": 2
    },
    {
      "token": "梦",
      "start_offset": 0,
      "end_offset": 1,
      "type": "n",
      "position": 3
    }
  ]
}

那实际文档被Analyzer了以后是否存储了空格呢？

因而先定义一个索引，开启term_vector。参考store term-vector

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

而后PUT一篇文档进去：

PUT user/profile/1
{
  "nick":"人生 如梦"
}

查看Term Vector：docs-termvectors

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

发现存储的Terms里面有空格。

{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 2,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "梦": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

而后再执行profile 查询分析：

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如梦"
    }
  }
}

发现Profile里面竟然有针对空格Term 的查询！！！（注意 nick 后面有个空格）

"type": "TermQuery",
            "description": "nick: ",
            "time": "58.2micros",
            "time_in_nanos": 58244,

profile结果以下：

"profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:  nick:如 nick:梦",
                "time": "642.9micros",
                "time_in_nanos": 642931,
                "breakdown": {
                  "score": 13370,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 390646,
                  "next_doc": 18462,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 220447,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "206.6micros",
                    "time_in_nanos": 206624,
                    "breakdown": {
                      "score": 942,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 167545,
                      "next_doc": 1493,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 36637,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "58.2micros",
                    "time_in_nanos": 58244,
                    "breakdown": {
                      "score": 918,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 46130,
                      "next_doc": 964,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 10225,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如",
                    "time": "51.3micros",
                    "time_in_nanos": 51334,
                    "breakdown": {
                      "score": 888,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 43779,
                      "next_doc": 1103,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5557,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:梦",
                    "time": "59.1micros",
                    "time_in_nanos": 59108,
                    "breakdown": {
                      "score": 3473,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 49739,
                      "next_doc": 900,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 4989,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 182090,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "25.9micros",
                "time_in_nanos": 25906,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "19micros",
                    "time_in_nanos": 19075
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }

而在实际的生产环境中，空格Term的查询耗时480ms，而一个正常词语（"微信"）的查询，只有18ms。以下在分片[user_v1][3]上的profile分析结果：

"profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黄色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "18.4ms",
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "480.5ms",
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黄色",
                    "time": "3.8ms",
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 583986593,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "730.3ms",
                "time_in_nanos": 730399762,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "533.2ms",
                    "time_in_nanos": 533238387
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

因为我采用的是HanLP分词，用的这个分词插件elasticsearch-analysis-hanlp，而采用ik_max_word分词却没有相应的问题，这应该是分词插件的bug，因而去github上提了一个issue，有兴趣的能够关注。看来我得去研究一下ElasticSearch Analyze整个流程的源码以及加载插件的源码了 ::(

以上是一个空格Term形成的查询性能问题。在Profile分析时，还发现，使用SSD的Collector time比机械硬盘快10倍左右。

分片 [user_v1][0] 的 Collector time长达7.6秒，而这个分片所在机器的磁盘是机械硬盘。而上面那个分片[user_v1][3]所在的磁盘是SSD，Collector time只有730.3ms。可见SSD与机械硬盘的在Collector time上相差10倍。下面是分片[user_v1][0]的profile查询分析：

{
        "id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黄色",
                "time": "726.1ms",
                "time_in_nanos": 726190295,
                "breakdown": {
                  "score": 339421458,
                  "build_scorer_count": 48,
                  "match_count": 0,
                  "create_weight": 65012,
                  "next_doc": 376526603,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 4935754,
                  "score_count": 4665766,
                  "build_scorer": 575653,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "63.2ms",
                    "time_in_nanos": 63220487,
                    "breakdown": {
                      "score": 649184,
                      "build_scorer_count": 61,
                      "match_count": 0,
                      "create_weight": 32572,
                      "next_doc": 62398621,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 6759,
                      "score_count": 5857,
                      "build_scorer": 127432,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "1m",
                    "time_in_nanos": 60373841264,
                    "breakdown": {
                      "score": 60184752245,
                      "build_scorer_count": 69,
                      "match_count": 0,
                      "create_weight": 5888,
                      "next_doc": 179443959,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 4929373,
                      "score_count": 4660228,
                      "build_scorer": 49501,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黄色",
                    "time": "528.1ms",
                    "time_in_nanos": 528107489,
                    "breakdown": {
                      "score": 141744,
                      "build_scorer_count": 43,
                      "match_count": 0,
                      "create_weight": 4717,
                      "next_doc": 527942227,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 967,
                      "score_count": 780,
                      "build_scorer": 17010,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 993826311,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "7.8s",
                "time_in_nanos": 7811511525,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "7.6s",
                    "time_in_nanos": 7616467158
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

结论

查询不单单与Segment数量、Collector time等有关，还与索引的mapping定义，查询方式(match、filter、term……)有关，可用Profile API分析查询性能问题。另外也有一些压测工具，好比：esrally

对于中文而言，还要注意 query string 被Analyze成各个token以后，究竟是针对了哪些Token查询，这个能够经过term vector进行测试，但生产环境通常不会开启term vector。所以，中文分词算法对搜索命中会有影响。

而至于搜索排序，可先用explain API 分析各个Term的得分，而后也可考虑ES的Function Score功能，针对某些特定的field作调节(field_value_factor)，甚至能够用机器学习模型优化搜索排序(learning to rank)

关于ElasticSearch查询效率的提高一些思考：

FileSystem cache 要足够（堆外内存 vs 堆外内存），数据分布要合理(冷热分离)
索引设计要合理（多字段、Analyzer、Index shard数量）、Segment数量(refresh interval 配置)
查询语法要合适（term、match、filter），可经过搜索参数调优(terminate_after提早返回、timeout查询响应超时)
profile分析

一次 ElasticSearch 搜索优化