Elasticsearch Analyzer 的内部机制

时间 2019-11-17

原文原文链接

1 本文将介绍各类 Analyzer，以及他们各类的应用场景。

涉及到的概念php

Analyzer 通常由三部分构成，character filters、tokenizers、token filters。掌握了 Analyzer 的原理，就能够根据咱们的应用场景配置 Analyzer。html

Elasticsearch 有10种分词器（Tokenizer）、31种 token filter，3种 character filter，一大堆配置项。此外，还有还能够安装 plugin 扩展功能。这些都是搭建 analyzer 的原材料。java

2 Analyzer 的组成要素

Analyzer 的内部就是一条流水线git

Step 1 字符过滤（Character filter）
Step 2 分词（Tokenization）
Step 3 Token 过滤（Token filtering）

Elasticsearch 已经默认构造了 8个 Analyzer。若没法知足咱们的需求，能够经过「Setting API」构造 Analyzer。github

PUT /my-index/_settings
{
    "index": {
        "analysis": {
            "analyzer": {
                "customHTMLSnowball": {
                    "type": "custom",
                    "char_filter": [
                        "html_strip"
                    ],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "stop",
                        "snowball"
                    ]
                }
            }
        }
    }
}

以上自定义的 Analyzer名为 customHTMLSnowball，表明的含义：算法

移除 html 标签（html_strip character filter），好比 <p> <a> <div> 。json
分词，去除标点符号（standard tokenizer）app
把大写的单词转为小写（lowercase token filter）less
过滤停用词（stop token filter），好比「the」「they」「i」「a」「an」「and」。curl
提取词干（snowball token filter，snowball 雪球算法是提取英文词干最经常使用的一种算法。）

cats -> cat

catty -> cat

stemmer -> stem

stemming -> stem

stemmed -> stem

The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>

一图胜前言，这段文本交给 customHTMLSnowball ，它是这样处理的。

3 如何选择合适的 Analyzer？

3.1 大篇幅的英文改选用哪一种 analyzer？

当咱们的搜索场景为：英文博文、英文新闻、英文论坛帖等大段的文本时，最好使用包含 stemming token filter 的 analyzer。

常见的 stemming token filter 有这几种： stemmer, snowball, porter_stem。

拿 snowball token filter 举例，它把 sing/ sings / singing 都转化词干 sing。而且丢弃了「they」「are」两个停用词。无论用户搜 sing、sings、singing，他的搜索结果都是基于「sing」这个term，所得的结果集都同样。

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }

词干提取在英文搜索种应用普遍，可是也有局限：

词干提取对中文意义不大（毫无心义？）。
搜索专业术语，人名时，词干提取反而让搜索结果变差。

eg： flying fish 与 fly fishing 意思差之千里，但通过 snowball 处理后的他们的词根（Term）相同 fli fish。

当用户搜索「假蝇钓鱼」信息时，出来的倒是「飞鱼」的结果，搜索结果十分不理想。

此类场景，建议使用精准搜索，采用简单的分词策略（不提取词干，只 lowercase）+ Fuzzy query 多是更好的选择。

3.2 该选用哪一种 analyzer 处理中文？

英文的分词比较简单，根据空格，标点符号就能够分的八九不离十。可是中文词与词之间没有空格，德文偶尔两个词会连在一块儿，使用默认的 standard analyzer 就不灵光了。

> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d '耶稣爬山宝训' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "稣", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "宝", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "训", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }

standard analyzer 将「耶稣爬山宝训」处理为5个独立的字，这不太靠谱。比较理想的结果应该为["耶稣", "爬山宝训"]。

此时咱们须要借助一些插件（plugin）来处理中文的分词。mmseg 是处理中文一个比较靠谱的插件。安装后能够引入 mmseg-analyzer，处理中文还不错。

3.3 Searching Tokens Exactly 精准搜索

当咱们搜索用户名(username)，商品分类（category），标签（tag）时，但愿精准搜索。建索引时最好不要再分词、也不要提取词干，彻底能够跳过 analyzer 这一步。

能够在某个字段的 mapping 中指定 "index": "not_analyzed"，从而直接把原始文本转为 term。

4 IK中文分词器配置

先测试ik分词器的基本功能

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

结果：

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

能够看出：经过ik_smart明显很智能的将 "中华人民共和国国歌"进行了正确的分词。

另一个例子：

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者荣耀是最好玩的游戏"
}

结果:

{
    "tokens": [
        {
            "token": "王者",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "荣耀",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "最",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "好玩",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "游戏",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

若是结果跟个人不同，那就对了，中文ik分词词库里面将“王者荣耀”是分开的，可是咱们又不肯意将其分开，根据github上面的指示能够配置

IKAnalyzer.cfg.xml 目录在：elasticsearch-5.4.0/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户能够在这里配置本身的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户能够在这里配置本身的扩展中止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
	<!--用户能够在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户能够在这里配置远程扩展中止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

//TODO

配置完了以后就能够看到刚才的结果了

顺便测试一下ik_max_word

POST _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}
结果看看就好了
{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "国歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

再看看github上面的一个例子

POST /index/fulltext/_mapping
{
  "fulltext": {
    "_all": {
      "analyzer": "ik_smart"
    },
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

存一些值

POST /index/fulltext/1
{
  "content": "美国留给伊拉克的是个烂摊子吗"
}

POST /index/fulltext/2
{
  "content": "公安部：各地校车将享最高路权"
}

POST /index/fulltext/3
{
  "content": "中韩渔警冲突调查：韩警平均天天扣1艘中国渔船"
}

POST /index/fulltext/4
{
  "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

取值

POST /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "中国"
    }
  }
}

结果：

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.0869478,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 1.0869478,
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.61094594,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均天天扣1艘中国渔船"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "1",
        "_score": 0.27179778,
        "_source": {
          "content": "美国留给伊拉克的是个烂摊子吗"
        }
      }
    ]
  }
}

es会按照分词进行索引，而后根据你的查询条件按照分数的高低给出结果

官网有一个例子，能够学习学习：https://github.com/medcl/elasticsearch-analysis-ik

看另外一个有趣的例子

PUT /index1
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, 
     "number_of_replicas" : 0 
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } 
    },
    "resource": {
      "dynamic": false, 
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "cn": {
              "type": "text",
              "analyzer": "ik_smart"
            },
            "en": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

field的做用有二：

1.好比一个string类型能够映射成text类型来进行全文检索，keyword类型做为排序和聚合;
2 至关于起了个别名，使用不一样的分类器

批量插入值

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影，最好，新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

取值

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title"
    }
  }
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

缘由，使用title里面查询fox,而title使用的是Standard标准分词器，被索引的是foxes，因此不会有结果，下面这种状况就会有结果了

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title.en"
    }
  }
}

结果就不列出来了，由于title.en使用的是english分词器

对比一下下面的输出，体会一下field的使用

GET /index1/resource/_search
{
  "query": {
    "match": {
      "title.cn": "the最好游戏"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新游戏",
      "fields": [ "title", "title.cn", "title.en" ]
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新",
      "fields": "title.cn"
    }
  }
}

根据结果体会体会用法

下面使用“王者荣耀作测试”，这里能够看到前面配置的HotWords.php是一把双刃剑，将“王者荣耀”放在里面以后，“王者荣耀”这个词就是一个总体，不会被切分红“王者”和“荣耀”，可是就是要搜索王者怎么办呢，这里就体现出fields的强大了，具体看下面

先存入数据

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 6 } }
{ "title": "王者荣耀最好玩的游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 7 } }
{ "title": "王者荣耀最好玩的新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 8 } }
{ "title": "王者荣耀最新游戏，最好玩，新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 9 } }
{ "title": "最最最最好的新新新新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 10 } }
{ "title": "I'm not happy about the foxes" }

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者荣耀",
      "fields": "title.cn"
    }
  }
}

#下面会没有结果返回
POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title.cn"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title"
    }
  }
}

对比结果就能够一目了然了，结果略！

因此一开始业务的需求要至关了解，才能有好的映射（mapping）被设计，搜索的时候也会省事很多

查看分词的命令， ES配置完成后须要测试分词，看看分词是否达到预期效果。

curl 命令查看：

1. 使用自定义的分析器查看分词：ansj_index_synonym：自定交分析器名称. pretty ：json格式显示

curl -XGET 'http://localhost:8200/zh/_analyze?analyzer=ansj_index_synonym&pretty' -d '童装童鞋'

2. 使用自定义的分词器（tokenizer）和过滤器（filters）查看分词：

curl -XGET 'http://localhost:8200/zh/_analyze?tokenizer=ansj_index&filters=synonym&pretty' -d '童装童鞋'

3. 查询某个字段的分词：

curl -XGET 'http://localhost:8200/zh/_analyze?field=brand_name&pretty' -d '童装童鞋'

“brand_name”：字段名称，若是是字段是nest,object类型，也能够写成"brand_name. name"

除了自定义本身的分析器，ES本身也有内置分析器如：

standard
simple
whitespace
stop
keyword
pattern
language
snowball
custom

具体解释：http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

须要英文好点在同鞋。

ES还内置了分词器和过滤器：

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.htmlstandard
edge_ngram
keyword
letter
lowercase
ngram
whitespace
pattern
uax_email_url
path_hierarchy
ascii folding
length
lowercase
uppercase
nGram
edge_ngram
porter_stem
shingle
stop
word_delimiter
stemmer
stemmer_override
keyword_marker
keyword_repeat
kstem
snowball
phonetic
synonym
reverse
elision
truncate
unique
pattern_capture
pattern_replace
trim
limit
hunspell
common_grams
normalization
delimited_payload
keep_words

参考：

https://github.com/medcl/elasticsearch-analysis-ik

http://keenwon.com/1404.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output