Elasticsearch Analyzer 的内部机制

1 本文将介绍各类 Analyzer,以及他们各类的应用场景。

涉及到的概念php

  1. Character filter
  2. Tokenizer
  3. Token filter
  4. Analyzer
  5. Term query

Analyzer 通常由三部分构成,character filters、tokenizers、token filters。掌握了 Analyzer 的原理,就能够根据咱们的应用场景配置 Analyzer。html

Elasticsearch 有10种分词器(Tokenizer)、31种 token filter,3种 character filter,一大堆配置项。此外,还有还能够安装 plugin 扩展功能。这些都是搭建 analyzer 的原材料。java

2 Analyzer 的组成要素

Analyzer 的内部就是一条流水线git

  • Step 1 字符过滤(Character filter)
  • Step 2 分词 (Tokenization)
  • Step 3 Token 过滤(Token filtering)

Elasticsearch 已经默认构造了 8个 Analyzer。若没法知足咱们的需求,能够经过「Setting API」构造 Analyzer。github

PUT /my-index/_settings
{
    "index": {
        "analysis": {
            "analyzer": {
                "customHTMLSnowball": {
                    "type": "custom",
                    "char_filter": [
                        "html_strip"
                    ],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "stop",
                        "snowball"
                    ]
                }
            }
        }
    }
}

以上自定义的 Analyzer名为 customHTMLSnowball, 表明的含义:算法

  1. 移除 html 标签 (html_strip character filter),好比 <p> <a> <div> 。json

  2. 分词,去除标点符号(standard tokenizer)app

  3. 把大写的单词转为小写(lowercase token filter)less

  4. 过滤停用词(stop token filter),好比 「the」 「they」 「i」 「a」 「an」 「and」。curl

  5. 提取词干(snowball token filter,snowball 雪球算法 是提取英文词干最经常使用的一种算法。)

    cats -> cat

    catty -> cat

    stemmer -> stem

    stemming -> stem

    stemmed -> stem

The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>

一图胜前言,这段文本交给 customHTMLSnowball ,它是这样处理的。

流程图

3 如何选择合适的 Analyzer?

3.1 大篇幅的英文改选用哪一种 analyzer?

当咱们的搜索场景为:英文博文、英文新闻、英文论坛帖等大段的文本时,最好使用包含 stemming token filter 的 analyzer。

常见的 stemming token filter 有这几种: stemmer, snowball, porter_stem。

拿 snowball token filter 举例,它把 sing/ sings / singing 都转化词干 sing。而且丢弃了「they」 「are」两个停用词。无论用户搜 sing、sings、singing, 他的搜索结果都是基于「sing」这个term,所得的结果集都同样。

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }

词干提取在英文搜索种应用普遍,可是也有局限:

  1. 词干提取对中文意义不大(毫无心义?)。

  2. 搜索专业术语,人名时,词干提取反而让搜索结果变差。

    eg: flying fish 与 fly fishing 意思差之千里,但通过 snowball 处理后的他们的词根(Term)相同 fli fish。

    当用户搜索「假蝇钓鱼」信息时,出来的倒是「飞鱼」 的结果,搜索结果十分不理想。

    此类场景,建议使用精准搜索,采用简单的分词策略(不提取词干,只 lowercase)+ Fuzzy query 多是更好的选择。

3.2 该选用哪一种 analyzer 处理中文?

英文的分词比较简单,根据空格,标点符号就能够分的八九不离十。可是中文词与词之间没有空格,德文偶尔两个词会连在一块儿,使用默认的 standard analyzer 就不灵光了。

> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d '耶稣爬山宝训' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "稣", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "宝", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "训", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }

standard analyzer 将「耶稣爬山宝训」处理为5个独立的字,这不太靠谱。比较理想的结果应该为["耶稣", "爬山宝训"]。

此时咱们须要借助一些插件(plugin)来处理中文的分词。mmseg 是处理中文一个比较靠谱的插件。安装后能够引入 mmseg-analyzer,处理中文还不错。

3.3 Searching Tokens Exactly 精准搜索

当咱们搜索用户名(username),商品分类(category),标签(tag)时,但愿精准搜索。建索引时最好不要再分词、也不要提取词干,彻底能够跳过 analyzer 这一步。

能够在某个字段的 mapping 中指定 "index": "not_analyzed",从而直接把原始文本转为 term。

 

4 IK中文分词器配置

先测试ik分词器的基本功能

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

结果:

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

能够看出:经过ik_smart明显很智能的将 "中华人民共和国国歌"进行了正确的分词。

另一个例子:

POST _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者荣耀是最好玩的游戏"
}

结果:

{
    "tokens": [
        {
            "token": "王者",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "荣耀",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "最",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "好玩",
            "start_offset": 6,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "游戏",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

若是结果跟个人不同,那就对了,中文ik分词词库里面将“王者荣耀”是分开的,可是咱们又不肯意将其分开,根据github上面的指示能够配置

IKAnalyzer.cfg.xml 目录在:elasticsearch-5.4.0/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户能够在这里配置本身的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户能够在这里配置本身的扩展中止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
	<!--用户能够在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户能够在这里配置远程扩展中止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

 

//TODO

配置完了以后就能够看到刚才的结果了

顺便测试一下ik_max_word

POST _analyze?pretty
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}
结果看看就好了
{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "国歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

 再看看github上面的一个例子

POST /index/fulltext/_mapping
{
  "fulltext": {
    "_all": {
      "analyzer": "ik_smart"
    },
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

存一些值

POST /index/fulltext/1
{
  "content": "美国留给伊拉克的是个烂摊子吗"
}

POST /index/fulltext/2
{
  "content": "公安部:各地校车将享最高路权"
}

POST /index/fulltext/3
{
  "content": "中韩渔警冲突调查:韩警平均天天扣1艘中国渔船"
}

POST /index/fulltext/4
{
  "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

取值

POST /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "中国"
    }
  }
}

结果:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.0869478,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 1.0869478,
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.61094594,
        "_source": {
          "content": "中韩渔警冲突调查:韩警平均天天扣1艘中国渔船"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "1",
        "_score": 0.27179778,
        "_source": {
          "content": "美国留给伊拉克的是个烂摊子吗"
        }
      }
    ]
  }
}

es会按照分词进行索引,而后根据你的查询条件按照分数的高低给出结果

官网有一个例子,能够学习学习:https://github.com/medcl/elasticsearch-analysis-ik

看另外一个有趣的例子

PUT /index1
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, 
     "number_of_replicas" : 0 
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } 
    },
    "resource": {
      "dynamic": false, 
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "cn": {
              "type": "text",
              "analyzer": "ik_smart"
            },
            "en": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

field的做用有二:

1.好比一个string类型能够映射成text类型来进行全文检索,keyword类型做为排序和聚合;
2 至关于起了个别名,使用不一样的分类器

批量插入值

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影,最好,新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

取值

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title"
    }
  }
}

结果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

缘由,使用title里面查询fox,而title使用的是Standard标准分词器,被索引的是foxes,因此不会有结果,下面这种状况就会有结果了

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title.en"
    }
  }
}

结果就不列出来了,由于title.en使用的是english分词器

对比一下下面的输出,体会一下field的使用

GET /index1/resource/_search
{
  "query": {
    "match": {
      "title.cn": "the最好游戏"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新游戏",
      "fields": [ "title", "title.cn", "title.en" ]
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新",
      "fields": "title.cn"
    }
  }
}

根据结果体会体会用法

下面使用“王者荣耀作测试”,这里能够看到前面配置的HotWords.php是一把双刃剑,将“王者荣耀”放在里面以后,“王者荣耀”这个词就是一个总体,不会被切分红“王者”和“荣耀”,可是就是要搜索王者怎么办呢,这里就体现出fields的强大了,具体看下面

先存入数据

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 6 } }
{ "title": "王者荣耀最好玩的游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 7 } }
{ "title": "王者荣耀最好玩的新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 8 } }
{ "title": "王者荣耀最新游戏,最好玩,新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 9 } }
{ "title": "最最最最好的新新新新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 10 } }
{ "title": "I'm not happy about the foxes" }
POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者荣耀",
      "fields": "title.cn"
    }
  }
}

#下面会没有结果返回
POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title.cn"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title"
    }
  }
}

对比结果就能够一目了然了,结果略!

因此一开始业务的需求要至关了解,才能有好的映射(mapping)被设计,搜索的时候也会省事很多

 

 

查看分词的命令, ES配置完成后须要测试分词,看看分词是否达到预期效果。 

curl 命令查看:

1. 使用自定义的分析器查看分词:ansj_index_synonym:自定交分析器名称.  pretty :json格式显示

curl -XGET 'http://localhost:8200/zh/_analyze?analyzer=ansj_index_synonym&pretty' -d '童装童鞋'

2. 使用自定义的分词器(tokenizer)和过滤器(filters)查看分词:

curl -XGET 'http://localhost:8200/zh/_analyze?tokenizer=ansj_index&filters=synonym&pretty' -d '童装童鞋'

3. 查询某个字段的分词:

curl -XGET 'http://localhost:8200/zh/_analyze?field=brand_name&pretty' -d '童装童鞋'

“brand_name”:字段名称,若是是字段是nest,object类型,也能够写成"brand_name. name"

 

除了自定义本身的分析器,ES本身也有内置分析器如:

standard 
simple 
whitespace 
stop 
keyword 
pattern 
language
snowball 
custom

具体解释:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html  

须要英文好点在同鞋。 

 

ES还内置了分词器和过滤器:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.htmlstandard
edge_ngram
keyword
letter
lowercase
ngram 
whitespace 
pattern 
uax_email_url 
path_hierarchy 
ascii folding 
length 
lowercase 
uppercase 
nGram 
edge_ngram 
porter_stem 
shingle 
stop 
word_delimiter 
stemmer 
stemmer_override 
keyword_marker 
keyword_repeat 
kstem 
snowball 
phonetic 
synonym 
reverse 
elision 
truncate 
unique 
pattern_capture 
pattern_replace 
trim 
limit
hunspell 
common_grams 
normalization 
delimited_payload 
keep_words 

 

 

参考:

https://github.com/medcl/elasticsearch-analysis-ik

http://keenwon.com/1404.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output

相关文章
相关标签/搜索