涉及到的概念php
Analyzer 通常由三部分构成,character filters、tokenizers、token filters。掌握了 Analyzer 的原理,就能够根据咱们的应用场景配置 Analyzer。html
Elasticsearch 有10种分词器(Tokenizer)、31种 token filter,3种 character filter,一大堆配置项。此外,还有还能够安装 plugin 扩展功能。这些都是搭建 analyzer 的原材料。java
Analyzer 的内部就是一条流水线git
Elasticsearch 已经默认构造了 8个 Analyzer。若没法知足咱们的需求,能够经过「Setting API」构造 Analyzer。github
PUT /my-index/_settings { "index": { "analysis": { "analyzer": { "customHTMLSnowball": { "type": "custom", "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ] } } } } }
以上自定义的 Analyzer名为 customHTMLSnowball, 表明的含义:算法
移除 html 标签 (html_strip character filter),好比 <p> <a> <div> 。json
分词,去除标点符号(standard tokenizer)app
把大写的单词转为小写(lowercase token filter)less
过滤停用词(stop token filter),好比 「the」 「they」 「i」 「a」 「an」 「and」。curl
提取词干(snowball token filter,snowball 雪球算法 是提取英文词干最经常使用的一种算法。)
cats -> cat
catty -> cat
stemmer -> stem
stemming -> stem
stemmed -> stem
The two <em>lazy</em> dogs, were slower than the less lazy <em>dog</em>
一图胜前言,这段文本交给 customHTMLSnowball ,它是这样处理的。
当咱们的搜索场景为:英文博文、英文新闻、英文论坛帖等大段的文本时,最好使用包含 stemming token filter 的 analyzer。
常见的 stemming token filter 有这几种: stemmer, snowball, porter_stem。
拿 snowball token filter 举例,它把 sing/ sings / singing 都转化词干 sing。而且丢弃了「they」 「are」两个停用词。无论用户搜 sing、sings、singing, 他的搜索结果都是基于「sing」这个term,所得的结果集都同样。
GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }
词干提取在英文搜索种应用普遍,可是也有局限:
词干提取对中文意义不大(毫无心义?)。
搜索专业术语,人名时,词干提取反而让搜索结果变差。
eg: flying fish 与 fly fishing 意思差之千里,但通过 snowball 处理后的他们的词根(Term)相同 fli fish。
当用户搜索「假蝇钓鱼」信息时,出来的倒是「飞鱼」 的结果,搜索结果十分不理想。
此类场景,建议使用精准搜索,采用简单的分词策略(不提取词干,只 lowercase)+ Fuzzy query 多是更好的选择。
英文的分词比较简单,根据空格,标点符号就能够分的八九不离十。可是中文词与词之间没有空格,德文偶尔两个词会连在一块儿,使用默认的 standard analyzer 就不灵光了。
> curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty=true' -d '耶稣爬山宝训' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "稣", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "宝", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "训", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }
standard analyzer 将「耶稣爬山宝训」处理为5个独立的字,这不太靠谱。比较理想的结果应该为["耶稣", "爬山宝训"]。
此时咱们须要借助一些插件(plugin)来处理中文的分词。mmseg 是处理中文一个比较靠谱的插件。安装后能够引入 mmseg-analyzer,处理中文还不错。
当咱们搜索用户名(username),商品分类(category),标签(tag)时,但愿精准搜索。建索引时最好不要再分词、也不要提取词干,彻底能够跳过 analyzer 这一步。
能够在某个字段的 mapping 中指定 "index": "not_analyzed",从而直接把原始文本转为 term。
先测试ik分词器的基本功能
POST _analyze?pretty { "analyzer": "ik_smart", "text": "中华人民共和国国歌" }
结果:
{ "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 1 } ] }
能够看出:经过ik_smart明显很智能的将 "中华人民共和国国歌"进行了正确的分词。
另一个例子:
POST _analyze?pretty { "analyzer": "ik_smart", "text": "王者荣耀是最好玩的游戏" }
结果:
{ "tokens": [ { "token": "王者", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "荣耀", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "最", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 2 }, { "token": "好玩", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 3 }, { "token": "游戏", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 4 } ] }
若是结果跟个人不同,那就对了,中文ik分词词库里面将“王者荣耀”是分开的,可是咱们又不肯意将其分开,根据github上面的指示能够配置
IKAnalyzer.cfg.xml 目录在:elasticsearch-5.4.0/plugins/ik/config
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户能够在这里配置本身的扩展字典 --> <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry> <!--用户能够在这里配置本身的扩展中止词字典--> <entry key="ext_stopwords">custom/ext_stopword.dic</entry> <!--用户能够在这里配置远程扩展字典 --> <!-- <entry key="remote_ext_dict">words_location</entry> --> <!--用户能够在这里配置远程扩展中止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
//TODO
配置完了以后就能够看到刚才的结果了
顺便测试一下ik_max_word
POST _analyze?pretty { "analyzer": "ik_max_word", "text": "中华人民共和国国歌" } 结果看看就好了 { "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中华人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中华", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, { "token": "华人", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "人民共和国", "start_offset": 2, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "共和国", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 6 }, { "token": "共和", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 }, { "token": "国", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 8 }, { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 9 } ] }
再看看github上面的一个例子
POST /index/fulltext/_mapping { "fulltext": { "_all": { "analyzer": "ik_smart" }, "properties": { "content": { "type": "text" } } } }
存一些值
POST /index/fulltext/1 { "content": "美国留给伊拉克的是个烂摊子吗" } POST /index/fulltext/2 { "content": "公安部:各地校车将享最高路权" } POST /index/fulltext/3 { "content": "中韩渔警冲突调查:韩警平均天天扣1艘中国渔船" } POST /index/fulltext/4 { "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首" }
取值
POST /index/fulltext/_search { "query": { "match": { "content": "中国" } } }
结果:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1.0869478, "hits": [ { "_index": "index", "_type": "fulltext", "_id": "4", "_score": 1.0869478, "_source": { "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首" } }, { "_index": "index", "_type": "fulltext", "_id": "3", "_score": 0.61094594, "_source": { "content": "中韩渔警冲突调查:韩警平均天天扣1艘中国渔船" } }, { "_index": "index", "_type": "fulltext", "_id": "1", "_score": 0.27179778, "_source": { "content": "美国留给伊拉克的是个烂摊子吗" } } ] } }
es会按照分词进行索引,而后根据你的查询条件按照分数的高低给出结果
官网有一个例子,能够学习学习:https://github.com/medcl/elasticsearch-analysis-ik
看另外一个有趣的例子
PUT /index1 { "settings": { "refresh_interval": "5s", "number_of_shards" : 1, "number_of_replicas" : 0 }, "mappings": { "_default_":{ "_all": { "enabled": false } }, "resource": { "dynamic": false, "properties": { "title": { "type": "text", "fields": { "cn": { "type": "text", "analyzer": "ik_smart" }, "en": { "type": "text", "analyzer": "english" } } } } } } }
field的做用有二:
1.好比一个string类型能够映射成text类型来进行全文检索,keyword类型做为排序和聚合;
2 至关于起了个别名,使用不一样的分类器
批量插入值
POST /_bulk { "create": { "_index": "index1", "_type": "resource", "_id": 1 } } { "title": "周星驰最新电影" } { "create": { "_index": "index1", "_type": "resource", "_id": 2 } } { "title": "周星驰最好看的新电影" } { "create": { "_index": "index1", "_type": "resource", "_id": 3 } } { "title": "周星驰最新电影,最好,新电影" } { "create": { "_index": "index1", "_type": "resource", "_id": 4 } } { "title": "最最最最好的新新新新电影" } { "create": { "_index": "index1", "_type": "resource", "_id": 5 } } { "title": "I'm not happy about the foxes" }
取值
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "fox", "fields": "title" } } }
结果
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] } }
缘由,使用title里面查询fox,而title使用的是Standard标准分词器,被索引的是foxes,因此不会有结果,下面这种状况就会有结果了
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "fox", "fields": "title.en" } } }
结果就不列出来了,由于title.en使用的是english分词器
对比一下下面的输出,体会一下field的使用
GET /index1/resource/_search { "query": { "match": { "title.cn": "the最好游戏" } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "the最新游戏", "fields": [ "title", "title.cn", "title.en" ] } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "the最新", "fields": "title.cn" } } }
根据结果体会体会用法
下面使用“王者荣耀作测试”,这里能够看到前面配置的HotWords.php是一把双刃剑,将“王者荣耀”放在里面以后,“王者荣耀”这个词就是一个总体,不会被切分红“王者”和“荣耀”,可是就是要搜索王者怎么办呢,这里就体现出fields的强大了,具体看下面
先存入数据
POST /_bulk { "create": { "_index": "index1", "_type": "resource", "_id": 6 } } { "title": "王者荣耀最好玩的游戏" } { "create": { "_index": "index1", "_type": "resource", "_id": 7 } } { "title": "王者荣耀最好玩的新游戏" } { "create": { "_index": "index1", "_type": "resource", "_id": 8 } } { "title": "王者荣耀最新游戏,最好玩,新游戏" } { "create": { "_index": "index1", "_type": "resource", "_id": 9 } } { "title": "最最最最好的新新新新游戏" } { "create": { "_index": "index1", "_type": "resource", "_id": 10 } } { "title": "I'm not happy about the foxes" }
POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者荣耀", "fields": "title.cn" } } } #下面会没有结果返回 POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者", "fields": "title.cn" } } } POST /index1/resource/_search { "query": { "multi_match": { "type": "most_fields", "query": "王者", "fields": "title" } } }
对比结果就能够一目了然了,结果略!
因此一开始业务的需求要至关了解,才能有好的映射(mapping)被设计,搜索的时候也会省事很多
查看分词的命令, ES配置完成后须要测试分词,看看分词是否达到预期效果。
curl 命令查看:
1. 使用自定义的分析器查看分词:ansj_index_synonym:自定交分析器名称. pretty :json格式显示
curl -XGET 'http://localhost:8200/zh/_analyze?analyzer=ansj_index_synonym&pretty' -d '童装童鞋'
2. 使用自定义的分词器(tokenizer)和过滤器(filters)查看分词:
curl -XGET 'http://localhost:8200/zh/_analyze?tokenizer=ansj_index&filters=synonym&pretty' -d '童装童鞋'
3. 查询某个字段的分词:
curl -XGET 'http://localhost:8200/zh/_analyze?field=brand_name&pretty' -d '童装童鞋'
“brand_name”:字段名称,若是是字段是nest,object类型,也能够写成"brand_name. name"
除了自定义本身的分析器,ES本身也有内置分析器如:
standard
simple
whitespace
stop
keyword
pattern
language
snowball
custom
须要英文好点在同鞋。
ES还内置了分词器和过滤器:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-tokenizers.htmlstandard
edge_ngram
keyword
letter
lowercase
ngram
whitespace
pattern
uax_email_url
path_hierarchy
ascii folding
length
lowercase
uppercase
nGram
edge_ngram
porter_stem
shingle
stop
word_delimiter
stemmer
stemmer_override
keyword_marker
keyword_repeat
kstem
snowball
phonetic
synonym
reverse
elision
truncate
unique
pattern_capture
pattern_replace
trim
limit
hunspell
common_grams
normalization
delimited_payload
keep_words
参考:
https://github.com/medcl/elasticsearch-analysis-ik
http://keenwon.com/1404.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output