es在对文档进行倒排索引的须要用分析器(Analyzer)对文档进行分析、创建索引。从文档中提取词元(Token)的算法称为分词器(Tokenizer),在分词前预处理的算法称为字符过滤器(Character Filter),进一步处理词元的算法称为词元过滤器(Token Filter),最后获得词(Term)。这整个分析算法称为分析器(Analyzer)。html
其工做流程:web
CharacterFilters
对文档中的不须要的字符过滤(例如html语言的<br/>等等)Tokenizer
分词器大段的文本分红词(Tokens)(例如能够空格基准对一句话进行分词)TokenFilter
在对分完词的Tokens进行过滤、处理(好比除去英文经常使用的量词:a,the,或者把去掉英文复数等)_analyze
来看es的分词是否是符合咱们的预期目标,咱们使用默认的分析器对下面这句话进行分析。结果包括token,起始的偏移量,类型和序号。我目前先只关注token便可。GET /jindouwin_search_group/_analyze { "text": "Her(5) a Black-cats" }
结果:算法
"tokens": [ { "token": "her", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "5", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "a", "start_offset": 7, "end_offset": 8, "type": "<ALPHANUM>", "position": 2 }, { "token": "black", "start_offset": 9, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "cats", "start_offset": 15, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 } ] }
从结果看出,分词器先去掉了一些无用的符号,再把一句话分为Her、五、a、Black、cats,在用TokenFilter
过滤大小写。apache
es中除了standard
标准分析器外,还有english
、stop
、lower
等等。咱们来看下使用english分析器来解析同一句话的效果。markdown
GET /jindouwin_search_group/_analyze { "text": "Her(5) a Black-cats" , "analyzer": "english" } 结果: { { "tokens": [ { "token": "her", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "5", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "black", "start_offset": 9, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "cat", "start_offset": 15, "end_offset": 19, "type": "<ALPHANUM>", "position": 4 } ] } }
能够明显的看出,english去掉了一些经常使用词(a),和把cats的复数形式去掉了。app
固然es的强大之处在于除了内置的分词器以外,咱们能够自定义分析器,经过组装CharacterFilters、Tokenizer、TokenFilter三个不一样组件来自定义分析器或者可使用别人完成的分析器,最出名的就是ik
中文分词插件。
除此以外咱们也能够CharacterFilters、Tokenizer、TokenFilter进行自定义。
关于一些内置的分析器种类,这里不一一分析,你们能够在官网进行翻阅。svg
官网示例:
做为示范,让咱们一块儿来建立一个自定义分析器吧,这个分析器能够作到下面的这些事:测试
"char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } }
"filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } }
咱们的分析器定义用咱们以前已经设置好的自定义过滤器组合了已经定义好的分词器和过滤器:ui
"analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } }
汇总起来,完整的 建立索引 请求 看起来应该像这样:atom
PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] }}, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] }}, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] }} }}}
索引被建立之后,使用 analyze API 来 测试这个新的分析器:
GET /my_index1/_analyze { "analyzer":"my_analyzer", "text": "The quick & brown fox" }
拷贝为 CURL在 SENSE 中查看
下面的缩略结果展现出咱们的分析器正在正确地运行:
{ "tokens": [ { "token": "quick", "start_offset": 4, "end_offset": 9, "type": "<ALPHANUM>", "position": 1 }, { "token": "and", "start_offset": 10, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "fox", "start_offset": 18, "end_offset": 21, "type": "<ALPHANUM>", "position": 4 } ] }
这个分析器如今是没有多大用处的,除非咱们告诉 Elasticsearch在哪里用上它。咱们能够像下面这样把这个分析器应用在一个 string 字段上:
PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "my_analyzer" } } }