ES内置的token filter不少,大部分实际工做中都用不到。这段时间准备ES认证工程师的考试,备考的时候须要熟悉这些不经常使用的filter。ES官方对一些filter只是一笔带过,我就想着把备考的笔记整理成博客备忘,也但愿能帮助到有这方面需求的人。算法
官方解释:测试
A token filter of type length that removes words that are too long or too short for the stream.
这个filter的功能是,去掉过长或者太短的单词。它有两个参数能够设置:this
Integer.MAX_VALUE
先来简单测试下它的效果,搜索引擎
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":3 }], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone" }
输出:spa
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 } ] }
能够看到大于3的单词都被过滤掉了。code
若是要给某个索引指定length
filer,能够参考下面这个示例:索引
PUT /length_example { "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 3 } } } } } GET length_example/_analyze { "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet" }
ngram filter的意义能够参考ngram tokenize
,后者至关因而keyword
tokenizer 加上 ngram filter
,效果是同样的。token
它的含义是:首先将text文本切分,执行时采用N-gram切割算法。N-grams
算法,像一个穿越单词的滑窗,是一个特定长度的持续的字符序列。rem
说着挺抽象,来个例子:文档
GET _analyze { "tokenizer": "ngram", "text": "北京大学" } GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "ngram", "min_gram":1, "max_gram":2 }], "text" : "北京大学" }
能够看到有两个属性,
max和min的间隔,也就是步长默认最大只能是1,能够经过设置索引的max_ngram_diff
修改,示例以下:
PUT /ngram_example { "settings" : { "index": { "max_ngram_diff": 10 }, "analysis" : { "analyzer" : { "default" : { "tokenizer" : "keyword", "filter" : ["my_ngram"] } }, "filter" : { "my_ngram" : { "type" : "ngram", "min_gram" : 2, "max_gram": 4 } } } } }
使用索引的analyzer测试,
GET ngram_example/_analyze { "analyzer": "default", "text" : "北京大学" }
输出,
{ "tokens" : [ { "token" : "北京", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "北京大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "京大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "大学", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ] }
你应该已经基本了解ngram filter
的用法了,可能会有个疑问,这个过滤器用在什么场景呢?事实上,它适合前缀中缀检索,好比搜索推荐功能,当你只输入了某个句子的一部分时,搜索引擎会显示出以这部分为前缀的一些匹配项,从而实现推荐功能。
这个filter从名字也能够看出它的功能,它能够删除先后空格。看个示例:
GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "trim"}], "text" : " 北京大学" }
输出,
{ "tokens" : [ { "token" : " 北京大学", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 } ] }
这个filter有一个length
属性,能够截断分词后的term,确保term的长度不会超过length。下面看个示例,
GET _analyze { "tokenizer" : "keyword", "filter": [{"type": "truncate", "length": 3}], "text" : "北京大学" }
输出,
{ "tokens" : [ { "token" : "北京大", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 } ] }
再来一个示例:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
输出,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "QUI", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 2 }, ...
这个filter在keyword比较长的场景下,能够用来避免出现一些OOM等问题。
unique词元过滤器的做用就是保证一样结果的词元只出现一次。看个示例:
GET _analyze { "tokenizer": "standard", "filter": ["unique"], "text": "this is a test test test" }
输出,
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "a", "start_offset" : 8, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "test", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 3 } ] }
同义词过滤器。它的使用场景是这样的,好比有一个文档里面包含番茄
这个词,咱们但愿搜索番茄
或者西红柿
,圣女果
均可以找到这个文档。示例以下:
PUT /synonym_example { "settings": { "analysis" : { "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["my_synonym"] } }, "filter" : { "my_synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
咱们须要在ES实例的config目录下,新建一个analysis/synonym.txt
的文件,内容以下:
番茄,西红柿,圣女果
记得要重启。
而后测试下,
GET /synonym_example/_analyze { "analyzer": "synonym", "text": "番茄" }
输出,
{ "tokens" : [ { "token" : "番茄", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "西红柿", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 }, { "token" : "圣女果", "start_offset" : 0, "end_offset" : 2, "type" : "SYNONYM", "position" : 0 } ] }
咱们知道一个分析器能够包含多个过滤器,那怎么来实现呢?看下面这个例子:
GET _analyze { "tokenizer" : "standard", "filter": [{"type": "length", "min":1, "max":4 },{"type": "truncate", "length": 3}], "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
这个例子中,咱们把length filter和truncate filter组合在一块儿使用,它首先基于标准分词,分词后的term大于4字节的会首先被过滤掉,接着剩下的term会被截断到3个字节。输出结果是,
{ "tokens" : [ { "token" : "The", "start_offset" : 0, "end_offset" : 3, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "2", "start_offset" : 4, "end_offset" : 5, "type" : "<NUM>", "position" : 1 }, { "token" : "ove", "start_offset" : 31, "end_offset" : 35, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "the", "start_offset" : 36, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "laz", "start_offset" : 40, "end_offset" : 44, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "bon", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ] }
若是是在索引中使用的话,参考下面这个例子:
PUT /length_truncate_example { "settings" : { "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["my_length", "my_truncate"] } }, "filter" : { "my_length" : { "type" : "length", "min" : 1, "max": 4 }, "my_truncate" : { "type" : "truncate", "length": 3 } } } } } GET length_truncate_example/_analyze { "analyzer": "default", "text" : "The 2 QUICK Brown-Foxes jumped over the lazy dog's bonet" }