1:Analyzer通常是由三部分组成:
character filters、tokenizers、token filtershtml
2 Analyzer 的组成要素:
Analyzer 的内部就是一条流水线git
Step 1 字符过滤(Character filter)
Step 2 分词 (Tokenization)
Step 3 Token 过滤(Token filtering)算法
3:Analyzer pipeline:
(input)
——---String----->> (CharacterFilters)
-----String----->> (Tokenizer)
-----Tokens----->> (TokensFilters)
-----Tokens----->>
(outpur)json
========================例子1==========================app
{
"index": {
"analysis": {
"analyzer": {
"customHTMLSnowball": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
}
}curl
以上自定义的 Analyzer名为 customHTMLSnowball, 表明的含义:
移除 html 标签 (html_strip character filter),好比 <p> <a> <div> 。
分词,去除标点符号(standard tokenizer)
把大写的单词转为小写(lowercase token filter)
过滤停用词(stop token filter),好比 「the」 「they」 「i」 「a」 「an」 「and」。
提取词干(snowball token filter,snowball 雪球算法 是提取英文词干最经常使用的一种算法。)
cats -> cat
catty -> cat
stemmer -> stem
stemming -> stem
stemmed -> stemurl
========================例子1==========================htm
========================例子2==========================blog
裸心es搜索,拼音搜索token
curl -XPUT "http://localhost:9200/yyyy" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
],
"filter": [
"pinyin_filter",
"lowercase",
"stop",
"ngram_1_20"
]
},
"default_search": {
"type": "custom",
"tokenizer": "ik_smart",
"char_filter": [
"html_strip"
]
}
},
"filter": {
"ngram_1_20": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
},
"pinyin_filter": {
"type": "pinyin",
"keep_original": true,
"keep_joined_full_pinyin": true
}
}
}
}
}'
========================例子2==========================