一、默认的分词器html
standard 分词器app
standard tokenizer:以单词边界进行切分
standard token filter:什么都不作
lowercase token filter:将全部字母转换为小写
stop token filer(默认被禁用):移除停用词,好比a the it等等spa
二、修改分词器的设置code
启用english停用词token filterhtm
PUT /my_index { "settings": { "analysis": { "analyzer": { "es_std": { "type": "standard", "stopwords": "_english_" } } } } } GET /my_index/_analyze { "analyzer": "standard", "text": "a dog is in the house" } GET /my_index/_analyze { "analyzer": "es_std", "text":"a dog is in the house" }
三、定制化本身的分词器blog
1.&字符转换token
2.停用某些词ip
3.大小写转换it
PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": ["&=> and"] } }, "filter": { "my_stopwords": { "type": "stop", "stopwords": ["the", "a"] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip", "&_to_and"], "tokenizer": "standard", "filter": ["lowercase", "my_stopwords"] } } } } } GET /my_index/_analyze { "text": "tom&jerry are a friend in the house, <a>, HAHA!!", "analyzer": "my_analyzer" } PUT /my_index/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "my_analyzer" } } }