es 中文分词主流都推荐 ik,使用简单,做者也一直持续更新,算是Lucene 体系最好的中文分词了。可是索引的文本每每是复杂的,不只包含中文,还有英文和数字以及一些符号。ik 分中文很好用,可是对英文和数字的组合的时候却不尽人意,而使用场景中像型号这等英文加数字在常见不过了。json
举个栗子:app
curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d' { "tokenizer" : "ik_max_word", "text" : "m123-test detailed output 一丝不挂 青丝变白发" } '
获得结果:curl
{ "tokens" : [ { "token" : "m123-test", "start_offset" : 0, "end_offset" : 9, "type" : "LETTER", "position" : 0 }, { "token" : "m", "start_offset" : 0, "end_offset" : 1, "type" : "ENGLISH", "position" : 1 }, { "token" : "123", "start_offset" : 1, "end_offset" : 4, "type" : "ARABIC", "position" : 2 }, { "token" : "test", "start_offset" : 5, "end_offset" : 9, "type" : "ENGLISH", "position" : 3 }, { "token" : "detailed", "start_offset" : 10, "end_offset" : 18, "type" : "ENGLISH", "position" : 4 }, { "token" : "output", "start_offset" : 19, "end_offset" : 25, "type" : "ENGLISH", "position" : 5 }, { "token" : "一丝不挂", "start_offset" : 26, "end_offset" : 30, "type" : "CN_WORD", "position" : 6 }, { "token" : "一丝", "start_offset" : 26, "end_offset" : 28, "type" : "CN_WORD", "position" : 7 }, { "token" : "一", "start_offset" : 26, "end_offset" : 27, "type" : "TYPE_CNUM", "position" : 8 }, { "token" : "丝", "start_offset" : 27, "end_offset" : 28, "type" : "CN_WORD", "position" : 9 }, { "token" : "不挂", "start_offset" : 28, "end_offset" : 30, "type" : "CN_WORD", "position" : 10 }, { "token" : "挂", "start_offset" : 29, "end_offset" : 30, "type" : "CN_WORD", "position" : 11 }, { "token" : "青丝", "start_offset" : 31, "end_offset" : 33, "type" : "CN_WORD", "position" : 12 }, { "token" : "丝", "start_offset" : 32, "end_offset" : 33, "type" : "CN_WORD", "position" : 13 }, { "token" : "变白", "start_offset" : 33, "end_offset" : 35, "type" : "CN_WORD", "position" : 14 }, { "token" : "白发", "start_offset" : 34, "end_offset" : 36, "type" : "CN_WORD", "position" : 15 }, { "token" : "发", "start_offset" : 35, "end_offset" : 36, "type" : "CN_WORD", "position" : 16 } ] }
这里中文和数字 m123 会被分红 m, 123,因此当你搜索m123
的时候, 实际搜索的是 123。url
使用 es 内置的 tokenizer 能够解决字母 + 数字问题, 以 standard 为例:code
{ "tokens" : [ { "token" : "m123", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "test", "start_offset" : 5, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "detailed", "start_offset" : 10, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "output", "start_offset" : 19, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "一", "start_offset" : 26, "end_offset" : 27, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "丝", "start_offset" : 27, "end_offset" : 28, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "不", "start_offset" : 28, "end_offset" : 29, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "挂", "start_offset" : 29, "end_offset" : 30, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "青", "start_offset" : 31, "end_offset" : 32, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "丝", "start_offset" : 32, "end_offset" : 33, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "变", "start_offset" : 33, "end_offset" : 34, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "白", "start_offset" : 34, "end_offset" : 35, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "发", "start_offset" : 35, "end_offset" : 36, "type" : "<IDEOGRAPHIC>", "position" : 12 } ] }
m123 能够搜获得,但这里一样带来新的问题 中文被分红单个字了,因此你搜 一
一挂
均可以搜到结果。索引
鱼和熊掌不可兼得哎,若是要兼得改 ik 分词的方法,或者单独为中文或者非中文加个附加字段。前者不会,只能选后者。token