es 中英文字母分词问题

时间 2020-01-04

原文原文链接

ik

es 中文分词主流都推荐 ik，使用简单，做者也一直持续更新,算是Lucene 体系最好的中文分词了。可是索引的文本每每是复杂的，不只包含中文，还有英文和数字以及一些符号。ik 分中文很好用，可是对英文和数字的组合的时候却不尽人意，而使用场景中像型号这等英文加数字在常见不过了。json

举个栗子：app

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
 "tokenizer" : "ik_max_word",
 "text" : "m123-test detailed output 一丝不挂 青丝变白发"
}
'

获得结果:curl

{
  "tokens" : [
    {
      "token" : "m123-test",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "123",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "一丝不挂",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "一丝",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "TYPE_CNUM",
      "position" : 8
    },
    {
      "token" : "丝",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "不挂",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "挂",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "青丝",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "丝",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "变白",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "白发",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "发",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 16
    }
  ]
}

这里中文和数字 m123 会被分红 m, 123，因此当你搜索m123的时候, 实际搜索的是 123。url

使用 es 内置的 tokenizer 能够解决字母 + 数字问题, 以 standard 为例:code

{
  "tokens" : [
    {
      "token" : "m123",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "丝",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "不",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "挂",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "青",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "丝",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "变",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "白",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "发",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    }
  ]
}

m123 能够搜获得，但这里一样带来新的问题中文被分红单个字了，因此你搜 一 一挂 均可以搜到结果。索引

鱼和熊掌不可兼得哎,若是要兼得改 ik 分词的方法，或者单独为中文或者非中文加个附加字段。前者不会，只能选后者。token