elastcisearch分词那些事

时间 2019-12-10

原文原文链接

准备今天的操做

删除以前的实验索引git

curl -XDELETE http://127.0.0.1:9200/synctest/article

output:
{"acknowledged":true}
复制代码

建立新mappinggithub

curl -XPUT 'http://127.0.0.1:9200/servcie/_mapping/massage' -d ' { "massage":{ "properties":{ "location":{ "type":"geo_point" }, "name":{ "type":"string" }, "age":{ "type":"integer" }, "address":{ "type":"string" }, "price":{ "type":"double", "index":"not_analyzed" }, "is_open":{ "type":"boolean" } } } }'
复制代码

查看新建立的mappingbash

curl -XGET http://127.0.0.1:9200/servcie/massage/_mapping?pretty

{
  "servcie" : {
    "mappings" : {
      "massage" : {
        "properties" : {
          "address" : {
            "type" : "string"
          },
          "age" : {
            "type" : "integer"
          },
          "is_open" : {
            "type" : "boolean"
          },
          "location" : {
            "type" : "geo_point"
          },
          "name" : {
            "type" : "string"
          },
          "price" : {
            "type" : "double"
          }
        }
      }
    }
  }
}
复制代码

进入咱们的分词测试

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"波多菠萝蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "<IDEOGRAPHIC>",
    "position" : 2
  }, {
    "token" : "萝",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "<IDEOGRAPHIC>",
    "position" : 3
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "<IDEOGRAPHIC>",
    "position" : 4
  } ]
}
复制代码

分词器是由一个分解器(tokenizer)、零个或多个词元过滤器(token filters)组成app

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"text":"abc dsf,sdsf"}'

复制代码

中文检索

若是使用中文检索,还必须使用中文分词,平时使用最多的可能就要属IK分词器了。curl

安装IK分词

./bin/plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.9.3/elasticsearch-analysis-ik-1.9.3.zip
复制代码

重启后查看插件(是否加载成功)elasticsearch

curl -XGET http://localhost:9200/_cat/plugins

Marrow analysis-ik 1.9.3 j  
复制代码

使用ik分词分析测试

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"波多菠萝蜜"}'

{
  "tokens" : [ {
    "token" : "波",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "多",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "菠萝蜜",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "菠萝",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "菠",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "萝",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "蜜",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 6
  } ]
}
复制代码

能够看到已经多菠萝、菠萝蜜进行了分词url

随着社会发展和不一样的业务术语, 有些新的词汇,并无收录到咱们的IK分词器, 即便使用match_pharse等查询也存在检索不到数据状况,那咱们该怎么办呢?spa

举个例子, 好比咱们但愿能检索出 “吊炸天” 这个词(1.9.3版本的IK并无被收录)插件

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸每天不容"}'

{
  "tokens" : [ {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
  }, {
    "token" : "每天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 3
  } ]
}
复制代码

若是必须的话, 这个时候咱们就须要修改IK的词库了

咱们修改analysis-ik/config/ik/custom 下 mydict.dic 文件, 这个文件是专门为咱们拓展词汇准备的, 再最后面添加好新词后保存并重启es便可

curl -XPOST 'http://127.0.0.1:9200/_analyze?pretty' -d '{"analyzer":"ik","text":"吊炸每天不容"}'


{
  "tokens" : [ {
    "token" : "吊炸天",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "吊",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "炸",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "每天",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "不容",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  } ]
}
复制代码

咱们能够看到已经对“吊炸天”进行了单独的分词.