Elasticsearch实践（四）：IK分词

时间 2019-12-12

原文原文链接

环境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4html

Elasticsearch默认也能对中文进行分词。java

咱们先来看看自带的中文分词效果：git

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json;'  -d '{"analyzer": "default","text": "今每天气真好"}'

GET /_analyze
{
  "analyzer": "default",
  "text": "今每天气真好"
}

结果：github

{
  "tokens": [
    {
      "token": "今",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "天",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "天",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "气",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "真",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "好",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    }
  ]
}

咱们发现，是按照每一个字进行分词的。这种在实际应用里确定达不到想要的效果。固然，若是是日志搜索，使用自带的就足够了。正则表达式

analyzer=default其实调用的是standard分词器。算法

接下来，咱们安装IK分词插件进行分词。json

安装IK

IK项目地址：https://github.com/medcl/elasticsearch-analysis-ikbash

首先须要说明的是，IK插件必须和 ElasticSearch 的版本一致，不然不兼容。app

安装方法1：
从 https://github.com/medcl/elasticsearch-analysis-ik/releases 下载压缩包，而后在ES的plugins目录建立analysis-ik子目录，把压缩包的内容复制到这个目录里面便可。最终plugins/analysis-ik/目录里面的内容：curl

plugins/analysis-ik/
    commons-codec-1.9.jar
    commons-logging-1.2.jar
    elasticsearch-analysis-ik-6.2.4.jar
    httpclient-4.5.2.jar
    httpcore-4.4.4.jar
    plugin-descriptor.properties

而后重启 ElasticSearch。

安装方法2：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

若是已下载压缩包，直接使用：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install file:///tmp/elasticsearch-analysis-ik-6.2.4.zip

而后重启 ElasticSearch。

IK分词

IK支持两种分词模式：

ik_max_word: 会将文本作最细粒度的拆分，会穷尽各类可能的组合
ik_smart: 会作最粗粒度的拆分

接下来，咱们测算IK分词效果和自带的有什么不一样：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "今每天气真好"}'

结果：

{
  "tokens": [
    {
      "token": "今每天气",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "真好",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

再试一下ik_max_word的效果：

{
  "tokens": [
    {
      "token": "今每天气",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "今天",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "每天",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "天气",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "真好",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

设置mapping默认分词器

示例：

{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word"
        }
    }
}

注：这里设置 search_analyzer 与 analyzer 相同是为了确保搜索时和索引时使用相同的分词器，以确保查询中的术语与反向索引中的术语具备相同的格式。若是不设置 search_analyzer，则 search_analyzer 与 analyzer 相同。详细请查阅：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

防盗版声明：本文系原创文章，发布于公众号飞鸿影的博客(fhyblog)及博客园，转载需做者赞成。

自定义分词词典

咱们也能够定义本身的词典供IK使用。好比：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "去朝阳公园"}'

结果：

{
  "tokens": [
    {
      "token": "去",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "朝阳",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "公园",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

咱们但愿朝阳公园做为一个总体，这时候能够把该词加入到本身的词典里。

新建本身的词典只须要简单几步就能够完成：
一、在elasticsearch-6.2.4/config/analysis-ik/目录增长一个my.dic:

$ touch my.dic
$ echo 朝阳公园 > my.dic

$ cat my.dic
朝阳公园

.dic为词典文件，其实就是简单的文本文件，词语与词语直接须要换行。注意是UTF8编码。咱们看一下自带的分词文件：

$ head -n 5 main.dic
一一列举
一一对应
一一道来
一丁
一丁不识

二、而后修改elasticsearch-6.2.4/config/analysis-ik/IKAnalyzer.cfg.xml文件：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户能够在这里配置本身的扩展字典 -->
    <entry key="ext_dict">my.dic</entry>
     <!--用户能够在这里配置本身的扩展中止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户能够在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户能够在这里配置远程扩展中止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

增长了my.dic，而后重启ES。咱们再看一下效果：

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "去朝阳公园"
}

结果：

{
  "tokens": [
    {
      "token": "去",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "朝阳公园",
      "start_offset": 1,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

说明自定义词典生效了。若是有多个词典，使用英文分号隔开：

<entry key="ext_dict">my.dic;custom/single_word_low_freq.dic</entry>

另外，咱们看到配置里还有个扩展中止词字典，这个是用来辅助断句的。咱们能够看一下自带的一个扩展中止词字典：

$ head -n 5 extra_stopword.dic
也
了
仍
从
以

也就是IK分词器遇到这些词就认为前面的词语不会与这些词构成词语。

IK分词也支持远程词典，远程词典的好处是支持热更新。词典格式和本地的一致，都是一行一个分词（换行符用 \n），还要求填写的URL知足：

该 http 请求须要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这二者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。

详见：https://github.com/medcl/elasticsearch-analysis-ik 热更新 IK 分词使用方法部分。

注意：上面的示例里咱们改的是`elasticsearch-6.2.4/config/analysis-ik/目录下内容，是由于IK是经过方法2里elasticsearch-plugin安装的。若是你是经过解压方式安装的，那么IK配置会在plugins目录，即：elasticsearch-6.2.4/plugins/analysis-ik/config。也就是说插件的配置既能够放在插件所在目录，也能够放在Elasticsearch的config目录里面。

ES内置的Analyzer分析器

es自带了许多内置的Analyzer分析器，无需配置就能够直接在index中使用：

标准分词器（standard）：以单词边界切分字符串为terms，根据Unicode文本分割算法。它会移除大部分的标点符号，小写分词后的term，支持停用词。
简单分词器（simple）：该分词器会在遇到非字母时切分字符串，小写全部的term。
空格分词器（whitespace）：遇到空格字符时切分字符串，
停用词分词器（stop）：相似简单分词器，同时支持移除停用词。
关键词分词器（keyword）：无操做分词器，会输出与输入相同的内容做为一个single term。
模式分词器（pattern）：使用正则表达式讲字符串且分为terms。支持小写字母和停用词。
语言分词器（language）：支持许多基于特定语言的分词器，好比english或french。
签名分词器（fingerprint）：是一个专家分词器，会产生一个签名，能够用于去重检测。
自定义分词器：若是内置分词器没法知足你的需求，能够自定义custom分词器，根据不一样的character filters，tokenizer，token filters的组合。例如IK就是自定义分词器。

详见文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

参考

一、medcl/elasticsearch-analysis-ik: The IK Analysis plugin integrates Lucene IK analyzer into elasticsearch, support customized dictionary.
https://github.com/medcl/elasticsearch-analysis-ik
二、ElesticSearch IK中文分词使用详解 - xsdxs的博客 - CSDN博客
https://blog.csdn.net/xsdxs/article/details/72853288