elasticsearch 中文分词插件IK-Analyze

时间 2019-11-07

标签 elasticsearch 中文分词插件 analyze 栏目日志分析繁體版

原文原文链接

elasticsearch 版本 7.3java

安装中文分词插件mysql

插件对应的版本须要和elasticsearch的版本一致git

插件各个版本下载地址github

https://github.com/medcl/elasticsearch-analysis-ik/releases

使用elasticsearch自带脚本进行安装 sql

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.0/elasticsearch-analysis-ik-7.3.0.zip

插件jar包安装在elasticsearch-7.3.0/plugins/analysis-ik下数据库

插件的配置文件存放在elasticsearch-7.3.0/config/analysis-ik下，在此目录中存放了许多词库，若是咱们想根据本身业务去扩展一些自定义词库的话，能够修改此目录中的 IKAnalyzer.cfg.xml 文件服务器

例如：app

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户能够在这里配置本身的扩展字典 -->
        <entry key="ext_dict">custom/mydict.dic;</entry>
         <!--用户能够在这里配置本身的扩展中止词字典-->
        <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
        <!--用户能够在这里配置远程扩展字典 -->
        <entry key="remote_ext_dict">http://10.0.11.1:10002/elasticsearch/myDict</entry>
        <!--用户能够在这里配置远程扩展中止词字典-->
        <entry key="remote_ext_stopwords">http://10.0.11.1:10002/elasticsearch/stopWordDict</entry>
</properties>

扩展词库能够配置在本地或存放在远程服务器上elasticsearch

custorm存放在IKAnalyzer.cfg.xml 文件所在目录中，须要注意的是扩展词典的文本格式为 UTF8 编码测试

配置在远程词库中更新词库后不须要重启，须要在http请求头中作些设置

该 http 请求须要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这二者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。
该 http 请求返回的内容格式是一行一个分词，换行符用 \n 便可。

修改完IKAnalyzer.cfg.xml须要重启服务

// 建立索引
PUT /full_text_test

// 添加mapping
POST /full_text_test/_mapping
{
  "properties":{
    "content":{
      "type":"text",
      "analyzer":"ik_max_word",
      "search_analyzer":"ik_smart"
    }
  }
}

// 添加一条数据
POST /full_text_test/_doc/1
{
  "content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

测试分词效果

ik_max_word: 会将文本作最细粒度的拆分

ik_smart: 会作最粗粒度的拆分

POST /full_text_test/_analyze
{
  "text": ["中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"],
  "tokenizer": "ik_max_word"
}

结果

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "驻",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "洛杉矶",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "领事馆",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "领事",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "馆",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "遭",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "亚裔",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "男子",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "子枪",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "枪击",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "嫌犯",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "已",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "CN_CHAR",
      "position" : 12
    },
    {
      "token" : "自首",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 13
    }
  ]
}

POST /full_text_test/_analyze
{
  "text": ["中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"],
  "tokenizer": "ik_smart"
}

结果

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "驻",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "洛杉矶",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "领事馆",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "遭",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "亚裔",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "男子",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "枪击",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "嫌犯",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "已",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "CN_CHAR",
      "position" : 9
    },
    {
      "token" : "自首",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 10
    }
  ]
}

实现一个能够从数据库管理的词库表，方便随时扩展词库

/**
 * elasticsearch ik-analysis 远程词库
 * 一、该 http 请求须要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，
 * 这二者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。
 * 二、该 http 请求返回的内容格式是一行一个分词，换行符用 \n 便可。
 */
@RequestMapping("myDict")
public String myDict(HttpServletResponse response) {
    // 从数据库中查询当前version
	String version = esDictVersionMapper.selectById(1).getVersion();
    // 设置请求头中的词库版本号
	response.setHeader("Last-Modified", version);
	StringBuilder sb = new StringBuilder();
    // 查出mysql中扩展词库表中全部数据,并以\n分隔
	esDictMapper.selectList(null).forEach(item -> sb.append(item.getWord()).append("\n"));
	return sb.toString();
}

常见问题
问题1："analyzer [ik_max_word] not found for field [content]"
解决办法：在全部es节点安装IK后，问题解决。