咱们知道搜索引擎接收搜索请求的第一步,就是对要查询的内容作作分词,Elasticsearch 2.3.3像其余搜索引擎同样,默认的标准分词器(standard)并不适合中文, 咱们经常使用的中文分词插件是IK Analysis 分词器。本文,咱们就介绍IK Analysis分词插件的安装。html
在未安装IK分词以前,咱们看一下使用standard分词的效果,git
启动以前安装好的ES,在浏览器的地址栏中输入下面的代码程序员
http://192.168.133.134:9200/hotel/_analyze?analyzer=standard&text=58码农,我帮码农,咱们为程序员的匠心精神服务!
咱们看到分词的效果以下:github
{ "tokens": [ { "token": "58", "start_offset": 0, "end_offset": 2, "type": "<NUM>", "position": 0 }, { "token": "码", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "农", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "我", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "帮", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "码", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "农", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "我", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "们", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "为", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "程", "start_offset": 13, "end_offset": 14, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "序", "start_offset": 14, "end_offset": 15, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "员", "start_offset": 15, "end_offset": 16, "type": "<IDEOGRAPHIC>", "position": 12 }, { "token": "的", "start_offset": 16, "end_offset": 17, "type": "<IDEOGRAPHIC>", "position": 13 }, { "token": "匠", "start_offset": 17, "end_offset": 18, "type": "<IDEOGRAPHIC>", "position": 14 }, { "token": "心", "start_offset": 18, "end_offset": 19, "type": "<IDEOGRAPHIC>", "position": 15 }, { "token": "精", "start_offset": 19, "end_offset": 20, "type": "<IDEOGRAPHIC>", "position": 16 }, { "token": "神", "start_offset": 20, "end_offset": 21, "type": "<IDEOGRAPHIC>", "position": 17 }, { "token": "服", "start_offset": 21, "end_offset": 22, "type": "<IDEOGRAPHIC>", "position": 18 }, { "token": "务", "start_offset": 22, "end_offset": 23, "type": "<IDEOGRAPHIC>", "position": 19 } ] }
咱们看到基本上是逐个字符的分词,并无把一些词语分在一块儿。咱们最后在安装完IK分词之后再看一下效果。apache
IK Analysis 分词插件的安装其实很简单,可是因为大多数状况下须要采用源码的方式安装,致使不少朋友安装失败。接下来,我就把安装源码安装的方式描述一下。json
1、 Maven安装浏览器
IK Analysis 是基于JAVA编写的,咱们采用源码安装的话,须要安装maven环境。那咱们先来介绍一下maven环境的安装。bash
1. 获取maven包。less
wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
获取安装包,我把maven放在了/usr/local 目录下面。elasticsearch
2. 解压
tar -xvf apache-maven-3.3.9-bin.tar.gz
3. 设置环境变量
vi /etc/profile
而后再文件的末尾粘贴上下面三个变量。
MAVEN_HOME=/usr/local/apache-maven-3.3.9 export MAVEN_HOME export PATH=${PATH}:${MAVEN_HOME}/bin
保存完成后,刷新环境变量
source /etc/profile
4. 验证
mvn -version
看到这些内容,表示maven安装成功。
2、 安装git
采用yum安装便可
yum install git
3、下载ik源码
git clone https://github.com/medcl/elasticsearch-analysis-ik
我把他放到了/usr/es/ik这个目录下面
4、编译并打包
这个过程会下载许多依赖的包。因此会耽误一些时间。执行的命令以下:
进入elasticsearch-analysis-ik 目录,而后执行下面的命令
mvn clean
执行清除命令之后,在执行编译命令,这个命令须要的时间更多。
mvn compile
最后执行打包命令
mvn package
打包完成之后,咱们能够再target目录看到打好的包。
5、复制并解压elasticsearch-analysis-ik-1.9.3.zip
执行下面的命令便可
unzip /usr/es/ik/elasticsearch-analysis-ik/target/releases/elasticsearch-analysis-ik-1.9.3.zip -d /usr/es/plugins/ik
解压完成后,咱们能够再/usr/es/plugins/ik中看到咱们解压的文件。
6、从新启动
咱们看到标红线的内容即时导入了IK分词。咱们来试一下分词的效果。
在浏览器器的地址栏中输入下面的内容:
http://192.168.133.134:9200/hotel/_analyze?analyzer=ik&text=58码农,我帮码农,咱们为程序员的匠心精神服务!
跟前面的对比一下,仅仅是采用的分词不同,前面采用的是standard,而本次咱们采用的是ik,咱们能够看到结果是下面的样子。
{ "tokens": [ { "token": "58", "start_offset": 0, "end_offset": 2, "type": "ARABIC", "position": 0 }, { "token": "码", "start_offset": 2, "end_offset": 3, "type": "COUNT", "position": 1 }, { "token": "农", "start_offset": 3, "end_offset": 4, "type": "CN_WORD", "position": 2 }, { "token": "我", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 3 }, { "token": "帮", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 4 }, { "token": "码", "start_offset": 7, "end_offset": 8, "type": "CN_CHAR", "position": 5 }, { "token": "农", "start_offset": 8, "end_offset": 9, "type": "CN_WORD", "position": 6 }, { "token": "咱们", "start_offset": 10, "end_offset": 12, "type": "CN_WORD", "position": 7 }, { "token": "为", "start_offset": 12, "end_offset": 13, "type": "CN_CHAR", "position": 8 }, { "token": "程序员", "start_offset": 13, "end_offset": 16, "type": "CN_WORD", "position": 9 }, { "token": "程序", "start_offset": 13, "end_offset": 15, "type": "CN_WORD", "position": 10 }, { "token": "序", "start_offset": 14, "end_offset": 15, "type": "CN_WORD", "position": 11 }, { "token": "员", "start_offset": 15, "end_offset": 16, "type": "CN_CHAR", "position": 12 }, { "token": "匠心", "start_offset": 17, "end_offset": 19, "type": "CN_WORD", "position": 13 }, { "token": "匠", "start_offset": 17, "end_offset": 18, "type": "CN_WORD", "position": 14 }, { "token": "心", "start_offset": 18, "end_offset": 19, "type": "CN_CHAR", "position": 15 }, { "token": "精神", "start_offset": 19, "end_offset": 21, "type": "CN_WORD", "position": 16 }, { "token": "服务", "start_offset": 21, "end_offset": 23, "type": "CN_WORD", "position": 17 } ] }
咱们看到的结果是:程序员、程序、精神、服务被做为词分出来了。这就是咱们本文介绍的IK分词的安装。
那么,咱们想一下,如何把“58码农”做为一个词可以分出来呢?你们能够观看 数航学院的在线视频进行学习(免费) 同时加群能够咨询ES相关问题
另外,关于IK分词的其余内容,你们也能够看一下这篇介绍(英文):https://github.com/medcl/elasticsearch-analysis-ik