Elasticsearch拼音分词和IK分词的安装及使用

时间 2019-11-06

标签 elasticsearch 拼音分词安装使用栏目日志分析繁體版

原文原文链接

1、Es插件配置及下载git

1.IK分词器的下载安装github

关于IK分词器的介绍再也不多少，一言以蔽之，IK分词是目前使用很是普遍分词效果比较好的中文分词器。作ES开发的，中文分词十有八九使用的都是IK分词器。app

下载地址:https://github.com/medcl/elasticsearch-analysis-ikelasticsearch

2.pinyin分词器的下载安装测试

能够在淘宝、京东的搜索框中输入pinyin就能查找到本身想要的结果，这就是拼音分词，拼音分词则是将中文分析成拼音格式，能够经过拼音分词分析出来的数据进行查找想要的结果。spa

下载地址：https://github.com/medcl/elasticsearch-analysis-pinyin插件

注：插件下载必定要和本身版本对应的Es版本一致，而且安装完插件后需重启Es，才能生效。

插件安装位置：（本人安装了三个插件，暂时先不介绍murmur3插件，能够暂时忽略）code

插件配置成功，重启Esblog

2、拼音分词器和IK分词器的使用索引

1.IK中文分词器的使用

1.1 ik_smart: 会作最粗粒度的拆分

GET /_analyze
{
  "text":"中华人民共和国国徽",
  "analyzer":"ik_smart"
}

结果：
{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "国徽",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

1.2 ik_max_word: 会将文本作最细粒度的拆分

GET /_analyze
{
  "text": "中华人民共和国国徽",
  "analyzer": "ik_max_word"
}

结果：
{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "国徽",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

2.拼音分词器的使用

GET /_analyze
{
  "text":"刘德华",
  "analyzer": "pinyin"
}

结果:
{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    }
  ]
}

注：无论是拼音分词器仍是IK分词器，当深刻搜索一条数据是时，必须是经过分词器分析的数据，才能被搜索到，不然搜索不到

3、IK分词和拼音分词的组合使用

当咱们建立索引时能够自定义分词器，经过指定映射去匹配自定义分词器

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  }
    
}

当咱们建type时，须要在字段的analyzer属性填写本身的映射

PUT /my_index/my_type/_mapping
{
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_smart_pinyin"
        }
      }
    }
}

测试，让咱们先添加几条数据

POST /my_index/my_type/_bulk
{ "index": { "_id":1}}
{ "name": "张三"}
{ "index": { "_id": 2}}
{ "name": "张四"}
{ "index": { "_id": 3}}
{ "name": "李四"}

IK分词查询

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "name": "李"
    }
  }
}

结果：
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.47160998,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.47160998,
        "_source": {
          "name": "李四"
        }
      }
    ]
  }
}

拼音分词查询：

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "name": "zhang"
    }
  }
}

结果：
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.3758317,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.3758317,
        "_source": {
          "name": "张四"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.3758317,
        "_source": {
          "name": "张三"
        }
      }
    ]
  }
}

Elasticsearch拼音分词和IK分词的安装及使用

注：插件下载必定要和本身版本对应的Es版本一致，而且安装完插件后需重启Es，才能生效。

注：无论是拼音分词器仍是IK分词器，当深刻搜索一条数据是时，必须是经过分词器分析的数据，才能被搜索到，不然搜索不到

注：搜索时，先查看被搜索的词被分析成什么样的数据，若是你搜索该词输入没有被分析出的参数时，是查不到的！！！！