Elasticsearch 中文分词器 IK 配置和使用

Elasticsearch 内置的分词器对中文不友好,会把中文分红单个字来进行全文检索,不能达到想要的结果git

IK Analysis for Elasticsearch:https://github.com/medcl/elasticsearch-analysis-ikgithub

ik 带有两个分词器json

  •     ik_max_word :会将文本作最细粒度的拆分;尽量多的拆分出词语
  •     ik_smart:会作最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

使用

建立一个名叫 iktest 的索引,设置它的分析器用 ik分词器用 ik_max_word,并建立一个 article 的类型,里面有一个 subject 的字段,指定其使用 ik_max_word 分词器安全

[root@k8s-0001 bin]# curl -H "Content-Type: application/json" -XPUT 'http://114.116.97.49:9200/iktest?pretty' -d '{
>     "settings" : {
>         "analysis" : {
>             "analyzer" : {
>                 "ik" : {
>                     "tokenizer" : "ik_max_word"
>                 }
>             }
>         }
>     },
>     "mappings" : {
>         "article" : {
>             "dynamic" : true,
>             "properties" : {
>                 "subject" : {
>                     "type" : "text",
>                     "analyzer" : "ik_max_word"
>                 }
>             }
>         }
>     }
> }'
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "iktest"
}

批量添加几条数据,这里我指定元数据 _id 方便查看,subject 内容为我随便找的几条新闻的标题网络

[root@k8s-0001 bin]# curl -H "Content-Type: application/json" -XPOST http://114.116.97.49:9200/iktest/article/_bulk?pretty -d '
> { "index" : { "_id" : "1" } }
> {"subject" : ""闺蜜"崔顺实被韩检方传唤 韩总统府促彻查真相" }
> { "index" : { "_id" : "2" } }
> {"subject" : "韩举行"护国训练" 青瓦台:决不准国家安全出问题" }
> { "index" : { "_id" : "3" } }
> {"subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮" }
> { "index" : { "_id" : "4" } }
> {"subject" : "村上春树获安徒生奖 演讲中谈及欧洲排外问题" }
> { "index" : { "_id" : "5" } }
> {"subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”" }
> '
{
  "took" : 10,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "1",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "2",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "3",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "4",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "5",
        "_version" : 1,
        "result" : "created",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 201
      }
    }
  ]
}

查询 “希拉里和韩国”app

[root@k8s-0001 bin]# curl -H "Content-Type: application/json" -XPOST http://114.116.97.49:9200/iktest/article/_search?pretty  -d'
> {
>     "query" : { "match" : { "subject" : "希拉里和韩国" }},
>     "highlight" : {
>         "pre_tags" : ["<font color='red'>"],
>         "post_tags" : ["</font>"],
>         "fields" : {
>             "subject" : {}
>         }
>     }
> }
> '
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "5",
        "_score" : 0.2876821,
        "_source" : {
          "subject" : "希拉里团队炮轰FBI 参院民主党领袖批其“违法”"
        },
        "highlight" : {
          "subject" : [
            "<font color=red>希拉里</font>团队炮轰FBI 参院民主党领袖批其“违法”"
          ]
        }
      },
      {
        "_index" : "iktest",
        "_type" : "article",
        "_id" : "3",
        "_score" : 0.2876821,
        "_source" : {
          "subject" : "媒体称FBI已经取得搜查令 检视希拉里电邮"
        },
        "highlight" : {
          "subject" : [
            "媒体称FBI已经取得搜查令 检视<font color=red>希拉里</font>电邮"
          ]
        }
      }
    ]
  }
}

热词更新配置

网络词语突飞猛进,如何让新出的网络热词(或特定的词语)实时的更新到咱们的搜索当中呢 
先用 ik 测试一下 :curl

Elasticsearch 中文分词器 IK 配置和使用elasticsearch

相关文章
相关标签/搜索