Elasticsearch入门和查询语法分析（ik中文分词）

时间 2019-11-18

标签 elasticsearch 入门查询语法分析中文分词栏目日志分析繁體版

原文原文链接

全文搜索如今已是很常见的功能了，固然你也能够用mysql加Sphinx实现。但开源的Elasticsearch（简称ES）目前是全文搜索引擎的首选。目前像GitHub、维基百科都使用的是ES，它能够快速的存储，搜索和分析数据。php

1、安装与启动html

ES的运行须要依赖java环境，能够在命令行运行 java --version 。若是出现java

说明已经安装了，不然你就须要安装下java环境。mysql

而后咱们就能够开始装ES了。一、能够用docker容器安装。二、用压缩包安装。sql

我是用压缩包安装的。docker

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.1.tar.gz
tar -xzf elasticsearch-6.3.1.tar.gz
cd elasticsearch-6.3.1/

而后输入 ./bin/elasticsearch 就能够启动ES了。在浏览器上输入 localhost:9200 ,若是出现json

就说明ES成功跑起来了。浏览器

不了解ES的同窗能够去看看阮老师的这篇文章http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html。app

2、IK分词curl

ES默认的分词是英文分词，对中文分词支持的并很差。因此咱们就须要安装ik中文分词。让咱们看看区别。

在这里须要说明的一点时，ES不少API请求都是GET带上了Request Body。因此经过浏览器或者postman等工具发起GET请求时会报错。有两种方法能够解决。

一、经过命令含的curl请求。

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer" : "standard",
  "text" : "this is a test"
}
'

二、在代码中经过curl请求。

// 经过php的guzzle包发起的请求
$client = new Client();
$response = $client->get('localhost:9200/_analyze', [
    'json' => [
        'analyzer'  => 'standard',
        'text'      => "功能进阶",
    ]
]);

$res = ($response->getBody()->getContents());

而后咱们来看看ik中文分词和ES默认的分词区别。一样是上面的请求

ES默认分词结果

{
  "tokens": [
    {
      "token": "功",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "能",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "进",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "阶",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    }
  ]
}

ik中文分词结果

ik分词也分两种分析器。ik_smart:尽量少的进行中文分词。ik_max_word:尽量多的进行中文分词。

$response = $client->get('localhost:9200/_analyze', [
    'json' => [
        'analyzer'  => 'ik_max_word',
        'text'      => "功能进阶",
    ]
]);

获得的结果为：

{
  "tokens": [
    {
      "token": "功能",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "能进",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "进阶",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

而ik_smart

$response = $client->get('localhost:9200/_analyze', [
    'json' => [
        'analyzer'  => 'ik_smart',
        'text'      => "功能进阶",
    ]
]);

的结果为

{
  "tokens": [
    {
      "token": "功能",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "进阶",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

其实他们的区别经过名字你也能够略知一二。哈哈。。。

假若有人想问，我就想把“功能进阶”当成一个词来搜索，能够吗？

Of course！！

这时候咱们就要自定义分词。进入你的ES目录，运行 cd config/analysis-ik/ 进去ik分词的配置。找到IKAnalyzer.cfg.xml文件，而后 vi IKAnalyzer.cfg.xml 。

我在 elasticsearch-6.3.0/config/analysis-ik 目录下，建立了 custom/mydict.dic ,而后添加到上图的红色框框中，这就是你自定义分词的文件。若是有多个文件，能够用英文分号(;)隔开。

能够看到，我在自定义中文分词文件中添加了“功能进阶”这个词。这时候用ik_smart分析器的结果是：

{
  "tokens": [
    {
      "token": "功能进阶",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

很好，这就是咱们想要的。

3、Query DSL

match

　　　　查询语法以下：title是须要查询的字段名，能够被替换成任何字段。query对应的是所需的查询。好比这里会被拆分红‘php’和‘后台’，应为operator是or，因此ES会去全部数据里的title字段查询包含‘后台’和‘php’的，若是operator为and，这查询的是即包含‘后台’又有‘php’的数据，这应该很好理解。

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'match' => [
                'title' => [
                    'query' => '后台php',
                    'operator' => 'or',
                ]
            ]
        ]
    ]
]);

multi_match

　　　　若是想在多个字段中查找，那就须要用到multi_match查询，语法以下：

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'multi_match' => [
                'query' => '张三 php',
                'fields' => ['title', 'desc', 'user']
            ]
        ]
    ]
]);

query_string

　　　　查询语法以下：相似match查询的operator，在这里须要在query中用OR或AND实现。

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'query_string' => [
                'query' => '(张三) OR (php)',
                'default_field' => 'title',
            ]
        ]
    ]
]);

多字段查询以下：

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'query_string' => [
                'query' => '(张三) OR (php)',
                'fields' => ['title', 'user'],
            ]
        ]
    ]
]);

range query

　　　　这是范围查询，例如查询年龄在10到20岁之间的。查询语法以下：

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'range' => [
                'age' => [
                    'gte' => 10,
                    'lte' => 20,
                ],
            ]
        ]
    ]
]);

gte表示>=，lte表示<=，gt表示>，lt表示<。

bool查询

　　bool查询的语法都是同样的。以下：

$response = $client->get('localhost:9200/accounts/person/_search', [
    'json' => [
        'query' => [
            'bool' => [
                'must/filter/should/must_not' => [
                    [
                        'query_string' => [
                            'query' => '研发',
                        ]
                    ],
                    [
                        'range' => [
                            'age' => [
                                'gt' => 20
                            ]
                        ]
                    ],

                ],
            ]
        ]
    ]
]);

　　1）must：must查询是查询字段中必须知足上面两个条件，而且会计算到score中。

　　2）filter：filter查询与must同样，都必须知足上面两个条件，只不过查询结果不会计算score，也就是score始终为0.

　　3）should：should查询只须要知足上面两种查询条件中的一种便可。

　　4）must_not：must_not查询是必须不知足上面两个查询条件。

以上也是我看文档总结出来的，若有不对的地方，望大神指点。