ES 入门 - 基于词项的查询

时间 2020-10-02

标签入门基于查询繁體版

原文原文链接

准备

首先先声明下，我这里使用的 ES 版本 5.2.0.html

为了便于理解，这里以以下 index 为格式，该格式是经过 PMACCT 抓取的 netflow 流量信息, 文中所涉及的到的例子，全基于此 index.node

本篇涉及的内容能够理解为 ES 的入门内容，主要针对词项的过滤，为基础篇。json

{
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "_source": {
                    "collector": "node1",
                    "src_port": "443",
                    "timestamp": 1600359600,
                    "device_ip": "1.1.1.1",
                    "flows": "40",
                    "dst_host": "2.2.2.2",
                    "TAG": 10001,
                    "router_ip": 172698718,
                    "dst_port": "16384",
                    "pkts": 40000,
                    "bits": 320000000000,
                    "src_host": "3.3.3.3"
                }
            },

在正式介绍搜索前，先明确一个概念。不少人在学习 ES 查询前，容易对 Term 和全文查询进行混淆。数组

首先，Term 是表达语义的最小单位，在搜索和利用统计语言模型时都须要处理 Term.缓存

对应在 ES 里，针对 Term 查询的输入来讲，不会作任何的分词处理，会把输入做为一个总体，在 ES 的倒排索引中进行词项的匹配，而后利用算分公式将结果返回。并能够经过 Constant Score 将查询转换为一个 Filtering，避免算分，利用缓存，从而提升性能。less

虽然输入时，不作分词处理，但在搜索时，会作分词处理。这样有时就会出现没法搜索出结果的状况，好比有 name 为 ‘Jack’ 的 doc. 但若是在搜索时，输入 Jack，ES 是没法查询到的。必须改为小写的 jack 或者使用 keyword 进行查询。elasticsearch

Term 查询包含：ide

Term Query
Range Query
Exists Query
Prefix Query
Wildcard Query

而全文查询，是基于全文本的查询。性能

在 ES 中，索引（输入）和搜索时都会分词。先将查询的字符串传递到合适分词器中，而后生成一个供查询的词项列表。学习

全文查询包括：

Match Query
Match Phrase Query
Query String Query

而下面的例子全都是基于 Term 查询。

ES 搜索概述

ES 搜索 API 能够分为两大类：

基于 URL 的参数搜索, 适合简单的搜索。
基于 Request Body 的搜索（DSL），适合更为复杂的搜索。

肯定查询的索引范围：

/_search: 集群上的全部索引

/index1/_search: index1 索引

/index1,index2/_search: index1 和 index2 索引

/index*/_search: 以 index 开头的全部索引

URL 查询

指定字段查询：

使用 q 指定参数，经过KV 间键值对查询。

举例1：查询设备 IP 为 1.1.1.1 的相关文档信息：

/shflows_agg_*/_search?q=device_ip:1.1.1.1 

{
    "profile": "true"
}

profile 的意思是查看查询过程

结果：能够看到 type 为 TermQuery，搜索时根据指定字段："device_ip:10.75.44.94"

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "device_ip:1.1.1.1",
                                "time": "445.8407320ms",
............

泛查询

不明确指定查询的 key，只指定 value，会对文档中全部 key 进行匹配

举例2：查询各个属性中带有 1.1.1.1 字符的文档, 好比若是 src_host 或者 dst_host 中出现 1.1.1.1，相关文档也会被查询出来。

/shflows_agg_*/_search?q=10.75.44.94

{
    "profile": "true"
}

结果：能够看到 description 变为 _all

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "_all:1.1.1.1",
 ......

DSL 查询

方法：经过在 body 中，编写 json 进行更为复杂的查询

查询全部文档

举例1：查询当前 index 全部文档：

/shflows_agg_index1/_search

{
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

对文档进行排序和分页

举例2：查询当前 index 全部文档，按照时间排序

/shflows_agg_index1/_search

{
    "from": 10,
    "size": 20,
    "sort": [{"timestamp": "desc"}],
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

指定文档返回的参数

举例：指定文档中，返回的仅是指定的参数

/shflows_agg_index1/_search

{
    "_source": ["timestamp", "device_ip"],
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

使用脚本字段，对文档中的多个值进行脚本运算

举例：将文档中的，源 ip 和源端口进行拼接，并以 ip_address 进行命名：

/shflows_agg_index1/_search

{
    "script_fields": {
        "ip_address":{
            "script": {
                "lang": "painless",
                "inline": "params.comment + doc['device_ip'].value + ':' + doc['dst_port'].value",
                "params" : {
                    "comment" : "ip address is: " 
                }
            }
        }
    },
    "query": {
        "match_all": {} 
    }
}

结果：在 fields 里多出了新的脚本拼接后的字段

{
    "took": 84,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 36248845,
        "max_score": 1.0,
        "hits": [
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:16384"
                    ]
                }
            },
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:443"
                    ]
                }
            },
.......

Query Context OR Filter Context 查询

在 ES 中，搜索过程有 Query 和 Filter 上下文两种：

Query 查询：在搜索过程当中会进行相关性的算分操做
Filter 查询：不须要进行算分，因此能够利用缓存，得到更好的性能

在 Query 和 Filter 查询里能够进行：

等值查询（term）
范围查询（range）

举例：如查询 dst_port 为 443 的 doc，并打分

/shflows_agg_index1/_search

{
    "profile": "true",
    "explain": true,
    "query": {
        "term": {"dst_port": 443}
    }
}

结果：

{
    "took": 191,
    "timed_out": false,
    "_shards": {
        "total": 11,
        "successful": 11,
        "failed": 0
    },
    "hits": {
        "total": 3871488,
        "max_score": 2.2973032,
        "hits": [
            {
                "_shard": "[shflows_agg_1600358400][0]",
                "_node": "RWTixYPtTieZaRgAH0NOkQ",
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 2.2973032,  ####### 能够看到这里有计算的分数
                "_source": {
                    "collector": "node1",
                    "src_port": "16384",
                    "timestamp": 1600359600,

使用 filter 查询：

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不进行算分操做
    "constant_score": {
      "filter": {
        "term": {
          "dst_port": 443
        }
      }
    }
  }
}

结果：

"hits": {
        "total": 3872768,
        "max_score": 1.0, # 1.0 为固定值
.....

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "ConstantScoreQuery", ## 不变分数查询
                                "description": "ConstantScore(dst_port:443)",

举例：terms 查询，查询 dst_port 为 443 和 22 doc

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不进行算分操做
    "constant_score": {
      "filter": {
        "terms": {
          "dst_port": [443,22]
        }
      }
    }
  }
}

举例：数据范围查询

{
    "profile": "true",
    "explain": true,
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "timestamp": {
                        # 大于等于
                        "gte": 1601049600,
                        # 小于等于
                        "lte": 1601308800
                    }
                }
            }
        }
    }
}

Bool 复合查询：多个条件进行筛选

在 ES 能够经过 bool 查询，将一个或者多个查询子句组合或者嵌套到一块儿，实现更为复杂的查询。

bool 查询共包含 4 个子句：

must：搜索的结果必须匹配，参与算分
should：选择性匹配，相似于 OR，知足一个条件就能够，参与算分
must_not: 必须不能匹配，属于 Filter context，不贡献算分
filter：必须匹配，属于 Filter context ，不贡献算分。

must_not 和 filter 性能更好，不须要算分。

举例：查询时间范围在 1601171628 和 1601175228 之间，目的端口为 80，源目的 IP 在 [1.1.1.1 ,1.1.1.2, 1.1.1.3] 中任意一个的 doc 信息。

{
    "profile": "true",
    "explain": true,
    "query": {
        "bool": {
            "must": [
                {
                    "range": {
                        "timestamp": {
                            "gte": 1601171628,
                            "lte": 1601175228
                        }
                    }
                },
                {
                    "term": {
                        "dst_port": 80
                    }
                }，
                {
                    "bool": {
                     # 注意这里 should 在 must 的数组里，若是和 must 同级，是没法影响 must 的结果的。
                    "should": [
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

参考

ES-Bool