ES的Query、Filter、Metric、Bucketing使用详解

时间 2019-12-18

标签 query filter metric bucketing 使用详解繁體版

原文原文链接

因为笔者在实际项目仅仅将ES用做索引数据库，并无深刻研究过ES的搜索功能。并且鉴于笔者的搜索引擎知识有限，本文将仅仅介绍ES简单（非全文）的查询API。javascript

笔者本来打算在本文中介绍聚合API的内容，可是写着写着发现文章有点过长，不便于阅读，故将聚合API的内容移至下一篇博客中。html

引言

单单介绍理论和API是乏味和低效率的，本文将结合一个实际的例子来介绍这些API。下表是本文数据表的表结构，表名（type）为“student”。注意，studentNo是本表的id，也就是_id字段的值与studentNo的值保持一致。java

字段名	字段含义	类型	是否能被索引	备注
studentNo	学号	string	是	id
name	姓名	string	是
sex	性别	string	是
age	年龄	integer	是
birthday	出生年月	date	是
address	家庭住址	string	是
classNo	班级	string	是
isLeader	是否为班干部	boolean	是

上面的表结构所对应的mapping以下，将数据保存在索引名为“student”的索引中。正则表达式

{
  "student": { "properties": { "studentNo": { "type": "string", "index": "not_analyzed" }, "name": { "type": "string", "index": "not_analyzed" }, "male": { "type": "string", "index": "not_analyzed" }, "age": { "type": "integer" }, "birthday": { "type": "date", "format": "yyyy-MM-dd" }, "address": { "type": "string", "index": "not_analyzed" }, "classNo": { "type": "string", "index": "not_analyzed " }, "isLeader": { "type": "boolean" } } } }

索引中保存的数据以下，下面介绍的全部API都将基于这个数据表。sql

studentNo	name	male	age	birthday	classNo	address	isLeader
1	刘备	男	24	1985-02-03	1	湖南省长沙市	true
2	关羽	男	22	1987-08-23	2	四川省成都市	false
3	糜夫人	女	19	1990-06-12	1	上海市	false
4	张飞	男	20	1989-07-30	3	北京市	false
5	诸葛亮	男	18	1992-04-27	2	江苏省南京市	true
6	孙尚香	女	16	1994-05-21	3		false
7	马超	男	19	1991-10-20	1	黑龙江省哈尔滨市	false
8	赵云	男	23	1986-10-26	2	浙江省杭州市	false

查询API

ES中的查询很是灵活，为用户提供了很是方便而强大的API。我的以为ES的调用接口设计得很是好，全部接口合理且风格一致，值得好好研究！shell

Query和Filter

ES为用户提供两类查询API，一类是在查询阶段就进行条件过滤的query查询，另外一类是在query查询出来的数据基础上再进行过滤的filter查询。这两类查询的区别是：数据库

query方法会计算查询条件与待查询数据之间的相关性，计算结果写入一个score字段，相似于搜索引擎。filter仅仅作字符串匹配，不会计算相关性，相似于通常的数据查询，因此filter得查询速度比query快。
filter查询出来的数据会自动被缓存，而query不能。

query和filter能够单独使用，也能够相互嵌套使用，很是灵活。api

Query查询

下面的状况下适合使用query查询：缓存

须要进行全文搜索。
查询结果依赖于相关性，即须要计算查询串和数据的相关性。

（1）Match All Querymarkdown

查询全部的数据，至关于不带条件查询。下面的代码是一个典型的match_all查询的调用方式。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "match_all": {} } } '

查询结果以下。其余全部的查询都是返回这种格式的数据。

{
  "took": 156, // 查询耗时（毫秒） "timed_out": false, // 是否超时 "_shards": { "total": 5, // 总共查询的分片数 "successful": 5, // 查询成功的分片数 "failed": 0 // 查询失败的分片数 }, "hits": { "total": 8, // 本次查询的记录数 "max_score": 1, // 查询全部数据中的最大score "hits": [ // 数据列表 { "_index": "student", // 数据所属的索引名 "_type": "student", // 数据所属的type "_id": "4", // 数据的id值 "_score": 1, // 该记录的score "_source": { // ES将原始数据保存到_source字段中 "studentNo": "4", "name": "张飞", "male": "男", "age": "20", "birthday": "1989-07-30", "classNo": "3", "isLeader": "F" } }, { …… // 其余的数据格式相同，就不列出来了 } ] } }

查询时，你会发现不管数据量有多大，每次最多只能查到10条数据。这是由于ES服务端默认对查询结果作了分页处理，每页默认的大小为10。若是想本身指定查询的数据，可以使用from和size字段，而且按指定的字段排序。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "match_all": {} }, "from": 2, // 从2条记录开始取 "size": 4, // 取4条数据 "sort": { "studentNo": { // 按studentNo字段升序 "order": "asc"// 降序为desc } } } '

注意：不要把from设得过大（超过10000），不然会致使ES服务端因频繁GC而没法正常提供服务。其实实际项目中也没有谁会翻那么多页，可是为了ES的可用性，务必要对分页查询的页码作必定的限制。

（2）term query

词语查询，若是是对未分词的字段进行查询，则表示精确查询。查找名为“诸葛亮”的学生，查询结果为学号为5的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "term": { "name": "诸葛亮" } } } '

（3）Bool Query

Bool（布尔）查询是一种复合型查询，它能够结合多个其余的查询条件。主要有3类逻辑查询：

must：查询结果必须符合该查询条件（列表）。
should：相似于in的查询条件。若是bool查询中不包含must查询，那么should默认表示必须符合查询列表中的一个或多个查询条件。
must_not：查询结果必须不符合查询条件（列表）。

查找2班的班干部，查询结果为学号为5的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "bool": { "must": [ { "term": { "classNo": "2" } }, { "term": { "isLeader": "true" } } ] } } } '

（4）Ids Query

id字段查询。查询数据id值为1和2的同窗，因为id的值与studentNo相同，故查询结果为学号为1和2的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "ids": { "type": "student", "values": [ "1", "2" ] } } } '

（5）Prefix Query

前缀查询。查找姓【赵】的同窗，查询结果是学号为8的赵云。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "prefix": { "name": "赵" } } } '

（6）Range Query

范围查询，针对date和number类型的数据。查找年龄到18~20岁的同窗，查询结果是学号为三、四、五、7的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "range": { "age": { "gte": "18", // 表示>= "lte": "20" // 表示<= } } } } '

实际上，对于date类型的数据，ES中以其时间戳（长整形）的形式存放的。

（7）Terms Query

多词语查询，查找符合词语列表的数据。若是要查询的字段索引为not_analyzed类型，则terms查询很是相似于关系型数据库中的in查询。下面查找学号为1，3的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "terms": { "studentNo": [ "1", "3" ] } } } '

（8）Wildcard Query

通配符查询，是简化的正则表达式查询，包括下面两类通配符：

* 表明任意（包括0个）多个字符
? 表明任意一个字符

查找名字的最后一个字是“亮”的同窗，查询结果是学号为5的诸葛亮。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "wildcard": { "name": "*亮" } } } '

（9）Regexp Query同窗

正则表达式查询，这是最灵活的字符串类型字段查询方式。查找家住长沙市的学生，查询结果为学号为1的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { "regexp": { "address": ".*长沙市.*" // 这里的.号表示任意一个字符 } } } '

Filter查询

下面的状况下适合使用filter查询：

yes/no的二元查询
针对精确值进行查询

filter和query的查询方式有很多是重叠的，因此本节仅仅介绍API的调用，一些通用的注意的事项就再也不重复了。

（1）Term Filter

词语查询，若是是对未分词的字段进行查询，则表示精确查询。查找名为“诸葛亮”的学生，查询结果为学号为5的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "term": { "name": "诸葛亮", "_cache" : true // 与query主要是这里的区别，能够设置数据缓存 } } } '

filter查询方式均可以经过设置_cache为true来缓存数据。若是下一次刚好以相同的查询条件进行查询而且该缓存没有过时，就能够直接从缓存中读取数据，这样就大大加快的查询速度。

（2）Bool Filter

查找2班的班干部，查询结果为学号为5的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "bool": { "must": [ { "term": { "classNo": "2" } }, { "term": { "isLeader": "true" } } ] } } } '

（3）And Filter

And逻辑链接查询，链接1个或1个以上查询条件。它与bool查询中的must查询很是类似。实际上，and查询能够转化为对应的bool查询。查找2班的班干部，查询结果为学号为5的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "and": [ { "term": { "classNo": "2" } }, { "term": { "isLeader": "true" } } ] } } '

（4）Or Filter

Or链接查询，表示逻辑或。。查找2班或者是班干部的学生名单，查询结果为学号为一、二、五、8的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "or": [ { "term": { "classNo": "2" } }, { "term": { "isLeader": "true" } } ] } } '

（5）Exists Filter

存在查询，查询指定字段至少包含一个非null值的数据。若是字段索引为not_analyzed类型，则查询sql中的is not null查询方式。查询地址存在学生，查询结果为除了6以外的全部学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "exists": { "field": "address" } } } '

（6）Missing Filter

缺失值查询，与Exists查询正好相反。查询地址不存在的学生，查询结果为学号为6的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "missing": { "field": "address" } } } '

（7）Prefix Filter

前缀查询。查找姓【赵】的同窗，查询结果是学号为8的赵云。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "prefix": { "name": "赵" } } } '

（8）Range Filter

范围查询，针对date和number类型的数据。查找年龄到18~20岁的同窗，查询结果是学号为三、四、五、7的记录。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "range": { "age": { "gte": "18", "lte": "20" } } } } '

（9）Terms Filter

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "terms": { "studentNo": [ "1", "3" ] } } } '

（10）Regexp Filter

正则表达式查询，是最灵活的字符串类型字段查询方式。查找家住长沙市的学生，查询结果为学号为1的学生。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "filter": { "regexp": { "address": ".*长沙市.*" } } } '

Aggregations （聚合）API的使用

ES提供的聚合功能能够用来进行简单的数据分析。本文仍然以上一篇提供的数据为例来说解。数据以下：

studentNo	name	male	age	birthday	classNo	address	isLeader
1	刘备	男	24	1985-02-03	1	湖南省长沙市	true
2	关羽	男	22	1987-08-23	2	四川省成都市	false
3	糜夫人	女	19	1990-06-12	1	上海市	false
4	张飞	男	20	1989-07-30	3	北京市	false
5	诸葛亮	男	18	1992-04-27	2	江苏省南京市	true
6	孙尚香	女	16	1994-05-21	3		false
7	马超	男	19	1991-10-20	1	黑龙江省哈尔滨市	false
8	赵云	男	23	1986-10-26	2	浙江省杭州市	false

本文的主要内容有：

metric API的使用
bucketing API的使用
两类API的嵌套使用

1. 聚合API

ES中的Aggregations API是从Facets功能基础上发展而来，官网正在进行替换计划，建议用户使用Aggregations API，而不是Facets API。ES中的聚合上能够分为下面两类：

metric（度量）聚合：度量类型聚合主要针对的number类型的数据，须要ES作比较多的计算工做
bucketing（桶）聚合：划分不一样的“桶”，将数据分配到不一样的“桶”里。很是相似sql中的group语句的含义。

metric既能够做用在整个数据集上，也能够做为bucketing的子聚合做用在每个“桶”中的数据集上。固然，咱们能够把整个数据集合看作一个大“桶”，全部的数据都分配到这个大“桶”中。

ES中的聚合API的调用格式以下：

"aggregations" : { // 表示聚合操做，可使用aggs替代 "<aggregation_name>" : { // 聚合名，能够是任意的字符串。用作响应的key，便于快速取得正确的响应数据。 "<aggregation_type>" : { // 聚合类别，就是各类类型的聚合，如min等 <aggregation_body> // 聚合体，不一样的聚合有不一样的body } [,"aggregations" : { [<sub_aggregation>]+ } ]? // 嵌套的子聚合，能够有0或多个 } [,"<aggregation_name_2>" : { ... } ]* // 另外的聚合，能够有0或多个 }

1.1 度量类型（metric）聚合

（1）Min Aggregation

最小值查询，做用于number类型字段上。查询2班最小的年龄值。

curl -XPOST "192.168.1.101:9200/student/student/_search" -d ' { "query": { // 能够先使用query查询获得须要的数据集 "term": { "classNo": "2" } }, "aggs": { "min_age": { "min": { "field": "age" } } } } '

查询结果为：

{
  "took": 19, // 前面部分数据与普通的查询数据相同 "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3, "max_score": 1.4054651, "hits": [ { "_index": "student", "_type": "student", "_id": "2", "_score": 1.4054651, "_source": { "studentNo": "2", "name": "关羽", "male": "男", "age": "22", "birthday": "1987-08-23", "classNo": "2", "isLeader": "false" } }, { "_index": "student", "_type": "student", "_id": "8", "_score": 1, "_source": { "studentNo": "8", "name": "赵云", "male": "男", "age": "23", "birthday": "1986-10-26", "classNo": "2", "isLeader": "false" } }, { "_index": "student", "_type": "student", "_id": "5", "_score": 0.30685282, "_source": { "studentNo": "5", "name": "诸葛亮", "male": "男", "age": "18", "birthday": "1992-04-27", "classNo": "2", "isLeader": "true" } } ] }, "aggregations": { // 聚合结果 "min_age": { // 前面输入的聚合名 "value": 18, // 聚合后的数据 "value_as_string": "18.0" } } }

上面的聚合查询有两个要注意的点：

能够经过query先过滤数据
返回的结果会包含聚合操做所做用的数据全集

有时候咱们对做用的数据全集并不太敢兴趣，咱们仅仅须要最终的聚合结果。能够经过查询类型（search_type）参数来实现这个需求。下面查询出来的数据量会大大减小，ES内部也会在查询时减小一些耗时的步骤，因此查询效率会提升。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d // 注意这里的search_type=count ' { "query": { // 能够先使用query查询获得须要的数据集 "term": { "classNo": "2" } }, "aggs": { "min_age": { "min": { "field": "age" } } } } '

本次的查询结果为：

{
... "aggregations": { // 聚合结果 "min_age": { // 前面输入的聚合名 "value": 18, // 聚合后的数据 "value_as_string": "18.0" } } }

（2）Max Aggregation

最大值查询。下面查询2班最大的年龄值，查询结果为23。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "query": { "term": { "classNo": "2" } }, "aggs": { "max_age": { "max": { "field": "age" } } } } '

（3）Sum Aggregation

数值求和。下面统计查询2班的年龄总和，查询结果为63。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "query": { "term": { "classNo": "2" } }, "aggs": { "sum_age": { "sum": { "field": "age" } } } } '

（4）Avg Aggregation

计算平均值。下面计算查询2班的年龄平均值，结果为21。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "query": { "term": { "classNo": "2" } }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } '

（5）Stats Aggregation

统计查询，一次性统计出某个字段上的经常使用统计值。下面对整个学校的学生进行简单地统计。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "stats_age": { "stats": { "field": "age" } } } } '

查询结果为：

{
  ...                     // 次要数据省略 "aggregations": { "stats_age": { "count": 8, // 含有年龄数据的学生计数 "min": 16, // 年龄最小值 "max": 24, // 年龄最大值 "avg": 20.125, // 年龄平均值 "sum": 161, // 年龄总和 "min_as_string": "16.0", "max_as_string": "24.0", "avg_as_string": "20.125", "sum_as_string": "161.0" } } }

（6）Top hits Aggregation

取符合条件的前n条数据记录。下面查询全校年龄排在前2位的学生，仅需返回学生姓名和年龄。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d { "aggs": { "top_age": { "top_hits": { "sort": [ // 排序 { "age": { // 按年龄降序 "order": "desc" } } ], "_source": { "include": [ // 指定返回字段 "name", "age" ] }, "size": 2 // 取前2条数据 } } } }

返回结果为：

{
  ...

  "aggregations": { "top_age": { "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "student", "_type": "student", "_id": "1", "_score": null, "_source": { "name": "刘备", "age": "24" }, "sort": [ 24 ] }, { "_index": "student", "_type": "student", "_id": "8", "_score": null, "_source": { "name": "赵云", "age": "23" }, "sort": [ 23 ] } ] } } } }

1.2 桶类型（bucketing）聚合

（1）Terms Aggregation

按照指定的1或多个字段将数据划分红若干个小的区间，计算落在每个区间上记录数量，并按指定顺序进行排序。下面统计每一个班的学生数，并按学生数从大到小排序，取学生数靠前的2个班级。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "terms_classNo": { "terms": { "field": "classNo", // 按照班号进行分组 "order": { // 按学生数从大到小排序 "_count": "desc" }, "size": 2 // 取前两名 } } } } '

值得注意的，取得的前2名的学生数其实是一个近似值，ES的实现方式参见这里。若是想要取得精确值，能够不指定size值，使其进行一次全排序，而后在程序中自行去取前2条记录。固然，这样作会使得ES作大量的排序运算工做，效率比较差。

（2）Range Aggregation

自定义区间范围的聚合，咱们能够本身手动地划分区间，ES会根据划分出来的区间将数据分配不一样的区间上去。下面将全校学生按照年龄划分为5个区间段：16岁如下、16~1八、19~2一、22~2四、24岁以上，要求统计每个年龄段内的学生数。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "range_age": { "range": { "field": "age", "ranges": [ { "to": 15 }, { "from": "16", "to": "18" }, { "from": "19", "to": "21" }, { "from": "22", "to": "24" }, { "from": "25" } ] } } } } '

（3）Date Range Aggregation

时间区间聚合专门针对date类型的字段，它与Range Aggregation的主要区别是其可使用时间运算表达式。主要包括+（加法）运算、-（减法）运算和/（四舍五入）运算，每种运算均可以做用在不一样的时间域上面，下面是一些时间运算表达式示例。

now+10y：表示从如今开始的第10年。
now+10M：表示从如今开始的第10个月。
1990-01-10||+20y：表示从1990-01-01开始后的第20年，即2010-01-01。
now/y：表示在年位上作舍入运算。今天是2015-09-06，则这个表达式计算结果为：2015-01-01。说好的rounding运算呢？结果是作的flooring运算，不知道为啥，估计是我理解错了-_-!!

下面查询25年前及更早出生的学生数。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "range_age": { "date_range": { "field": "birthday", "ranges": [ { "to": "now-25y" } ] } } } } '

（4）Histogram Aggregation

直方图聚合，它将某个number类型字段等分红n份，统计落在每个区间内的记录数。它与前面介绍的Range聚合很是像，只不过Range能够任意划分区间，而Histogram作等间距划分。既然是等间距划分，那么参数里面必然有距离参数，就是interval参数。下面按学生年龄统计各个年龄段内的学生数量，分隔距离为2岁。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "histogram_age": { "histogram": { "field": "age", "interval": 2, // 距离为2 "min_doc_count": 1 // 只返回记录数量大于等于1的区间 } } } } '

（5）Date Histogram Aggregation

时间直方图聚合，专门对时间类型的字段作直方图聚合。这种需求是比较经常使用见得的，咱们在统计时，一般就会按照固定的时间断（1个月或1年等）来作统计。下面统计学校中同一年出生的学生数。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "data_histogram_birthday": { "date_histogram": { "field": "birthday", "interval": "year", // 按年统计 "format": "yyyy" // 返回结果的key的格式 } } } } '

返回结果以下，能够看到因为上面的”format”: “yyyy”，因此返回的key_as_string只返回年的信息。

{
  "buckets": [ { "key_as_string": "1985", "key": 473385600000, "doc_count": 1 }, { "key_as_string": "1986", "key": 504921600000, "doc_count": 1 }, { "key_as_string": "1987", "key": 536457600000, "doc_count": 1 }, { "key_as_string": "1989", "key": 599616000000, "doc_count": 1 }, { "key_as_string": "1990", "key": 631152000000, "doc_count": 1 }, { "key_as_string": "1991", "key": 662688000000, "doc_count": 1 }, { "key_as_string": "1992", "key": 694224000000, "doc_count": 1 }, { "key_as_string": "1994", "key": 757382400000, "doc_count": 1 } ] }

（6）Missing Aggregation

值缺损聚合，它是一类单桶聚合，也就是最终只会产生一个“桶”。下面统计学生信息中地址栏缺损的记录数量。因为只有学号为6的孙尚香的地址缺损，因此统计值为1。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "missing_address": { "missing": { "field": "address" } } } } '

1.3 嵌套使用

前面已经说过，聚合操做是能够嵌套使用的。经过嵌套，可使得metric类型的聚合操做做用在每一“桶”上。咱们可使用ES的嵌套聚合操做来完成稍微复杂一点的统计功能。下面统计每个班里最大的年龄值。

curl -XPOST "192.168.1.101:9200/student/student/_search?search_type=count" -d ' { "aggs": { "missing_address": { "terms": { "field": "classNo" }, "aggs": { // 在这里嵌套新的子聚合 "max_age": { "max": { // 使用max聚合 "field": "age" } } } } } } '

返回结果以下：

{
  "buckets": [ { "key": "1", // key是班级号 "doc_count": 3, // 每一个班级内的人数 "max_age": { // 这里是咱们指定的子聚合名 "value": 24, // 每班的年龄值 "value_as_string": "24.0" } }, { "key": "2", "doc_count": 3, "max_age": { "value": 23, "value_as_string": "23.0" } }, { "key": "3", "doc_count": 1, "max_age": { "value": 20, "value_as_string": "20.0" } }, { "key": "4", "doc_count": 1, "max_age": { "value": 16, "value_as_string": "16.0" } } ] }

2. 总结

本文介绍了ES中的一些经常使用的聚合API的使用，包括metric、bucketing以及它们的嵌套使用方法。掌握了这些API就能够完成简单的数据统计功能，更多的API详见官方文档。

想进阶的同窗，请看： ES权威指南