在 Elasticsearch 5.x 有一个字段折叠(Field Collapsing,#22337)的功能很是有意思,在这里分享一下,
字段折叠是一个颇有历史的需求了,能够看这个 issue,编号#256,最初是2010年7月提的issue,也是讨论最多的帖子之一(240+评论),熬了6年才支持的特性,你说牛不牛,哈哈。
目测该特性将于5.3发布,尝鲜地址:Elasticsearch-5.3.0-SNAPSHOT,文档地址:search-request-collapse。
So,什么是字段折叠,能够理解就是按特定字段进行合并去重,好比咱们有一个菜谱搜索,我但愿按菜谱的“菜系”字段进行折叠,即返回结果每一个菜系都返回一个结果,也就是按菜系去重,我搜索关键字“鱼”,要去返回的结果里面各类菜系都有,有湘菜,有粤菜,有中餐,有西餐,别全是湘菜,就是这个意思,经过按特定字段折叠以后,来丰富搜索结果的多样性。
说到这里,有人确定会想到,使用 term agg+ top hits agg 来实现啊,这种组合两种聚和的方式能够实现上面的功能,不过也有一些局限性,好比,不能分页,#4915;结果不够精确(top term+top hits,es 的聚合实现选择了牺牲精度来提升速度);数据量大的状况下,聚合比较慢,影响搜索体验。
而新的的字段折叠的方式是怎么实现的的呢,有这些要点:html
下面来看看具体的例子,就知道怎么回事了,使用起来很简单。git
DELETE recipes PUT recipes POST recipes/type/_mapping { "properties": { "name":{ "type": "text" }, "rating":{ "type": "float" },"type":{ "type": "keyword" } } } POST recipes/type/ { "name":"清蒸鱼头","rating":1,"type":"湘菜" } POST recipes/type/ { "name":"剁椒鱼头","rating":2,"type":"湘菜" } POST recipes/type/ { "name":"红烧鲫鱼","rating":3,"type":"湘菜" } POST recipes/type/ { "name":"鲫鱼汤(辣)","rating":3,"type":"湘菜" } POST recipes/type/ { "name":"鲫鱼汤(微辣)","rating":4,"type":"湘菜" } POST recipes/type/ { "name":"鲫鱼汤(变态辣)","rating":5,"type":"湘菜" } POST recipes/type/ { "name":"广式鲫鱼汤","rating":5,"type":"粤菜" } POST recipes/type/ { "name":"鱼香肉丝","rating":2,"type":"川菜" } POST recipes/type/ { "name":"奶油鲍鱼汤","rating":2,"type":"西菜" }
POST recipes/type/_search { "query": {"match": { "name": "鱼" }},"size": 3 }
全是湘菜,个人天,最近上火不想吃辣,这个第一页的结果对我来讲就是垃圾,以下:github
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": 0.26742277, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": 0.26742277, "_source": { "name": "鲫鱼汤(变态辣)", "rating": 5, "type": "湘菜" } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHXO_OA-dG63Txsa", "_score": 0.19100356, "_source": { "name": "红烧鲫鱼", "rating": 3, "type": "湘菜" } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHWy_OA-dG63TxsZ", "_score": 0.19100356, "_source": { "name": "剁椒鱼头", "rating": 2, "type": "湘菜" } } ] } }
咱们再看看,此次我想加个评分排序,你们都喜欢的是那些,看看有没有喜欢吃的,执行查询:json
POST recipes/type/_search { "query": {"match": { "name": "鱼" }},"sort": [ { "rating": { "order": "desc" } } ],"size": 3 }
结果稍微好点了,不过3个里面2个是湘菜,仍是有点不合适,结果以下:app
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鲫鱼汤(变态辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "广式鲫鱼汤", "rating": 5, "type": "粤菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHX7_OA-dG63Txsc", "_score": null, "_source": { "name": "鲫鱼汤(微辣)", "rating": 4, "type": "湘菜" }, "sort": [ 4 ] } ] } }
如今我知道了,我要看看其余菜系,这家不是还有西餐、广东菜等各类菜系的么,来来,帮我每一个菜系来一个菜看看,换 terms agg 先获得惟一的 term 的 bucket,再组合 top_hits agg,返回按评分排序的第一个 top hits,有点复杂,不要紧,看下面的查询就知道了:elasticsearch
GET recipes/type/_search { "query": { "match": { "name": "鱼" } }, "sort": [ { "rating": { "order": "desc" } } ],"aggs": { "type": { "terms": { "field": "type", "size": 10 },"aggs": { "rated": { "top_hits": { "sort": [{ "rating": {"order": "desc"} }], "size": 1 } } } } }, "size": 0, "from": 0 }
看下面的结果,虽然 json 结构有点复杂,不过总算是咱们想要的结果了,湘菜、粤菜、川菜、西菜都出来了,每样一个,不重样:ide
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": 0, "hits": [] }, "aggregations": { "type": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "湘菜", "doc_count": 6, "rated": { "hits": { "total": 6, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鲫鱼汤(变态辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] } ] } } }, { "key": "川菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYr_OA-dG63Txsf", "_score": null, "_source": { "name": "鱼香肉丝", "rating": 2, "type": "川菜" }, "sort": [ 2 ] } ] } } }, { "key": "粤菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "广式鲫鱼汤", "rating": 5, "type": "粤菜" }, "sort": [ 5 ] } ] } } }, { "key": "西菜", "doc_count": 1, "rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHY3_OA-dG63Txsg", "_score": null, "_source": { "name": "奶油鲍鱼汤", "rating": 2, "type": "西菜" }, "sort": [ 2 ] } ] } } } ] } } }
上面的实现方法,前面已经说了,能够作,有局限性,那看看新的字段折叠法如何作到呢,查询以下,加一个 collapse 参数,指定对那个字段去重就好了,这里固然对菜系“type”字段进行去重了:ui
GET recipes/type/_search { "query": { "match": { "name": "鱼" } }, "collapse": { "field": "type" }, "size": 3, "from": 0 }
结果很理想嘛,命中结果仍是熟悉的那个味道(和查询结果长的同样嘛),以下:htm
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoDNlRJ_OA-dG63TxpW", "_score": 0.018980097, "_source": { "name": "鲫鱼汤(微辣)", "rating": 4, "type": "湘菜" }, "fields": { "type": [ "湘菜" ] } }, { "_index": "recipes", "_type": "type", "_id": "AVoDNlRk_OA-dG63TxpZ", "_score": 0.013813315, "_source": { "name": "鱼香肉丝", "rating": 2, "type": "川菜" }, "fields": { "type": [ "川菜" ] } }, { "_index": "recipes", "_type": "type", "_id": "AVoDNlRb_OA-dG63TxpY", "_score": 0.0125863515, "_source": { "name": "广式鲫鱼汤", "rating": 5, "type": "粤菜" }, "fields": { "type": [ "粤菜" ] } } ] } }
我再试试翻页,把 from 改一下,如今返回了3条数据,from 改为3,新的查询以下:排序
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoDNlRw_OA-dG63Txpa", "_score": 0.012546891, "_source": { "name": "奶油鲍鱼汤", "rating": 2, "type": "西菜" }, "fields": { "type": [ "西菜" ] } } ] } }
上面的结果只有一条了,去重以后原本就只有4条数据,上面的工做正常,每一个菜系只有一个菜啊,那我不乐意了,帮我每一个菜系里面多返回几条,我好选菜啊,加上参数 inner_hits 来控制返回的条数,这里返回2条,按 rating 也排个序,新的查询构造以下:
GET recipes/type/_search { "query": { "match": { "name": "鱼" } }, "collapse": { "field": "type", "inner_hits": { "name": "top_rated", "size": 2, "sort": [ { "rating": "desc" } ] } }, "sort": [ { "rating": { "order": "desc" } } ], "size": 2, "from": 0 }
查询结果以下,完美:
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 9, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鲫鱼汤(变态辣)", "rating": 5, "type": "湘菜" }, "fields": { "type": [ "湘菜" ] }, "sort": [ 5 ], "inner_hits": { "top_rated": { "hits": { "total": 6, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYF_OA-dG63Txsd", "_score": null, "_source": { "name": "鲫鱼汤(变态辣)", "rating": 5, "type": "湘菜" }, "sort": [ 5 ] }, { "_index": "recipes", "_type": "type", "_id": "AVoESHX7_OA-dG63Txsc", "_score": null, "_source": { "name": "鲫鱼汤(微辣)", "rating": 4, "type": "湘菜" }, "sort": [ 4 ] } ] } } } }, { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "广式鲫鱼汤", "rating": 5, "type": "粤菜" }, "fields": { "type": [ "粤菜" ] }, "sort": [ 5 ], "inner_hits": { "top_rated": { "hits": { "total": 1, "max_score": null, "hits": [ { "_index": "recipes", "_type": "type", "_id": "AVoESHYW_OA-dG63Txse", "_score": null, "_source": { "name": "广式鲫鱼汤", "rating": 5, "type": "粤菜" }, "sort": [ 5 ] } ] } } } } ] } }
好了,字段折叠介绍就到这里。