在Elasticsearch全文检索中,咱们用的比较多的就是Multi Match Query,其支持对多个字段进行匹配。Elasticsearch支持5种类型的Multi Match,咱们一块儿来深刻学习下它们的区别。html
直接从官网的文档上摘抄一段来:json
这里咱们只考虑前面三种,后两种能够另外单独研究,就先忽略了。app
PUT /gino_product { "mappings": { "product": { "properties": { "productName": { "type": "string", "analyzer": "fulltext_analyzer", "copy_to": [ "bigSearchField" ] }, "brandName": { "type": "string", "analyzer": "fulltext_analyzer", "copy_to": [ "bigSearchField" ], "fields": { "brandName_pinyin": { "type": "string", "analyzer": "pinyin_analyzer", "search_analyzer": "standard" }, "brandName_keyword": { "type": "string", "analyzer": "keyword", "search_analyzer": "standard" } } }, "sortName": { "type": "string", "analyzer": "fulltext_analyzer", "copy_to": [ "bigSearchField" ], "fields": { "sortName_pinyin": { "type": "string", "analyzer": "pinyin_analyzer", "search_analyzer": "standard" } } }, "productKeyword": { "type": "string", "analyzer": "fulltext_analyzer", "copy_to": [ "bigSearchField" ] }, "bigSearchField": { "type": "string", "analyzer": "fulltext_analyzer" } } } }, "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 }, "analysis": { "tokenizer": { "simple_pinyin": { "type": "pinyin", "first_letter": "none" } }, "analyzer": { "fulltext_analyzer": { "type": "ik", "use_smart": true }, "pinyin_analyzer": { "type": "custom", "tokenizer": "simple_pinyin", "filter": [ "word_delimiter", "lowercase" ] } } } } }
POST /gino_product/product/1 { "productName": "耐克女生运动轻跑鞋", "brandName": "耐克", "sortName": "鞋子", "productKeyword": "耐克,潮流,运动,轻跑鞋" } POST /gino_product/product/2 { "productName": "耐克女生休闲运动服", "brandName": "耐克", "sortName": "上衣", "productKeyword": "耐克,休闲,运动" } POST /gino_product/product/3 { "productName": "阿迪达斯女生冬季运动板鞋", "brandName": "阿迪达斯", "sortName": "鞋子", "productKeyword": "阿迪达斯,冬季,运动,板鞋" } POST /gino_product/product/4 { "productName": "阿迪达斯女生冬季运动夹克外套", "brandName": "阿迪达斯", "sortName": "上衣", "productKeyword": "阿迪达斯,冬季,运动,夹克,外套" }
POST /gino_product/_search { "query": { "multi_match": { "query": "运动", "fields": [ "brandName^100", "brandName.brandName_pinyin^100", "brandName.brandName_keyword^100", "sortName^80", "sortName.sortName_pinyin^80", "productName^60", "productKeyword^20" ], "type": <multi-match-type>, "operator": "AND" } } }
发现使用3种type均可以搜索出4条商品数据,并且排序也是一致的。elasticsearch
POST /gino_product/_search { "query": { "multi_match": { "query": "运动 上衣", "fields": [ "brandName^100", "brandName.brandName_pinyin^100", "brandName.brandName_keyword^100", "sortName^80", "sortName.sortName_pinyin^80", "productName^60", "productKeyword^20" ], "type": <multi-match-type>, "operator": "AND" } } }
此次搜索只有cross_field才能搜索出数据,而使用best_fields和most_fields不行,为何?ide
使用validate API来比较区别post
POST /gino_product/_validate/query?rewrite=true { "query": { "multi_match": { "query": "运动 上衣", "fields": [ "brandName^100", "brandName.brandName_pinyin^100", "brandName.brandName_keyword^100", "sortName^80", "sortName.sortName_pinyin^80", "productName^60", "productKeyword^20" ], "type": <multi-match-type>, "operator": "AND" } } }
每一个字段匹配时分别使用mapping上定义的analyzer和search_analyzer。学习
(+brandName:运动 +brandName:上衣)^100.0 | (+brandName.brandName_pinyin:运 +brandName.brandName_pinyin:动 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 | (+brandName.brandName_keyword:运 +brandName.brandName_keyword:动 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0 | (+sortName:运动 +sortName:上衣)^80.0 | (+sortName.sortName_pinyin:运 +sortName.sortName_pinyin:动 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 | (+productName:运动 +productName:上衣)^60.0 | (+productKeyword:运动 +productKeyword:上衣)^20.0
与best_fields不一样之处在于相关性评分,best_fields取最大匹配得分(max计算),而most_fields取全部匹配之和(sum计算)。测试
( (+brandName:运动 +brandName:上衣)^100.0 (+brandName.brandName_pinyin:运 +brandName.brandName_pinyin:动 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 (+brandName.brandName_keyword:运 +brandName.brandName_keyword:动 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0 (+sortName:运动 +sortName:上衣)^80.0 (+sortName.sortName_pinyin:运 +sortName.sortName_pinyin:动 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 (+productName:运动 +productName:上衣)^60.0 (+productKeyword:运动 +productKeyword:上衣)^20.0 )
首先ES会对cross_fields进行查询重写分组,分组的依据是search_analyzer。具体到咱们的例子中【brandName.brandName_pinyin、brandName.brandName_keyword、sortName.sortName_pinyin】这三个字段的search_analyzer是standard,而其他的字段是fulltext_analyzer,所以最终被分为了两组。ui
( ( +(brandName.brandName_pinyin:运^100.0 | sortName.sortName_pinyin:运^80.0 | brandName.brandName_keyword:运^100.0) +(brandName.brandName_pinyin:动^100.0 | sortName.sortName_pinyin:动^80.0 | brandName.brandName_keyword:动^100.0) +(brandName.brandName_pinyin:上^100.0 | sortName.sortName_pinyin:上^80.0 | brandName.brandName_keyword:上^100.0) +(brandName.brandName_pinyin:衣^100.0 | sortName.sortName_pinyin:衣^80.0 | brandName.brandName_keyword:衣^100.0) ) ( +(productKeyword:运动^20.0 | brandName:运动^100.0 | sortName:运动^80.0 | productName:运动^60.0) +(productKeyword:上衣^20.0 | brandName:上衣^100.0 | sortName:上衣^80.0 | productName:上衣^60.0) ) )
最多见的作法就是使用_all字段或者copyTo字段来实现,好比咱们mapping里面的bigSearchField字段。spa
因为cross_fields须要根据search_analyzer进行分组,所以像搜索【运动 shangyi】这样的输入时是没法匹配到商品的,所以应该尽量地减小分组既尽可能使用统一的search_analyzer,或者在search时强制指定search_analyzer覆盖mapping里定义的search_analyzer。
在上面的例子中,咱们设置的operator均为AND,意味着全部搜索的Token都必须被匹配。那设置成OR会怎么样以及什么场景下该使用OR呢?
在使用OR的时候要特别注意,由于只要有一个Token匹配就会把商品搜索出来,好比上面的搜索【运动 上衣】的时候,会把鞋子的商品也匹配出来,这样搜索的准确度会远远下降。
在一些特殊的搜索中,好比咱们搜索【耐克 阿迪达斯 上衣】,若是使用operator为AND,则不管使用哪一种multi-search-type都没法匹配出商品(想一想为何?),此时咱们能够设置operator为OR而且设置minimum_should_match为60%,这样就能够搜索出属于耐克和阿迪达斯的上衣了,这种状况至关于一种智能的搜索降级了。
/gino_product/_search { "query": { "multi_match": { "query": "耐克 阿迪达斯 上衣", "fields": [ "brandName^100", "brandName.brandName_pinyin^100", "brandName.brandName_keyword^100", "sortName^80", "sortName.sortName_pinyin^80", "productName^60", "productKeyword^20" ], "type": "cross_fields", "operator": "OR", "minimum_should_match": "60%" } } }
在Elasticsearch相关性打分机制学习一文中咱们曾经探讨过best_fields和cross_fields相关性评分的机制,其中的例子使用的相同的search_analyzer。那对于分组状况下,cross_fields评分又是如何计算的呢?
咱们仍是用上面的例子,增长explain参数来看一下。
POST /gino_product/_search { "explain": true, "query": { "multi_match": { "query": "运动 上衣", "fields": [ "brandName^100", "brandName.brandName_pinyin^100", "brandName.brandName_keyword^100", "sortName^80", "sortName.sortName_pinyin^80", "productName^60", "productKeyword^20" ], "type": "cross_fields", "operator": "AND" } } }
详细ES响应报文:cross_fields_scoring.json
经过上述validate API获得的分组信息和explain获得的评分详情信息,能够总结出一个cross_fields评分公式:
score(q, d) = coord(q, d) * ∑(∑(max(score(t, f))))