elasticsearch学习笔记高级篇（十）——多字段搜索（上）

时间 2019-12-07

原文原文链接

只有一个简单的match子句的查询是不多见的。咱们常常须要在一个或者多个字段中查询相同的或者不一样的查询字符串，意味着咱们须要可以组合多个子查询以及使它们的相关性得分有意义。网站

一、best fields

假设咱们有一个让用户搜索博客文章的网站。其中有两个文档以下：ui

PUT /test_index/_create/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /test_index/_create/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

进行查询：code

GET /test_index/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

输出结果:ip

{
  "took" : 326,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.90425634,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90425634,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

咱们发现文档1的相关度分数更高，排在了前面。要理解缘由，能够相像一下bool查询是如何计算相关度分数的
（1）运行should子句中的两个查询
（2）相加查询返回的分值
（3）将相加获得的分值乘以匹配的查询子句的数量
（4）除以总的查询子句的数量
文档1在两个字段中都包含了brown，所以两个match查询都匹配成功并拥有了一个分值。文档2在body字段中包含了brown以及fox，可是在title字段中没有出现任何搜索的单词。所以对body字段查询获得的高分加上对title字段查询获得的零分，而后在乘以匹配的查询子句数量1，最后除以总的查询子句数量2，致使总体分数值比文档1的低。文档

解决方法：

相比使用bool查询，咱们可使用dis_max查询（Disjuction Max Query），意思就是返回匹配了任何查询的文档，而且分值是产生了最佳匹配的查询所对应的分值：字符串

GET /test_index/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Brown fox"
          }
        },
        {
          "match": {
            "body": "Brown fox"
          }
        }
      ]
    }
  }
}

输出结果:博客

{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.77041256,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.77041256,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      }
    ]
  }
}

此时就获得了咱们想要的结果，文档2排在了文档1的前面it

二、针对best fields查询进行调优

上面的例子中，假设咱们搜索的是“quick pets”，两份文档中都包含了单词quick,可是只有文档2包含了pets。两份文档都没能在一个字段中同时包含搜索的两个单词。
进行查询：io

GET /test_index/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick pets"
          }
        },
        {
          "match": {
            "body": "Quick pets"
          }
        }
      ]
    }
  }
}

输出结果:test

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      }
    ]
  }
}

此时能够发现两份文档的分值是同样的。可是咱们指望的是同时匹配了title字段和body字段的文档可以拥有更高的排名。
注意：dis_max查询只是简单的使用最佳匹配查询子句获得的相关度分数。

解决方法：

要想获得咱们指望的结果，此时能够经过指定tie_breaker参数：
查询以下：

GET /test_index/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": "Quick pets"
          }
        },
        {
          "match": {
            "body": "Quick pets"
          }
        }
      ],
      "tie_breaker": 0.3
    }
  }
}

输出结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.87613803,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.87613803,
        "_source" : {
          "title" : "Keeping pets healthy",
          "body" : "My quick brown fox eats rabbits on a regular basis."
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931472,
        "_source" : {
          "title" : "Quick brown rabbits",
          "body" : "Brown rabbits are commonly seen."
        }
      }
    ]
  }
}

此时就是咱们指望的结果了，文档2的分数比文档1的要高了
tie_breaker参数会让dis_max查询的行为更像是dis_max和bool的一种折中。它会经过下面的方式改变分值计算过程：
（1）取得最佳匹配查询子句的_score
（2）将其它每一个匹配的子句的分值乘以tie_breaker
（3）将以上获得的分值进行累加并规范化
经过tie_breaker参数匹配的全部子句都会起做用，只不过最佳匹配子句的做用更大
注意：tie_breaker的取值范围是0到1之间的浮点数，取0时即为仅使用最佳匹配子句。取1则会将全部匹配的子句一视同仁。它的确切值须要根据你的数据和查询进行调整，可是一个合理的值会靠近0，来确保不会压倒dis_max查询具备的最佳匹配性质。

三、multi_match查询

multi_match查询提供了一个简便的方法用来对多个字段执行相同的查询。
默认状况下，该查询以best_fields类型执行，它会为每一个字段生成一个match查询，而后将这些查询包含在一个dis_max查询中。下面的dis_max查询：

GET /test_index/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.3,
      "queries": [
        {
          "match": {
            "title": {
              "query": "Quick brown fox",
              "minimum_should_match": "30%"
            }
          }
        },
        {
          "match": {
            "body": {
              "query": "Quick brown fox",
              "minimum_should_match": "30%"
            }
          }
        }
      ]
    }
  }
}

能够经过multi_match简单地重写以下：

GET /test_index/_search
{
  "query": {
      "multi_match": {
      "query": "Quick brown fox",
      "type": "best_fields",
      "fields": ["title", "body"],
      "tie_breaker": 0.3,
      "minimum_should_match": "30%"
    }
  }
}

注意：type属性为best_fields；minimum_should_match和operator参数会被传入到生成的match查询中
针对multi_match，还可使用通配符匹配字段名，以及针对个别字段进行加权
通配符：

"fields": "*_title"

加权：

"fields": ["*_title", "chapter_title^2"] # 此时chapter_title字段的boost值为2，而book_title和section_title字段的boost值为默认的1