23个最有用的Elasticseaerch检索技巧（上）

前言

本文主要介绍 Elasticsearch 23种最有用的检索技巧，提供了详尽的源码举例，并配有相应的Java API实现，是不可多得的 Elasticsearch 学习&实战资料git

数据准备

为了讲解不一样类型 ES 检索，咱们将要对包含如下类型的文档集合进行检索：github

title               标题
authors             做者
summary             摘要
release data        发布日期
number of reviews   评论数

首先，咱们借助 bulk API 批量建立新的索引并提交数据正则表达式

# 设置索引 settings
PUT /bookdb_index
{ "settings": { "number_of_shards": 1 }}

# bulk 提交数据
POST /bookdb_index/book/_bulk
{"index":{"_id":1}}
{"title":"Elasticsearch: The Definitive Guide","authors":["clinton gormley","zachary tong"],"summary":"A distibuted real-time search and analytics engine","publish_date":"2015-02-07","num_reviews":20,"publisher":"oreilly"}
{"index":{"_id":2}}
{"title":"Taming Text: How to Find, Organize, and Manipulate It","authors":["grant ingersoll","thomas morton","drew farris"],"summary":"organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization","publish_date":"2013-01-24","num_reviews":12,"publisher":"manning"}
{"index":{"_id":3}}
{"title":"Elasticsearch in Action","authors":["radu gheorge","matthew lee hinman","roy russo"],"summary":"build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms","publish_date":"2015-12-03","num_reviews":18,"publisher":"manning"}
{"index":{"_id":4}}
{"title":"Solr in Action","authors":["trey grainger","timothy potter"],"summary":"Comprehensive guide to implementing a scalable search engine using Apache Solr","publish_date":"2014-04-05","num_reviews":23,"publisher":"manning"}

注意：本文实验使用的ES版本是 ES 6.3.0sql

一、基本匹配检索( Basic Match Query)

1.1 全文检索

有两种方式能够执行全文检索：bash

1）使用包含参数的检索API，参数做为URL的一部分微信

举例：如下对 "guide" 执行全文检索app

GET bookdb_index/book/_search?q=guide

[Results]
  "hits": {
    "total": 2,
    "max_score": 1.3278645,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1.3278645,
        "_source": {
          "title": "Solr in Action",
          "authors": [
            "trey grainger",
            "timothy potter"
          ],
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "publish_date": "2014-04-05",
          "num_reviews": 23,
          "publisher": "manning"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 1.2871116,
        "_source": {
          "title": "Elasticsearch: The Definitive Guide",
          "authors": [
            "clinton gormley",
            "zachary tong"
          ],
          "summary": "A distibuted real-time search and analytics engine",
          "publish_date": "2015-02-07",
          "num_reviews": 20,
          "publisher": "oreilly"
        }
      }
    ]
  }

2）使用完整的ES DSL，其中Json body做为请求体
其执行结果如方式 1）结果一致.elasticsearch

GET bookdb_index/book/_search
{
  "query": {
    "multi_match": {
      "query": "guide",
      "fields" : ["_all"]
    }
  }
}

解读： 使用multi_match关键字代替match关键字，做为对多个字段运行相同查询的方便的简写方式。 fields属性指定要查询的字段，在这种状况下，咱们要对文档中的全部字段进行查询ide

注意：ES 6.x 默认不启用 _all 字段, 不指定 fields 默认搜索为全部字段性能

1.2 指定特定字段检索

这两个API也容许您指定要搜索的字段。
例如，要在标题字段(title)中搜索带有 "in action" 字样的图书

1）URL检索方式

GET bookdb_index/book/_search?q=title:in action

[Results]
  "hits": {
    "total": 2,
    "max_score": 1.6323128,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "3",
        "_score": 1.6323128,
        "_source": {
          "title": "Elasticsearch in Action",
          "authors": [
            "radu gheorge",
            "matthew lee hinman",
            "roy russo"
          ],
          "summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
          "publish_date": "2015-12-03",
          "num_reviews": 18,
          "publisher": "manning"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1.6323128,
        "_source": {
          "title": "Solr in Action",
          "authors": [
            "trey grainger",
            "timothy potter"
          ],
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "publish_date": "2014-04-05",
          "num_reviews": 23,
          "publisher": "manning"
        }
      }
    ]
  }

2）DSL检索方式
然而，full body的DSL为您提供了建立更复杂查询的更多灵活性（咱们将在后面看到）以及指定您但愿的返回结果。在下面的示例中，咱们指定要返回的结果数、偏移量（对分页有用）、咱们要返回的文档字段以及属性的高亮显示。

结果数的表示方式：size
偏移值的表示方式：from
指定返回字段的表示方式：_source
高亮显示的表示方式：highliaght

GET bookdb_index/book/_search
{
  "query": {
    "match": {
      "title": "in action"
    }
  },
  "size": 2,
  "from": 0,
  "_source": ["title", "summary", "publish_date"],
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

[Results]
  "hits": {
    "total": 2,
    "max_score": 1.6323128,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "3",
        "_score": 1.6323128,
        "_source": {
          "summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
          "title": "Elasticsearch in Action",
          "publish_date": "2015-12-03"
        },
        "highlight": {
          "title": [
            "Elasticsearch <em>in</em> <em>Action</em>"
          ]
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1.6323128,
        "_source": {
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "title": "Solr in Action",
          "publish_date": "2014-04-05"
        },
        "highlight": {
          "title": [
            "Solr <em>in</em> <em>Action</em>"
          ]
        }
      }
    ]
  }

注意:

对于 multi-word 检索，匹配查询容许您指定是否使用 and 运算符，
而不是使用默认 or 运算符 ---> "operator" : "and"

您还能够指定 minimum_should_match 选项来调整返回结果的相关性，详细信息能够在Elasticsearch指南中查询Elasticsearch guide获取。

二、多字段检索 (Multi-field Search)

如咱们已经看到的，要在搜索中查询多个文档字段（例如在标题和摘要中搜索相同的查询字符串），请使用multi_match查询

GET bookdb_index/book/_search
{
  "query": {
    "multi_match": {
      "query": "guide", 
      "fields": ["title", "summary"]
    }
  }
}

[Results]
  "hits": {
    "total": 3,
    "max_score": 2.0281231,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 2.0281231,
        "_source": {
          "title": "Elasticsearch: The Definitive Guide",
          "authors": [
            "clinton gormley",
            "zachary tong"
          ],
          "summary": "A distibuted real-time search and analytics engine",
          "publish_date": "2015-02-07",
          "num_reviews": 20,
          "publisher": "oreilly"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1.3278645,
        "_source": {
          "title": "Solr in Action",
          "authors": [
            "trey grainger",
            "timothy potter"
          ],
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "publish_date": "2014-04-05",
          "num_reviews": 23,
          "publisher": "manning"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "3",
        "_score": 1.0333893,
        "_source": {
          "title": "Elasticsearch in Action",
          "authors": [
            "radu gheorge",
            "matthew lee hinman",
            "roy russo"
          ],
          "summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
          "publish_date": "2015-12-03",
          "num_reviews": 18,
          "publisher": "manning"
        }
      }
    ]
  }

注意：以上结果中文档4（_id=4）匹配的缘由是guide在summary存在。

三、 Boosting提高某字段得分的检索( Boosting)

因为咱们正在多个字段进行搜索，咱们可能但愿提升某一字段的得分。在下面的例子中，咱们将“摘要”字段的得分提升了3倍，以增长“摘要”字段的重要性，从而提升文档 4 的相关性。

GET bookdb_index/book/_search
{
  "query": {
    "multi_match": {
      "query": "elasticsearch guide", 
      "fields": ["title", "summary^3"]
    }
  },
  "_source": ["title", "summary", "publish_date"]
}

[Results]
  "hits": {
    "total": 3,
    "max_score": 3.9835935,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 3.9835935,
        "_source": {
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "title": "Solr in Action",
          "publish_date": "2014-04-05"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "3",
        "_score": 3.1001682,
        "_source": {
          "summary": "build scalable search applications using Elasticsearch without having to do complex low-level programming or understand advanced data science algorithms",
          "title": "Elasticsearch in Action",
          "publish_date": "2015-12-03"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 2.0281231,
        "_source": {
          "summary": "A distibuted real-time search and analytics engine",
          "title": "Elasticsearch: The Definitive Guide",
          "publish_date": "2015-02-07"
        }
      }
    ]
  }

注意：Boosting不只意味着计算得分乘法以增长因子。实际的提高得分值是经过归一化和一些内部优化。参考 Elasticsearch guide查看更多

四、Bool检索( Bool Query)

可使用 AND / OR / NOT 运算符来微调咱们的搜索查询，以提供更相关或指定的搜索结果。

在搜索API中是经过bool查询来实现的。
bool查询接受 must 参数（等效于AND），一个 must_not 参数（至关于NOT）或者一个 should 参数（等同于OR）。

例如，若是我想在标题中搜索一本名为 "Elasticsearch" 或 "Solr" 的书，AND由 "clinton gormley" 创做，但NOT由 "radu gheorge" 创做

GET bookdb_index/book/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {"match": {"title": "Elasticsearch"}},
              {"match": {"title": "Solr"}}
            ]
          }
        },
        {
          "match": {"authors": "clinton gormely"}
        }
      ],
      "must_not": [
        {
          "match": {"authors": "radu gheorge"}
        }
      ]
    }
  }
}

[Results]
  "hits": {
    "total": 1,
    "max_score": 2.0749094,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 2.0749094,
        "_source": {
          "title": "Elasticsearch: The Definitive Guide",
          "authors": [
            "clinton gormley",
            "zachary tong"
          ],
          "summary": "A distibuted real-time search and analytics engine",
          "publish_date": "2015-02-07",
          "num_reviews": 20,
          "publisher": "oreilly"
        }
      }
    ]
  }

注意：您能够看到，bool查询能够包含任何其余查询类型，包括其余布尔查询，以建立任意复杂或深度嵌套的查询

五、 Fuzzy 模糊检索( Fuzzy Queries)

在 Match检索和多匹配检索中能够启用模糊匹配来捕捉拼写错误。基于与原始词的 Levenshtein 距离来指定模糊度

GET bookdb_index/book/_search
{
  "query": {
    "multi_match": {
      "query": "comprihensiv guide",
      "fields": ["title","summary"],
      "fuzziness": "AUTO"
    }
  },
  "_source": ["title","summary","publish_date"],
  "size": 1
}

[Results]
  "hits": {
    "total": 2,
    "max_score": 2.4344182,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 2.4344182,
        "_source": {
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "title": "Solr in Action",
          "publish_date": "2014-04-05"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 1.2871116,
        "_source": {
          "summary": "A distibuted real-time search and analytics engine",
          "title": "Elasticsearch: The Definitive Guide",
          "publish_date": "2015-02-07"
        }
      }
    ]
  }

"AUTO" 的模糊值至关于当字段长度大于5时指定值2。可是，设置80％的拼写错误的编辑距离为1，将模糊度设置为1可能会提升总体搜索性能。有关更多信息， Typos and Misspellingsch

六、 Wildcard Query 通配符检索

通配符查询容许您指定匹配的模式，而不是整个词组（term）检索

？匹配任何字符
* 匹配零个或多个字符

举例，要查找具备以 "t" 字母开头的做者的全部记录，以下所示

GET bookdb_index/book/_search
{
  "query": {
    "wildcard": {
      "authors": {
        "value": "t*"
      }
    }
  },
  "_source": ["title", "authors"],
  "highlight": {
    "fields": {
      "authors": {}
    }
  }
}

[Results]
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "Elasticsearch: The Definitive Guide",
          "authors": [
            "clinton gormley",
            "zachary tong"
          ]
        },
        "highlight": {
          "authors": [
            "zachary <em>tong</em>"
          ]
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "Taming Text: How to Find, Organize, and Manipulate It",
          "authors": [
            "grant ingersoll",
            "thomas morton",
            "drew farris"
          ]
        },
        "highlight": {
          "authors": [
            "<em>thomas</em> morton"
          ]
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "Solr in Action",
          "authors": [
            "trey grainger",
            "timothy potter"
          ]
        },
        "highlight": {
          "authors": [
            "<em>trey</em> grainger",
            "<em>timothy</em> potter"
          ]
        }
      }
    ]
  }

七、正则表达式检索( Regexp Query)

正则表达式能指定比通配符检索更复杂的检索模式，举例以下：

POST bookdb_index/book/_search
{
  "query": {
    "regexp": {
      "authors": "t[a-z]*y"
    }
  },
  "_source": ["title", "authors"],
  "highlight": {
    "fields": {
      "authors": {}
    }
  }
}

[Results]
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 1,
        "_source": {
          "title": "Solr in Action",
          "authors": [
            "trey grainger",
            "timothy potter"
          ]
        },
        "highlight": {
          "authors": [
            "<em>trey</em> grainger",
            "<em>timothy</em> potter"
          ]
        }
      }
    ]
  }

八、匹配短语检索( Match Phrase Query)

匹配短语查询要求查询字符串中的全部词都存在于文档中，按照查询字符串中指定的顺序而且彼此靠近。

默认状况下，这些词必须彻底相邻，但您能够指定偏离值（slop value)，该值指示在仍然考虑文档匹配的状况下词与词之间的偏离值。

GET bookdb_index/book/_search
{
  "query": {
    "multi_match": {
      "query": "search engine",
      "fields": ["title", "summary"],
      "type": "phrase",
      "slop": 3
    }
  },
  "_source": [ "title", "summary", "publish_date" ]
}

[Results]
  "hits": {
    "total": 2,
    "max_score": 0.88067603,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "4",
        "_score": 0.88067603,
        "_source": {
          "summary": "Comprehensive guide to implementing a scalable search engine using Apache Solr",
          "title": "Solr in Action",
          "publish_date": "2014-04-05"
        }
      },
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "1",
        "_score": 0.51429313,
        "_source": {
          "summary": "A distibuted real-time search and analytics engine",
          "title": "Elasticsearch: The Definitive Guide",
          "publish_date": "2015-02-07"
        }
      }
    ]
  }

注意：在上面的示例中，对于非短语类型查询，文档_id 1一般具备较高的分数，而且显示在文档_id 4以前，由于其字段长度较短。

然而，做为一个短语查询，词与词之间的接近度被考虑在内，因此文档_id 4分数更好

九、匹配词组前缀检索

匹配词组前缀查询在查询时提供搜索即时类型或 "相对简单" "的自动完成版本，而无需以任何方式准备数据。

像match_phrase查询同样，它接受一个斜率参数，使得单词的顺序和相对位置没有那么 "严格"。它还接受max_expansions参数来限制匹配的条件数以减小资源强度

GET bookdb_index/book/_search
{
  "query": {
    "match_phrase_prefix": {
      "summary": {
        "query": "search en",
        "slop": 3,
        "max_expansions": 10
      }
    }
  },
  "_source": ["title","summary","publish_date"]
}

注意：查询时间搜索类型具备性能成本。一个更好的解决方案是将时间做为索引类型。更多相关API查询 Completion Suggester API 或者 Edge-Ngram filters 。

十、字符串检索（ Query String）

query_string查询提供了以简明的简写语法执行多匹配查询 multi_match queries ，布尔查询 bool queries ，提高得分 boosting ，模糊匹配 fuzzy matching ，通配符 wildcards ，正则表达式 regexp 和范围查询 range queries 的方式。

在下面的例子中，咱们对 "search algorithm" 一词执行模糊搜索，其中一本做者是 "grant ingersoll" 或 "tom morton"。咱们搜索全部字段，但将提高应用于文档2的摘要字段

GET bookdb_index/book/_search
{
  "query": {
    "query_string": {
      "query": "(saerch~1 algorithm~1) AND (grant ingersoll)  OR (tom morton)",
      "fields": ["summary^2","title","authors","publisher"]
    }
  },
  "_source": ["title","summary","authors"],
  "highlight": {
    "fields": {
      "summary": {}
    }
  }
}

[Results]
  "hits": {
    "total": 1,
    "max_score": 3.571021,
    "hits": [
      {
        "_index": "bookdb_index",
        "_type": "book",
        "_id": "2",
        "_score": 3.571021,
        "_source": {
          "summary": "organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization",
          "title": "Taming Text: How to Find, Organize, and Manipulate It",
          "authors": [
            "grant ingersoll",
            "thomas morton",
            "drew farris"
          ]
        },
        "highlight": {
          "summary": [
            "organize text using approaches such as full-text <em>search</em>, proper name recognition, clustering, tagging"
          ]
        }
      }
    ]
  }

十一、简化的字符串检索（Simple Query String）

simple_query_string 查询是 query_string 查询的一个版本，更适合用于暴露给用户的单个搜索框，
由于它分别用 + / | / - 替换了 AND / OR / NOT 的使用，并放弃查询的无效部分，而不是在用户出错时抛出异常。

GET bookdb_index/book/_search
{
  "query": {
    "simple_query_string": {
      "query": "(saerch~1 algorithm~1) + (grant ingersoll)  | (tom morton)",
      "fields": ["summary^2","title","authors","publisher"]
    }
  },
  "_source": ["title","summary","authors"],
  "highlight": {
    "fields": {
      "summary": {}
    }
  }
}

[Results]
# 结果同上

Java API 实现

Java API 实现，代码见 https://github.com/whirlys/elastic-example/tree/master/UsefullESSearchSkill

小结

因为公众号推送每篇字数不能超过 5w 字，因此拆成两篇。
今天很晚了，文章修正以及 Java API 实现明天再更新吧

更多内容请访问个人我的博客：http://laijianfeng.org
参考文章：
铭毅天下:[译]你必须知道的23个最有用的Elasticseaerch检索技巧
英文原文：23 Useful Elasticsearch Example Queries

本文分享自微信公众号 - 小旋锋（whirlysBigData）。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。