ElasticSearch 2.x入门与快速实践

时间 2019-11-16

原文原文链接

本文从属于笔者的爬虫与搜索引擎最佳实践系列文章html

Introduction

ElasticSearch是一个基于Apache Lucene(TM)的开源搜索引擎。不管在开源仍是专有领域，Lucene能够被认为是迄今为止最早进、性能最好的、功能最全的搜索引擎库。可是，Lucene只是一个库。想要使用它，你必须使用Java来做为开发语言并将其直接集成到你的应用中，更糟糕的是，Lucene很是复杂，你须要深刻了解检索的相关知识来理解它是如何工做的。ElasticSearch也使用Java开发并使用Lucene做为其核心来实现全部索引和搜索的功能，可是它的目的是经过简单的RESTful API来隐藏Lucene的复杂性，从而让全文搜索变得简单。
不过，Elasticsearch不只仅是Lucene和全文搜索，咱们还能这样去描述它：node

分布式的实时文件存储，每一个字段都被索引并可被搜索git
分布式的实时分析搜索引擎github
能够扩展到上百台服务器，处理PB级结构化或非结构化数据web

并且，全部的这些功能被集成到一个服务里面，你的应用能够经过简单的RESTful API、各类语言的客户端甚至命令行与之交互。上手Elasticsearch很是容易。它提供了许多合理的缺省值，并对初学者隐藏了复杂的搜索引擎理论。它开箱即用（安装便可使用），只需不多的学习既可在生产环境中使用。在ElasticSearch中，咱们经常会听到Index、Type以及Document等概念，那么它们与传统的熟知的关系型数据库中名称的类好比下：chrome

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices   -> Types  -> Documents -> Fields

这里借用此文的一张思惟脑图来描述整个ElasticSearch生态圈你所应该了解的内容:
数据库

Reference

Books & Tutorial

ElasticSearch权威指南中文版express
elasticsearch-definitive-guideapache

Quick Start

Installation

在这里下载ElasticSearch的最新预编译版本，而后直接解压缩启动便可。笔者此时使用的是2.3.5版本的ElasticSearch，其文件目录结构以下：json

home---这是Elasticsearch解压的目录
　　bin---这里面是ES启动的脚本

　　conf---elasticsearch.yml为ES的配置文件

　　data---这里是ES得当前节点的分片的数据，能够直接拷贝到其余的节点进行使用

　　logs---日志文件

　　plugins---这里存放一些经常使用的插件，若是有一切额外的插件，能够放在这里使用。

在ElasticSearch 2.x版本中，默认是不容许以Root用户身份运行实例，可使用bin/elasticsearch -Des.insecure.allow.root=true来以Root身份启动集群，此外还可使用bin/elasticsearch -f -Des.path.conf=/path/to/config/dir参数来读取相关的.yml或者.json配置。

还有些常见的配置以下所示：

Setting	Description
`http.port`	A bind port range. Defaults to `9200-9300`.
`http.publish_port`	The port that HTTP clients should use when communicating with this node. Useful when a cluster node is behind a proxy or firewall and the `http.port` is not directly addressable from the outside. Defaults to the actual port assigned via `http.port`.
`http.bind_host`	The host address to bind the HTTP service to. Defaults to `http.host`(if set) or `network.bind_host`.
`http.publish_host`	The host address to publish for HTTP clients to connect to. Defaults to `http.host` (if set) or `network.publish_host`.
`http.host`	Used to set the `http.bind_host` and the `http.publish_host` Defaults to `http.host` or `network.host`.
`http.max_content_length`	The max content of an HTTP request. Defaults to `100mb`. If set to greater than `Integer.MAX_VALUE`, it will be reset to 100mb.
`http.max_initial_line_length`	The max length of an HTTP URL. Defaults to `4kb`
`http.max_header_size`	The max size of allowed headers. Defaults to `8kB`
`http.compression`	Support for compression when possible (with Accept-Encoding). Defaults to `false`.
`http.compression_level`	Defines the compression level to use. Defaults to `6`.
`http.cors.enabled`	Enable or disable cross-origin resource sharing, i.e. whether a browser on another origin can do requests to Elasticsearch. Defaults to `false`.
`http.cors.allow-origin`	Which origins to allow. Defaults to no origins allowed. If you prepend and append a `/` to the value, this will be treated as a regular expression, allowing you to support HTTP and HTTPs. for example using `/https?:\/\/localhost(:[0-9]+)?/` would return the request header appropriately in both cases. `` is a valid value but is considered a security risk* as your elasticsearch instance is open to cross origin requests from anywhere.
`http.cors.max-age`	Browsers send a "preflight" OPTIONS-request to determine CORS settings. `max-age` defines how long the result should be cached for. Defaults to `1728000` (20 days)
`http.cors.allow-methods`	Which methods to allow. Defaults to `OPTIONS, HEAD, GET, POST, PUT, DELETE`.
`http.cors.allow-headers`	Which headers to allow. Defaults to `X-Requested-With, Content-Type, Content-Length`.
`http.cors.allow-credentials`	Whether the `Access-Control-Allow-Credentials` header should be returned. Note: This header is only returned, when the setting is set to `true`. Defaults to `false`
`http.detailed_errors.enabled`	Enables or disables the output of detailed error messages and stack traces in response output. Note: When set to `false` and the`error_trace` request parameter is specified, an error will be returned; when `error_trace` is not specified, a simple message will be returned. Defaults to `true`
`http.pipelining`	Enable or disable HTTP pipelining, defaults to `true`.
`http.pipelining.max_events`	The maximum number of events to be queued up in memory before a HTTP connection is closed, defaults to `10000`.

REST API

在咱们启动了某个ElasticSearch实例以后，便可以经过ElasticSearch自带的基于JSON REST API来进行交互。咱们可使用官方教程中提供的curl工具，或者稍微复杂一点的经常使用工具Fiddler或者RESTClient来进行交互，不过这里推荐使用Sense，这是Chrome内置的一个插件，可以提供不少的ElasticSearch的自动补全功能。

当咱们直接访问根目录时，会获得以下的提示:

{
   "name": "Mister Fear",
   "cluster_name": "elasticsearch",
   "version": {
      "number": "2.3.5",
      "build_hash": "90f439ff60a3c0f497f91663701e64ccd01edbb4",
      "build_timestamp": "2016-07-27T10:36:52Z",
      "build_snapshot": false,
      "lucene_version": "5.5.0"
   },
   "tagline": "You Know, for Search"
}

CRUD

Index:建立与更新索引

在ElasticSearch中，Index这一动做类比于CRUD中的Create与Update，当咱们尝试为某个不存在的文档创建索引时，会自动根据其相似与ID建立新的文档，不然就会对原有的文档进行修改。ElasticSearch使用PUT请求来进行Index操做，你须要提供索引名称、类型名称以及可选的ID，格式规范为:http://localhost:9200/<index>/<type>/[<id>]。其中索引名称能够是任意字符，若是ElasticSearch中并不存在该索引则会自动建立。类型名的原则很相似于索引，不过其与索引相比会指明更多的细节信息：

每一个类型有本身独立的ID空间
不一样的类型有不一样的映射(Mappings)，即不一样的属性/域的创建索引的方案
尽量地在一块儿搜索请求中只对某个类型或者特定的类型进行搜索

典型的某个Index请求为:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}'

在上述请求执行以后，ElasticSearch会为咱们建立索引名为Movies，类型名为Movie，ID为1的文档。固然你也能够在Sense中运行该请求，这样的话用户体验会更好一点：

在上图中咱们能够了解到，ElasticSearch对于PUT请求的响应中包含了是否操做成功、文档编号等信息。此时咱们若是进行默认的全局搜索，能够获得以下返回：

能够看出咱们刚刚新建的文档已经能够被查询，接下来咱们尝试对刚才新创建的文档进行些修改，添加某些关键字属性。咱们一样能够利用PUT请求来进行该操做，不过咱们此次务必要加上须要修改的文档的ID编号:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

对于此操做的ElasticSearch的响应与前者很相似，不过会能够看出_version属性值已经发生了变化：

该属性便是用来追踪文档被修改过的次数，能够在乐观并发控制策略中控制并发修改，ElasticSearch仅会容许版本号高于原文档版本号的修改发生。

GET

最简单的获取某个文档的方式便是基于文档ID进行搜索，标准的请求格式为:http://localhost:9200/<index>/<type>/<id>，咱们查询下上文中插入的一些电影数据:

curl -XGET "http://localhost:9200/movies/movie/1" -d''

返回数据中一样会包含版本信息、ID编号以及源信息。

Delete:删除索引

如今咱们尝试去删除上文中插入的部分文档，对于要删除的文档一样须要传入索引名、类型名与文档名这些信息，譬如:

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

在咱们删除了该文档以后，再次尝试用GET方法获取该文档信息时，会获得以下的响应:

Search

ElasticSearch最诱人的地方便是为咱们提供了方便快捷的搜索功能，咱们首先尝试使用以下的命令建立测试文档:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
    "title": "Lawrence of Arabia",
    "director": "David Lean",
    "year": 1962,
    "genres": ["Adventure", "Biography", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
    "title": "To Kill a Mockingbird",
    "director": "Robert Mulligan",
    "year": 1962,
    "genres": ["Crime", "Drama", "Mystery"]
}'

curl -XPUT "http://localhost:9200/movies/movie/4" -d'
{
    "title": "Apocalypse Now",
    "director": "Francis Ford Coppola",
    "year": 1979,
    "genres": ["Drama", "War"]
}'

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

curl -XPUT "http://localhost:9200/movies/movie/6" -d'
{
    "title": "The Assassination of Jesse James by the Coward Robert Ford",
    "director": "Andrew Dominik",
    "year": 2007,
    "genres": ["Biography", "Crime", "Drama"]
}'

这里须要了解的是，ElasticSearch为咱们提供了通用的_bulk端点来在单请求中完成多文档建立操做，不过这里为了简单起见仍是分为了多个请求进行执行。ElasticSearch中搜索主要是基于_search这个端点进行的，其标准请求格式为:<index>/<type>/_search，其中index与type都是可选的。换言之，咱们能够以以下几种方式发起请求:

http://localhost:9200/_search... - 搜索全部的Index与Type
http://localhost:9200/movies/... - 搜索Movies索引下的全部类型
http://localhost:9200/movies/... -仅搜索包含在Movies索引Movie类型下的文档

全文搜索

ElasticSearch的Query DSL为咱们提供了许多不一样类型的强大的查询的语法，其核心的查询字符串包含不少查询的选项，而且由ElasticSearch编译转化为多个简单的查询请求。最简单的查询请求便是全文检索，譬如咱们这里须要搜索关键字:kill:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "kill"
        }
    }
}'

执行该请求可能获得以下响应:

指定域搜索

在上文简单的全文检索中，咱们会搜索每一个文档中的全部域。而不少时候咱们仅须要对指定的部分域中文档进行搜索操做，譬如咱们要搜索仅在标题中出现ford字段的文档:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "ford",
            "fields": ["title"]
        }
    }
}'

而在全文搜索中，fields字段即被设置为了默认的_all值：

Web Interface

Kibana

Kibana 4 权威指南