Elasticsearch 索引建立 / 数据检索

时间 2019-12-05

原文原文链接

es 6.0 开始不推荐一个index下多个type的模式，而且会在 7.0 中彻底移除。在 6.0 的index下是没法建立多个type的，type带来的字段类型冲突和检索效率降低的问题，致使了type会被移除。（5.x到6.x）
_all字段也被舍弃了，使用 copy_to自定义联合字段。（5.x到6.x）
type:text/keyword 来决定是否分词，index: true/false决定是否索引（2.x到5.x）
analyzer来单独设定分词器（2.x到5.x）

建立索引

先把 ik 装上，重启服务。html

# 使用 elasticsearch-plugin 安装
elasticsearch-plugin install \
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

文档字段类型参考：
https://www.elastic.co/guide/...git

文档字段其余参数参考（不一样字段类型可能会有相应的特征属性）：
https://www.elastic.co/guide/...github

咱们新建一个名news的索引：正则表达式

设定默认分词器为ik分词器用来处理中文
使用默认名 _doc 定义 type
故意关闭_source存储（用来验证 store 选项）
title 不存储 author 不分词 content 存储shell

_source字段的含义能够看下这篇博文：https://blog.csdn.net/napoay/...app

PUT /news
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "index": {
            "analysis.analyzer.default.type" : "ik_smart"
        }
    },
    "mappings": {
        "_doc": {
            "_source": {
                "enabled": false
            },
            "properties": {
                "news_id": {
                    "type": "integer",
                    "index": true
                },
                "title": {
                    "type": "text",
                    "store": false
                },
                "author": {
                    "type": "keyword"
                },
                "content": {
                    "type": "text",
                    "store": true
                },
                "created_at": {
                    "type": "date",
                    "format": "yyyy-MM-dd hh:mm:ss"
                }
            }
        }
    }
}
# 查看建立的结构
GET /news/_mapping

验证分词器是否生效elasticsearch

# 验证分词插件是否生效
GET /_analyze
{
    "analyzer": "ik_smart",
    "text": "我热爱祖国"
}
GET /_analyze
{
    "analyzer": "ik_max_word",
    "text": "我热爱祖国"
}

# 索引的默认分词器
GET /news/_analyze
{
    "text": "我热爱祖国！"
}

# 指定字段 分词器将根据字段属性作相应分词处理
# author 为 keyword 是不会作分词处理
GET /news/_analyze
{
    "field": "author"
    "text": "我热爱祖国！"
}
# title 的分词结果
GET /news/_analyze
{
    "field": "title"
    "text": "我热爱祖国！"
}

添加文档

用于演示，后面的查询会以这些文档为例。ide

POST /news/_doc
{
    "news_id": 1,
    "title": "咱们一块儿学旺叫",
    "author": "才华横溢王大猫",
    "content": "咱们一块儿学旺叫，一块儿旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，个人尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 2,
    "title": "咱们一块儿学猫叫",
    "author": "王大猫不会被分词",
    "content": "咱们一块儿学猫叫，仍是旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，个人尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 3,
    "title": "实在编不出来了",
    "author": "王大猫",
    "content": "实在编不出来了，随便写点数据作测试吧，旺旺旺",
    "created_at": "2019-03-26 11:55:20"
}

检索数据

GET /news/_doc/_search 为查询news下_doc的文档的接口，咱们用 restApi+DSL演示测试

match_all

即无检索条件获取所有数据ui

#无条件分页检索 以 news_id 排序
GET /news/_doc/_search
{
    "query": {
        "match_all": {}
    },
    "from": 0,
    "size": 2,
    "sort": {
        "news_id": "desc"
    }
}

由于咱们关掉了_source字段，即 ES 只会对数据创建倒排索引，不会存储其原数据，因此结果里没有相关文档原数据内容。关掉的缘由主要是想演示highlight机制。

match

普通检索，不少文章都说match查询会对查询内容进行分词，其实并不彻底正确，match查询也要看检索的字段type类型，若是字段类型自己就是不分词的keyword(not_analyzed)，那match就等同于term查询了。

咱们能够经过分词器explain一下字段会被如何处理:

GET /news/_analyze
{
    "filed": "title",
    "text": "我会被如何处理呢？分词？不分词？"
}

查询

GET /news/_doc/_search
{
    "query": {
        "match": {
            "title": "咱们会被分词"
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

经过highlight咱们能够将检索到的关键词以高亮的方式返回上下文内容，若是关闭了_source就得开启字段的store属性存储字段的原数据，这样才能作高亮处理，否则没有原内容了，也就没办法高亮关键词了

multi_match

对多个字段进行检索，好比我想查询title或content中有咱们关键词的文档，以下便可：

GET /news/_doc/_search
{
    "query": {
        "multi_match": {
            "query": "咱们是好人",
            "fields": ["title", "content"]
        }
    },
    "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

match_phrase

这个须要认证理解一下，match_phrase，短语查询，何为短语查询呢？简单来讲即被查询的文档字段中要包含查询内容被分词解析后的全部关键词，且关键词在文档中的分布距离差offset要知足slop设定的阈值。slop表征能够将关键词平移几回来知足在文档中的分布，若是slop足够的大，那么即使全部关键词在文档中分布的很离散，也是能够经过平移知足的。

content: i love china
match_phrase: i china
slop: 0//查不到 须要将 i china 的 china 关键词 slop 1 后变为 i - china 才能知足
slop: 1//查获得

测试实例

# 先看下查询会被如何解析分词
GET /news/_analyze
{
    "field": "title",
    "text": "咱们学"
}
# reponse
{
    "tokens": [
        {
            "token": "咱们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "学",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        }
    ]
}

# 再看下某文档的title是被怎样创建倒排索引的
GET /news/_analyze
{
    "field": "title",
    "text": "咱们一块儿学旺叫"
}
# reponse
{
    "tokens": [
        {
            "token": "咱们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "一块儿",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "学",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 2
        },
        ...
    ]
}

注意position字段，只有slop的阈值大于两个不相邻的关键词的position差时，才能知足平移关键词至查询内容短语分布的位置条件。

查询内容被分词为：["咱们", "学"]，而文档中["咱们", "学"]两个关键字的距离为 1，因此，slop必须大于等于1，此文档才能被查询到。

使用查询短语模式：

GET /news/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "咱们学",
                "slop": 1
            }
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

查询结果：

{
            ...
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "if-CuGkBddO9SrfVBoil",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>咱们</em>一块儿<em>学</em>猫叫"
                    ]
                }
            },
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "iP-AuGkBddO9SrfVOIg3",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>咱们</em>一块儿<em>学</em>旺叫"
                    ]
                }
            }
            ...
}

term

term要理解只是不对查询条件分词，做为一个关键词去检索索引。但文档存储时字段是否被分词创建索引由_mappings时设定了。可能有["咱们", "一块儿"]两个索引，但并无["咱们一块儿"]这个索引，查询不到。keyword类型的字段则存储时不分词，创建完整索引，查询时也不会对查询条件分词，是强一致性的。

GET /news/_doc/_search
{
    "query": {
        "term": {
           "title": "咱们一块儿" 
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

terms

terms则是给定多个关键词，就比如人工分词

{
    "query": {
        "terms": {
           "title": ["咱们", "一块儿"]
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

知足["咱们", "一块儿"]任意关键字的文档都能被检索到。

wildcard

shell通配符查询: ? 一个字符 * 多个字符，查询倒排索引中符合pattern的关键词。

查询有两个字符的关键词的文档

{
   "query": {
       "wildcard": {
               "title": "??"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

prefix

前缀查询，查询倒排索引中符合pattern的关键词。

{
   "query": {
       "prefix": {
               "title": "我"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

regexp

正则表达式查询，查询倒排索引中符合pattern的关键词。

查询含有2 ~ 3 个字符的关键词的文档

{
   "query": {
       "regexp": {
               "title": ".{2,3}"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

bool

布尔查询经过 bool连接多个查询组合：
must：必须全知足
must_not：必须全不知足
should：知足一个便可

{
   "query": {
        "bool": {
            "must": {
                "match": {
                    "title": "绝对要有咱们"
                }
            },
            "must_not": {
                "term": {
                    "title": "绝对不能有我"
                }
            },
            "should": [
                {
                    "match": {
                        "content": "咱们"
                    }
                },
                {
                    "multi_match": {
                        "query": "知足",
                        "fields": ["title", "content"]
                    }
                },
                {
                    "match_phrase": {
                        "title": "一个便可"
                    }
                }
            ],
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2019-01-05 12:00:00"
                    }
                }
            }
        }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

filter

filter 一般状况下会配合match之类的使用，对符合查询条件的数据进行过滤。

{
   "query": {
        "bool": {
            "must": {
                "match_all": {}
            },
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
        }
   }
}

或者单独使用

{
   "query": {
       "constant_score" : {
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
       }
   }
}

多个过滤条件：2017-12-05 12:00:00 <= created_at < 2020-12-05 12:00:00 and news_id >= 2

{
   "query": {
       "constant_score" : {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "created_at": {
                                    "lt": "2020-12-05 12:00:00",
                                    "gt": "2017-12-05 12:00:00"
                                }
                            }
                        },
                        {
                            "range": {
                                "news_id": {
                                    "gte": 2
                                }
                            }
                        }
                    ]
                }
            }
       }
   }
}