Elasticsearch由浅入深（八）搜索引擎：mapping、精确匹配与全文搜索、分词器、mapping总结

时间 2019-11-06

标签 elasticsearch 由浅入深搜索引擎 mapping 精确匹配全文搜索分词器总结栏目日志分析繁體版

原文原文链接

下面先简单描述一下mapping是什么？html

自动或手动为index中的type创建的一种数据结构和相关配置，简称为mapping
dynamic mapping，自动为咱们创建index，建立type，以及type对应的mapping，mapping中包含了每一个field对应的数据类型，以及如何分词等设置web

当咱们插入几条数据，让ES自动为咱们创建一个索引数组

PUT /website/article/1
{
  "post_date": "2019-08-21",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

PUT /website/article/2
{
  "post_date": "2019-08-22",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}

PUT /website/article/3
{
  "post_date": "2019-08-23",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}

查看mapping数据结构

GET /website/_mapping

{
  "website": {
    "mappings": {
      "article": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "post_date": {
            "type": "date"
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

上面是插入数据自动生成的mapping，还有手动生成的mapping。这种自动或手动为index中的type创建的一种数据结构和相关配置，称为mapping。app

尝试各类搜索ide

GET /website/article/_search?q=2019            //3条结果             
GET /website/article/_search?q=2019-08-21            //3条结果
GET /website/article/_search?q=post_date:2019-08-21       //1条结果
GET /website/article/_search?q=post_date:2019         //0条结果

搜索结果为何不一致，由于es自动创建mapping的时候，设置了不一样的field不一样的data type。不一样的data type的分词、搜索等行为是不同的。因此出现了_all field和post_date field的搜索表现彻底不同。
下面是手动建立的mapping。post

PUT /test_mapping
{
  "mappings" : {
    "properties" : {
      "author_id" : {
        "type" : "long"
      },
      "content" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "post_date" : {
        "type" : "date"
      },
      "title" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      }
    }
  }
}

View Code

精确匹配与全文搜索的对比分析

exact value

也就是某个field必须所有匹配才能返回相应的document
示例:测试

GET /website/article/_search?q=post_date:2019-08-21       //1条结果
GET /website/article/_search?q=post_date:2019         //0条结果

exact value，搜索的时候，必须输入2019-08-21，才能搜索出来
若是你输入一个21，是搜索不出来的ui

full text

full text与exact value不同，不是说单纯的只是匹配完整的一个值，而是能够对值进行拆分词语后（分词）进行匹配，也能够经过缩写、时态、大小写、同义词等进行匹配。
示例：this

GET /website/article/_search?q=2019            //3条结果             
GET /website/article/_search?q=2019-08-21            //3条结果

倒排索引核心原理

下面演示一下倒排索引简单创建的过程，固然实际中倒排索引的创建过程会很是的复杂。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词，初步的倒排索引的创建

word    doc1    doc2
I        *        *
really   *
liked    *        *
my       *        *
small    *
dogs     *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

搜索 mother like little dog, 不会有任何结果
mother
like
little
dog
这确定不是咱们想要的结果。好比mother和mom其实根本就没有区别。可是却检索不到。可是作下测试发现ES是能够查到的。实际上ES在创建倒排索引的时候，还会执行一个操做，就是会对拆分的各个单词进行相应的处理，以提高后面搜索的时候可以搜索到相关联的文档的几率。像时态的转换，单复数的转换，同义词的转换，大小写的转换。这个过程称为正则化（normalization）
mother-> mom
liked -> like
small -> little
dogs -> dog
这样从新创建倒排索引：

word    doc1    doc2
I        *        *
really   *
like     *        *
my       *        *
little   *
dog      *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

查询：mother like little dog 分词正则化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都会搜索出来
doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词器

切分词语，normalization（提高recall召回率）

给你一段句子，而后将这段句子拆分红一个一个的单个的单词，同时对每一个单词进行normalization（时态转换，单复数转换），分瓷器
recall，召回率：搜索的时候，增长可以搜索到的结果的数量

character filter：在一段文本进行分词以前，先进行预处理，好比说最多见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分词，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各类处理，最后处理好的结果才会拿去创建倒排索引

内置分词器的介绍：

待分词：Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，好比说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

mapping引入案例遗留问题大揭秘

GET /_search?q=2019

搜索的是_all field，document全部的field都会拼接成一个大串，进行分词

2019-01-02 my second article this is my second article in this website 11400

        doc1        doc2        doc3
2019      *          *           *
01        *         
02                   *
03                               *

_all，2017，天然会搜索到3个docuemnt

GET /_search?q=post_date:2019-01-01

date，会做为exact value去创建索引

             doc1        doc2        doc3
2017-01-01    *        
2017-01-02                 *         
2017-01-03                             *

测试分词器

语法：

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

{
  "tokens": [
    {
      "token": "text",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "to",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyze",
      "start_offset": 8,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

对mapping进一步总结

往ES里面直接插入数据，ES会自动创建索引，同时创建type以及对应的mapping
mapping中自动定义了每一个fieldd的数据类型
不一样的数据类型（好比说text和date），可能有的是exact value，有的是full text
exact value，在创建倒排索引的时候，分词的时候，都是将整个值一块儿做为关键字创建到倒排索引中；full text会经历各类各样的处理，分词，normalization（时态转换，同义词转换，大小写转换），才会创建到倒排索引中
在搜索的时候，exact value和full text类型就决定了，对exact value和full text field进行搜索的行为也是不同的，会跟创建倒排索引的行为保持一致；好比说exact value搜索的时候，就是直接按照整个值进行匹配，full text也会进行分词和正则化normalization再去倒排索引中去搜索。
能够用 ES的dynamic mapping，让其自动创建mapping,包括自动设置数据类型；也能够提早手动建立index和type的mapping,本身对各个field进行设置，包括数据类型，包括索引行为，包括分析器等等。

mapping本质上就是index的type的元数据，决定了数据类型，创建倒排索引的行为，还有进行搜索的行为。

mapping核心数据类型以及dynamic mapping

核心数据类型

string text：字符串类型
byte:字节类型
short：短整型
integer：整型
long:长整型
float:浮点型
boolean:布尔类型
date:时间类型

固然还有一些高级类型，像数组，对象object，但其底层都是text字符串类型

dynamic mapping

true or false -> boolean
123 -> long
123.45 -> float
2017-01-01 -> date
"hello world" -> string text

查看mapping

语法：

GET /{index}/_mapping
GET /{index}/_mapping/{type}

手动创建和修改mapping以及定制string类型是否分词

注意：只能建立index时手动创建mapping，或者新增field mapping，可是不能update field mapping。

```
"analyzer": "standard":自动分词
```
```
date：日期
```
```
keyword：不分词
```

# 建立索引
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "content": {
        "type": "text"
      },
      "post_date": {
        "type": "date"
      },
      "publisher_id": {
        "type": "keyword"
      }
    }
  }
}


#修改字段的mapping
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "text"
      }
    }
  }
}

{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
        "index_uuid": "5xLohnJITHqCwRYInmBFmA",
        "index": "website"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
    "index_uuid": "5xLohnJITHqCwRYInmBFmA",
    "index": "website"
  },
  "status": 400
}


#增长mapping的字段
PUT /website/_mapping
{
  "properties": {
    "new_field": {
      "type": "text"
    }
  }
}

{
  "acknowledged" : true
}

mapping复杂类型y以及object类型数据底层结构

multivalue field
```
{
    "tags": ["tag1", "tag2"]
}
```
创建索引时与string是同样的，数据类型不能混
empty field
```
null，[]，[null]
```

object field
初始化数据：

PUT /company/employee/1
{
  "address": {
    "country": "china",
    "province": "guangdong",
    "city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}

查看mapping

GET /company/_mapping/employee

{
  "company": {
    "mappings": {
      "employee": {
        "properties": {
          "address": {
            "properties": {
              "city": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "country": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "province": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "age": {
            "type": "long"
          },
          "join_date": {
            "type": "date"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

View Code

object field底层解析

{
  "address": {
    "country": "china",
    "province": "guangdong",
    "city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}

↓↓↓↓

{
    "name":            [jack],
    "age":          [27],
    "join_date":      [2017-01-01],
    "address.country":         [china],
    "address.province":   [guangdong],
    "address.city":  [guangzhou]
}

{
    "authors": [
        { "age": 26, "name": "Jack White"},
        { "age": 55, "name": "Tom Jones"},
        { "age": 39, "name": "Kitty Smith"}
    ]
}

↓↓↓↓

{
    "authors.age":    [26, 55, 39],
    "authors.name":   [jack, white, tom, jones, kitty, smith]
}