Elasticsearch数据类型及其属性

时间 2019-11-12

原文原文链接

1、数据类型html

字段类型概述git

一级分类	二级分类	具体类型
核心类型	字符串类型	string,text,keyword
h	整数类型	integer,long,short,byte
h	浮点类型	double,float,half_float,scaled_float
h	逻辑类型	boolean
h	日期类型	date
h	范围类型	range
h	二进制类型	binary
复合类型	数组类型	array
f	对象类型	object
f	嵌套类型	nested
地理类型	地理坐标类型	geo_point
d	地理地图	geo_shape
特殊类型	IP类型	ip
t	范围类型	completion
t	令牌计数类型	token_count
t	附件类型	attachment
t	抽取类型	percolator

核心类型web

一、字符串类型
　　string类型: 在ElasticSearch 旧版本中使用较多，从ElasticSearch 5.x开始再也不支持string，由text和keyword类型替代。
　　text 类型：当一个字段是要被全文搜索的，好比Email内容、产品描述，应该使用text类型。设置text类型之后，字段内容会被分析，在生成倒排索引之前，字符串会被分析器分红一个一个词项。text类型的字段不用于排序，不多用于聚合。
　　keyword
keyword类型适用于索引结构化的字段，好比email地址、主机名、状态码和标签。若是字段须要进行过滤(好比查找已发布博客中status属性为published的文章)、排序、聚合。keyword类型的字段只能经过精确值搜索到。算法

二、整数类型数组

类型	取值范围
byte	-128~127
short	-32768~32767
integer	-231~231-1
short	-263~263-1

在知足需求的状况下，尽量选择范围小的数据类型。好比，某个字段的取值最大值不会超过100，那么选择byte类型便可。迄今为止吉尼斯记录的人类的年龄的最大值为134岁，对于年龄字段，short足矣。字段的长度越短，索引和搜索的效率越高。app

三、浮点类型elasticsearch

类型	取值范围
doule	64位双精度IEEE 754浮点类型
float	32位单精度IEEE 754浮点类型
half_float	16位半精度IEEE 754浮点类型
scaled_float	缩放类型的的浮点数

对于float、half_float和scaled_float,-0.0和+0.0是不一样的值，使用term查询查找-0.0不会匹配+0.0，一样range查询中上边界是-0.0不会匹配+0.0，下边界是+0.0不会匹配-0.0。ide

其中scaled_float，好比价格只须要精确到分，price为57.34的字段缩放因子为100，存起来就是5734
优先考虑使用带缩放因子的scaled_float浮点类型。性能

四、date类型
日期类型表示格式能够是如下几种：
（1）日期格式的字符串，好比 “2018-01-13” 或 “2018-01-13 12:10:30”
（2）long类型的毫秒数( milliseconds-since-the-epoch，epoch就是指UNIX诞生的UTC时间1970年1月1日0时0分0秒)
（3）integer的秒数(seconds-since-the-epoch)大数据
五、boolean类型　true和false
六、 binary类型
　　进制字段是指用base64来表示索引中存储的二进制数据，可用来存储二进制形式的数据，例如图像。默认状况下，该类型的字段只存储不索引。二进制类型只支持index_name属性。
七、array类型
（1）字符数组: [ “one”, “two” ]
（2）整数数组: productid:[ 1, 2 ]
（3）对象（文档）数组: “user”:[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]，
注意：lasticSearch不支持元素为多个数据类型：[ 10, “some string” ]
八、 object类型
JSON对象，文档会包含嵌套的对象
九、ip类型
p类型的字段用于存储IPv4或者IPv6的地址

2、Mapping 支持属性

一、enabled：仅存储、不作搜索和聚合分析
```
"enabled":true （缺省）| false
```
二、index：是否构建倒排索引（便是否分词，设置false，字段将不会被索引）
```
"index": true（缺省）| false
```

三、index_option：存储倒排索引的哪些信息

4个可选参数：
      docs：索引文档号
      freqs：文档号+词频
      positions：文档号+词频+位置，一般用来距离查询
      offsets：文档号+词频+位置+偏移量，一般被使用在高亮字段
  分词字段默认是positions，其余默认时docs
  
  "index_options": "docs"

四、norms：是否归一化相关参数、若是字段仅用于过滤和聚合分析、可关闭
分词字段默认配置，不分词字段：默认{“enable”: false}，存储长度因子和索引时boost，建议对须要参加评分字段使用，会额外增长内存消耗
```
"norms": {"enable": true, "loading": "lazy"}
```
五、doc_value：是否开启doc_value，用户聚合和排序分析
对not_analyzed字段，默认都是开启，分词字段不能使用，对排序和聚合能提高较大性能，节约内存
```
"doc_value": true（缺省）| false
```
六、fielddata：是否为text类型启动fielddata，实现排序和聚合分析
针对分词字段，参与排序或聚合时能提升性能，不分词字段统一建议使用doc_value
```
"fielddata": {"format": "disabled"}
```
七、store：是否单独设置此字段的是否存储而从_source字段中分离，只能搜索，不能获取值
```
"store": false（默认）| true
```
八、coerce：是否开启自动数据类型转换功能，好比：字符串转数字，浮点转整型
```
"coerce: true（缺省）| false"
```
九、multifields：灵活使用多字段解决多样的业务需求
十一、dynamic：控制mapping的自动更新
```
"dynamic": true（缺省）| false | strict
```

十二、data_detection：是否自动识别日期类型
```
"data_detection"：true（缺省）| false
```

dynamic和data_detection的详解：Elasticsearch dynamic mapping（动态映射）策略.

1三、analyzer：指定分词器，默认分词器为standard analyzer
```
"analyzer": "ik"
```
1四、boost：字段级别的分数加权，默认值是1.0
```
"boost": 1.23
```
1五、fields：能够对一个字段提供多种索引模式，同一个字段的值，一个分词，一个不分词
```
"fields": {"raw": {"type": "string", "index": "not_analyzed"}}
```
1六、ignore_above：超过100个字符的文本，将会被忽略，不被索引
```
"ignore_above": 100
```
1七、include_in_all：设置是否此字段包含在_all字段中，默认时true，除非index设置成no
```
"include_in_all": true
```
1八、null_value：设置一些缺失字段的初始化，只有string可使用，分词字段的null值也会被分词
```
"null_value": "NULL"
```
1九、position_increament_gap：影响距离查询或近似查询，能够设置在多值字段的数据上或分词字段上，查询时能够指定slop间隔，默认值时100
```
"position_increament_gap": 0
```
20、search_analyzer：设置搜索时的分词器，默认跟analyzer是一致的，好比index时用standard+ngram，搜索时用standard用来完成自动提示功能
```
"search_analyzer": "ik"
```
2一、similarity：默认时TF/IDF算法，指定一个字段评分策略，仅仅对字符串型和分词类型有效
```
"similarity": "BM25"
```
2二、trem_vector：默认不存储向量信息，支持参数yes（term存储），with_positions（term+位置），with_offsets（term+偏移量），with_positions_offsets（term+位置+偏移量）对快速高亮fast vector highlighter能提高性能，但开启又会加大索引体积，不适合大数据量用
```
"trem_vector": "no"
```

3、Mapping 字段设置流程

avatar

----------------------------

说在前面: Elasticsearch中每一个field都要精确对应一个数据类型.
本文的全部演示, 都是基于Elasticsearch 6.6.0进行的, 不一样的版本可能存在API发生修改、不支持的状况, 还请注意.

1 核心数据类型

1.1 字符串类型 - string(再也不支持)

(1) 使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "title": {"type": "string"},    // 全文本
                "tags": {"type": "string", "index": "not_analyzed"} // 关键字, 不分词
            }
        }
    }
}

(2) ES 5.6.10中的响应信息:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [tags]
#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "website"
}

(3) ES 6.6.0中的响应信息:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [string] declared on field [title]"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "Failed to parse mapping [blog]: No handler for type [string] declared on field [title]",
    "caused_by": {
      "type": "mapper_parsing_exception",
      "reason": "No handler for type [string] declared on field [title]"
    }
  },
  "status": 400
}

可知string类型的field已经被移除了, 咱们须要用text或keyword类型来代替string.

1.1.1 文本类型 - text

在Elasticsearch 5.4 版本开始, text取代了须要分词的string.

—— 当一个字段须要用于全文搜索(会被分词), 好比产品名称、产品描述信息, 就应该使用text类型.

text的内容会被分词, 能够设置是否须要存储: "index": "true|false".
text类型的字段不能用于排序, 也不多用于聚合.

使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "summary": {"type": "text", "index": "true"}
            }
        }
    }
}

1.1.2 关键字类型 - keyword

在Elasticsearch 5.4 版本开始, keyword取代了不须要分词的string.

—— 当一个字段须要按照精确值进行过滤、排序、聚合等操做时, 就应该使用keyword类型.

keyword的内容不会被分词, 能够设置是否须要存储: "index": "true|false".

使用示例:

PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "tags": {"type": "keyword", "index": "true"}
            }
        }
    }
}

1.2 数字类型 - 8种

数字类型有以下分类:

类型	说明
byte	有符号的8位整数, 范围: [-128 ~ 127]
short	有符号的16位整数, 范围: [-32768 ~ 32767]
integer	有符号的32位整数, 范围: [$-2^{31}$ ~ $2^{31}$-1]
long	有符号的32位整数, 范围: [$-2^{63}$ ~ $2^{63}$-1]
float	32位单精度浮点数
double	64位双精度浮点数
half_float	16位半精度IEEE 754浮点类型
scaled_float	缩放类型的的浮点数, 好比price字段只需精确到分, 57.34缩放因子为100, 存储结果为5734

使用注意事项:

尽量选择范围小的数据类型, 字段的长度越短, 索引和搜索的效率越高;
优先考虑使用带缩放因子的浮点类型.

使用示例:

PUT shop
{
    "mappings": {
        "book": {
            "properties": {
                "name": {"type": "text"},
                "quantity": {"type": "integer"},  // integer类型
                "price": {
                    "type": "scaled_float",       // scaled_float类型
                    "scaling_factor": 100
                }
            }
        }
    }
}

1.3 日期类型 - date

JSON没有日期数据类型, 因此在ES中, 日期能够是:

包含格式化日期的字符串, "2018-10-01", 或"2018/10/01 12:10:30".
表明时间毫秒数的长整型数字.
表明时间秒数的整数.

若是时区未指定, 日期将被转换为UTC格式, 但存储的倒是长整型的毫秒值.
能够自定义日期格式, 若未指定, 则使用默认格式: strict_date_optional_time||epoch_millis

(1) 使用日期格式示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "pub_date": {"type": "date"}   // 日期类型
            }
        }
    }
}

// 添加数据
PUT website/blog/11
{ "pub_date": "2018-10-10" }

PUT website/blog/12
{ "pub_date": "2018-10-10T12:00:00Z" }  // Solr中默认使用的日期格式

PUT website/blog/13
{ "pub_date": "1589584930103" }         // 时间的毫秒值

(2) 多种日期格式:

多个格式使用双竖线||分隔, 每一个格式都会被依次尝试, 直到找到匹配的.
第一个格式用于将时间毫秒值转换为对应格式的字符串.

使用示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "date": {
                    "type": "date",  // 能够接受以下类型的格式
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                }
            }
        }
    }
}

1.4 布尔类型 - boolean

能够接受表示真、假的字符串或数字:

真值: true, "true", "on", "yes", "1"...
假值: false, "false", "off", "no", "0", ""(空字符串), 0.0, 0

1.5 二进制型 - binary

二进制类型是Base64编码字符串的二进制值, 不以默认的方式存储, 且不能被搜索.

使用示例:

// 添加映射
PUT website
{
    "mappings": {
        "blog": {
            "properties": {
                "blob": {"type": "binary"}   // 二进制
            }
        }
    }
}
// 添加数据
PUT website/blog/1
{
    "title": "Some binary blog",
    "blob": "hED903KSrA084fRiD5JLgY=="
}

注意: Base64编码的二进制值不能嵌入换行符\n.

1.6 范围类型 - range

range类型支持如下几种:

类型	范围
integer_range	$-2^{31}$ ~ $2^{31}-1$
long_range	$-2^{63}$ ~ $2^{63}-1$
float_range	32位单精度浮点型
double_range	64位双精度浮点型
date_range	64位整数, 毫秒计时
ip_range	IP值的范围, 支持IPV4和IPV6, 或者这两种同时存在

(1) 添加映射:

PUT company
{
    "mappings": {
        "department": {
            "properties": {
                "expected_number": {  // 预期员工数
                    "type": "integer_range"
                },
                "time_frame": {       // 发展时间线
                    "type": "date_range", 
                    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
                },
                "ip_whitelist": {     // ip白名单
                    "type": "ip_range"
                }
            }
        }
    }
}

(2) 添加数据:

PUT company/department/1
{
    "expected_number" : {
        "gte" : 10,
        "lte" : 20
    },
    "time_frame" : { 
        "gte" : "2018-10-01 12:00:00", 
        "lte" : "2018-11-01"
    }, 
    "ip_whitelist": "192.168.0.0/16"
}

(3) 查询数据:

GET company/department/_search
{
    "query": {
        "term": {
            "expected_number": {
                "value": 12
            }
        }
    }
}
GET company/department/_search
{
    "query": {
        "range": {
            "time_frame": {
                "gte": "208-08-01",
                "lte": "2018-12-01",
                "relation": "within" 
            }
        }
    }
}

查询结果：

{
  "took": 26,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "company",
        "_type": "department",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "expected_number": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2018-10-01 12:00:00",
            "lte": "2018-11-01"
          },
          "ip_whitelist" : "192.168.0.0/16"
        }
      }
    ]
  }
}

2 复杂数据类型

2.1 数组类型 - array

ES中没有专门的数组类型, 直接使用[]定义便可;

数组中全部的值必须是同一种数据类型, 不支持混合数据类型的数组:

① 字符串数组: ["one", "two"];
② 整数数组: [1, 2];
③ 由数组组成的数组: [1, [2, 3]], 等价于[1, 2, 3];
④ 对象数组: [{"name": "Tom", "age": 20}, {"name": "Jerry", "age": 18}].

注意:

动态添加数据时, 数组中第一个值的类型决定整个数组的类型;

不支持混合数组类型, 好比[1, "abc"];

数组能够包含null值, 空数组[]会被当作missing field —— 没有值的字段.

2.2 对象类型 - object

JSON文档是分层的: 文档能够包含内部对象, 内部对象也能够包含内部对象.

(1) 添加示例:

PUT employee/developer/1
{
    "name": "ma_shoufeng",
    "address": {
        "region": "China",
        "location": {"province": "GuangDong", "city": "GuangZhou"}
    }
}

(2) 存储方式:

{
    "name":                       "ma_shoufeng",
    "address.region":             "China",
    "address.location.province":  "GuangDong", 
    "address.location.city":      "GuangZhou"
}

(3) 文档的映射结构相似为:

PUT employee
{
    "mappings": {
        "developer": {
            "properties": {
                "name": { "type": "text", "index": "true" }, 
                "address": {
                    "properties": {
                        "region": { "type": "keyword", "index": "true" },
                        "location": {
                            "properties": {
                                "province": { "type": "keyword", "index": "true" },
                                "city": { "type": "keyword", "index": "true" }
                            }
                        }
                    }
                }
            }
        }
    }
}

2.3 嵌套类型 - nested

嵌套类型是对象数据类型的一个特例, 可让array类型的对象被独立索引和搜索.

2.3.1 对象数组是如何存储的

① 添加数据:

PUT game_of_thrones/role/1
{
    "group": "stark",
    "performer": [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}

② 内部存储结构:

{
    "group":             "stark",
    "performer.first": [ "john", "sansa" ],
    "performer.last":  [ "snow", "stark" ]
}

③ 存储分析:

能够看出, user.first和user.last会被平铺为多值字段, 这样一来, John和Snow之间的关联性就丢失了.

在查询时, 可能出现John Stark的结果.

2.3.2 用nested类型解决object类型的不足

若是须要对以最对象进行索引, 且保留数组中每一个对象的独立性, 就应该使用嵌套数据类型.

—— 嵌套对象实质是将每一个对象分离出来, 做为隐藏文档进行索引.

① 建立映射:

PUT game_of_thrones
{
    "mappings": {
        "role": {
            "properties": {
                "performer": {"type": "nested" }
            }
        }
    }
}

② 添加数据:

PUT game_of_thrones/role/1
{
    "group" : "stark",
    "performer" : [
        {"first": "John", "last": "Snow"},
        {"first": "Sansa", "last": "Stark"}
    ]
}

③ 检索数据:

GET game_of_thrones/_search
{
    "query": {
        "nested": {
            "path": "performer",
            "query": {
                "bool": {
                    "must": [
                        { "match": { "performer.first": "John" }},
                        { "match": { "performer.last":  "Snow" }} 
                    ]
                }
            }, 
            "inner_hits": {
                "highlight": {
                    "fields": {"performer.first": {}}
                }
            }
        }
    }
}

3 地理数据类型

3.1 地理点类型 - geo point

地理点类型用于存储地理位置的经纬度对, 可用于:

查找必定范围内的地理点;

经过地理位置或相对某个中心点的距离聚合文档;

将距离整合到文档的相关性评分中;

经过距离对文档进行排序.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "developer": {
            "properties": {
                "location": {"type": "geo_point"}
            }
        }
    }
}

(2) 存储地理位置:

// 方式一: 纬度 + 经度键值对
PUT employee/developer/1
{
    "text": "小蛮腰-键值对地理点参数", 
    "location": {
        "lat": 23.11, "lon": 113.33     // 纬度: latitude, 经度: longitude
    }
}

// 方式二: "纬度, 经度"的字符串参数
PUT employee/developer/2
{
  "text": "小蛮腰-字符串地理点参数",
  "location": "23.11, 113.33"           // 纬度, 经度
}

// 方式三: ["经度, 纬度"] 数组地理点参数
PUT employee/developer/3
{
  "text": "小蛮腰-数组参数",
  "location": [ 113.33, 23.11 ]         // 经度, 纬度
}

(3) 查询示例:

GET employee/_search
{
    "query": { 
        "geo_bounding_box": { 
            "location": {
                "top_left": { "lat": 24, "lon": 113 },      // 地理盒子模型的上-左边
                "bottom_right": { "lat": 22, "lon": 114 }   // 地理盒子模型的下-右边
            }
        }
    }
}

3.2 地理形状类型 - geo_shape

是多边形的复杂形状. 使用较少, 这里省略.

能够参考这篇文章: Elasticsearch地理位置总结

4 专门数据类型

4.1 IP类型

IP类型的字段用于存储IPv4或IPv6的地址, 本质上是一个长整型字段.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "customer": {
            "properties": {
                "ip_addr": { "type": "ip" }
            }
        }
    }
}

(2) 添加数据:

PUT employee/customer/1
{ "ip_addr": "192.168.1.1" }

(3) 查询数据:

GET employee/customer/_search
{
    "query": {
        "term": { "ip_addr": "192.168.0.0/16" }
    }
}

4.2 计数数据类型 - token_count

token_count类型用于统计字符串中的单词数量.

本质上是一个整数型字段, 接受并分析字符串值, 而后索引字符串中单词的个数.

(1) 添加映射:

PUT employee
{
    "mappings": {
        "customer": {
            "properties": {
                "name": { 
                    "type": "text",
                    "fields": {
                        "length": {
                            "type": "token_count", 
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}

(2) 添加数据:

PUT employee/customer/1
{ "name": "John Snow" }
PUT employee/customer/2
{ "name": "Tyrion Lannister" }

(3) 查询数据:

GET employee/customer/_search
{
    "query": {
        "term": { "name.length": 2 }
    }
}

参考资料

Elasticsearch 6.6 官方文档 - Field datatypes

Elasticsearch 5.4 Mapping详解

做者：yongfutian 连接：https://www.jianshu.com/p/01f489c46c38 来源：简书简书著做权归做者全部，任何形式的转载都请联系做者得到受权并注明出处。