掌握分词器的配置、测试
掌握文档的管理操做
掌握路由规则。html
认识分词器node
Analyzer 分析器正则表达式
在ES中一个Analyzer 由下面三种组件组合而成:数据库
character filter :字符过滤器,对文本进行字符过滤处理,如处理文本中的html标签字符。处理完后再交给tokenizer进行分词。一个analyzer中可包含0个或多个字符过滤器,多个按配置顺序依次进行处理。
tokenizer:分词器,对文本进行分词。一个analyzer必需且只可包含一个tokenizer。
token filter:词项过滤器,对tokenizer分出的词进行过滤处理。如转小写、停用词处理、同义词处理。一个analyzer可包含0个或多个词项过滤器,按配置顺序进行过滤。json
如何测试分词器api
POST _analyze { "analyzer": "whitespace", "text": "The quick brown fox." } POST _analyze { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ], "text": "Is this déja vu?" }
搞清楚position和offset数组
{ "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "quick", "start_offset": 4, "end_offset": 9, "type": "word", "position": 1 }
内建的character filter网络
HTML Strip Character Filter
html_strip :过滤html标签,解码HTML entities like &.
Mapping Character Filter
mapping :用指定的字符串替换文本中的某字符串。
Pattern Replace Character Filter
pattern_replace :进行正则表达式替换。并发
HTML Strip Character Filterapp
测试:
POST _analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" }
在索引中配置:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["b"] } } } } }
测试:
POST my_index/_analyze { "analyzer": "my_analyzer", "text": "<p>I'm so <b>happy</b>!</p>" }
escaped_tags 用来指定例外的标签。 若是没有例外标签需配置,则不须要在此进行客户化定义,在上面的my_analyzer中直接使用 html_strip
Mapping character filter
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Pattern Replace Character Filter
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html
内建的Tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
集成的中文分词器Ikanalyzer中提供的tokenizer:ik_smart 、 ik_max_word
测试tokenizer
POST _analyze { "tokenizer": "standard", "text": "张三说的确实在理" } POST _analyze { "tokenizer": "ik_smart", "text": "张三说的确实在理" }
内建的Token Filter
ES中内建了不少Token filter ,详细了解:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
Lowercase Token Filter :lowercase 转小写
Stop Token Filter :stop 停用词过滤器
Synonym Token Filter: synonym 同义词过滤器
说明:中文分词器Ikanalyzer中自带有停用词过滤功能。
Synonym Token Filter 同义词过滤器
PUT /test_index { "settings": { "index" : { "analysis" : { "analyzer" : { "my_ik_synonym" : { "tokenizer" : "ik_smart", "filter" : ["synonym"] } }, "filter" : { "synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } } } synonyms_path:指定同义词文件(相对config的位置)
同义词定义格式
ES同义词格式支持 solr、 WordNet 两种格式。
在analysis/synonym.txt中用solr格式定义以下同义词 文件必定要UTF-8编码
张三,李四 电饭煲,电饭锅 => 电饭煲 电脑 => 计算机,computer
一行一类同义词,=> 表示标准化为
测试:
POST test_index/_analyze { "analyzer": "my_ik_synonym", "text": "张三说的确实在理" } POST test_index/_analyze { "analyzer": "my_ik_synonym", "text": "我想买个电饭锅和一个电脑" }
经过例子的结果了解同义词的处理行为
内建的Analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
集成的中文分词器Ikanalyzer中提供的Analyzer:ik_smart 、 ik_max_word
内建的和集成的analyzer能够直接使用。若是它们不能知足咱们的须要,则咱们可本身组合字符过滤器、分词器、词项过滤器来定义自定义的analyzer
自定义 Analyzer
zero or more character filters
a tokenizer
zero or more token filters.
配置参数:
PUT my_index8 { "settings": { "analysis": { "analyzer": { "my_ik_analyzer": { "type": "custom", "tokenizer": "ik_smart", "char_filter": [ "html_strip" ], "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } }}
为字段指定分词器
PUT my_index8/_mapping/_doc { "properties": { "title": { "type": "text", "analyzer": "my_ik_analyzer" } } }
PUT my_index8/_mapping/_doc { "properties": { "title": { "type": "text", "analyzer": "my_ik_analyzer", "search_analyzer": "other_analyzer" } } }
PUT my_index8/_doc/1 { "title": "张三说的确实在理" } GET /my_index8/_search { "query": { "term": { "title": "张三" } } }
为索引定义个default分词器
PUT /my_index10 { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "ik_smart", "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } },
"mappings": { "_doc": { "properties": { "title": { "type": "text" } } } } }
PUT my_index10/_doc/1 { "title": "张三说的确实在理" } GET /my_index10/_search { "query": { "term": { "title": "张三" } } }
Analyzer的使用顺序
咱们能够为每一个查询、每一个字段、每一个索引指定分词器。
在索引阶段ES将按以下顺序来选用分词:
首先选用字段mapping定义中指定的analyzer
字段定义中没有指定analyzer,则选用 index settings中定义的名字为default 的analyzer。
如index setting中没有定义default分词器,则使用 standard analyzer.
查询阶段ES将按以下顺序来选用分词:
The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.
新建文档
PUT twitter/_doc/1 指定文档id,新增/修改 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
POST twitter/_doc/ 新增,自动生成文档id { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
{ "_index": "twitter", //所属索引 "_type": "_doc", //所属mapping type "_id": "p-D3ymMBl4RK_V6aWu_V", //文档id "_version": 1, //文档版本 "result": "created", "_shards": { //分片的写入状况 "total": 3, //所在分片有三个副本 "successful": 1, //1个副本上成功写入 "failed": 0 //失败副本数 }, "_seq_no": 0, //第几回操做该文档 "_primary_term": 3 //词项数 }
获取单个文档
HEAD twitter/_doc/11 查看是否存储
GET twitter/_doc/1
GET twitter/_doc/1?_source=false
GET twitter/_doc/1/_source
{ "_index": "twitter", "_type": "_doc", "_id": "1", "_version": 2, "found": true, "_source": { "id": 1, "user": "kimchy", "post_date": "2009-11-15T14:12:12", "message": "trying out Elasticsearch" }}
PUT twitter11 { //获取存储字段 "mappings": { "_doc": { "properties": { "counter": { "type": "integer", "store": false }, "tags": { "type": "keyword", "store": true } } } }} PUT twitter11/_doc/1 { "counter" : 1, "tags" : ["red"] } GET twitter11/_doc/1?stored_fields=tags,counter
获取多个文档 _mget
GET /_mget { "docs" : [ { "_index" : "twitter", "_type" : "_doc", "_id" : "1" }, { "_index" : "twitter", "_type" : "_doc", "_id" : "2" "stored_fields" : ["field3", "field4"] } ] }
GET /twitter/_mget { "docs" : [ { "_type" : "_doc", "_id" : "1" }, { "_type" : "_doc", "_id" : "2" } ] }
GET /twitter/_doc/_mget { "docs" : [ { "_id" : "1" }, { "_id" : "2" } ] }
GET /twitter/_doc/_mget { "ids" : ["1", "2"] }
请求参数_source stored_fields 能够用在url上也可用在请求json串中
删除文档
DELETE twitter/_doc/1 指定文档id进行删除
DELETE twitter/_doc/1?version=1 用版原本控制删除
{ "_shards" : { "total" : 2, "failed" : 0, "successful" : 2 }, "_index" : "twitter", "_type" : "_doc", "_id" : "1", "_version" : 2, "_primary_term": 1, "_seq_no": 5, "result": "deleted" }
查询删除
POST twitter/_delete_by_query { "query": { "match": { "message": "some message" } } }
POST twitter/_doc/_delete_by_query?conflicts=proceed { "query": { "match_all": {} } } 当有文档有版本冲突时,不放弃删除操做(记录冲突的文档,继续删除其余复合查询的文档)
经过task api 来查看 查询删除任务
GET _tasks?detailed=true&actions=*/delete/byquery
GET /_tasks/taskId:1 查询具体任务的状态
POST _tasks/task_id:1/_cancel 取消任务
{ "nodes" : { "r1A2WoRbTwKZ516z6NEs5A" : { "name" : "r1A2WoR", "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 6154, "updated" : 0, "created" : 0, "deleted" : 3500, "batches" : 36, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 }, "description" : "" } } } }}
更新文档
PUT twitter/_doc/1 指定文档id进行修改 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
PUT twitter/_doc/1?version=1 乐观锁并发更新控制 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
{ "_index": "twitter", "_type": "_doc", "_id": "1", "_version": 3, "result": "updated", "_shards": { "total": 3, "successful": 1, "failed": 0 }, "_seq_no": 2, "_primary_term": 3 }
Scripted update 经过脚原本更新文档
PUT uptest/_doc/1 一、准备一个文档 { "counter" : 1, "tags" : ["red"] }
POST uptest/_doc/1/_update 二、对文档1的counter + 4 { "script" : { "source": "ctx._source.counter += params.count", "lang": "painless", "params" : { "count" : 4 } } }
POST uptest/_doc/1/_update 三、往数组中加入元素 { "script" : { "source": "ctx._source.tags.add(params.tag)", "lang": "painless", "params" : { "tag" : "blue" } } }
脚本说明:painless是es内置的一种脚本语言,ctx执行上下文对象(经过它还可访问_index, _type, _id, _version, _routing and _now (the current timestamp) ),params是参数集合
Scripted update 经过脚原本更新文档
说明:脚本更新要求索引的_source 字段是启用的。更新执行流程:
一、获取到原文档
二、经过_source字段的原始数据,执行脚本修改。
三、删除原索引文档
四、索引修改后的文档
它只是下降了一些网络往返,并减小了get和索引之间版本冲突的可能性。
POST uptest/_doc/1/_update { "script" : "ctx._source.new_field = 'value_of_new_field'" } 四、添加一个字段
POST uptest/_doc/1/_update { "script" : "ctx._source.remove('new_field')" } 五、移除一个字段
POST uptest/_doc/1/_update { "script" : { "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }", "lang": "painless", "params" : { "tag" : "green" } } } 六、判断删除或不作什么
POST uptest/_doc/1/_update { "doc" : { "name" : "new_name" } } 七、合并传人的文档字段进行更新
{ "_index": "uptest", "_type": "_doc", "_id": "1", "_version": 4, "result": "noop", "_shards": { "total": 0, "successful": 0, "failed": 0 } } 八、再次执行7,更新内容相同,不需作什么
POST uptest/_doc/1/_update { "doc" : { "name" : "new_name" }, "detect_noop": false } 九、设置不作noop检测
POST uptest/_doc/1/_update { "script" : { "source": "ctx._source.counter += params.count", "lang": "painless", "params" : { "count" : 4 } }, "upsert" : { "counter" : 1 } } 十、upsert 操做:若是要更新的文档存在,则执行脚本进行更新,如不存在,则把 upsert中的内容做为一个新文档写入。
查询更新
POST twitter/_update_by_query { "script": { "source": "ctx._source.likes++", "lang": "painless" }, "query": { "term": { "user": "kimchy" } } } 经过条件查询来更新文档
批量操做
批量操做API /_bulk 让咱们能够在一次调用中执行多个索引、删除操做。这能够大大提升索引数据的速度。批量操做内容体需按以下以新行分割的json结构格式给出:
action_and_meta_data\n optional_source\n action_and_meta_data\n optional_source\n .... action_and_meta_data\n optional_source\n
POST _bulk { "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } } { "field1" : "value1" } { "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } } { "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } } { "field1" : "value3" } { "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} } { "doc" : {"field2" : "value2"} }
action_and_meta_data: action能够是 index, create, delete and update ,meta_data 指: _index ,_type,_id
请求端点能够是: /_bulk, /{index}/_bulk, {index}/{type}/_bulk
curl + json 文件 批量索引多个文档
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
reindex 重索引
Reindex API /_reindex 让咱们能够将一个索引中的数据重索引到另外一个索引中(拷贝),要求源索引的_source 是开启的。目标索引的setting 、mapping 信息与源索引无关。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } }
重索引要考虑的一个问题:目标索引中存在源索引中的数据,这些数据的version如何处理。
一、若是没有指定version_type 或指定为 internal,则会是采用目标索引中的版本,重索引过程当中,执行的就是新增、更新操做。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "internal" } }
二、若是想使用源索引中的版原本进行版本控制更新,则设置 version_type 为extenal。重索引操做将写入不存在的,更新旧版本的数据。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" } }
若是你只想从源索引中复制目标索引中不存在的文档数据,能够指定 op_type 为 create 。此时存在的文档将触发 版本冲突(会致使放弃操做),可设置“conflicts”: “proceed“,跳过继续
POST _reindex { "conflicts": "proceed", "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } }
你也能够只索引源索引的一部分数据,经过 type 或 查询来指定你须要的数据
POST _reindex { "source": { "index": "twitter", "type": "_doc", "query": { "term": { "user": "kimchy" } } }, "dest": { "index": "new_twitter" } }
能够从多个源获取数据
POST _reindex { "source": { "index": ["twitter", "blog"], "type": ["_doc", "post"] }, "dest": { "index": "all_together" } }
POST _reindex { "size": 10000, 能够限定文档数量 "source": { "index": "twitter", "sort": { "date": "desc" } }, "dest": { "index": "new_twitter" } }
POST _reindex { 能够选择复制源文档的哪些字段 "source": { "index": "twitter", "_source": ["user", "_doc"] }, "dest": { "index": "new_twitter" } }
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" }, "script": { //能够用script来改变文档 "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", "lang": "painless" } }
POST _reindex { "source": { "index": "source", "query": { "match": { "company": "cat" } } }, "dest": { //能够指定路由值 "index": "dest", "routing": "=cat" } }
POST _reindex { "source": { //从远程源复制 "remote": { "host": "http://otherhost:9200", "username": "user", "password": "pass" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } }
经过_task 来查询执行状态
GET _tasks?detailed=true&actions=*reindex
?refresh
对于索引、更新、删除操做若是想操做完后立马重刷新可见,可带上refresh参数。
PUT /test/_doc/1?refresh {"test": "test"} PUT /test/_doc/2?refresh=true {"test": "test"}
refresh 可选值说明
未给值或=true,则立马会重刷新读索引。 =false ,至关于没带refresh 参数,遵循内部的定时刷新。 =wait_for ,登记等待刷新,当登记的请求数达到index.max_refresh_listeners 参数设定的值时(defaults to 1000),将触发重刷新。
集群组成
建立索引的流程
节点故障
索引文档
文档是如何路由的
文档该存到哪一个分片上?
决定文档存放到哪一个分片上就是文档路由。ES中经过下面的计算获得每一个文档的存放分片:
shard = hash(routing) % number_of_primary_shards
routing 是用来进行hash计算的路由值,默认是使用文档id值。咱们能够在索引文档时经过routing参数指定别的路由值
POST twitter/_doc?routing=kimchy { "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
在索引、删除、更新、查询中均可以使用routing参数(可多值)指定操做的分片。
PUT my_index2 { "mappings": { "_doc": { "_routing": { "required": true } } } } 强制要求给定路由值
思考:关系型数据库中有分区表,经过选定分区,能够下降操做的数据量,提升效率。在ES的索引中能不能这样作?
能够:经过指定路由值,让一个分片上存放一个区的数据。如按部门存放数据,则可指定路由值为部门值。
搜索
搜索的步骤:如要搜索 索引 s1 一、node2解析查询。 二、node2将查询发给索引s1的分片/副本(R1,R2,R0)节点 三、各节点执行查询,将结果发给Node2 四、Node2合并结果,做出响应。