1. 集群健康情况处理node
当集群处于yellow或者red状态的时候,总体处理步骤以下:bash
(1) 首先查看集群状态curl
localhost:9200/_cluster/health?pretty
{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 278,
"active_shards": 278,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 278,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50
}elasticsearch
主要关注其中的unassigned_shards指标,其表明已经在集群状态中存在的分片,可是实际在集群里又找不着。一般未分配分片的来源是未分配的副本。好比,一个有 5 分片和 1 副本的索引,在单节点集群上,就会有 5 个未分配副本分片。若是你的集群是 red
状态,也会长期保有未分配分片(由于缺乏主分片)。其余指标解释:fetch
(1) initializing_shards
是刚刚建立的分片的个数。好比,当你刚建立第一个索引,分片都会短暂的处于 initializing
状态。这一般会是一个临时事件,分片不该该长期停留在 initializing
状态。你还可能在节点刚重启的时候看到 initializing
分片:当分片从磁盘上加载后,它们会从 initializing
状态开始。url
(2) number_of_nodes
和 number_of_data_nodes
这个命名彻底是自描述的。spa
(3) active_primary_shards
指出你集群中的主分片数量。这是涵盖了全部索引的汇总值。code
(4) active_shards
是涵盖了全部索引的_全部_分片的汇总值,即包括副本分片。blog
(5) relocating_shards
显示当前正在从一个节点迁往其余节点的分片的数量。一般来讲应该是 0,不过在 Elasticsearch 发现集群不太均衡时,该值会上涨。好比说:添加了一个新节点,或者下线了一个节点。索引
(2)查找问题索引
curl -XGET 'localhost:9200/_cluster/health?level=indices' { "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 278, "active_shards": 278, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 278, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50, "indices": { "gaczrk": { "status": "yellow", "number_of_shards": 5, "number_of_replicas": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5 }, "special-sms-extractor_zhuanche_20200204": { "status": "yellow", "number_of_shards": 5, "number_of_replicas": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5 }, "specialhtl201905": { "status": "yellow", "number_of_shards": 1, "number_of_replicas": 1, "active_primary_shards": 1, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 },
"v2": { "status": "red",
"number_of_shards": 10,
"number_of_replicas": 1,
"active_primary_shards": 0,
"active_shards": 0,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 20
},
"sms20181009": {
"status": "yellow",
"number_of_shards": 5,
"number_of_replicas": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5
},
......
这个参数会让 cluster-health
API 在咱们的集群信息里添加一个索引清单,以及有关每一个索引的细节(状态、分片数、未分配分片数等等),一旦咱们询问要索引的输出,哪一个索引有问题立马就很清楚了:v2
索引。咱们还能够看到这个索引曾经有 10 个主分片和一个副本,而如今这 20 个分片全不见了。能够推测,这 20 个索引就是位于从咱们集群里不见了的那两个节点上。通常来说,Elasticsearch是有自我分配节点功能的,首先查看这个功能是否开启:
curl -XGET 'localhost:9200/_cluster/settings?pretty' -d '{ "persistent": {}, "transient": { "cluster": { "routing": { "allocation": { "enable": "all" } } } } }'
level
参数还能够接受其余更多选项:
localhost:9200/_cluster/health?level=shards { "cluster_name": "elasticsearch", "status": "yellow", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 278, "active_shards": 278, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 278, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50, "indices": { "gaczrk": { "status": "yellow", "number_of_shards": 5, "number_of_replicas": 1, "active_primary_shards": 5, "active_shards": 5, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 5, "shards": { "0": { "status": "yellow", "primary_active": true, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 }, "1": { "status": "yellow", "primary_active": true, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 }, "2": { "status": "yellow", "primary_active": true, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 }, "3": { "status": "yellow", "primary_active": true, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 }, "4": { "status": "yellow", "primary_active": true, "active_shards": 1, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 1 } } }, ......
shards
选项会提供一个详细得多的输出,列出每一个索引里每一个分片的状态和位置。这个输出有时候颇有用,可是因为太过详细会比较难用。
(3) 手动分配未分配分片
查询未分配分片的节点以及未分配缘由
localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason index shard prirep state unassigned.reason gaczrk 4 p STARTED gaczrk 4 r UNASSIGNED CLUSTER_RECOVERED gaczrk 2 p STARTED gaczrk 2 r UNASSIGNED CLUSTER_RECOVERED gaczrk 1 p STARTED
未分配缘由说明:
INDEX_CREATED: 因为建立索引的API致使未分配。
CLUSTER_RECOVERED: 因为彻底集群恢复致使未分配。
INDEX_REOPENED: 因为打开open或关闭close一个索引致使未分配。
DANGLING_INDEX_IMPORTED: 因为导入dangling索引的结果致使未分配。
NEW_INDEX_RESTORED: 因为恢复到新索引致使未分配。
EXISTING_INDEX_RESTORED: 因为恢复到已关闭的索引致使未分配。
REPLICA_ADDED: 因为显式添加副本分片致使未分配。
ALLOCATION_FAILED: 因为分片分配失败致使未分配。
NODE_LEFT: 因为承载该分片的节点离开集群致使未分配。
REINITIALIZED: 因为当分片从开始移动到初始化时致使未分配(例如,使用影子shadow副本分片)。
REROUTE_CANCELLED: 做为显式取消从新路由命令的结果取消分配。
REALLOCATED_REPLICA: 肯定更好的副本位置被标定使用,致使现有的副本分配被取消,出现未分配。
而后执行命令手动分配:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands": [{ "allocate": { "index": "gaczrk(索引名称)", "shard": 4分片编号), "node": "其余node的id", "allow_primary": true } }] }'
若是未分片较多的话,能够用以下脚本进行自动分派:
#!/bin/bash array=( node1 node2 node3 ) node_counter=0 length=${#array[@]} IFS=$'\n' for line in $(curl -s 'http://127.0.0.1:9200/_cat/shards'| fgrep UNASSIGNED); do INDEX=$(echo $line | (awk '{print $1}')) SHARD=$(echo $line | (awk '{print $2}')) NODE=${array[$node_counter]} echo $NODE curl -XPOST 'http://127.0.0.1:9200/_cluster/reroute' -d '{ "commands": [ { "allocate": { "index": "'$INDEX'", "shard": '$SHARD', "node": "'$NODE'", "allow_primary": true } } ] }' node_counter=$(((node_counter)%length +1)) done
(4) 快速分配分片
在上面的命令执行输出结果中,假如全部的primary shards都是好的,全部replica shards有问题,有一种快速恢复的方法,就是强制删除掉replica shards,让elasticsearch自主从新生成。 首先先将出问题的index的副本为0
curl -XPUT '/问题索引名称/_settings?pretty' -d '{ "index" : { "number_of_replicas" : 0 } }'
而后观察集群状态,最后经过命令在恢复期索引副本数据
curl -XGET '/问题索引名称/_settings { "index" : { "number_of_replicas" : 1 } }
等待节点自动分配后,集群成功恢复成gree
(5)集群分片始终处于 INITIALIZING状态
curl -XGET 'localhost:9200/_cat/shards/7a_cool?v&pretty' 7a_cool 5 r STARTED 4583018 759.4mb 10.2.4.21 pt01-pte-10-2-4-21 7a_cool 17 r INITIALIZING 10.2.4.22 pt01-pte-10-2-4-22 《==异常分片
解决办法:
1)首先关闭异常分片主机es 服务;
登录pt01-pte-10-2-4-22 主机 ,/etc/init.d/elasticsearch stop
若是分片自动迁移至其它主机,状态恢复,则集群正常,若是状态仍是在初始化状态,则说明问题依旧存在;则执行上面手动分配分片命令,若是问题依然存在,则将问题索引分片副本数置为0,让集群
自主调整集群分片,调整完成后集群状态变成:green