Elasticsearch学习之集群常见情况处理（干货）

时间 2019-11-06

原文原文链接

1. 集群健康情况处理node

当集群处于yellow或者red状态的时候，总体处理步骤以下：bash

（1）首先查看集群状态curl

localhost:9200/_cluster/health?pretty

{
　　"cluster_name": "elasticsearch",
　　"status": "yellow",
　　"timed_out": false,
　　"number_of_nodes": 1,
　　"number_of_data_nodes": 1,
　　"active_primary_shards": 278,
　　"active_shards": 278,
　　"relocating_shards": 0,
　　"initializing_shards": 0,
　　"unassigned_shards": 278,
　　"delayed_unassigned_shards": 0,
　　"number_of_pending_tasks": 0,
　　"number_of_in_flight_fetch": 0,
　　"task_max_waiting_in_queue_millis": 0,
　　"active_shards_percent_as_number": 50
}elasticsearch

主要关注其中的unassigned_shards指标，其表明已经在集群状态中存在的分片，可是实际在集群里又找不着。一般未分配分片的来源是未分配的副本。好比，一个有 5 分片和 1 副本的索引，在单节点集群上，就会有 5 个未分配副本分片。若是你的集群是 red 状态，也会长期保有未分配分片（由于缺乏主分片）。其余指标解释：fetch

(1) initializing_shards 是刚刚建立的分片的个数。好比，当你刚建立第一个索引，分片都会短暂的处于 initializing 状态。这一般会是一个临时事件，分片不该该长期停留在 initializing 状态。你还可能在节点刚重启的时候看到 initializing 分片：当分片从磁盘上加载后，它们会从 initializing 状态开始。url

(2) number_of_nodes 和 number_of_data_nodes 这个命名彻底是自描述的。spa

(3) active_primary_shards 指出你集群中的主分片数量。这是涵盖了全部索引的汇总值。code

(4) active_shards 是涵盖了全部索引的_全部_分片的汇总值，即包括副本分片。blog

(5) relocating_shards 显示当前正在从一个节点迁往其余节点的分片的数量。一般来讲应该是 0，不过在 Elasticsearch 发现集群不太均衡时，该值会上涨。好比说：添加了一个新节点，或者下线了一个节点。索引

（2）查找问题索引

curl -XGET 'localhost:9200/_cluster/health?level=indices'

{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 278,
    "active_shards": 278,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 278,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50,
    "indices": {
        "gaczrk": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5
        },
        "special-sms-extractor_zhuanche_20200204": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5
        },
        "specialhtl201905": {
            "status": "yellow",
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "active_primary_shards": 1,
            "active_shards": 1,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 1
        },

        "v2": {
         "status": "red", 
         "number_of_shards": 10,
         "number_of_replicas": 1,
         "active_primary_shards": 0,
         "active_shards": 0,
         "relocating_shards": 0,
 "initializing_shards": 0,
 "unassigned_shards": 20 
        },

       "sms20181009": { 
"status": "yellow", 
"number_of_shards": 5, 
"number_of_replicas": 1, 
"active_primary_shards": 5, 
"active_shards": 5, 
"relocating_shards": 0, 
"initializing_shards": 0, 
"unassigned_shards": 5 
}, 
......

这个参数会让 cluster-health API 在咱们的集群信息里添加一个索引清单，以及有关每一个索引的细节（状态、分片数、未分配分片数等等），一旦咱们询问要索引的输出，哪一个索引有问题立马就很清楚了：v2 索引。咱们还能够看到这个索引曾经有 10 个主分片和一个副本，而如今这 20 个分片全不见了。能够推测，这 20 个索引就是位于从咱们集群里不见了的那两个节点上。通常来说，Elasticsearch是有自我分配节点功能的，首先查看这个功能是否开启：

curl -XGET 'localhost:9200/_cluster/settings?pretty' -d  
'{
    "persistent": {},
    "transient": {
        "cluster": {
            "routing": {
                "allocation": {
                    "enable": "all"
                }
            }
        }
    }
}'

level 参数还能够接受其余更多选项：

localhost:9200/_cluster/health?level=shards

{
    "cluster_name": "elasticsearch",
    "status": "yellow",
    "timed_out": false,
    "number_of_nodes": 1,
    "number_of_data_nodes": 1,
    "active_primary_shards": 278,
    "active_shards": 278,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 278,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 50,
    "indices": {
        "gaczrk": {
            "status": "yellow",
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "active_primary_shards": 5,
            "active_shards": 5,
            "relocating_shards": 0,
            "initializing_shards": 0,
            "unassigned_shards": 5,
            "shards": {
                "0": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "1": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "2": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "3": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                },
                "4": {
                    "status": "yellow",
                    "primary_active": true,
                    "active_shards": 1,
                    "relocating_shards": 0,
                    "initializing_shards": 0,
                    "unassigned_shards": 1
                }
            }
        },
......

shards 选项会提供一个详细得多的输出，列出每一个索引里每一个分片的状态和位置。这个输出有时候颇有用，可是因为太过详细会比较难用。

(3) 手动分配未分配分片

查询未分配分片的节点以及未分配缘由

localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason

index                                   shard prirep state      unassigned.reason 
gaczrk                                  4     p      STARTED                      
gaczrk                                  4     r      UNASSIGNED CLUSTER_RECOVERED 
gaczrk                                  2     p      STARTED                      
gaczrk                                  2     r      UNASSIGNED CLUSTER_RECOVERED 
gaczrk                                  1     p      STARTED

未分配缘由说明：

INDEX_CREATED:  因为建立索引的API致使未分配。
CLUSTER_RECOVERED:  因为彻底集群恢复致使未分配。
INDEX_REOPENED:  因为打开open或关闭close一个索引致使未分配。
DANGLING_INDEX_IMPORTED:  因为导入dangling索引的结果致使未分配。
NEW_INDEX_RESTORED:  因为恢复到新索引致使未分配。
EXISTING_INDEX_RESTORED:  因为恢复到已关闭的索引致使未分配。
REPLICA_ADDED:  因为显式添加副本分片致使未分配。
ALLOCATION_FAILED:  因为分片分配失败致使未分配。
NODE_LEFT:  因为承载该分片的节点离开集群致使未分配。
REINITIALIZED:  因为当分片从开始移动到初始化时致使未分配（例如，使用影子shadow副本分片）。
REROUTE_CANCELLED:  做为显式取消从新路由命令的结果取消分配。
REALLOCATED_REPLICA:  肯定更好的副本位置被标定使用，致使现有的副本分配被取消，出现未分配。

而后执行命令手动分配:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
    "commands": [{
        "allocate": {
            "index": "gaczrk(索引名称)",
            "shard": 4分片编号),
            "node": "其余node的id",
            "allow_primary": true
        }
    }]
}'

若是未分片较多的话，能够用以下脚本进行自动分派：

#!/bin/bash
array=( node1 node2 node3 )
node_counter=0
length=${#array[@]}
IFS=$'\n'
for line in $(curl -s 'http://127.0.0.1:9200/_cat/shards'|  fgrep UNASSIGNED); do
    INDEX=$(echo $line | (awk '{print $1}'))
    SHARD=$(echo $line | (awk '{print $2}'))
    NODE=${array[$node_counter]}
    echo $NODE
    curl -XPOST 'http://127.0.0.1:9200/_cluster/reroute' -d '{
        "commands": [
        {
            "allocate": {
                "index": "'$INDEX'",
                "shard": '$SHARD',
                "node": "'$NODE'",
                "allow_primary": true
            }
        }
        ]
    }'
    node_counter=$(((node_counter)%length +1))
done

(4) 快速分配分片

在上面的命令执行输出结果中，假如全部的primary shards都是好的，全部replica shards有问题，有一种快速恢复的方法，就是强制删除掉replica shards，让elasticsearch自主从新生成。首先先将出问题的index的副本为0

curl -XPUT '/问题索引名称/_settings?pretty' -d '{
    "index" : {
        "number_of_replicas" : 0
    }
}'

而后观察集群状态，最后经过命令在恢复期索引副本数据

curl -XGET '/问题索引名称/_settings
{
    "index" : {
        "number_of_replicas" : 1
    }
}

等待节点自动分配后，集群成功恢复成gree

（5）集群分片始终处于 INITIALIZING状态

curl -XGET 'localhost:9200/_cat/shards/7a_cool?v&pretty'

7a_cool 5  r STARTED      4583018 759.4mb 10.2.4.21 pt01-pte-10-2-4-21
7a_cool 17 r INITIALIZING                 10.2.4.22 pt01-pte-10-2-4-22  《==异常分片

解决办法：

1)首先关闭异常分片主机es 服务；

登录pt01-pte-10-2-4-22 主机  ，/etc/init.d/elasticsearch  stop

若是分片自动迁移至其它主机，状态恢复，则集群正常，若是状态仍是在初始化状态，则说明问题依旧存在；则执行上面手动分配分片命令，若是问题依然存在，则将问题索引分片副本数置为0，让集群

自主调整集群分片，调整完成后集群状态变成：green