一、集群状态非绿排查清单
1.1 集群状态的含义
-
红色:至少一个主分片未分配成功;html
-
黄色:至少一个副本分片未分配成功;node
-
绿色:所有主&副本都分配成功。缓存
1.2 排查实战
1.2.1 查看集群状态
GET _cluster/health
返回状态举例:"status" : "red", 红色,至少一个主分片未分配成功。并发
1.2.2 到底哪一个节点出现了红色或者黄色问题呢?
GET _cluster/health?level=indices
以下的方式,更明快直接elasticsearch
GET /_cat/indices?v&health=yellow GET /_cat/indices?v&health=red
找到对应的索引。ide
1.2.3 到底索引的哪一个分片出现了红色或者黄色问题呢?
GET _cluster/health?level=shards
1.2.4 到底什么缘由致使了集群变成红色或者黄色呢?
GET _cluster/allocation/explain
返回核心信息解读举例:工具
"current_state" : "unassigned",——未分配 "unassigned_info" : { "reason" : "INDEX_CREATED",——缘由,索引建立阶段 "at" : "2020-01-29T07:32:39.041Z", "last_allocation_status" : "no" }, "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"hot"]""" }
根本缘由,shard分片与节点过滤类型不一致 到此,找到了根本缘由,也就知道了对应解决方案。性能
1.3 扩展思考:相似 "current_state" : "unassigned",——未分配 还有哪些?
实战:ui
GET _cat/shards?h=index,shard,prirep,state,unassigned.reason
官网:https://www.elastic.co/guide/en/elasticsearch/reference/7.2/cat-shards.htmlspa
未分配状态及缘由解读:
(1)INDEX_CREATED Unassigned as a result of an API creation of an index. (2)CLUSTER_RECOVERED Unassigned as a result of a full cluster recovery. (3)INDEX_REOPENED Unassigned as a result of opening a closed index. (4)DANGLING_INDEX_IMPORTED Unassigned as a result of importing a dangling index. (5)NEW_INDEX_RESTORED Unassigned as a result of restoring into a new index. (6)EXISTING_INDEX_RESTORED Unassigned as a result of restoring into a closed index. (7)REPLICA_ADDED Unassigned as a result of explicit addition of a replica. (8)ALLOCATION_FAILED Unassigned as a result of a failed allocation of the shard. (9)NODE_LEFT Unassigned as a result of the node hosting it leaving the cluster. (10)REROUTE_CANCELLED Unassigned as a result of explicit cancel reroute command. (11)REINITIALIZED When a shard moves from started back to initializing, for example, with shadow replicas. (12)REALLOCATED_REPLICA A better replica location is identified and causes the existing replica allocation to be cancelled.
二、节点间分片移动
适用场景:手动移动分配分片。将启动的分片从一个节点移动到另外一节点。
POST /_cluster/reroute { "commands": [ { "move": { "index": "indexname", "shard": 1, "from_node": "nodename", "to_node": "nodename" } } ] }
三、集群节点优雅下线
适用场景:保证集群颜色绿色的前提下,将某个节点优雅下线。
PUT /_cluster/settings { "transient": { "cluster.routing.allocation.exclude._ip": "122.5.3.55" } }
四、强制刷新
适用场景:刷新索引是确保当前仅存储在事务日志中的全部数据也永久存储在Lucene索引中。
POST /_flush
注意:这和 7.6 版本以前的同步刷新(将来8版本+会废弃同步刷新)一致。
POST /_flush/synced
五、更改并发分片的数量以平衡集群
适用场景:
控制在集群范围内容许多少并发分片从新平衡。默认值为2。
PUT /_cluster/settings { "transient": { "cluster.routing.allocation.cluster_concurrent_rebalance": 2 } }
六、更改每一个节点同时恢复的分片数量
适用场景:
若是节点已从集群断开链接,则其全部分片将都变为未分配状态。通过必定的延迟后,分片将分配到其余位置。每一个节点要恢复的并发分片数由该设置肯定。
PUT /_cluster/settings { "transient": { "cluster.routing.allocation.node_concurrent_recoveries": 6 } }
七、调整恢复速度
适用场景:
为了不集群过载,Elasticsearch限制了分配给恢复的速度。你能够仔细更改该设置,以使其恢复更快。
若是此值调的过高,则正在进行的恢复可能会消耗过多的带宽和其余资源,这可能会使集群不稳定。
PUT /_cluster/settings { "transient": { "indices.recovery.max_bytes_per_sec": "80mb" } }
八、清除节点上的缓存
适用场景:若是节点达到较高的JVM值,则能够在节点级别上调用该API 以使 Elasticsearch 清理缓存。
这会下降性能,但可使你摆脱OOM(内存不足)的困扰。
POST /_cache/clear
九、调整断路器
适用场景:为了不在Elasticsearch中进入OOM,能够调整断路器上的设置。这将限制搜索内存,并丢弃全部估计消耗比所需级别更多的内存的搜索。
注意:这是一个很是精密的设置,你须要仔细校准。
PUT /_cluster/settings { "persistent": { "indices.breaker.total.limit": "40%" } }
十、集群迁移
适用场景:集群数据迁移、索引数据迁移等。
方案1、 针对索引部分或者所有数据,reindex
POST _reindex { "source": { "index": "my-index-000001" }, "dest": { "index": "my-new-index-000001" } }
方案二:借助第三方工具迁移索引或者集群
-
elasticdump
-
elasticsearch-migration
工具本质:scroll + bulk 实现。
十一、集群数据备份和恢复
适用场景:高可用业务场景,按期增量、全量数据备份,以备应急不时之需。
PUT /_snapshot/my_backup/snapshot_hamlet_index?wait_for_completion=true { "indices": "hamlet_*", "ignore_unavailable": true, "include_global_state": false, "metadata": { "taken_by": "mingyi", "taken_because": "backup before upgrading" } } POST /_snapshot/my_backup/snapshot_hamlet_index/_restore