一、集群状态非绿排查清单

1.1 集群状态的含义

红色：至少一个主分片未分配成功；html
黄色：至少一个副本分片未分配成功；node
绿色：所有主&副本都分配成功。缓存

1.2 排查实战

1.2.1 查看集群状态

GET _cluster/health

返回状态举例："status" : "red", 红色，至少一个主分片未分配成功。并发

1.2.2 到底哪一个节点出现了红色或者黄色问题呢？

GET _cluster/health?level=indices

以下的方式，更明快直接elasticsearch

GET /_cat/indices?v&health=yellow
GET /_cat/indices?v&health=red

找到对应的索引。ide

1.2.3 到底索引的哪一个分片出现了红色或者黄色问题呢？

GET _cluster/health?level=shards

1.2.4 到底什么缘由致使了集群变成红色或者黄色呢？

GET _cluster/allocation/explain

返回核心信息解读举例：工具

"current_state" : "unassigned",——未分配
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",——缘由，索引建立阶段
    "at" : "2020-01-29T07:32:39.041Z",
    "last_allocation_status" : "no"
  },
  "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"hot"]"""
        }

根本缘由，shard分片与节点过滤类型不一致到此，找到了根本缘由，也就知道了对应解决方案。性能

1.3 扩展思考：相似 "current_state" : "unassigned",——未分配还有哪些？

实战：ui

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

官网：https://www.elastic.co/guide/en/elasticsearch/reference/7.2/cat-shards.htmlspa

未分配状态及缘由解读：

（1）INDEX_CREATED
Unassigned as a result of an API creation of an index.
（2）CLUSTER_RECOVERED
Unassigned as a result of a full cluster recovery.
（3）INDEX_REOPENED
Unassigned as a result of opening a closed index.
（4）DANGLING_INDEX_IMPORTED
Unassigned as a result of importing a dangling index.
（5）NEW_INDEX_RESTORED
Unassigned as a result of restoring into a new index.
（6）EXISTING_INDEX_RESTORED
Unassigned as a result of restoring into a closed index.
（7）REPLICA_ADDED
Unassigned as a result of explicit addition of a replica.
（8）ALLOCATION_FAILED
Unassigned as a result of a failed allocation of the shard.
（9）NODE_LEFT
Unassigned as a result of the node hosting it leaving the cluster.
（10）REROUTE_CANCELLED
Unassigned as a result of explicit cancel reroute command.
（11）REINITIALIZED
When a shard moves from started back to initializing, for example, with shadow replicas.
（12）REALLOCATED_REPLICA
A better replica location is identified and causes the existing replica allocation to be cancelled.

二、节点间分片移动

适用场景：手动移动分配分片。将启动的分片从一个节点移动到另外一节点。

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "indexname",
        "shard": 1,
        "from_node": "nodename",
        "to_node": "nodename"
      }
    }
  ]
}

三、集群节点优雅下线

适用场景：保证集群颜色绿色的前提下，将某个节点优雅下线。

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "122.5.3.55"
  }
}

四、强制刷新

适用场景：刷新索引是确保当前仅存储在事务日志中的全部数据也永久存储在Lucene索引中。

POST /_flush

注意：这和 7.6 版本以前的同步刷新（将来8版本+会废弃同步刷新）一致。

POST /_flush/synced

五、更改并发分片的数量以平衡集群

适用场景：

控制在集群范围内容许多少并发分片从新平衡。默认值为2。

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2
  }
}

六、更改每一个节点同时恢复的分片数量

适用场景：

若是节点已从集群断开链接，则其全部分片将都变为未分配状态。通过必定的延迟后，分片将分配到其余位置。每一个节点要恢复的并发分片数由该设置肯定。

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 6
  }
}

七、调整恢复速度

适用场景：

为了不集群过载，Elasticsearch限制了分配给恢复的速度。你能够仔细更改该设置，以使其恢复更快。

若是此值调的过高，则正在进行的恢复可能会消耗过多的带宽和其余资源，这可能会使集群不稳定。

PUT /_cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "80mb"
  }
}

八、清除节点上的缓存

适用场景：若是节点达到较高的JVM值，则能够在节点级别上调用该API 以使 Elasticsearch 清理缓存。

这会下降性能，但可使你摆脱OOM（内存不足）的困扰。

POST /_cache/clear

九、调整断路器

适用场景：为了不在Elasticsearch中进入OOM，能够调整断路器上的设置。这将限制搜索内存，并丢弃全部估计消耗比所需级别更多的内存的搜索。

注意：这是一个很是精密的设置，你须要仔细校准。

PUT /_cluster/settings
{
  "persistent": {
    "indices.breaker.total.limit": "40%"
  }
}

十、集群迁移

适用场景：集群数据迁移、索引数据迁移等。

方案1、针对索引部分或者所有数据，reindex

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

方案二：借助第三方工具迁移索引或者集群

elasticdump
elasticsearch-migration

工具本质：scroll + bulk 实现。

十一、集群数据备份和恢复

适用场景：高可用业务场景，按期增量、全量数据备份，以备应急不时之需。

PUT /_snapshot/my_backup/snapshot_hamlet_index?wait_for_completion=true
{
  "indices": "hamlet_*",
  "ignore_unavailable": true,
  "include_global_state": false,
  "metadata": {
    "taken_by": "mingyi",
    "taken_because": "backup before upgrading"
  }
}

POST /_snapshot/my_backup/snapshot_hamlet_index/_restore

Elasticsearch 运维实战经常使用命令清单

一、集群状态非绿排查清单

1.1 集群状态的含义

1.2 排查实战

1.2.1 查看集群状态

1.2.2 到底哪一个节点出现了红色或者黄色问题呢？

1.2.3 到底索引的哪一个分片出现了红色或者黄色问题呢？

1.2.4 到底什么缘由致使了集群变成红色或者黄色呢？

1.3 扩展思考：相似 "current_state" : "unassigned",——未分配还有哪些？

二、节点间分片移动

三、集群节点优雅下线

四、强制刷新

五、更改并发分片的数量以平衡集群

六、更改每一个节点同时恢复的分片数量

七、调整恢复速度

八、清除节点上的缓存

九、调整断路器

十、集群迁移

方案1、针对索引部分或者所有数据，reindex

方案二：借助第三方工具迁移索引或者集群

十一、集群数据备份和恢复

小结

Elasticsearch 运维实战经常使用命令清单

一、集群状态非绿排查清单

1.1 集群状态的含义

1.2 排查实战

1.2.1 查看集群状态

1.2.2 到底哪一个节点出现了红色或者黄色问题呢？

1.2.3 到底索引的哪一个分片出现了红色或者黄色问题呢？

1.2.4 到底什么缘由致使了集群变成红色或者黄色呢？

1.3 扩展思考：相似 "current_state" : "unassigned",——未分配 还有哪些？

二、节点间分片移动

三、集群节点优雅下线

四、强制刷新

五、更改并发分片的数量以平衡集群

六、更改每一个节点同时恢复的分片数量

七、调整恢复速度

八、清除节点上的缓存

九、调整断路器

十、集群迁移

方案1、 针对索引部分或者所有数据，reindex

方案二：借助第三方工具迁移索引或者集群

十一、集群数据备份和恢复

小结

1.3 扩展思考：相似 "current_state" : "unassigned",——未分配还有哪些？

方案1、针对索引部分或者所有数据，reindex