kubectl get componentstatus ERROR：HTTP probe failed with statuscode: 503

经过kubectl命令能够查看k8s各组件的状态：node

[root@wecloud-test-k8s-1 ~]# kubectl get cs NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-2 Healthy {"health": "true"} etcd-1 Healthy {"health": "true"} etcd-0 Healthy {"health": "true"}

这里分享一个问题的解决方法，我再屡次执行查看状态的时候发现etcd的状态老是会有部分节点出现Unhealthy的状态。git

[root@wecloud-test-k8s-1 ~]# kubectl get componentstatuses NAME STATUS MESSAGE ERROR controller-manager Healthy ok scheduler Healthy ok etcd-0 Healthy {"health": "true"} etcd-2 Healthy {"health": "true"} etcd-1 Unhealthy HTTP probe failed with statuscode: 503 [root@wecloud-test-k8s-1 ~]# kubectl get componentstatuses NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health": "true"} etcd-2 Unhealthy HTTP probe failed with statuscode: 503 etcd-1 Unhealthy HTTP probe failed with statuscode: 503

现象是etcd的监控状态很是不稳定，查看日志发现etcd服务的各节点之间的心跳检测出现了问题：github

root@zhangchi-ThinkPad-T450s:~# ssh 192.168.99.189 [root@wecloud-test-k8s-2 ~]# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since 一 2018-04-09 22:56:31 CST; 1 day 10h ago Docs: https://github.com/coreos Main PID: 17478 (etcd) CGroup: /system.slice/etcd.service └─17478 /usr/local/bin/etcd --name infra1 --cert-file=/etc/kubernetes/ssl/kubernetes.pem --key-file=/etc/kubernetes/ssl/kubern... 4月 11 09:33:35 wecloud-test-k8s-2.novalocal etcd[17478]: e23bf6fd185b2dc5 [quorum:2] has received 1 MsgVoteResp votes and 1 vote ...ctions 4月 11 09:33:36 wecloud-test-k8s-2.novalocal etcd[17478]: e23bf6fd185b2dc5 received MsgVoteResp from c9b9711086e865e3 at term 337 4月 11 09:33:36 wecloud-test-k8s-2.novalocal etcd[17478]: e23bf6fd185b2dc5 [quorum:2] has received 2 MsgVoteResp votes and 1 vote ...ctions 4月 11 09:33:36 wecloud-test-k8s-2.novalocal etcd[17478]: e23bf6fd185b2dc5 became leader at term 337 4月 11 09:33:36 wecloud-test-k8s-2.novalocal etcd[17478]: raft.node: e23bf6fd185b2dc5 elected leader e23bf6fd185b2dc5 at term 337 4月 11 09:33:41 wecloud-test-k8s-2.novalocal etcd[17478]: timed out waiting for read index response 4月 11 09:33:46 wecloud-test-k8s-2.novalocal etcd[17478]: failed to send out heartbeat on time (exceeded the 100ms timeout for 401...516ms) 4月 11 09:33:46 wecloud-test-k8s-2.novalocal etcd[17478]: server is likely overloaded 4月 11 09:33:46 wecloud-test-k8s-2.novalocal etcd[17478]: failed to send out heartbeat on time (exceeded the 100ms timeout for 401.80886ms) 4月 11 09:33:46 wecloud-test-k8s-2.novalocal etcd[17478]: server is likely overloaded Hint: Some lines were ellipsized, use -l to show in full.

报错信息主要为：failed to send out heartbeat on time (exceeded the 100ms timeout for 401.80886ms)算法

心跳检测报错主要与如下因素有关（磁盘速度、cpu性能和网络不稳定问题）：ruby

etcd使用了raft算法，leader会定时地给每一个follower发送心跳，若是leader连续两个心跳时间没有给follower发送心跳，etcd会打印这个log以给出告警。一般状况下这个issue是disk运行过慢致使的，leader通常会在心跳包里附带一些metadata，leader须要先把这些数据固化到磁盘上，而后才能发送。写磁盘过程可能要与其余应用竞争，或者由于磁盘是一个虚拟的或者是SATA类型的致使运行过慢，此时只有更好更快磁盘硬件才能解决问题。etcd暴露给Prometheus的metrics指标walfsyncduration_seconds就显示了wal日志的平均花费时间，一般这个指标应低于10ms。网络

第二种缘由就是CPU计算能力不足。若是是经过监控系统发现CPU利用率确实很高，就应该把etcd移到更好的机器上，而后经过cgroups保证etcd进程独享某些核的计算能力，或者提升etcd的priority。ssh

第三种缘由就多是网速过慢。若是Prometheus显示是网络服务质量不行，譬如延迟过高或者丢包率太高，那就把etcd移到网络不拥堵的状况下就能解决问题。可是若是etcd是跨机房部署的，长延迟就不可避免了，那就须要根据机房间的RTT调整heartbeat-interval，而参数election-timeout则至少是heartbeat-interval的5倍。性能

本次实验是在openstack云主机上进行的，因此磁盘io不足是已知的问题，因此须要修改hearheat-interval的值（调大一些）。spa

在etcd服务节点上修改/etc/etcd/etcd.conf文件，添加以下内容：日志

6秒检测频率

ETCD_HEARTBEAT_INTERVAL=6000 ETCD_ELECTION_TIMEOUT=30000

而后重启etcd服务