一次在k8s集群中建立实例发现etcd集群状态出现链接失败情况,致使建立实例失败。因而排查了一下缘由。docker
问题来源
下面是etcd集群健康状态:bootstrap
1
2
3
4
5
6
7
8
9
10
11
|
[root@docker01 ~]
# cd /opt/kubernetes/ssl/
[root@docker01 ssl]
# /opt/kubernetes/bin/etcdctl \
> --ca-
file
=ca.pem --cert-
file
=server.pem --key-
file
=server-key.pem \
> --endpoints=
"https://10.0.0.99:2379,https://10.0.0.100:2379,https://10.0.0.111:2379"
\
> cluster-health
member 1bd4d12de986e887 is healthy: got healthy result from https:
//10
.0.0.99:2379
member 45396926a395958b is healthy: got healthy result from https:
//10
.0.0.100:2379
failed to check the health of member c2c5804bd87e2884 on https:
//10
.0.0.111:2379: Get https:
//10
.0.0.111:2379
/health
: net
/http
: TLS handshake timeout
member c2c5804bd87e2884 is unreachable: [https:
//10
.0.0.111:2379] are all unreachable
cluster is healthy
[root@docker01 ssl]
#
|
能够明显看到etcd节点03出现问题。bash
这个时候到节点03上来重启etcd服务以下:app
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
[root@docker03 ~]
# systemctl restart etcd
Job
for
etcd.service failed because the control process exited with error code. See
"systemctl status etcd.service"
and
"journalctl -xe"
for
details.
[root@docker03 ~]
# journalctl -xe
Mar 24 22:24:32 docker03 etcd[1895]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 24 22:24:32 docker03 etcd[1895]: the server is already initialized as member before, starting as etcd member...
Mar 24 22:24:32 docker03 etcd[1895]: peerTLS: cert =
/opt/kubernetes/ssl/server
.pem, key =
/opt/kubernetes/ssl/server-key
.pem, ca = , trusted-ca =
/opt/kubernetes/ssl
Mar 24 22:24:32 docker03 etcd[1895]: listening
for
peers on https:
//10
.0.0.111:2380
Mar 24 22:24:32 docker03 etcd[1895]: The scheme of client url http:
//127
.0.0.1:2379 is HTTP
while
peer key
/cert
files are presented. Ignored key
/cert
files.
Mar 24 22:24:32 docker03 etcd[1895]: listening
for
client requests on 127.0.0.1:2379
Mar 24 22:24:32 docker03 etcd[1895]: listening
for
client requests on 10.0.0.111:2379
Mar 24 22:24:32 docker03 etcd[1895]: member c2c5804bd87e2884 has already been bootstrapped
Mar 24 22:24:32 docker03 systemd[1]: etcd.service: main process exited, code=exited, status=1
/FAILURE
Mar 24 22:24:32 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http:
//lists
.freedesktop.org
/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:32 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:32 docker03 systemd[1]: etcd.service failed.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service holdoff
time
over, scheduling restart.
Mar 24 22:24:33 docker03 systemd[1]: start request repeated too quickly
for
etcd.service
Mar 24 22:24:33 docker03 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
-- Defined-By: systemd
-- Support: http:
//lists
.freedesktop.org
/mailman/listinfo/systemd-devel
--
-- Unit etcd.service has failed.
--
-- The result is failed.
Mar 24 22:24:33 docker03 systemd[1]: Unit etcd.service entered failed state.
Mar 24 22:24:33 docker03 systemd[1]: etcd.service failed.
|
并无成功启动服务,能够看到提示信息:member c2c5804bd87e2884 has already been bootstrappedpost
查看资料说是:
One of the member was bootstrapped via discovery service. You must remove the previous data-dir to clean up the member information. Or the member will ignore the new configuration and start with the old configuration. That is why you see the mismatch.
大概意思:
其中一个成员是经过discovery service引导的。必须删除之前的数据目录来清理成员信息。不然成员将忽略新配置,使用旧配置。这就是为何你看到了不匹配。
看到了这里,问题所在也就很明确了,启动失败的缘由在于data-dir (/var/lib/etcd/default.etcd)中记录的信息与 etcd启动的选项所标识的信息不太匹配形成的。ui
问题解决
第一种方式咱们能够经过修改启动参数解决这类错误。既然 data-dir 中已经记录信息,咱们就不必在启动项中加入多于配置。具体修改--initial-cluster-state参数:url
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
[root@docker03 ~]
# cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=-
/opt/kubernetes/cfg/etcd
ExecStart=
/opt/kubernetes/bin/etcd
\
--name=${ETCD_NAME} \
--data-
dir
=${ETCD_DATA_DIR} \
--listen-peer-urls=${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=${ETCD_LISTEN_CLIENT_URLS},http:
//127
.0.0.1:2379 \
--advertise-client-urls=${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=${ETCD_INITIAL_CLUSTER} \
--initial-cluster-state=existing \
# 将new这个参数修改为existing,启动正常!
--cert-
file
=
/opt/kubernetes/ssl/server
.pem \
--key-
file
=
/opt/kubernetes/ssl/server-key
.pem \
--peer-cert-
file
=
/opt/kubernetes/ssl/server
.pem \
--peer-key-
file
=
/opt/kubernetes/ssl/server-key
.pem \
--trusted-ca-
file
=
/opt/kubernetes/ssl/ca
.pem \
--peer-trusted-ca-
file
=
/opt/kubernetes/ssl/ca
.pem
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
|
咱们将 --initial-cluster-state=new 修改为 --initial-cluster-state=existing,再次从新启动就ok了。spa
第二种方式删除全部etcd节点的 data-dir 文件(不删也行),重启各个节点的etcd服务,这个时候,每一个节点的data-dir的数据都会被更新,就不会有以上故障了。rest
第三种方式是复制其余节点的data-dir中的内容,以此为基础上以 --force-new-cluster 的形式强行拉起一个,而后以添加新成员的方式恢复这个集群。code
这是目前的几种解决办法