zabbix监控报警一台ceph节点journal盘写入寿命已经达到96%以上,根据intel官方说法写入寿命达到设置值将会没法正常写入。PercentageUsed : 97bash
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x021B5 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x08 CrcErrorCount : 0 AverageNandEraseCycles : 2917 MediaErrors : 0x00 PowerCycles : 0x0C ProgramFailCount : 0 MaxNandEraseCycles : 2922 HighestLifetimeTemperature : 57 PercentageUsed : 97 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 2913 LowestLifetimeTemperature : 23 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 50
有12个osd用这块盘作的日志服务器
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk ├─nvme0n1p1 259:1 0 30G 0 part ├─nvme0n1p2 259:2 0 30G 0 part ├─nvme0n1p3 259:3 0 30G 0 part ├─nvme0n1p4 259:4 0 30G 0 part ├─nvme0n1p5 259:5 0 30G 0 part ├─nvme0n1p6 259:6 0 30G 0 part ├─nvme0n1p7 259:7 0 30G 0 part ├─nvme0n1p8 259:8 0 30G 0 part ├─nvme0n1p9 259:9 0 30G 0 part ├─nvme0n1p10 259:10 0 30G 0 part ├─nvme0n1p11 259:11 0 30G 0 part └─nvme0n1p12 259:12 0 30G 0 part [root@ceph-11 ~]#
1,下降osd优先级
在大部分故障场景, 咱们须要关机操做, 为了让用户无感知, 咱们须要提早下降待操做的节点的优先级。首先看下ceph版本号,ceph版本为10.x. 咱们启用了primary-affinity支持, 用户的io请求会先转给primary pg处理. 而后写入其余replica(副本).。先找出host ceph-11对应的osd,而后把这些osd的primary-affinity设为0, 意思就是上面的pg除非其余副本挂了, 不然不该该成为主pg.ide
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0.89999 88 5.45599 osd.88 up 0.79999 0.29999 89 5.45599 osd.89 up 1.00000 0.89999 90 5.45599 osd.90 up 1.00000 0.89999 91 5.45599 osd.91 up 1.00000 0.89999 92 5.45599 osd.92 up 1.00000 0.79999 93 5.45599 osd.93 up 1.00000 0.89999 94 5.45599 osd.94 up 1.00000 0.89999 95 5.45599 osd.95 up 1.00000 0.89999 96 5.45599 osd.96 up 1.00000 0.89999 97 5.45599 osd.97 up 1.00000 0.89999 98 5.45599 osd.98 up 0.89999 0.89999
将osd87到98优先级设置为0for osd in {87..98}; do ceph osd primary-affinity "$osd" 0; done
ui
使用ceph osd tree能够看到对应的节点设置日志
-12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
2,禁止踢出节点ceph osd set noout
code
默认状况下, osd长时间无响应则会被自动踢出集群, 从而触发数据迁移. 关机更换ssd操做时间较长, 为了不数据无心义地来回迁移, 咱们须要临时禁止集群自动踢osd,使用ceph -s检查是否配置完成。能够看到集群状态变为WARN, 额外提示说noout flag被设置了, 并且flags这样多了一项ip
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN noout flag(s) set monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73511: 111 osds: 108 up, 108 in flags noout,sortbitwise,require_jewel_osds pgmap v85913863: 5064 pgs, 24 pools, 89164 GB data, 12450 kobjects 261 TB used, 141 TB / 403 TB avail 5060 active+clean 4 active+clean+scrubbing+deep client io 27608 kB/s rd, 59577 kB/s wr, 399 op/s rd, 668 op/s wr
3,检查pg是否完成切换ci
[root@ceph-11 ~]# ceph pg ls | grep "\[9[1-8]," 13.24 5066 0 0 0 0 41480507922 3071 3071 active+clean 2019-07-02 19:33:37.537802 73497'120563162 73511:110960694 [94,25,64] 94 [94,25,64] 94 73497'120562718 2019-07-02 19:33:37.537761 73294'120561198 2019-07-01 18:11:54.686413 13.10f 4874 0 0 0 0 39967832064 3083 3083 active+clean 2019-07-01 23:56:13.911259 73511'59603193 73511:52739094 [91,44,38] 91 [91,44,38] 91 73302'59589396 2019-07-01 23:56:13.911226 69213'59545762019-06-26 22:58:12.864475 13.17d 5001 0 0 0 0 40919228578 3088 3088 active+clean 2019-07-02 13:51:04.162137 73511'34680543 73511:26095334 [96,45,72] 96 [96,45,72] 96 73497'34678725 2019-07-02 13:51:04.162089 70393'34676042019-07-01 08:47:58.771910 13.20d 4872 0 0 0 0 40007166482 3036 3036 active+clean 2019-07-03 07:40:28.677097 73511'27811217 73511:22372286 [93,85,73] 93 [93,85,73] 93 73497'27809831 2019-07-03 07:40:28.677059 73302'27796622019-07-01 23:15:14.731237 13.214 5006 0 0 0 0 40940654592 3079 3079 active+clean 2019-07-02 21:10:51.094829 73511'34400529 73511:27161705 [94,61,53] 94 [94,61,53] 94 73497'34398612 2019-07-02 21:10:51.094784 73294'34393962019-07-01 18:54:06.249357 13.2fd 4950 0 0 0 0 40522633728 3086 3086 active+clean 2019-07-02 06:36:14.763435 73511'149011011 73511:136693896 [91,58,36] 91 [91,58,36] 91 73497'148963815 2019-07-02 06:36:14.763383 73497'148963815 2019-07-02 06:36:14.763383 13.3ae 4989 0 0 0 0 40879544320 3055 3055 active+clean 2019-07-02 00:30:44.817062 73511'67827999 73511:60578765 [91,54,25] 91 [91,54,25] 91 73302'67806651 2019-07-02 00:30:44.817017 69213'67776352
主pg不愿走啊,既然这样那就无论它了,咱们前面已经设置禁止踢出节点,且咱们用的是三副本,直接关闭这台机器ceph会启用副本,也不会出现数据迁移。
一个存储3份的集群, 能够容忍任意两个主机故障.,因此你须要确保已经关机的节点数量不要超出限制. 以避免引起更大的故障.v8
4,中止服务、关闭服务器、更换ssd
新换上去的ssd使用率为0,PercentageUsed : 0it
[root@ceph-11 ~]# isdct show -sensor PowerOnHours : 0x063F3 EraseFailCount : 0 EndToEndErrorDetectionCount : 0 ReliabilityDegraded : False AvailableSpare : 100 AvailableSpareBelowThreshold : False DeviceStatus : Healthy SpecifiedPCBMaxOperatingTemp : 85 SpecifiedPCBMinOperatingTemp : 0 UnsafeShutdowns : 0x00 CrcErrorCount : 0 AverageNandEraseCycles : 7 MediaErrors : 0x00 PowerCycles : 0x012 ProgramFailCount : 0 MaxNandEraseCycles : 10 HighestLifetimeTemperature : 48 PercentageUsed : 0 ThermalThrottleStatus : 0 ErrorInfoLogEntries : 0x00 MinNandEraseCycles : 6 LowestLifetimeTemperature : 16 ReadOnlyMode : False ThermalThrottleCount : 0 TemperatureThresholdExceeded : False Temperature - Celsius : 48
5,插入新的磁盘为nvme0n1
[root@ceph-11 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 5.5T 0 disk └─sda1 8:1 0 5.5T 0 part /var/lib/ceph/osd/ceph-87 sdb 8:16 0 5.5T 0 disk └─sdb1 8:17 0 5.5T 0 part /var/lib/ceph/osd/ceph-88 sdc 8:32 0 5.5T 0 disk └─sdc1 8:33 0 5.5T 0 part /var/lib/ceph/osd/ceph-89 sdd 8:48 0 5.5T 0 disk └─sdd1 8:49 0 5.5T 0 part /var/lib/ceph/osd/ceph-90 sde 8:64 0 5.5T 0 disk └─sde1 8:65 0 5.5T 0 part /var/lib/ceph/osd/ceph-91 sdf 8:80 0 5.5T 0 disk └─sdf1 8:81 0 5.5T 0 part /var/lib/ceph/osd/ceph-92 sdg 8:96 0 5.5T 0 disk └─sdg1 8:97 0 5.5T 0 part /var/lib/ceph/osd/ceph-93 sdh 8:112 0 5.5T 0 disk └─sdh1 8:113 0 5.5T 0 part /var/lib/ceph/osd/ceph-94 sdi 8:128 0 5.5T 0 disk └─sdi1 8:129 0 5.5T 0 part /var/lib/ceph/osd/ceph-95 sdj 8:144 0 5.5T 0 disk └─sdj1 8:145 0 5.5T 0 part /var/lib/ceph/osd/ceph-96 sdk 8:160 0 5.5T 0 disk └─sdk1 8:161 0 5.5T 0 part /var/lib/ceph/osd/ceph-97 sdl 8:176 0 5.5T 0 disk └─sdl1 8:177 0 5.5T 0 part /var/lib/ceph/osd/ceph-98 sdm 8:192 0 419.2G 0 disk └─sdm1 8:193 0 419.2G 0 part / nvme0n1 259:0 0 372.6G 0 disk
6,重建journal
因为journal故障, 开机后没法正常启动osd. 须要从新建立journal,编辑脚原本生成最终执行的脚本。
#!/bin/bash desc="create ceph journal part for specified osd." type_journal_uuid=45b0969e-9b03-4f30-b4c6-b4b80ceff106 sgdisk=sgdisk journal_size=30G //分区设置大小 journal_dev=/dev/nvme0n1 //ssd磁盘名称 sleep=5 osd_uuids=$(grep "" /var/lib/ceph/osd/ceph-*/journal_uuid 2>/dev/null) die(){ echo >&2 "$@"; exit 1; } tip(){ printf >&2 "%b" "$@"; } [ "$osd_uuids" ] || die "no osd uuid found." echo "osd journal uuid:" echo "$osd_uuids" echo "now sleep $sleep" sleep $sleep journal_script="/dev/shm/ceph-journal.sh" echo "ls -l /dev/nvme0n1p*" > "$journal_script" echo "sleep 5" >> "$journal_script" # 须要预先检测分区的位置. 而后才能成功设置名称和uuid之类的数据. IFS=": " while read osd_path uuid; do let d++ [ "$osd_path" ] || continue osd_id=${osd_path#/var/lib/ceph/osd/ceph-} osd_id=${osd_id%/journal_uuid} journal_link=${osd_path%_uuid} [ ${osd_id:-1} -ge 0 ] || { echo "invalid osd id: $osd_id."; exit 11; } tip "create journal for osd $osd_id ... " $sgdisk --mbrtogpt --new=$d:0:+"$journal_size" \ --change-name=$d:'ceph journal' \ --typecode=$d:"$type_journal_uuid" \ --partition-guid=$d:"$uuid" \ "$journal_dev" || exit 1 tip "part done.\n" ln -sfT /dev/disk/by-partuuid/"$uuid" "$journal_link" || exit 3 echo "ceph-osd --mkjournal --osd-journal /dev/nvme0n1p"$d "-i "$osd_id >> "$journal_script" sleep 1 done << EOF $osd_uuids EOF
上述脚本仅用于生成最终的执行脚本. 其默认路径是/dev/shm/ceph-journal.sh
请务必人工确认内容操做无误, 方能够root权限手动执行之[root@ceph-11~]# bash /dev/shm/ceph-journal.sh
脚本内容:
[root@ceph-11 ~]# cat /dev/shm/ceph-journal.sh #!/bin/bash ls -l /dev/nvme0n1p* sleep 5 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p1 -i 87 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p2 -i 88 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p3 -i 89 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p4 -i 90 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p5 -i 91 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p6 -i 92 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p7 -i 93 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p8 -i 94 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p9 -i 95 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p10 -i 96 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p11 -i 97 ceph-osd --mkjournal --osd-journal /dev/nvme0n1p12 -i 98 [root@ceph-11 ~]#
7,journal跟换完毕,检查恢复服务
osd服务已恢复
[root@ceph-11 ~]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10008 0 root sas6t3 -10007 0 root sas6t2 -10006 130.94598 root sas6t1 -12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0 88 5.45599 osd.88 up 0.79999 0 89 5.45599 osd.89 up 1.00000 0 90 5.45599 osd.90 up 1.00000 0 91 5.45599 osd.91 up 1.00000 0 92 5.45599 osd.92 up 1.00000 0 93 5.45599 osd.93 up 1.00000 0 94 5.45599 osd.94 up 1.00000 0 95 5.45599 osd.95 up 1.00000 0 96 5.45599 osd.96 up 1.00000 0 97 5.45599 osd.97 up 1.00000 0 98 5.45599 osd.98 up 0.89999 0
恢复osd flag,须要把干预期间的其余操做所有恢复ceph osd unset noout
恢复osd优先级
[root@ceph-11 ~]# for osd in {87..98}; do ceph osd primary-affinity "$osd" 0.8; done set osd.87 primary-affinity to 0.8 (8524282) set osd.88 primary-affinity to 0.8 (8524282) set osd.89 primary-affinity to 0.8 (8524282) set osd.90 primary-affinity to 0.8 (8524282) set osd.91 primary-affinity to 0.8 (8524282) set osd.92 primary-affinity to 0.8 (8524282) set osd.93 primary-affinity to 0.8 (8524282) set osd.94 primary-affinity to 0.8 (8524282) set osd.95 primary-affinity to 0.8 (8524282) set osd.96 primary-affinity to 0.8 (8524282) set osd.97 primary-affinity to 0.8 (8524282) set osd.98 primary-affinity to 0.8 (8524282) [root@ceph-11 ~]#
等待集群恢复
等待集群自动recovery恢复到 HEALHTH_OK 状态.
期间若是出现 HEALTH_ERROR 状态, 能够及时跟进, 搜索Google.
[root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 12 pgs degraded 2 pgs recovering 10 pgs recovery_wait 12 pgs stuck unclean recovery 116/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918476: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 116/38259009 objects degraded (0.000%) 5049 active+clean 10 active+recovery_wait+degraded 3 active+clean+scrubbing+deep 2 active+recovering+degraded recovery io 22105 kB/s, 4 objects/s client io 55017 kB/s rd, 77280 kB/s wr, 944 op/s rd, 590 op/s wr [root@ceph-11 ~]# [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_WARN 1 pgs degraded 1 pgs recovering 1 pgs stuck unclean recovery 2/38259009 objects degraded (0.000%) monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918493: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 2/38259009 objects degraded (0.000%) 5060 active+clean 3 active+clean+scrubbing+deep 1 active+recovering+degraded client io 81789 kB/s rd, 245 MB/s wr, 1441 op/s rd, 651 op/s wr [root@ceph-11 ~]# ceph -s cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 406, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e94: 1/1/1 up {0=ceph-1=up:active}, 1 up:standby osdmap e73609: 111 osds: 108 up, 108 in flags sortbitwise,require_jewel_osds pgmap v85918494: 5064 pgs, 24 pools, 89195 GB data, 12454 kobjects 261 TB used, 141 TB / 403 TB avail 5061 active+clean 3 active+clean+scrubbing+deep recovery io 7388 kB/s, 0 objects/s client io 67551 kB/s rd, 209 MB/s wr, 1153 op/s rd, 901 op/s wr [root@ceph-11 ~]#
集群状态已经恢复正常。