版本号node
[root@controller1 ~]# ceph -v ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
状态
在admin节点执行ceph -s
app
能够看到集群的状态, 以下示例ide
cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0} election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e85: 1/1/1 up {0=ceph-2=up:active} osdmap e62553: 111 osds: 109 up, 109 in flags sortbitwise,require_jewel_osds pgmap v72844263: 5064 pgs, 24 pools, 93130 GB data, 13301 kobjects 273 TB used, 133 TB / 407 TB avail 5058 active+clean 6 active+clean+scrubbing+deep client io 57046 kB/s rd, 35442 kB/s wr, 1703 op/s rd, 1486 op/s wr
若是咱们须要持续观察, 有两种办法
一种是:ceph -w
优化
这是官方的作法, 效果和ceph -s同样, 不过下面的client io那行会持续更新
有时咱们指望看下上面其余信息的变更状况, 因此我写了个脚本ui
watch -n 1 "ceph -s| awk -v ll=$COLUMNS '/^ *mds[0-9]/{ \$0=substr(\$0, 1, ll); } /^ +[0-9]+ pg/{next} /monmap/{ next } /^ +recovery [0-9]+/{next} { print}'; ceph osd pool stats | awk '/^pool/{ p=\$2 } /^ +(recovery|client)/{ if(p){print \"\n\"p; p=\"\"}; print }'"
参考输出debug
Every 1.0s: ceph -s| awk -v ll=105 '/^ *mds[0-9]/{$0=substr($0, 1, ll);} /^ ... Mon Jan 21 18:09:44 2019 cluster 936a5233-9441-49df-95c1-01de82a192f4 health HEALTH_OK election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6 fsmap e85: 1/1/1 up {0=ceph-2=up:active} osdmap e62561: 111 osds: 109 up, 109 in flags sortbitwise,require_jewel_osds pgmap v73183831: 5064 pgs, 24 pools, 93179 GB data, 13310 kobjects 273 TB used, 133 TB / 407 TB avail 5058 active+clean 6 active+clean+scrubbing+deep client io 263 MB/s rd, 58568 kB/s wr, 755 op/s rd, 1165 op/s wr cinder-sas client io 248 MB/s rd, 33529 kB/s wr, 363 op/s rd, 597 op/s wr vms client io 1895 B/s rd, 2343 kB/s wr, 121 op/s rd, 172 op/s wr cinder-ssd client io 15620 kB/s rd, 22695 kB/s wr, 270 op/s rd, 395 op/s wr
用量code
# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 407T 146T 260T 64.04 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS cinder-sas 13 76271G 89.25 9186G 10019308 images 14 649G 6.60 9186G 339334 vms 15 7026G 43.34 9186G 1807073 cinder-ssd 16 4857G 74.73 1642G 645823 rbd 17 0 0 16909G 1
osd
能够快速看到osd的拓扑关系, 能够用于查看osd的状态等信息ip
# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10008 0 root sas6t3 -10007 0 root sas6t2 -10006 130.94598 root sas6t1 -12 65.47299 host ceph-11 87 5.45599 osd.87 up 1.00000 0.89999 88 5.45599 osd.88 up 0.79999 0.29999 89 5.45599 osd.89 up 1.00000 0.89999 90 5.45599 osd.90 up 1.00000 0.89999 91 5.45599 osd.91 up 1.00000 0.89999 92 5.45599 osd.92 up 1.00000 0.79999 93 5.45599 osd.93 up 1.00000 0.89999 94 5.45599 osd.94 up 1.00000 0.89999 95 5.45599 osd.95 up 1.00000 0.89999 96 5.45599 osd.96 up 1.00000 0.89999 97 5.45599 osd.97 up 1.00000 0.89999 98 5.45599 osd.98 up 0.89999 0.89999 -13 65.47299 host ceph-12 99 5.45599 osd.99 up 1.00000 0.79999 100 5.45599 osd.100 up 1.00000 0.79999 101 5.45599 osd.101 up 1.00000 0.79999 102 5.45599 osd.102 up 1.00000 0.79999 103 5.45599 osd.103 up 1.00000 0.79999 104 5.45599 osd.104 up 0.79999 0.79999 105 5.45599 osd.105 up 1.00000 0.79999 106 5.45599 osd.106 up 1.00000 0.79999 107 5.45599 osd.107 up 1.00000 0.79999 108 5.45599 osd.108 up 1.00000 0.79999 109 5.45599 osd.109 up 1.00000 0.79999 110 5.45599 osd.110 up 1.00000 0.79999
我写了个脚本, 能够高亮ceph osd df | awk -v c1=84 -v c2=90 '{z=NF-2; if($z<=100&&$z>c1){c=34;if($z>c2)c=31;$z="\033["c";1m"$z"\033[0m"}; print}'
ci
reweight
人工权重
当osd负载不均衡时, 就须要人工干预权重. 默认值都是1, 咱们通常都是下降权重osd reweight <int[0-]> <float[0.0-1.0]> reweight osd to 0.0 < <weight> < 1.0
rem
primary affinity
这个控制osd里的pg成为primary的比例. 0表示除非其余的pg挂了, 不然不会成为pg. 1表示, 除非其余的都是1, 那么这个必定会成为primary. 至于其余的值, 则是根据osd拓扑结构计算决定具体的pg数量. 毕竟不一样的pool可能位于不一样的osd上
osd primary-affinity <osdname (id|osd. adjust osd primary-affinity from 0.0 <= id)> <float[0.0-1.0]> <weight> <= 1.0
pool
命令皆以 ceph osd pool 开头
看看有哪些poolceph osd pool ls
在结尾加detail能够看pool详情
# ceph osd pool ls detail pool 13 'cinder-sas' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 63138 flags hashpspool stripe_width 0 removed_snaps [1~5,7~2,a~2,e~10,23~4,2c~24,51~2,54~2,57~2,5a~a] pool 14 'images' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 512 pgp_num 512 last_change 63012 flags hashpspool stripe_width 0
调整pool的属性
# ceph osd pool set pool名称 属性 值 osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority <val> {--yes-i-really-mean-it} : set pool parameter <var> to <val>
pg
命令皆以 ceph pg 开头
查看状态
# ceph pg stat v79188443: 5064 pgs: 1 active+clean+scrubbing, 2 active+clean+scrubbing+deep, 5061 active+clean; 88809 GB data, 260 TB used, 146 TB / 407 TB avail; 384 MB/s rd, 134 MB/s wr, 2380 op/s
ceph pg ls, 后面能够跟状态, 也能够跟其余参数
# ceph pg ls | grep scrub pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 13.1e 4832 0 0 0 0 39550330880 3034 3034 active+clean+scrubbing+deep 2019-04-08 15:24:46.496295 63232'167226529 63232:72970092 [95,80,44] 95 [95,80,44] 95 63130'167208564 2019-04-07 05:16:01.452400 63130'167117875 2019-04-05 18:53:54.796948 13.13b 4955 0 0 0 0 40587477010 3065 3065 active+clean+scrubbing+deep 2019-04-08 15:19:43.641336 63232'93849435 63232:89107385 [87,39,78] 87 [87,39,78] 87 63130'93838372 2019-04-07 08:07:43.825933 62998'93796094 2019-04-01 22:23:14.399257 13.1ac 4842 0 0 0 0 39605106850 3081 3081 active+clean+scrubbing+deep 2019-04-08 15:26:40.119698 63232'29801889 63232:23652708 [110,31,76] 110 [110,31,76] 110 63130'29797321 2019-04-07 10:50:26.243588 62988'29759937 2019-04-01 08:19:34.927978 13.31f 4915 0 0 0 0 40128633874 3013 3013 active+clean+scrubbing 2019-04-08 15:27:19.489919 63232'45174880 63232:38010846 [99,25,42] 99 [99,25,42] 99 63130'45170307 2019-04-07 06:29:44.946734 63130'45160962 2019-04-05 21:30:38.849569 13.538 4841 0 0 0 0 39564094976 3003 3003 active+clean+scrubbing 2019-04-08 15:27:15.731348 63232'69555013 63232:58836987 [109,85,24] 109 [109,85,24] 109 63130'69542700 2019-04-07 08:09:00.311084 63130'69542700 2019-04-07 08:09:00.311084 13.71f 4851 0 0 0 0 39552301568 3014 3014 active+clean+scrubbing 2019-04-08 15:27:16.896665 63232'57281834 63232:49191849 [100,75,66] 100 [100,75,66] 100 63130'57247440 2019-04-07 05:43:44.886559 63008'57112775 2019-04-03 05:15:51.434950 13.774 4867 0 0 0 0 39723743842 3092 3092 active+clean+scrubbing 2019-04-08 15:27:19.501188 63232'32139217 63232:28360980 [101,63,21] 101 [101,63,21] 101 63130'32110484 2019-04-07 06:24:22.174377 63130'32110484 2019-04-07 06:24:22.174377 13.7fe 4833 0 0 0 0 39485484032 3015 3015 active+clean+scrubbing+deep 2019-04-08 15:27:15.699899 63232'38297730 63232:32962414 [108,82,56] 108 [108,82,56] 108 63130'38286258 2019-04-07 07:59:53.586416 63008'38267073 2019-04-03 14:44:02.779800
固然也可使用ls-by开头的命令
pg ls {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-primary <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-osd <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]} pg ls-by-pool <poolstr> {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
修复
# ceph pg repair 13.e1 instructing pg 13.e1 on osd.110 to repair
平常故障处理
pg inconsistent
出现inconsistent状态, 即表示符合此问题. 后面的 1scrub error表示这是scrub相关的问题
# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set pg 13.e1 is active+clean+inconsistent, acting [110,55,21] 1 scrub errors noout flag(s) set
使用以下操做:
# ceph pg repair 13.e1 instructing pg 13.e1 on osd.110 to repair
检查
此时能够看到13.e1进入了deep scrub
# ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 pgs repair; 1 scrub errors; noout flag(s) set pg 13.e1 is active+clean+scrubbing+deep+inconsistent+repair, acting [110,55,21] 1 scrub errors noout flag(s) set
等待一段时间后, 能够看到报错消失, pg13.e1也回归了active+clean状态
# ceph health detail HEALTH_WARN noout flag(s) set noout flag(s) set
问题缘由
ceph会按期对pg作校验. 出现inconsistent并不表明必定是出现了数据不一致, 这个只是由于数据和校验码不一致而已,当肯定执行repair后, ceph会进行一次deep scrub, 从而判断数据是否存在不一致的状况, 若是deep scrub经过, 那么说明没有数据问题, 只须要修正校验便可。
request blocked for XXs
定位请求被阻塞的osdceph health detail | grep blocked
而后下降上述osd的primary affinity, 能够分流一部分pg出去. 压力会变小. 以前的值能够经过ceph osd tree查看ceph osd primary-affinity OSD_ID 比以前低的数值
主要仍是因为集群不均衡, 致使部分osd压力过大. 没法及时处理请求. 1,若是频繁出现, 建议调查缘由:2,若是是由于客户端IO需求增大, 那么尝试优化客户端, 下降没必要要的读写.3,若是是由于部分osd一直没法处理请求, 建议临时下降此osd的primary affinity. 并保持关注, 由于这多是磁盘故障的前兆.4,若是某个journal ssd上的osd均出现此问题, 建议排查journal ssd是否存在写入瓶颈, 或者是否故障.