ceph基础操做

时间 2019-11-07

标签 ceph 基础繁體版

原文原文链接

版本号node

[root@controller1 ~]# ceph -v 
ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)

状态
在admin节点执行
ceph -sapp

能够看到集群的状态, 以下示例ide

cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_OK
     monmap e5: 6 mons at {ceph-1=100.100.200.201:6789/0,ceph-2=100.100.200.202:6789/0,ceph-3=100.100.200.203:6789/0,ceph-4=100.100.200.204:6789/0,ceph-5=100.100.200.205:6789/0,ceph-6=100.100.200.206:6789/0}
            election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e85: 1/1/1 up {0=ceph-2=up:active}
     osdmap e62553: 111 osds: 109 up, 109 in
            flags sortbitwise,require_jewel_osds
      pgmap v72844263: 5064 pgs, 24 pools, 93130 GB data, 13301 kobjects
            273 TB used, 133 TB / 407 TB avail
                5058 active+clean
                   6 active+clean+scrubbing+deep
  client io 57046 kB/s rd, 35442 kB/s wr, 1703 op/s rd, 1486 op/s wr

若是咱们须要持续观察, 有两种办法
一种是：
ceph -w优化

这是官方的作法, 效果和ceph -s同样, 不过下面的client io那行会持续更新
有时咱们指望看下上面其余信息的变更状况, 因此我写了个脚本ui

watch -n 1 "ceph -s|
awk -v ll=$COLUMNS '/^ *mds[0-9]/{
  \$0=substr(\$0, 1, ll);
 }
 /^ +[0-9]+ pg/{next}
 /monmap/{ next }
 /^ +recovery [0-9]+/{next}
 { print}';
ceph osd pool stats | awk '/^pool/{
  p=\$2
 }
 /^ +(recovery|client)/{
  if(p){print \"\n\"p; p=\"\"};
  print
}'"

参考输出debug

Every 1.0s: ceph -s| awk -v ll=105 '/^ *mds[0-9]/{$0=substr($0, 1, ll);} /^ ...  Mon Jan 21 18:09:44 2019

    cluster 936a5233-9441-49df-95c1-01de82a192f4
     health HEALTH_OK
            election epoch 382, quorum 0,1,2,3,4,5 ceph-1,ceph-2,ceph-3,ceph-4,ceph-5,ceph-6
      fsmap e85: 1/1/1 up {0=ceph-2=up:active}
     osdmap e62561: 111 osds: 109 up, 109 in
            flags sortbitwise,require_jewel_osds
      pgmap v73183831: 5064 pgs, 24 pools, 93179 GB data, 13310 kobjects
            273 TB used, 133 TB / 407 TB avail
                5058 active+clean
                   6 active+clean+scrubbing+deep
  client io 263 MB/s rd, 58568 kB/s wr, 755 op/s rd, 1165 op/s wr

cinder-sas
  client io 248 MB/s rd, 33529 kB/s wr, 363 op/s rd, 597 op/s wr

vms
  client io 1895 B/s rd, 2343 kB/s wr, 121 op/s rd, 172 op/s wr

cinder-ssd
  client io 15620 kB/s rd, 22695 kB/s wr, 270 op/s rd, 395 op/s wr

用量code

# ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    407T      146T         260T         64.04
POOLS:
    NAME                            ID     USED       %USED     MAX AVAIL     OBJECTS
    cinder-sas                      13     76271G     89.25         9186G     10019308
    images                          14       649G      6.60         9186G       339334
    vms                             15      7026G     43.34         9186G      1807073
    cinder-ssd                      16      4857G     74.73         1642G       645823
    rbd                             17          0         0        16909G            1

osd
能够快速看到osd的拓扑关系, 能够用于查看osd的状态等信息ip

# ceph osd tree
ID     WEIGHT    TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10008         0 root sas6t3
-10007         0 root sas6t2
-10006 130.94598 root sas6t1
   -12  65.47299     host ceph-11
    87   5.45599         osd.87        up  1.00000          0.89999
    88   5.45599         osd.88        up  0.79999          0.29999
    89   5.45599         osd.89        up  1.00000          0.89999
    90   5.45599         osd.90        up  1.00000          0.89999
    91   5.45599         osd.91        up  1.00000          0.89999
    92   5.45599         osd.92        up  1.00000          0.79999
    93   5.45599         osd.93        up  1.00000          0.89999
    94   5.45599         osd.94        up  1.00000          0.89999
    95   5.45599         osd.95        up  1.00000          0.89999
    96   5.45599         osd.96        up  1.00000          0.89999
    97   5.45599         osd.97        up  1.00000          0.89999
    98   5.45599         osd.98        up  0.89999          0.89999
   -13  65.47299     host ceph-12
    99   5.45599         osd.99        up  1.00000          0.79999
   100   5.45599         osd.100       up  1.00000          0.79999
   101   5.45599         osd.101       up  1.00000          0.79999
   102   5.45599         osd.102       up  1.00000          0.79999
   103   5.45599         osd.103       up  1.00000          0.79999
   104   5.45599         osd.104       up  0.79999          0.79999
   105   5.45599         osd.105       up  1.00000          0.79999
   106   5.45599         osd.106       up  1.00000          0.79999
   107   5.45599         osd.107       up  1.00000          0.79999
   108   5.45599         osd.108       up  1.00000          0.79999
   109   5.45599         osd.109       up  1.00000          0.79999
   110   5.45599         osd.110       up  1.00000          0.79999

我写了个脚本, 能够高亮
ceph osd df | awk -v c1=84 -v c2=90 '{z=NF-2; if($z<=100&&$z>c1){c=34;if($z>c2)c=31;$z="\033["c";1m"$z"\033[0m"}; print}'ci

reweight
人工权重
当osd负载不均衡时, 就须要人工干预权重. 默认值都是1, 咱们通常都是下降权重
osd reweight <int[0-]> <float[0.0-1.0]> reweight osd to 0.0 < <weight> < 1.0rem

primary affinity
这个控制osd里的pg成为primary的比例. 0表示除非其余的pg挂了, 不然不会成为pg. 1表示, 除非其余的都是1, 那么这个必定会成为primary. 至于其余的值, 则是根据osd拓扑结构计算决定具体的pg数量. 毕竟不一样的pool可能位于不一样的osd上

osd primary-affinity <osdname (id|osd.   adjust osd primary-affinity from 0.0 <=
 id)> <float[0.0-1.0]>                     <weight> <= 1.0

pool
命令皆以 ceph osd pool 开头
看看有哪些pool
ceph osd pool ls

在结尾加detail能够看pool详情

# ceph osd pool ls detail
pool 13 'cinder-sas' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 63138 flags hashpspool stripe_width 0
        removed_snaps [1~5,7~2,a~2,e~10,23~4,2c~24,51~2,54~2,57~2,5a~a]
pool 14 'images' replicated size 3 min_size 2 crush_ruleset 8 object_hash rjenkins pg_num 512 pgp_num 512 last_change 63012 flags hashpspool stripe_width 0

调整pool的属性

# ceph osd pool set pool名称 属性 值
osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|write_fadvise_dontneed|noscrub|nodeep-scrub|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|use_gmt_hitset|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_dirty_high_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|min_write_recency_for_promote|fast_read|hit_set_grade_decay_rate|hit_set_search_last_n|scrub_min_interval|scrub_max_interval|deep_scrub_interval|recovery_priority|recovery_op_priority|scrub_priority <val> {--yes-i-really-mean-it} :  set pool parameter <var> to <val>

pg
命令皆以 ceph pg 开头

查看状态

# ceph pg stat
v79188443: 5064 pgs: 1 active+clean+scrubbing, 2 active+clean+scrubbing+deep, 5061 active+clean; 88809 GB data, 260 TB used, 146 TB / 407 TB avail; 384 MB/s rd, 134 MB/s wr, 2380 op/s

ceph pg ls, 后面能够跟状态, 也能够跟其余参数

# ceph pg ls | grep scrub
pg_stat objects mip     degr    misp    unf     bytes   log     disklog state   state_stamp     v       reported     up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
13.1e   4832    0       0       0       0       39550330880     3034    3034    active+clean+scrubbing+deep 2019-04-08 15:24:46.496295       63232'167226529 63232:72970092  [95,80,44]      95      [95,80,44]      95  63130'167208564  2019-04-07 05:16:01.452400      63130'167117875 2019-04-05 18:53:54.796948
13.13b  4955    0       0       0       0       40587477010     3065    3065    active+clean+scrubbing+deep 2019-04-08 15:19:43.641336       63232'93849435  63232:89107385  [87,39,78]      87      [87,39,78]      87  63130'93838372   2019-04-07 08:07:43.825933      62998'93796094  2019-04-01 22:23:14.399257
13.1ac  4842    0       0       0       0       39605106850     3081    3081    active+clean+scrubbing+deep 2019-04-08 15:26:40.119698       63232'29801889  63232:23652708  [110,31,76]     110     [110,31,76]     110 63130'29797321   2019-04-07 10:50:26.243588      62988'29759937  2019-04-01 08:19:34.927978
13.31f  4915    0       0       0       0       40128633874     3013    3013    active+clean+scrubbing  2019-04-08 15:27:19.489919   63232'45174880  63232:38010846  [99,25,42]      99      [99,25,42]      99      63130'45170307       2019-04-07 06:29:44.946734      63130'45160962  2019-04-05 21:30:38.849569
13.538  4841    0       0       0       0       39564094976     3003    3003    active+clean+scrubbing  2019-04-08 15:27:15.731348   63232'69555013  63232:58836987  [109,85,24]     109     [109,85,24]     109     63130'69542700       2019-04-07 08:09:00.311084      63130'69542700  2019-04-07 08:09:00.311084
13.71f  4851    0       0       0       0       39552301568     3014    3014    active+clean+scrubbing  2019-04-08 15:27:16.896665   63232'57281834  63232:49191849  [100,75,66]     100     [100,75,66]     100     63130'57247440       2019-04-07 05:43:44.886559      63008'57112775  2019-04-03 05:15:51.434950
13.774  4867    0       0       0       0       39723743842     3092    3092    active+clean+scrubbing  2019-04-08 15:27:19.501188   63232'32139217  63232:28360980  [101,63,21]     101     [101,63,21]     101     63130'32110484       2019-04-07 06:24:22.174377      63130'32110484  2019-04-07 06:24:22.174377
13.7fe  4833    0       0       0       0       39485484032     3015    3015    active+clean+scrubbing+deep 2019-04-08 15:27:15.699899       63232'38297730  63232:32962414  [108,82,56]     108     [108,82,56]     108 63130'38286258   2019-04-07 07:59:53.586416      63008'38267073  2019-04-03 14:44:02.779800

固然也可使用ls-by开头的命令

pg ls {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-primary <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-osd <osdname (id|osd.id)> {<int>} {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}
pg ls-by-pool <poolstr> {active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered [active|clean|down|replay|splitting|scrubbing|scrubq|degraded|inconsistent|peering|repair|recovering|backfill_wait|incomplete|stale|remapped|deep_scrub|backfill|backfill_toofull|recovery_wait|undersized|activating|peered...]}

修复

# ceph pg repair 13.e1
instructing pg 13.e1 on osd.110 to repair

平常故障处理
pg inconsistent
出现inconsistent状态, 即表示符合此问题. 后面的 1scrub error表示这是scrub相关的问题

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; noout flag(s) set
pg 13.e1 is active+clean+inconsistent, acting [110,55,21]
1 scrub errors
noout flag(s) set

使用以下操做:

# ceph pg repair 13.e1
instructing pg 13.e1 on osd.110 to repair

检查
此时能够看到13.e1进入了deep scrub

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 pgs repair; 1 scrub errors; noout flag(s) set
pg 13.e1 is active+clean+scrubbing+deep+inconsistent+repair, acting [110,55,21]
1 scrub errors
noout flag(s) set

等待一段时间后, 能够看到报错消失, pg13.e1也回归了active+clean状态

# ceph health detail
HEALTH_WARN noout flag(s) set
noout flag(s) set

问题缘由
ceph会按期对pg作校验. 出现inconsistent并不表明必定是出现了数据不一致, 这个只是由于数据和校验码不一致而已，当肯定执行repair后, ceph会进行一次deep scrub, 从而判断数据是否存在不一致的状况, 若是deep scrub经过, 那么说明没有数据问题, 只须要修正校验便可。

request blocked for XXs
定位请求被阻塞的osd
ceph health detail | grep blocked

而后下降上述osd的primary affinity, 能够分流一部分pg出去. 压力会变小. 以前的值能够经过ceph osd tree查看
ceph osd primary-affinity OSD_ID 比以前低的数值

主要仍是因为集群不均衡, 致使部分osd压力过大. 没法及时处理请求. 1，若是频繁出现, 建议调查缘由:2，若是是由于客户端IO需求增大, 那么尝试优化客户端, 下降没必要要的读写.3，若是是由于部分osd一直没法处理请求, 建议临时下降此osd的primary affinity. 并保持关注, 由于这多是磁盘故障的前兆.4，若是某个journal ssd上的osd均出现此问题, 建议排查journal ssd是否存在写入瓶颈, 或者是否故障.