记一次企业级存储规划及重大事故

时间 2019-12-05

标签一次企业存储规划重大事故繁體版

原文原文链接

存储规划是企业必须考虑到的因素，无限增大的资源不但须要管理维护，还须要考虑容灾备份的机制。我公司在存储变化上，发现磁盘使用率立刻超标的状况，发起了数据存储迁移的规划方案php

1、静态资源架构

①实施前节点架构:

res1-res10为10个静态资源节点，数据目录于/data/下挂载有2TB数据盘，经过nfs挂载32TB共享存储块位于/mnt下，数据经过rsync+sersync同步于/mnt目录，与32TB共享存储块保持实时状态node

②实施后节点架构:

单数res节点为nfs-master，双数res节点为nfs-slave，每两节点共享一块32T的存储块，2T SSD数据盘依旧为程序目录，将之前静态资源目录/data/webapps/nginx/res/upload/cdn/node*变动为新路径/resource_nginx/res/upload/cdn/node*。10个节点依旧为db共享存储的nfs-slave，实时同步静态资源nginx

2、实施操做

①购入5块共享存储块
②10个实例每两个实例挂载一块共享存储，一共5块，采用nfs
③之前备份用了16T的共享存储块依旧作备份
④更改res静态目录，将2T中静态资源对拷入新存储块路径中，node1和node2节点的静态数据导入新块存储，将node3和node4的静态数据导入新块存储,将node5和node6的静态数据导入新块存储,每两个节点数据导入一新块，以此类推
⑤确认新数据同步到5块存储后开始实施更新
⑥将nginx中静态根路径更改成新res路径，更改中心配置config项目中resource项目的目录参数为新res路径
⑦逐调权重为0，重启项目平滑升级，重载nginx使得新路径生效
⑧确认项目输出日志是否异常
⑨测试功能上传与访问
⑩恢复权重，更改fstab，与备份块rsync+sersync路径

3、发生的问题

①8月1日更新事后，8月2日任务是须要恢复静态资源的备份方案，从新调整rsync+sersync的配置文件。因为运维过程当中的疏忽，10台res节点少更改了res6的配置，下面附正常更改的文件与res6当时未更改的文件es6

须要恢复的正常rsyncd.conf的配置文件（举例为res1的）
[root@hmf_res1 ~]# cat /etc/rsyncd.confweb

#rsync_config_________________start
#created by wang 2017-04-18 00:08
##rsyncd.conf start##
uid = root
gid = root
use chroot = no
max connections = 200
timeout = 300
pid file = /var/run/rsyncd.pid
lock file = /var/run/rysnc.lock
log file = /var/log/rsyncd.log
ignore errors
read only = false
list = false
hosts allow = 127.0.0.1/8
#hosts deny = 0.0.0.0/32
#auth users = rsync_backup
#secrets file = /etc/rsync.password
[node1]
path = /mnt/node1
[node2]
path = /mnt/node2
[node3]
path = /mnt/node3
[node4]
path = /mnt/node4
[node5]
path = /mnt/node5
[node6]
path = /mnt/node6
[node7]
path = /mnt/node7
[node8]
path = /mnt/node8
[node9]
path = /mnt/node9
[node10]
path = /mnt/node10

须要恢复的正常sersync中config.xml配置文件
[root@hmf_res1 ~]# cat /data/server/sersync/config/config.xmlexpress

<?xml version="1.0" encoding="ISO-8859-1"?>
<head version="2.5">
    <host hostip="localhost" port="8008"></host>
    <debug start="false"/>
    <fileSystem xfs="false"/>
    <filter start="false">
        <exclude expression="(.*)\.svn"></exclude>
        <exclude expression="(.*)\.gz"></exclude>
        <exclude expression="^info/*"></exclude>
        <exclude expression="^static/*"></exclude>
    </filter>
    <inotify>
        <delete start="true"/>
        <createFolder start="true"/>
        <createFile start="false"/>
        <closeWrite start="true"/>
        <moveFrom start="true"/>
        <moveTo start="true"/>
        <attrib start="false"/>
        <modify start="false"/>
    </inotify>

    <sersync>
        <localpath watch="/resource_nginx/res/upload/cdn/node1">
            <remote ip="127.0.0.1" name="node1"/>
            <!--<remote ip="192.168.8.39" name="tongbu"/>-->
            <!--<remote ip="192.168.8.40" name="tongbu"/>-->
        </localpath>
        <rsync>
            <commonParams params="-az"/>
            <auth start="false" users="rsync_backup" passwordfile="/etc/rsync.password"/>
            <userDefinedPort start="false" port="874"/><!-- port=874 -->
            <timeout start="true" time="600"/><!-- timeout=100 -->
            <ssh start="false"/>
        </rsync>
        <failLog path="/data/server/sersync/logs/rsync_fail.log" timeToExecute="60"/><!--default every 60mins execute once-->
        <crontab start="false" schedule="600"><!--600mins-->
            <crontabfilter start="false">
                <exclude expression="*.php"></exclude>
                <exclude expression="info/*"></exclude>
            </crontabfilter>
        </crontab>
        <plugin start="false" name="command"/>
    </sersync>

    <plugin name="command">
        <param prefix="/bin/sh" suffix="" ignoreError="true"/>  <!--prefix /opt/tongbu/mmm.sh suffix-->
        <filter start="false">
            <include expression="(.*)\.php"/>
            <include expression="(.*)\.sh"/>
        </filter>
    </plugin>

    <plugin name="socket">
        <localpath watch="/opt/tongbu">
            <deshost ip="192.168.138.20" port="8009"/>
        </localpath>
    </plugin>
    <plugin name="refreshCDN">
        <localpath watch="/data0/htdocs/cms.xoyo.com/site/">
            <cdninfo domainname="ccms.chinacache.com" port="80" username="xxxx" passwd="xxxx"/>
            <sendurl base="http://pic.xoyo.com/cms"/>
            <regexurl regex="false" match="cms.xoyo.com/site([/a-zA-Z0-9]*).xoyo.com/images"/>
        </localpath>
    </plugin>
</head>

当时未更改res6中rsyncd.conf配置文件，也就是规划项目以前的配置
[root@hmf_res6 ~]# cat /etc/rsyncd.conf服务器

#rsync_config_________________start
#created by wang 2017-04-18 00:08
##rsyncd.conf start##
uid = root
gid = root
use chroot = no
max connections = 200
timeout = 300
pid file = /var/run/rsyncd.pid
lock file = /var/run/rysnc.lock
log file = /var/log/rsyncd.log
ignore errors
read only = false
list = false
hosts allow = 127.0.0.1/8
#hosts deny = 0.0.0.0/32
#auth users = rsync_backup
#secrets file = /etc/rsync.password
[node1]
path = /mnt/node1
[node2]
path = /mnt/node2
[node3]
path = /mnt/node3
[node4]
path = /mnt/node4
[node5]
path = /mnt/node5
[node6]
path = /mnt/node6
[node7]
path = /mnt/node7
[node8]
path = /mnt/node8
[node9]
path = /mnt/node9
[node10]
path = /mnt/node10
[resource]     #此处未删除配置，继续往新块里同步
path = /resource_nginx/res/upload/cdn/node6

未更改res6中sersync config.xml配置文件
[root@hmf_res6 ~]# cat /data/server/sersync/config/config.xml架构

<?xml version="1.0" encoding="ISO-8859-1"?>
<head version="2.5">
    <host hostip="localhost" port="8008"></host>
    <debug start="false"/>
    <fileSystem xfs="false"/>
    <filter start="false">
    <exclude expression="(.*)\.svn"></exclude>
    <exclude expression="(.*)\.gz"></exclude>
    <exclude expression="^info/*"></exclude>
    <exclude expression="^static/*"></exclude>
    </filter>
    <inotify>
    <delete start="true"/>
    <createFolder start="true"/>
    <createFile start="false"/>
    <closeWrite start="true"/>
    <moveFrom start="true"/>
    <moveTo start="true"/>
    <attrib start="false"/>
    <modify start="false"/>
    </inotify>

    <sersync>
    <localpath watch="/data/webapps/nginx/res/upload/cdn/node6">     #此处未更改为新存储路径，还一直监听着原2T路径，但此路径已经再也不写入数据
        <remote ip="127.0.0.1" name="node6"/>
        <!--<remote ip="192.168.8.39" name="tongbu"/>-->
        <!--<remote ip="192.168.8.40" name="tongbu"/>-->
    </localpath>
    <rsync>
        <commonParams params="-az"/>
        <auth start="false" users="rsync_backup" passwordfile="/etc/rsync.password"/>
        <userDefinedPort start="false" port="874"/><!-- port=874 -->
        <timeout start="true" time="600"/><!-- timeout=100 -->
        <ssh start="false"/>
    </rsync>
    <failLog path="/data/server/sersync/logs/rsync_fail.log" timeToExecute="60"/><!--default every 60mins execute once-->
    <crontab start="false" schedule="600"><!--600mins-->
        <crontabfilter start="false">
        <exclude expression="*.php"></exclude>
        <exclude expression="info/*"></exclude>
        </crontabfilter>
    </crontab>
    <plugin start="false" name="command"/>
    </sersync>

    <plugin name="command">
    <param prefix="/bin/sh" suffix="" ignoreError="true"/>    <!--prefix /opt/tongbu/mmm.sh suffix-->
    <filter start="false">
        <include expression="(.*)\.php"/>
        <include expression="(.*)\.sh"/>
    </filter>
    </plugin>

    <plugin name="socket">
    <localpath watch="/opt/tongbu">
        <deshost ip="192.168.138.20" port="8009"/>
    </localpath>
    </plugin>
    <plugin name="refreshCDN">
    <localpath watch="/data0/htdocs/cms.xoyo.com/site/">
        <cdninfo domainname="ccms.chinacache.com" port="80" username="xxxx" passwd="xxxx"/>
        <sendurl base="http://pic.xoyo.com/cms"/>
        <regexurl regex="false" match="cms.xoyo.com/site([/a-zA-Z0-9]*).xoyo.com/images"/>
    </localpath>
    </plugin>
</head>

②根据sersync同步的默认规则中，<delete start="true"/>，此项参数为监听源服务器目录位置，若是在源服务器目录没有的文件，会进行删除，有的文件进行同步，保持文件相同状态。这是一个致命的导火索
③8月7日事后一周，没有发现任何异常。可是，在这段时间里res6节点中已经没有任何上传的新文件内容，由于在新路径一直在监听旧路径目录，旧路径必定是没有新文件的，res6接到上传的新文件就被删除，接到就被删除，遵循<delete start="true"/>规则。此时已经res6早已经有大量的404，静态资源访问状态码的监控也没有作到位
④正好在这天接到了能够删除旧路径数据的许可，运维过程当中使用rm命令删除旧数据app

rm -rf /data/webapps/nginx/res/cdn/upload/node1/*
rm -rf /data/webapps/nginx/res/cdn/upload/node2/*
rm -rf /data/webapps/nginx/res/cdn/upload/node3/*
rm -rf /data/webapps/nginx/res/cdn/upload/node4/*
rm -rf /data/webapps/nginx/res/cdn/upload/node5/*
rm -rf /data/webapps/nginx/res/cdn/upload/node6/*
rm -rf /data/webapps/nginx/res/cdn/upload/node7/*
rm -rf /data/webapps/nginx/res/cdn/upload/node8/*
rm -rf /data/webapps/nginx/res/cdn/upload/node9/*
rm -rf /data/webapps/nginx/res/cdn/upload/node10/*

8月8日，事故发生，res6节点静态资源访问状态码大量404，第一时间查看访问日志，发如今删除旧数据到如今的时间里已经逐渐报出404，等到7日半夜时已经所有404。意识到是数据出了问题，发现新块存储中数据不存在/resource_nginx/res/cdn/upload/node6/，准备恢复备份，却又发现备份的数据也不存在了/mnt/node6/运维

缘由是sersync监听路径/data/webapps/nginx/res/cdn/upload/node6/同步到目标服务器两个路径中/resource_nginx/res/cdn/upload/node六、/mnt/node6

4、解决问题

①问题一的解决：运维过程当中，发生漏配，误配的问题，必定是自动化统一运维没有作到位。以后的架构运维中采用saltstack，设置主机组，实现统一cmd，统一分发文件以及监控

nodegroups:
  hmf_res: 'L@hmf_res1,hmf_res2,hmf_res3,hmf_res4,hmf_res5,hmf_res11,hmf_res7,hmf_res8,hmf_res9,hmf_res10'


salt -N 'hmf_res' cmd.run 'command'    #指定分组统一管理

②问题二的解决：在生产环境中，若是没有刻意要求源服务器与目标服务器的文件一致性的话，请务必关闭此默认项
更改sersync主配文件

<delete start="false"/>

③问题三的解决：在监控服务器上增强完善监控脚本，对出现大量的404 500 502状态码上作优化

④存储的规划眼界必定要放远，计划赶不上变化，无限增大的资源绝对不能拘束于成本。请慎用rm命令。在使用它时，请明知它会带来的后果