存储规划是企业必须考虑到的因素,无限增大的资源不但须要管理维护,还须要考虑容灾备份的机制。我公司在存储变化上,发现磁盘使用率立刻超标的状况,发起了数据存储迁移的规划方案php
res1-res10为10个静态资源节点,数据目录于/data/下挂载有2TB数据盘,经过nfs挂载32TB共享存储块位于/mnt下,数据经过rsync+sersync同步于/mnt目录,与32TB共享存储块保持实时状态node
单数res节点为nfs-master,双数res节点为nfs-slave,每两节点共享一块32T的存储块,2T SSD数据盘依旧为程序目录,将之前静态资源目录/data/webapps/nginx/res/upload/cdn/node*
变动为新路径/resource_nginx/res/upload/cdn/node*
。10个节点依旧为db共享存储的nfs-slave,实时同步静态资源nginx
①购入5块共享存储块
②10个实例每两个实例挂载一块共享存储,一共5块,采用nfs
③之前备份用了16T的共享存储块依旧作备份
④更改res静态目录,将2T中静态资源对拷入新存储块路径中,node1和node2节点的静态数据导入新块存储,将node3和node4的静态数据导入新块存储,将node5和node6的静态数据导入新块存储,每两个节点数据导入一新块,以此类推
⑤确认新数据同步到5块存储后开始实施更新
⑥将nginx中静态根路径更改成新res路径,更改中心配置config项目中resource项目的目录参数为新res路径
⑦逐调权重为0,重启项目平滑升级,重载nginx使得新路径生效
⑧确认项目输出日志是否异常
⑨测试功能上传与访问
⑩恢复权重,更改fstab,与备份块rsync+sersync路径
①8月1日更新事后,8月2日任务是须要恢复静态资源的备份方案,从新调整rsync+sersync的配置文件。因为运维过程当中的疏忽,10台res节点少更改了res6的配置,下面附正常更改的文件与res6当时未更改的文件es6
须要恢复的正常rsyncd.conf的配置文件(举例为res1的)
[root@hmf_res1 ~]# cat /etc/rsyncd.confweb
#rsync_config_________________start #created by wang 2017-04-18 00:08 ##rsyncd.conf start## uid = root gid = root use chroot = no max connections = 200 timeout = 300 pid file = /var/run/rsyncd.pid lock file = /var/run/rysnc.lock log file = /var/log/rsyncd.log ignore errors read only = false list = false hosts allow = 127.0.0.1/8 #hosts deny = 0.0.0.0/32 #auth users = rsync_backup #secrets file = /etc/rsync.password [node1] path = /mnt/node1 [node2] path = /mnt/node2 [node3] path = /mnt/node3 [node4] path = /mnt/node4 [node5] path = /mnt/node5 [node6] path = /mnt/node6 [node7] path = /mnt/node7 [node8] path = /mnt/node8 [node9] path = /mnt/node9 [node10] path = /mnt/node10
须要恢复的正常sersync中config.xml配置文件
[root@hmf_res1 ~]# cat /data/server/sersync/config/config.xmlexpress
<?xml version="1.0" encoding="ISO-8859-1"?> <head version="2.5"> <host hostip="localhost" port="8008"></host> <debug start="false"/> <fileSystem xfs="false"/> <filter start="false"> <exclude expression="(.*)\.svn"></exclude> <exclude expression="(.*)\.gz"></exclude> <exclude expression="^info/*"></exclude> <exclude expression="^static/*"></exclude> </filter> <inotify> <delete start="true"/> <createFolder start="true"/> <createFile start="false"/> <closeWrite start="true"/> <moveFrom start="true"/> <moveTo start="true"/> <attrib start="false"/> <modify start="false"/> </inotify> <sersync> <localpath watch="/resource_nginx/res/upload/cdn/node1"> <remote ip="127.0.0.1" name="node1"/> <!--<remote ip="192.168.8.39" name="tongbu"/>--> <!--<remote ip="192.168.8.40" name="tongbu"/>--> </localpath> <rsync> <commonParams params="-az"/> <auth start="false" users="rsync_backup" passwordfile="/etc/rsync.password"/> <userDefinedPort start="false" port="874"/><!-- port=874 --> <timeout start="true" time="600"/><!-- timeout=100 --> <ssh start="false"/> </rsync> <failLog path="/data/server/sersync/logs/rsync_fail.log" timeToExecute="60"/><!--default every 60mins execute once--> <crontab start="false" schedule="600"><!--600mins--> <crontabfilter start="false"> <exclude expression="*.php"></exclude> <exclude expression="info/*"></exclude> </crontabfilter> </crontab> <plugin start="false" name="command"/> </sersync> <plugin name="command"> <param prefix="/bin/sh" suffix="" ignoreError="true"/> <!--prefix /opt/tongbu/mmm.sh suffix--> <filter start="false"> <include expression="(.*)\.php"/> <include expression="(.*)\.sh"/> </filter> </plugin> <plugin name="socket"> <localpath watch="/opt/tongbu"> <deshost ip="192.168.138.20" port="8009"/> </localpath> </plugin> <plugin name="refreshCDN"> <localpath watch="/data0/htdocs/cms.xoyo.com/site/"> <cdninfo domainname="ccms.chinacache.com" port="80" username="xxxx" passwd="xxxx"/> <sendurl base="http://pic.xoyo.com/cms"/> <regexurl regex="false" match="cms.xoyo.com/site([/a-zA-Z0-9]*).xoyo.com/images"/> </localpath> </plugin> </head>
当时未更改res6中rsyncd.conf配置文件,也就是规划项目以前的配置
[root@hmf_res6 ~]# cat /etc/rsyncd.conf服务器
#rsync_config_________________start #created by wang 2017-04-18 00:08 ##rsyncd.conf start## uid = root gid = root use chroot = no max connections = 200 timeout = 300 pid file = /var/run/rsyncd.pid lock file = /var/run/rysnc.lock log file = /var/log/rsyncd.log ignore errors read only = false list = false hosts allow = 127.0.0.1/8 #hosts deny = 0.0.0.0/32 #auth users = rsync_backup #secrets file = /etc/rsync.password [node1] path = /mnt/node1 [node2] path = /mnt/node2 [node3] path = /mnt/node3 [node4] path = /mnt/node4 [node5] path = /mnt/node5 [node6] path = /mnt/node6 [node7] path = /mnt/node7 [node8] path = /mnt/node8 [node9] path = /mnt/node9 [node10] path = /mnt/node10 [resource] #此处未删除配置,继续往新块里同步 path = /resource_nginx/res/upload/cdn/node6
未更改res6中sersync config.xml配置文件
[root@hmf_res6 ~]# cat /data/server/sersync/config/config.xml架构
<?xml version="1.0" encoding="ISO-8859-1"?> <head version="2.5"> <host hostip="localhost" port="8008"></host> <debug start="false"/> <fileSystem xfs="false"/> <filter start="false"> <exclude expression="(.*)\.svn"></exclude> <exclude expression="(.*)\.gz"></exclude> <exclude expression="^info/*"></exclude> <exclude expression="^static/*"></exclude> </filter> <inotify> <delete start="true"/> <createFolder start="true"/> <createFile start="false"/> <closeWrite start="true"/> <moveFrom start="true"/> <moveTo start="true"/> <attrib start="false"/> <modify start="false"/> </inotify> <sersync> <localpath watch="/data/webapps/nginx/res/upload/cdn/node6"> #此处未更改为新存储路径,还一直监听着原2T路径,但此路径已经再也不写入数据 <remote ip="127.0.0.1" name="node6"/> <!--<remote ip="192.168.8.39" name="tongbu"/>--> <!--<remote ip="192.168.8.40" name="tongbu"/>--> </localpath> <rsync> <commonParams params="-az"/> <auth start="false" users="rsync_backup" passwordfile="/etc/rsync.password"/> <userDefinedPort start="false" port="874"/><!-- port=874 --> <timeout start="true" time="600"/><!-- timeout=100 --> <ssh start="false"/> </rsync> <failLog path="/data/server/sersync/logs/rsync_fail.log" timeToExecute="60"/><!--default every 60mins execute once--> <crontab start="false" schedule="600"><!--600mins--> <crontabfilter start="false"> <exclude expression="*.php"></exclude> <exclude expression="info/*"></exclude> </crontabfilter> </crontab> <plugin start="false" name="command"/> </sersync> <plugin name="command"> <param prefix="/bin/sh" suffix="" ignoreError="true"/> <!--prefix /opt/tongbu/mmm.sh suffix--> <filter start="false"> <include expression="(.*)\.php"/> <include expression="(.*)\.sh"/> </filter> </plugin> <plugin name="socket"> <localpath watch="/opt/tongbu"> <deshost ip="192.168.138.20" port="8009"/> </localpath> </plugin> <plugin name="refreshCDN"> <localpath watch="/data0/htdocs/cms.xoyo.com/site/"> <cdninfo domainname="ccms.chinacache.com" port="80" username="xxxx" passwd="xxxx"/> <sendurl base="http://pic.xoyo.com/cms"/> <regexurl regex="false" match="cms.xoyo.com/site([/a-zA-Z0-9]*).xoyo.com/images"/> </localpath> </plugin> </head>
②根据sersync同步的默认规则中,<delete start="true"/>
,此项参数为监听源服务器目录位置,若是在源服务器目录没有的文件,会进行删除,有的文件进行同步,保持文件相同状态。这是一个致命的导火索
③8月7日事后一周,没有发现任何异常。可是,在这段时间里res6节点中已经没有任何上传的新文件内容,由于在新路径一直在监听旧路径目录,旧路径必定是没有新文件的,res6接到上传的新文件就被删除,接到就被删除,遵循<delete start="true"/>
规则。此时已经res6早已经有大量的404,静态资源访问状态码的监控也没有作到位
④正好在这天接到了能够删除旧路径数据的许可,运维过程当中使用rm命令删除旧数据app
rm -rf /data/webapps/nginx/res/cdn/upload/node1/* rm -rf /data/webapps/nginx/res/cdn/upload/node2/* rm -rf /data/webapps/nginx/res/cdn/upload/node3/* rm -rf /data/webapps/nginx/res/cdn/upload/node4/* rm -rf /data/webapps/nginx/res/cdn/upload/node5/* rm -rf /data/webapps/nginx/res/cdn/upload/node6/* rm -rf /data/webapps/nginx/res/cdn/upload/node7/* rm -rf /data/webapps/nginx/res/cdn/upload/node8/* rm -rf /data/webapps/nginx/res/cdn/upload/node9/* rm -rf /data/webapps/nginx/res/cdn/upload/node10/*
8月8日,事故发生,res6节点静态资源访问状态码大量404,第一时间查看访问日志,发如今删除旧数据到如今的时间里已经逐渐报出404,等到7日半夜时已经所有404。意识到是数据出了问题,发现新块存储中数据不存在/resource_nginx/res/cdn/upload/node6/
,准备恢复备份,却又发现备份的数据也不存在了/mnt/node6/
运维
缘由是sersync监听路径/data/webapps/nginx/res/cdn/upload/node6/
同步到目标服务器两个路径中/resource_nginx/res/cdn/upload/node六、/mnt/node6
①问题一的解决:运维过程当中,发生漏配,误配的问题,必定是自动化统一运维没有作到位。以后的架构运维中采用saltstack,设置主机组,实现统一cmd,统一分发文件以及监控
nodegroups: hmf_res: 'L@hmf_res1,hmf_res2,hmf_res3,hmf_res4,hmf_res5,hmf_res11,hmf_res7,hmf_res8,hmf_res9,hmf_res10' salt -N 'hmf_res' cmd.run 'command' #指定分组统一管理
②问题二的解决:在生产环境中,若是没有刻意要求源服务器与目标服务器的文件一致性的话,请务必关闭此默认项
更改sersync主配文件
<delete start="false"/>
③问题三的解决:在监控服务器上增强完善监控脚本,对出现大量的404 500 502状态码上作优化
④存储的规划眼界必定要放远,计划赶不上变化,无限增大的资源绝对不能拘束于成本。请慎用rm命令。在使用它时,请明知它会带来的后果