顺风车运营研发团队 黄桃php
问题现象
某机器这段时间出现cpu-idle掉地的报警 如图:php-fpm
缘由分析优化
查看当时的监控(php-fpm-idle、cpu-idle,io-wait、io-write 等)
(1)php-fpm-idle今天出现两次突降,一次是12点左右,一次是16:30左右,查看整周也常常出现突降,如图spa
(2)io-wait在11:58分时 忽然升高 3d
(3)io-write也在11:58时出现大量写: 日志
(4)cpu-idle当时短暂出现下降,以后出现徒增,可是结合整周的曲线来看,一直维持在70-80之间,徒增缘由待分析:code
(5)缘由推测:由于当时出现了大批量的写日志,致使io-wait上升,php-fpm进程由于写文件出现延时,形成总体响应过慢,从而致使了fpm掉地; 对同一个文件进行 write时,大批量并行会出现等待,阻塞blog
验证推测
(1)查看当时的php-fpm的慢日志,看当时阻塞的地方,基本是在调用fwrite阻塞进程
(2)查看当时的程序日志trace.log的大小,日志文件越大的时间段,正好是fpm-idle下滑严重的阶段: 图片
(3)在经过 sar命令,验证下当时的写入磁盘状况,在出现掉地的时间段确实出现极大的写入,wr_sec/s 从每秒几百低峰的几百增加到十几万:
问题缘由及优化建议
写入日志大可能两方面缘由:
(1)当时请求暴增
(2)请求未暴增,可是某些请求触发了某些不合理的打日志
验证缘由1,出现问题时间段每秒的流量
32 [02/Jul/2018:12:01:12 18 [02/Jul/2018:12:01:13 18 [02/Jul/2018:12:01:14 42 [02/Jul/2018:12:01:15 30 [02/Jul/2018:12:01:16 35 [02/Jul/2018:12:01:17 26 [02/Jul/2018:12:01:18 30 [02/Jul/2018:12:01:19 1 [02/Jul/2018:12:01:22 108 [02/Jul/2018:12:01:24 17 [02/Jul/2018:12:01:25 1 [02/Jul/2018:12:01:27 1 [02/Jul/2018:12:01:29 1 [02/Jul/2018:12:01:30 9 [02/Jul/2018:12:01:33 1 [02/Jul/2018:12:01:31 1 [02/Jul/2018:12:01:32 146 [02/Jul/2018:12:01:33 62 [02/Jul/2018:12:01:34 44 [02/Jul/2018:12:01:35 1 [02/Jul/2018:12:01:37 1 [02/Jul/2018:12:01:38 1 [02/Jul/2018:12:01:41 2 [02/Jul/2018:12:01:44 2 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:45 12 [02/Jul/2018:12:01:50 2 [02/Jul/2018:12:01:45 7 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:46 1 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:46 15 [02/Jul/2018:12:01:50 7 [02/Jul/2018:12:01:48 2 [02/Jul/2018:12:01:50 1 [02/Jul/2018:12:01:48 342 [02/Jul/2018:12:01:50 65 [02/Jul/2018:12:01:51 46 [02/Jul/2018:12:01:52 54 [02/Jul/2018:12:01:53 1 [02/Jul/2018:12:01:55 1 [02/Jul/2018:12:01:56 1 [02/Jul/2018:12:01:57 1 [02/Jul/2018:12:01:59 16 [02/Jul/2018:12:02:03 1 [02/Jul/2018:12:02:01 1 [02/Jul/2018:12:02:02 42 [02/Jul/2018:12:02:03 1 [02/Jul/2018:12:02:02 187 [02/Jul/2018:12:02:03 39 [02/Jul/2018:12:02:04 40 [02/Jul/2018:12:02:05 25 [02/Jul/2018:12:02:06 44 [02/Jul/2018:12:02:07 29 [02/Jul/2018:12:02:08
正常状况是QPS是30左右,出问题时间段则很是不稳定,时高时低,相差很是大,例如在 12:01:50 时 342 qps,可是前十几秒则基本都是个位数;缘由?前几十秒都阻塞了,响应不了,积累到了12:01:50才响应;总体流量并未出现暴增;
验证缘由2,查看当时traceId日志,查看写进去的内容:
发现这段写入很是巨大,一行就119kb ,总共写了 33555 行,总大小占:33555 * 119KB = 3993045KB =3899M 基本能够判定为这行出的问题了
优化建议:在底层打日志的类中对字符串长度作限制,避免这种大批量的写入;