见google-sre-ebook, 对rpc来讲. 我作监控图表时通常是qps/rtt/error, qps对应饱和度和流量(对特定业务作过压测的状况下), rtt(query round trip time)对应延迟。error就不用说了.git
曾经只用histogram作qps和rtt. 这样作的问题在于, 当请求超时时, histogram的count会先降低再上升. 正确的作法是github
counter_inc start_time = now() call end_time = now() histogram
如ubuntu
sum(rate(xxxx_count{app="xxx"}[30s])) by (method)
好比, 5S采集一次. 那么, rate统计的是5S内的平均qps. 实际qps不会平均分布.api
平均值:app
sum(rate(xxx_ts_sum{app="xxx"}[30s]) / rate(xxx_ts_count{app="chat"}[30s])) by (method)
中位数:post
histogram_quantile(0.5, sum(rate(xxx_ts_bucket{app="xxx"}[30s])) by (le, method))
%99:google
histogram_quantile(0.99, sum(rate(xxx_ts_bucket{app="xxx"}[30s])) by (le, method))
wget http://localhost:9090/metrics以下, 只记录了每一个区间请求的数量.
假设buckets 只有[0, 1000, 5000], 而全部请求都是100ms, 统计中位数, p99时, 都会统计为500ms. 由于没法得知0 至 1000ms的具体分布状况.
因此, 在高度集中分布的区间, 须要将buckets划分得细致一些.
同时, 在分析尾部延迟时, 要注意buckets形成的统计误差.code
# TYPE http_ts histogram # HELP http_ts Http Post execution time. http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="2"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="5"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="10"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="25"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="50"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="100"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="250"} 1 http_ts_bucket{host_name="ubuntu",app="nil",method="post",api="http://127.0.0.1/test",code="200",le="500"} 1