nginx 日志分析

时间 2019-11-05

标签 nginx 日志分析栏目 Nginx 繁體版

原文原文链接

一、日志简介php

1. nginx日志主要有两种：访问日志和错误日志。访问日志主要记录客户端访问nginx的每个请求，格式能够自定义；错误日志主要记录客户端访问nginx出错时的日志，格式不支持自定义。两种日志均可以选择性关闭。
1. 经过访问日志，你能够获得用户地域来源、跳转来源、使用终端、某个URL访问量等相关信息；经过错误日志，你能够获得系统某个服务或server的性能瓶颈等。所以，将日志好好利用，你能够获得不少有价值的信息。

二、访问日志html

[Access.log]

log_format  main  '$remote_addr $remote_user [$time_local] "$request" $http_host '
              '$status $upstream_status $body_bytes_sent "$http_referer" '
              '"$http_user_agent" $ssl_protocol $ssl_cipher $upstream_addr '
              '$request_time $upstream_response_time'; 

变量名称    变量描述    举例说明
$remote_addr    客户端地址    例：10.140.15.91
$remote_user    客户端用户名称    -
$time_local    访问时间和时区    例：18/Jul/2016:17:00:01 +0800
$request    请求的URI和HTTP协议"    例：GET /pa/img/home/logo-alipay-t.png HTTP/1.1"
$http_host    请求地址，即浏览器中你输入的地址(IP或域名)    例：img.xxxxx.com 10.253.70.113
$status    HTTP请求状态    例：200
$upstream_status    upstream状态    例：200
$body_bytes_sent    发送给客户端文件内容大小    例：547
$http_referer    跳转来源     例："https://cashier.xxxxx.com.../"
$http_user_agent    用户终端代理    例："Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; GTB7.0; .NET4.0C;
$ssl_protocol    SSL协议版本    例：TLSv1
$ssl_cipher    交换数据中的算法    例：RC4-SHA
$upstream_addr    后台upstream的地址，即真正提供服务的主机地址    例：10.238.35.247:80
$request_time    整个请求的总时间    例：0.205
$upstream_response_time    请求过程当中，upstream响应时间    例：0.002

实例：nginx

10.9.137.91 - [02/Aug/2016:14:47:12 +0800] "GET /images/XX/20160324752729.png HTTP/1.1"img.xxxxx.com 200 200 2038 https://cashier.xxxxx.com/XX/PaymentResult.htm?payNo=XX&outBizNo=2012XX "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0; 360SE)" TLSv1 AES128-SHA 10.228.21.237:80 0.198 0.001

线下测试（$http_referer）：算法

10.14.21.107 - - [14/Aug/2016:17:28:22 +0800] "GET /spanner/watch/v1?--db=ztg-1&--mode=compare&--index=status&--option=&--cluster=whole&-F=2016/8/12-00:00:00&-T=+2880&-i=1&-n=0&_=1344936501292 HTTP/1.1" 200 94193 "http://spanner.xxxxx.net/optionFrame/history.html" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.60 Safari/537.1"

备注：$http_referer和重定向有关。

三、错误日志后端

错误信息                                            错误说明
"upstream prematurely closed connection"        请求uri的时候出现的异常，是因为upstream还未返回应答给用户时用户断掉链接形成的，对系统没有影响，能够忽略
"recv() failed (104: Connection reset by peer)"
        （1）服务器的并发链接数超过了其承载量，服务器会将其中一些链接Down掉； 
        （2）客户关掉了浏览器，而服务器还在给客户端发送数据； 
        （3）浏览器端按了Stop
"(111: Connection refused) while connecting to upstream"    用户在链接时，若遇到后端upstream挂掉或者不通，会收到该错误
"(111: Connection refused) while reading response header from upstream"    用户在链接成功后读取数据时，若遇到后端upstream挂掉或者不通，会收到该错误
"(111: Connection refused) while sending request to upstream"    Nginx和upstream链接成功后发送数据时，若遇到后端upstream挂掉或者不通，会收到该错误
"(110: Connection timed out) while connecting to upstream"    nginx链接后面的upstream时超时
"(110: Connection timed out) while reading upstream"    nginx读取来自upstream的响应时超时
"(110: Connection timed out) while reading response header from upstream"    nginx读取来自upstream的响应头时超时
"(110: Connection timed out) while reading upstream"    nginx读取来自upstream的响应时超时
"(104: Connection reset by peer) while connecting to upstream"    upstream发送了RST，将链接重置
"upstream sent invalid header while reading response header from upstream"    upstream发送的响应头无效
"upstream sent no valid HTTP/1.0 header while reading response header from upstream"    upstream发送的响应头无效
"client intended to send too large body"    用于设置容许接受的客户端请求内容的最大值，默认值是1M，client发送的body超过了设置值
"reopening logs"    用户发送kill  -USR1命令
"gracefully shutting down",    用户发送kill  -WINCH命令
"no servers are inside upstream"    upstream下未配置server
"no live upstreams while connecting to upstream"    upstream下的server全都挂了
"SSL_do_handshake() failed"    SSL握手失败
"SSL_write() failed (SSL:) while sending to client" 
"(13: Permission denied) while reading upstream" 
"(98: Address already in use) while connecting to upstream" 
"(99: Cannot assign requested address) while connecting to upstream" 
"ngx_slab_alloc() failed: no memory in SSL session shared cache"                ssl_session_cache大小不够等缘由形成
"could not add new SSL session to the session cache while SSL handshaking"        ssl_session_cache大小不够等缘由形成
"send() failed (111: Connection refused)"

例： connect() failed (111: Connection refused) while connecting to upstream. 后端的upstream中指定的是fast-cgi的php服务器，因此缘由有多是php-fpm.conf的进程数量过少致使。解决方法：php-fpm.conf配置文件pm.max_children修改大一点,重启php-fpm浏览器

除了上述方法，还须要注意os层面的资源限制，整体个人以下：tomcat

ulimit -a  
    core file size              (blocks, -c) 0  
    data seg size               (kbytes, -d) unlimited  
    scheduling priority                 (-e) 0  
    file size               (blocks, -f) unlimited  
    pending signals                 (-i) 32063  
    max locked memory       (kbytes, -l) 64  
    max memory size         (kbytes, -m) unlimited  
    open files                      (-n) 65536  
    pipe size            (512 bytes, -p) 8  
    POSIX message queues     (bytes, -q) 819200  
    real-time priority              (-r) 0   
    stack size              (kbytes, -s) 8192  
    cpu time               (seconds, -t) unlimited  
    max user processes              (-u) 32063  
    virtual memory          (kbytes, -v) unlimited  
    file locks                      (-x) unlimited

例： connect() failed (110: Connection refused) while connecting to upstream. 后端的upstream中指定的是tomcat server ，因此缘由有多是tomcat 的链接数过少，tomcat没法处理高并发致使。解决办法：建议修改tomcat 配置文件中子进程的最大个数、并发链接数、内存大小等。另外，还建议调整nginx 到后端upstream的timeout 时长，以下所示：服务器

worker_processes 4;
    pid /run/nginx.pid;
    worker_rlimit_nofile 16384;
    events {
            worker_connections 10000;
            # multi_accept on;
    }
    http {
            proxy_connect_timeout 300s;
            proxy_send_timeout   600s;
            proxy_read_timeout   600s;
            send_timeout 600s;

}

例：一台nginx是用来作转发数据的,也就是反向代理。当链接数达到8000以上时，就会出现NGINX 502 Bad Gateway错误。nginx日志所有都是一些110错误代码：*7300911 connect() failed (110: Connection timed out) while connecting to upstream。(nginx的并发链接数五六万都没问题的吧，怎么会出现这种现象呢)cookie

分析：Nginx 502错误的缘由比较多，大部分是由于在反向代理模式下后端服务器出现问题引发的。这些错误通常都不是nginx自己的问题，因此必定要从后端找缘由！但nginx抛出的这些出错信息很是简单，着实让nginx的用户备受置疑，毕竟从字眼上理解，bad gateway？不就是bad nginx吗？让不了解的人看到，会直接把责任推在nginx身上。502错误最一般的出现状况就是后端主机当机，固然，后端主机并发数过高也可能引发偶尔的502，后端主机优化配置不得当，也可能形成502。若是一直502，得检查后端主机地址或端口是否配置错误。

例：nginx 配置以下:session

upstream tomcat{
server tom1:8901 max_fails=1 fail_timeout=60s;
server tom2:8901 max_fails=1 fail_timeout=60s;
}

日志里面每隔一两分钟就会记录一条相似 379803415 no live upstreams while connecting to upstream 的日志，此外，还有大量的“upstream prematurely closed connection while reading response header from upstream”的日志。
· ·
·分析，nginx在链接upstream的其中一个server时，maxfail为1，表示有任何一次失败，nginx就认为该机器失败，而后就去尝试连下一个；若是全部的都失败了，在底层就会进行quick recovery 把每一个peer的失败次数都重置为0，而后再返回一个NGX_BUSY，而后nginx就会打印一条no live upstreams ,最后又回到原始状态，接着进行转发了。这就解释了no live upstreams以后还能正常访问。

·
从新看配置文件，若是其中一台有一次失败，nginx就会认为它已经死掉，而后就会把之后的流量全都打到另外一台上面，当另一台也有一次失败的时候，就认为两个都死掉了，而后quick recovery，而后打印一条日志。这样带来的另外一个问题是，若是几台nginx同时认定一台后端已经死掉的时候，会形成流量的不均衡。

初步的解决方法：把max_fails从1改为5，效果很明显，“no live upstreams”出现的几率变少了不少，但却没有彻底消失。

至于日志中的“upstream prematurely closed connection while reading response header from upstream” 抓包分析后发现是由于nginx 给后端服务器发送一个请求后，后端服务器直接回复[fin,ack]，而后nginx将502的结果转发给请求者。这种状况极可能是因为后端应用引发

关于概念

1.并发链接数

客户端向服务器发起请求，并创建了TCP链接。每秒钟服务器连接的总TCP数量，就是并发链接数。

2.请求数

请求数指的是客户端在创建完链接后，向http服务发出GET/POST/HEAD数据包。
拓展：服务器返回了请求结果后有两种状况：
http数据包头包含Close字样，关闭本次TCP链接；
http数据包头包含Keep-Alive字样，本次链接不关闭，可继续经过该链接继续向http服务发 送请求，用于减小TCP并发链接数。

PV（page view）

访问量,即页面浏览量或者点击量,用户每次对网站的访问均被记录1次。用户对同一页面的屡次访问，访问量值累计

UV（unique visitor）

独立访客量。将每一个独立设备（以cookie为依据）视为一位访客，一天以内（00:00-24:00）的访客数量。一天以内相同cookie的访问只被计算1次。

独立IP

00:00-24:00内相同IP地址只被计算一次

注：关于并发链接数、请求数、并发用户数的概念参考日志http://www.4wei.cn/archives/1002399

经常使用日志分析命令

一、总请求数

wc -l access.log |awk '{print $1}'

二、独立IP数

awk '{print $1}' access.log|sort |uniq |wc -l

三、每秒客户端请求数 TOP5

awk  -F'[ []' '{print $5}' access.log|sort|uniq -c|sort -rn|head -5

四、访问最频繁IP Top5

awk '{print $1}' access.log|sort |uniq -c | sort -rn |head -5

五、访问最频繁的URL TOP5

awk '{print $7}' access.log|sort |uniq -c | sort -rn |head -5

六、响应大于10秒的URL TOP5

awk '{if ($12 > 10){print $7}}' access.log|sort|uniq -c|sort -rn |head -5

七、HTTP状态码(非200)统计 Top5

awk '{if ($13 != 200){print $13}}' access.log|sort|uniq -c|sort -rn|head -5

八、分析请求数大于50000的源IP的行为

awk '{print $1}' access.log|sort |uniq -c |sort -rn|awk '{if ($1 > 50000){print $2}}' > tmp.txt
for i in $(cat tmp.txt)
do
     echo $i  >> analysis.txt
     echo "访问行为统计" >> analysis.txt
     grep $i  access.log|awk '{print $6}' |sort |uniq -c | sort -rn |head -5 >> analysis.txt
     echo "访问接口统计" >> analysis.txt
     grep $i  access.log|awk '{print $7}' |sort |uniq -c | sort -rn |head -5 >> analysis.txt
     echo -e "\n"  >> /root/analysis/$Ydate.txt
done
注：若是源IP来自代理服务器，应将第一条命令过滤地址改成$http_x_forwarded_for地址

awk '{print $NF}' access.log|sort |uniq -c |sort -rn|awk '{if ($1 > 50000){print $2}}' > tmp.txt