nginx 502 和 504 超时演示

时间 2019-11-07

标签 nginx 超时演示栏目 Nginx 繁體版

原文原文链接

最近线上 nginx 遇到了一些较难排查的 502 和 504 错误，顺便了解了一下 nginx 的相关配置。我发现网上不少介绍 nginx 超时配置只是列了这几个配置的含义和数值，并无解释什么缘由会触发哪一个配置。所以趁这个机会演示一下，如何让 nginx 符合预期正确出现 502 和 504。html

502 和 504 的解释

在 http status 的定义中：nginx

502 Bad Gateway: The server was acting as a gateway or proxy and received an invalid response from the upstream server.
504: he server was acting as a gateway or proxy and did not receive a timely response from the upstream server.

502 的错误缘由是 Bad Gateway，通常是因为上游服务的故障引发的；而 504 则是 nginx 访问上游服务超时，两者彻底是两个意思。但在某些状况下，上游服务的超时（触发 tcp reset）也可能引起 502，咱们会在以后详述。git

演示环境

你须要 3 个逻辑组件：nginx 服务器，php-fpm，client 访问客户端。3 个组件能够在同一台机器中，我用的是 docker 来配置 PHP 和 nginx 环境，在宿主机上访问。若是你很熟悉这 3 个组件，这部分能够跳过。用 docker 来作各类测试和实验很是方便，这里就不展开了。docker-compose 的配置参考了这篇文章。个人 docker composer 文件以下：github

version: '3'
services:
 web:
 image: nginx:alpine
 ports:
 - "8080:80"
 volumes:
 - ./code:/code
 - ./nginx/site.conf:/etc/nginx/conf.d/site.conf
 depends_on:
 - php
 php:
 image: php:7.1-fpm-alpine
 volumes:
 - ./code:/code
 - ./php/php-fpm.conf:/usr/local/etc/php-fpm.conf
复制代码

使用的镜像都是基于 alpine 制做的，很是小巧：web

REPOSITORY  TAG               SIZE
php         7.1-fpm-alpin     69.5MB
nginx       alpine            18.6MB
复制代码

nginx 的配置：docker

server {
  index index.php index.html;
  server_name php-docker.local;
  error_log  /var/log/nginx/error.log;
  access_log /var/log/nginx/access.log;
  root /code;

  location ~ \.php$ {
    try_files $uri =404;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass 127.0.0.1:9000;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param PATH_INFO $fastcgi_path_info;
    fastcgi_connect_timeout 5s;
    fastcgi_read_timeout 8s;
    fastcgi_send_timeout 10s;
  }
}
复制代码

php-fpm 的配置shell

[global]
include=etc/php-fpm.d/*.conf
request_terminate_timeout=3s
复制代码

代码放在 github。bash

关键参数

在这个演示中，PHP 的关键参数有两个，一个是 PHP 脚本的 max_execution_time，这个配置在php.ini中；另外一个是 php-fpm 的 request_terminate_timeout，在php-fpm.conf中。当以 php-fpm 提供服务时，request_terminate_timeout 设置会覆盖 max_execution_time 的设置，所以咱们这里只测试 request_terminate_timeout。服务器

request_terminate_timeout 的意思是 php-fpm 接受的请求的超时时间，超过这个时间 php-fpm 会 kill 掉执行脚本的 worker 进程。

nginx的关键参数是 fastcgi 相关的 timeout，即：fastcgi_connect_timeout，fastcgi_read_timeout，fastcgi_send_timeout。

这几个 nginx 参数的主语都是 nginx，因此 fastcgi_connect_timeout 的意思是 nginx 链接到 fastcgi 的超时时间，fastcgi_read_timeout 是 nginx 读取 fastcgi 的内容的超时时间，fastcgi_send_timeout 是 nginx 发送内容到 fastcgi 的超时时间。

演示过程

首先启动 nginx 和 PHP：

docker-compose up
复制代码

在 code 文件夹下添加一个 index.php 文件：

<?php
sleep(70);
echo 'hello world';
复制代码

上游服务主动 reset

访问 php-docker.local:8080/index.php，报错 502 bad gateway。并且是在 3s 以后报的错，说明触发了 request_terminate_timeout 设置，php-fpm 关闭了链接。

经过观察 ps aux | grep php 能够发现，php-fpm 是经过杀掉超时的进程来解决进程超时问题的（pid 每次有一个会变化，说明一个进程杀掉了，并启动了另外一个进程。这和 php-fpm 的进程池设定有关，你的设定未必会从新启动一个新的进程）。

/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
    6 www-data   0:00 php-fpm: pool www
    7 www-data   0:00 php-fpm: pool www
/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
    7 www-data   0:00 php-fpm: pool www
   17 www-data   0:00 php-fpm: pool www
/var/www/html # ps aux | grep php
    1 root       0:00 php-fpm: master process (/usr/local/etc/php-fpm.conf)
   17 www-data   0:00 php-fpm: pool www
   20 www-data   0:00 php-fpm: pool www
复制代码

在这种状况下，nginx 日志中的错误是：

recv() failed (104: Connection reset by peer) while reading response header from upstream
复制代码

即链接被服务端（PHP）reset 了，也就很好理解了。

注意，在这种状况下，php-fpm 的日志中也会记录的：

php_1  | [18-Jul-2018 16:33:42] WARNING: [pool www] child 5, script '/code/index.php' (request: "GET /index.php") execution timed out (3.040130 sec), terminating
php_1  | [18-Jul-2018 16:33:42] WARNING: [pool www] child 5 exited on signal 15 (SIGTERM) after 30.035736 seconds from start
php_1  | [18-Jul-2018 16:33:42] NOTICE: [pool www] child 8 started
复制代码

这也是能够发现问题的一个地方。

nginx 读取上游服务超时

删掉 request_terminate_timeout 配置，重启应用：

docker-compose down && docker-compose up
复制代码

此时，PHP 脚本将要执行 70s，确定超过 nginx 设置的超时时间，get 一下发现确实如此，8s 以后抛出 504 Gateway Time-out 错误，nginx 日志是：

upstream timed out (110: Operation timed out) while reading response header from upstream
复制代码

说明触发了 fastcgi_read_timeout 设置。

关闭上游服务

关掉 PHP 服务：

docker-composer stop php
复制代码

PHP 服务停掉以后第一次访问，获得 504 错误，错误是：

upstream timed out (110: Operation timed out) while connecting to upstream
复制代码

超时时间为 fastcgi_connect_timeout 的设置。说明这个时候 tcp 链接还在，可是尝试链接的时候失败了。

再次访问，获得 502 错误，错误是：

connect() failed (113: Host is unreachable) while connecting to upstream
复制代码

502 的缘由很容易理解，上游服务挂了，同时由于以前访问的时候发现链接不上就把链接断掉了，再次链接的时候便没法找到 host 了。

我曾怀疑第一次访问 504 是因为 keepalive。但我停掉 PHP 以后隔了很久才发第一个请求，仍然是这个结果。

若是将 nginx fastcgi_pass 配置为 127.0.0.1:9000（本地没有这个端口），则立刻就会抛出 502 错误，错误为：

connect() failed (111: Connection refused) while connecting to upstream
复制代码

登入 nginx 服务，使用 tcpdump 监控 9000 上的通讯：

tcpdump -i eth0 -nnA tcp port 9000
# 若是你的 PHP 在本地，eth0 应该改为 lo
复制代码

咱们发现，当 PHP 关闭以后第一次访问，nginx 会尝试向 PHP 发起若干次 TCP SYN 请求，但 PHP 显然不会响应，这个时候 nginx 就返回了 504。第二次访问的时候 nginx 根本不会发起任何请求，直接 502 了[^2]。若是咱们这个时候执行nginx -t会发现，nginx 已经认为配置文件有问题了：nginx: configuration file /etc/nginx/nginx.conf test failed。

换一种配置

这篇文章提到，咱们以前的 nginx 配置并不合理[^1]，咱们从新设置 nginx：

server {
  index index.php index.html;
  server_name php-docker.local;
  error_log  /var/log/nginx/error.log;
  access_log /var/log/nginx/access.log;
  root /code;
  resolver 127.0.0.11;  # here
  location ~ \.php$ {
    set $upstream php:9000; # here
    try_files $uri =404;
    fastcgi_split_path_info ^(.+\.php)(/.+)$;
    fastcgi_pass $upstream;  # here
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param PATH_INFO $fastcgi_path_info;
    fastcgi_connect_timeout 5s;
    fastcgi_read_timeout 8s;
    fastcgi_send_timeout 10s;
  }
}
复制代码

其中 127.0.0.11 是 docker 的内网 dns resolver。该配置动态指定 fastcgi pass，因此 nginx 不会检查该链接可否创建起来。

按照这个配置启动，先访问 index.php 创建链接，而后关闭 PHP，表现为：

在 keepalive 期间，抛出 504 错误，超时时间为 fastcgi_connect_timeout，错误是：

upstream timed out (110: Operation timed out) while connecting to upstream
复制代码

keepalive 断线以后，抛出 502 错误，超时时间不定，错误是：

connect() failed (113: Host is unreachable) while connecting to upstream
复制代码

按照这篇文章所说，这种配置 nginx 不会认为有问题，执行nginx -t确实如此。在 一段时间 内，每次请求 nginx 都会向 upstream 发送 SYN，这段时间的状态码都是 504，以后再访问就再也不发 TCP 包，状态码也变成 502。

其余

除此以外，PHP 脚本还有一个超时时间的设置：max_execution_time。它是限制 PHP 脚本的执行时间，但这个时间不会计算系统调用（好比 sleep，io，等）。由于该缘由致使 PHP 杀掉进程时，会抛出 fatal error，而 php-fpm 不会有 fatal error。

这里实验使用的是 PHP 的 fastcgi 工做方式，若是是 nginx 经过代理的方式链接上游服务的话，fastcgi_connect_timeout，fastcgi_read_timeout，fastcgi_send_timeout 都须要替换成对应的 proxy_connect_timeout，proxy_read_timeout，proxy_send_timeout。

结论

504 的缘由比较简单，通常都是上游服务的执行时间超过了 nginx 的等待时间，这种状况是因为上游服务的业务太过耗时致使的，或者链接到上游服务器超时。从上面的实验来看，后者的缘由比较难以追踪，由于这种状况下链接是存在的，可是却连不上，好在这种 504 通常都会在一段时间后转为 502。

502 的缘由是因为上游服务器的故障，好比停机，进程被杀死，上游服务 reset 了链接，进程僵死等各类缘由。在 nginx 的日志中咱们可以发现 502 错误的具体缘由，分别为：104: Connection reset by peer，113: Host is unreachable，111: Connection refused。

有一些细节上的差异和 nginx 的工做原理有关，这部分还没有深挖。

[^1]: 这篇文章代表，咱们以前的设置中，若是 PHP 没有先启动起来，那么 nginx 也是启动不起来的，这种设置并不合理：nginx 的一台上游服务有问题，结果 nginx 就没法提供服务了。但这和咱们的演示关系不大，所以并无在正文中过多描述。

[^2]: 按理说，既然 nginx 已经知道 PHP 不可达，不去发 TCP 请求了，那么应该当即 502 才是。实验中发现，这种状况下的 502 有 3s 左右的延时，不知何故。