zero down-time更新服务程序

时间 2019-11-08

标签 zero time 更新服务程序繁體版

原文原文链接

从问题开始

先来抛一块砖，对于静态编译的应用程序，好比用C、C++、Golang或者其它的语言编写的程序，若是咱们修改一个BUG或者添加一个新的特性后，如何在服务不下线的状况下更远应用程序呢？nginx

抛出了一个问题，一个很日常的问题，有人对问题思考比较透彻，好比牛顿，被苹果砸中了以后，引发了不少的思考，最后发现了万有引力定律。apache

若是你被苹果砸中了怎么办？api

玩笑话一句，那咱们若是被苹果砸中了会不死变成智障呢？缓存

那么咱们回到刚才这个问题：服务器

当咱们修复BUG，添加新的需求后，如何如丝般顺滑地升级服务器应用程序，而不会中断服务？dom

这个问题意味着：socket

C / C++ / GO都是静态语言，全部的指令都编译在可执行文件，升级就意味着编译新的执行文件替换旧的执行文件，已经运行的进程如何加载新的image（可执行程序文件）去执行呢？ide

正在处理的业务逻辑不能中断，正在处理的链接不能暴力中断？函数

这种如丝般顺滑地升级应用程序，咱们称之为热更新。测试

用个形象上的比喻表示就是：

你如今在坐卡车，卡车开到了150KM/H

而后，有个轮胎，爆了

而后，司机说，你就直接换吧，我不停车。你当心点换

哦，Lee哥，我明白了，在这些状况下，咱们是不能使用哪一个万能地“重启”去解决问题的。

第一种解决方案：灰度发布和A/B测试引发的思考

灰度发布（又名金丝雀发布）是指在黑与白之间，可以平滑过渡的一种发布方式。在其上能够进行A/B testing，即让一部分用户继续用产品特性A，一部分用户开始用产品特性B，若是用户对B没有什么反对意见，那么逐步扩大范围，把全部用户都迁移到B 上面来。灰度发布能够保证总体系统的稳定，在初始灰度的时候就能够发现、调整问题，以保证其影响度。利用nginx作灰度发布的方案以下图：

nginx是一个反向代理软件，能够把外网的请求转发到内网的业务服务器上，系统的分层的设计，通常咱们把nginx归为接入层，固然LVS/F5/Apache等等都能去转发用户请求。好比咱们来看一个nginx的配置：

http {

upstream cluster {

    ip_hash;

    server 192.168.2.128:8086 weight=1 fail_timeout=15 max_fails =3;

    server 192.168.2.130:8086 weight=2 fail_timeout=15 max_fails =3;

}

server {

    listen 8080;

    location / {

        proxy_pass http://cluster;

    }

}

}

咱们对8080端口的访问，都会转发到cluster说定义的upstream里，upstream里会根据IP hash的策略转发给192.168.2.128和192.168.2.130的8086端口的服务上。这里配置的是ip hash，固然nginx还支持其余策略。

那么经过nginx如何去如丝般升级服务程序呢？

好比nginx的配置：

http {

upstream cluster {  

    ip_hash;

    server 192.168.2.128:8086 weight=1 fail_timeout=15 max_fails =3;

    server 192.168.2.130:8086 weight=2 fail_timeout=15 max_fails =3;

}  



server {  

    listen 80;  



    location / {

        proxy_pass http://cluster;  

    }  

}

}

假如咱们的服务部署在192.168.2.128上，如今咱们修复BUG或者增长新的特性后，咱们从新部署了一台服务（好比192.168.2.130上），那么咱们就能够修改nginx配置如上，而后执行nginx -s reload加载新的配置，这样咱们现有的链接和服务都没有断掉，可是新的业务服务已经能够开始服务了，这就是经过nginx作的灰度发布，依据这样的方法作的测试称之为A/B测试，好了，那如何让老的服务完全停掉呢？

能够修改nginx的配置以下，即在对应的upstream的服务器上添加down字段：

http {

upstream cluster {  

    ip_hash;

server 192.168.2.128:8086 weight=1 fail_timeout=15 max_fails =3down;

server 192.168.2.130:8086 weight=2 fail_timeout=15 max_fails =3;

}  



server {  

    listen 80;  



    location / {

        proxy_pass http://cluster;  

    }  

}

}

这样等过一段时间，就能够把192.168.2.128上的服务给停掉了。

这就是经过接入层nginx的一个如丝般顺滑的一个方案，这种思想一样能够应用于其余的好比LVS、apache等，固然还能够经过DNS，zookeeper，etcd等，就是把流量全都打到新的系统上去。

灰度发布解决的流量转发到新的系统中去，可是若是对于nginx这样的应用程序，或者我就是要在这台机器上升级image，那怎么办呢？这就必需要实现热更新，这里须要考虑的问题是旧的服务若是缓存了数据怎么办？若是正在处理业务逻辑怎么办？

第二种解决方案：nginx的热更新方案

nginx采用Master/Worker的多进程模型，Master进程负责整个nginx进程的管理，好比停机、日志重启和热更新等等，worker进程负责用户的请求处理。

如上一个nginx里配置的全部的监听端口都是首先在Master进程里create的socket（sfd）、bind、listen，而后Master在建立worker进程的时候把这些socket经过unix domain socket复制给了Worker进程，Worker进程把这些socket全都添加到epoll，以后若是有客户端链接进来了，则由worker进程负责处理，那么也就是说用户的请求是由worker进程处理的。

先交代了nginx的IO处理模型的背景，而后咱们再看nginx的热更新方案：

升级的步骤：

第一步：升级nginx二进制文件，须要先将新的nginx可执行文件替换原有旧的nginx文件，而后给nginx master进程发送USR2信号，告知其开始升级可执行文件；nginx master进程会将老的pid文件增长.oldbin后缀，而后调用exec函数拉起新的master和worker进程，并写入新的master进程的pid。

UID PID PPID C STIME TTY TIME CMD

root 4584 1 0 Oct17 ? 00:00:00 nginx: master process /usr/local/apigw/apigw_nginx/nginx

root 12936 4584 0 Oct26 ? 00:03:24 nginx: worker process

root 12937 4584 0 Oct26 ? 00:00:04 nginx: worker process

root 12938 4584 0 Oct26 ? 00:00:04 nginx: worker process

root 23692 4584 0 21:28 ? 00:00:00 nginx: master process /usr/local/apigw/apigw_nginx/nginx

root 23693 23692 3 21:28 ? 00:00:00 nginx: worker process

root 23694 23692 3 21:28 ? 00:00:00 nginx: worker process

root 23695 23692 3 21:28 ? 00:00:00 nginx: worker process

关于exec家族的函数说明见下：

NAME

execl, execlp, execle, execv, execvp, execvpe - execute a file

SYNOPSIS

#include <unistd.h>

   extern char **environ;

   int execl(const char *path, const char *arg, ...

                   /* (char  *) NULL */);

   int execlp(const char *file, const char *arg, ...

                   /* (char  *) NULL */);

   int execle(const char *path, const char *arg, ...

                   /*, (char *) NULL, char * const envp[] */);

   int execv(const char *path, char *const argv[]);

   int execvp(const char *file, char *const argv[]);

   int execvpe(const char *file, char *const argv[],

                   char *const envp[]);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

execvpe(): _GNU_SOURCE

DESCRIPTION

The exec() family of functions replaces the current process image with a new process image. The functions described in this manual page are front-ends for execve(2).

(See the manual page for execve(2) for further details about the replacement of the current process image.)

   The initial argument for these functions is the name of a file that is to be executed.

   The const char *arg and subsequent ellipses in the execl(), execlp(), and execle() functions can be thought of as arg0, arg1, ..., argn.  Together they describe a  list

   of  one or more pointers to null-terminated strings that represent the argument list available to the executed program.  The first argument, by convention, should point

   to the filename associated with the file being executed.  The list of arguments must be terminated by a null pointer, and, since  these  are  variadic  functions,  this

   pointer must be cast (char *) NULL.

   The  execv(),  execvp(),  and execvpe() functions provide an array of pointers to null-terminated strings that represent the argument list available to the new program.

   The first argument, by convention, should point to the filename associated with the file being executed.  The array of pointers must be terminated by a null pointer.

   The execle() and execvpe() functions allow the caller to specify the environment of the executed program via the argument envp.  The envp argument is an array of point‐

   ers  to null-terminated strings and must be terminated by a null pointer.  The other functions take the environment for the new process image from the external variable

   environ in the calling process.

第二步：在此以后，全部工做进程(包括旧进程和新进程)将会继续接受请求。这时候，须要发送WINCH信号给nginx master进程，master进程将会向worker进程发送消息，告知其须要进行graceful shutdown，worker进程会在链接处理完以后进行退出。

UID PID PPID C STIME TTY TIME CMD

root 4584 1 0 Oct17 ? 00:00:00 nginx: master process /usr/local/apigw/apigw_nginx/nginx

root 12936 4584 0 Oct26 ? 00:03:24 nginx: worker process

root 12937 4584 0 Oct26 ? 00:00:04 nginx: worker process

root 12938 4584 0 Oct26 ? 00:00:04 nginx: worker process

root 23692 4584 0 21:28 ? 00:00:00 nginx: master process /usr/local/apigw/apigw_nginx/nginx

若是旧的worker进程还须要处理链接，则worker进程不会当即退出，须要待消息处理完后再退出。

第三步：通过一段时间以后，将会只会有新的worker进程处理新的链接。

注意，旧master进程并不会关闭它的listen socket；由于若是出问题后，须要回滚，master进程须要法从新启动它的worker进程。

第四步：若是升级成功，则能够向旧master进程发送QUIT信号，中止老的master进程；若是新的master进程（意外）退出，那么旧master进程将会去掉本身的pid文件的.oldbin后缀。

几个核心的步骤和命令说明以下：

操做的命令

master进程相关信号

USR2 升级可执行文件

WINCH 优雅中止worker进程

QUIT 优雅中止master进程

worker进程相关信号

TERM, INT 快速退出进程

QUIT 优雅中止进程

nginx自己是一个代理组件（代理http TCP UDP），自己并无什么业务逻辑，也即没有什么状态数据可言，即便有业务逻辑这套方案也是能够的。

nginx是如何graceful shutdown的？也即正在处理的http请求和长链接怎么处理？

如何启动新的的image：

好了，以上就是zero down-time update的一些方案，若是还有不明白能够看下面这个视频。
https://www.bilibili.com/vide...