容器底层 --- 超细节的 Namespace 机制讲解

Namespace

Linux Namespace 是 Linux 提供的一种内核级别环境隔离的方法。这种隔离机制和 chroot 很相似，chroot 是把某个目录修改成根目录，从而没法访问外部的内容。Linux Namesapce 在此基础之上，提供了对 UTS、IPC、Mount、PID、Network、User 等的隔离机制，以下所示。web

分类	系统调用参数	相关内核版本
Mount Namespaces	CLONE_NEWNS	Linux 2.4.19
UTS Namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC Namespaces	CLONE_NEWIPC	Linux 2.6.19
PID Namespaces	CLONE_NEWPID	Linux 2.6.19
Network Namespaces	CLONE_NEWNET	始于Linux 2.6.24 完成于 Linux 2.6.29
User Namespaces	CLONE_NEWUSER	始于 Linux 2.6.23 完成于 Linux 3.8)

★
Linux Namespace 官方文档：Namespaces in operationdocker
”

namespace 有三个系统调用可使用：shell

clone() --- 实现线程的系统调用，用来建立一个新的进程，并能够经过设计上述参数达到隔离。
unshare() --- 使某个进程脱离某个 namespace
setns(int fd, int nstype) --- 把某进程加入到某个 namespace

下面使用这几个系统调用来演示 Namespace 的效果，更加详细地能够看 DOCKER基础技术：LINUX NAMESPACE（上）、 DOCKER基础技术：LINUX NAMESPACE（下）。ubuntu

UTS Namespace

UTS Namespace 主要是用来隔离主机名的，也就是每一个容器都有本身的主机名。咱们使用以下的代码来进行演示。注意：假如在容器内部没有设置主机名的话会使用主机的主机名的；假如在容器内部设置了主机名可是没有使用 CLONE_NEWUTS 的话那么改变的实际上是主机的主机名。windows

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];

char* const container_args[] = {
    "/bin/bash",
    NULL
};

int container_main(void* arg) {
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container_dawn", 15);
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    printf("Parent [%5d] - start a container!\n", getpid());
    int container_id = clone(container_main, container_stack + STACK_SIZE, 
                                CLONE_NEWUTS | SIGCHLD, NULL);
    waitpid(container_id, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

PID Namespace

每一个容器都有本身的进程环境中，也就是至关于容器内进程的 PID 从 1 开始命名，此时主机上的 PID 其实也仍是从 1 开始命名的，就至关于有两个进程环境：一个主机上的从 1 开始，另外一个容器里的从 1 开始。centos

为啥 PID 从 1 开始就至关于进程环境的隔离了呢？所以在传统的 UNIX 系统中，PID 为 1 的进程是 init，地位特殊。它做为全部进程的父进程，有不少特权。另外，其还会检查全部进程的状态，咱们知道若是某个进程脱离了父进程（父进程没有 wait 它），那么 init 就会负责回收资源并结束这个子进程。因此要想作到进程的隔离，首先须要建立出 PID 为 1 的进程。安全

可是，【kubernetes 里面的话】bash

int container_main(void* arg) {
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container_dawn", 15);
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    printf("Parent [%5d] - start a container!\n", getpid());
    int container_id = clone(container_main, container_stack + STACK_SIZE, 
                                CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL);
    waitpid(container_id, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

若是此时你在子进程的 shell 中输入 ps、top 等命令，咱们仍是能够看到全部进程。这是由于，ps、top 这些命令是去读 /proc 文件系统，因为此时文件系统并无隔离，因此父进程和子进程经过命令看到的状况都是同样的。微信

IPC Namespace

常见的 IPC 有共享内存、信号量、消息队列等。当使用 IPC Namespace 把 IPC 隔离起来以后，只有同一个 Namespace 下的进程才能相互通讯，由于主机的 IPC 和其余 Namespace 中的 IPC 都是看不到了的。而这个的隔离主要是由于建立出来的 IPC 都会有一个惟一的 ID，那么主要对这个 ID 进行隔离就行了。网络

想要启动 IPC 隔离，只须要在调用 clone 的时候加上 CLONE_NEWIPC 参数就能够了。

int container_main(void* arg) {
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container_dawn", 15);
    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    printf("Parent [%5d] - start a container!\n", getpid());
    int container_id = clone(container_main, container_stack + STACK_SIZE, 
                                CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | SIGCHLD, NULL);
    waitpid(container_id, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

Mount Namespace

Mount Namespace 可让容器有本身的 root 文件系统。须要注意的是，在经过 CLONE_NEWNS 建立 mount namespace 以后，父进程会把本身的文件结构复制给子进程中。因此当子进程中不从新 mount 的话，子进程和父进程的文件系统视图是同样的，假如想要改变容器进程的视图，必定须要从新 mount（这个是 mount namespace 和其余 namespace 不一样的地方）。

另外，子进程中新的 namespace 中的全部 mount 操做都只影响自身的文件系统（注意这边是 mount 操做，而建立文件等操做都是会有所影响的），而不对外界产生任何影响，这样能够作到比较严格地隔离（固然这边是除 share mount 以外的）。

下面咱们从新挂载子进程的 /proc 目录，从而可使用 ps 来查看容器内部的状况。

int container_main(void* arg) {
    printf("Container [%5d] - inside the container!\n", getpid());

    sethostname("container_dawn", 15);

    if (mount("proc", "/proc", "proc", 0, NULL) !=0 ) {
        perror("proc");
    }

    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    printf("Parent [%5d] - start a container!\n", getpid());
    int container_id = clone(container_main, container_stack + STACK_SIZE, 
                                CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
    waitpid(container_id, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

★
这里会有个问题就是在退出子进程以后，当再次使用 ps -elf 的时候会报错，以下所示

这是由于 /proc 是 share mount，对它的操做会影响全部的 mount namespace，能够看这里：http://unix.stackexchange.com/questions/281844/why-does-child-with-mount-namespace-affect-parent-mounts
”

上面仅仅从新 mount 了 /proc 这个目录，其余的目录仍是跟父进程同样视图的。通常来讲，容器建立以后，容器进程须要看到的是一个独立的隔离环境，而不是继承宿主机的文件系统。接下来演示一个山寨镜像，来模仿 Docker 的 Mount Namespace。也就是给子进程实现一个较为完整的独立的 root 文件系统，让这个进程只能访问本身构成的文件系统中的内容（想一想咱们日常使用 Docker 容器的样子）。

首先咱们使用 docker export 将 busybox 镜像导出成一个 rootfs 目录，这个 rootfs 目录的状况如图所示，已经包含了 /proc、/sys 等特殊的目录。

以后咱们在代码中将一些特殊目录从新挂载，并使用 chroot() 系统调用将进程的根目录改为上文的 rootfs 目录。

char* const container_args[] = {
    "/bin/sh",
    NULL
};

int container_main(void* arg) {
    printf("Container [%5d] - inside the container!\n", getpid());
    sethostname("container_dawn", 15);
    
    if (mount("proc", "rootfs/proc", "proc", 0, NULL) != 0) {
        perror("proc");
    }
    if (mount("sysfs", "rootfs/sys", "sysfs", 0, NULL)!=0) {
        perror("sys");
    }
    if (mount("none", "rootfs/tmp", "tmpfs", 0, NULL)!=0) {
        perror("tmp");
    }
    if (mount("udev", "rootfs/dev", "devtmpfs", 0, NULL)!=0) {
        perror("dev");
    }
    if (mount("devpts", "rootfs/dev/pts", "devpts", 0, NULL)!=0) {
        perror("dev/pts");
    }
    if (mount("shm", "rootfs/dev/shm", "tmpfs", 0, NULL)!=0) {
        perror("dev/shm");
    }
    if (mount("tmpfs", "rootfs/run", "tmpfs", 0, NULL)!=0) {
        perror("run");
    }

    if ( chdir("./rootfs") || chroot("./") != 0 ){
        perror("chdir/chroot");
    }

    // 改变根目录以后，那么 /bin/bash 是从改变以后的根目录中搜索了
    execv(container_args[0], container_args);
    perror("exec");
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    printf("Parent [%5d] - start a container!\n", getpid());
    int container_id = clone(container_main, container_stack + STACK_SIZE, 
                                CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
    waitpid(container_id, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

最后，查看实现效果以下图所示。

实际上，Mount Namespace 是基于 chroot 的不断改良才被发明出来的，chroot 能够算是 Linux 中第一个 Namespace。那么上面被挂载在容器根目录上、用来为容器镜像提供隔离后执行环境的文件系统，就是所谓的容器镜像，也被叫作 rootfs（根文件系统）。须要明确的是，rootfs 只是一个操做系统所包含的文件、配置和目录，并不包括操做系统内核。

User Namespace

容器内部看到的 UID 和 GID 和外部是不一样的了，好比容器内部针对 dawn 这个用户显示的是 0，可是实际上这个用户在主机上应该是 1000。要实现这样的效果，须要把容器内部的 UID 和主机的 UID 进行映射，须要修改的文件是 /proc/<pid>/uid_map 和 /proc/<pid>/gid_map，这两个文件的格式是

ID-INSIDE-NS  ID-OUTSIDE-NS LENGTH

ID-INSIDE-NS ：表示在容器内部显示的 UID 或 GID
ID-OUTSIDE-NS：表示容器外映射的真实的 UID 和 GID
LENGTH：表示映射的范围，通常为 1，表示一一对应

好比，下面就是将真实的 uid=1000 的映射为容器内的 uid =0：

$ cat /proc/8353/uid_map
0       1000          1

再好比，下面则表示把 namesapce 内部从 0 开始的 uid 映射到外部从 0 开始的 uid，其最大范围是无符号 32 位整型（下面这条命令是在主机环境中输入的）。

$ cat /proc/$$/uid_map
0          0 4294967295

默认状况，设置了 CLONE_NEWUSER 参数可是没有修改上述两个文件的话，容器中默认状况下显示为 65534，这是由于容器找不到真正的 UID，因此就设置了最大的 UID。以下面的代码所示：

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/capability.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char container_stack[STACK_SIZE];
char* const container_args[] = {
    "/bin/bash",
    NULL
};

int container_main(void* arg) {

    printf("Container [%5d] - inside the container!\n", getpid());

    printf("Container: eUID = %ld; eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    printf("Container [%5d] - setup hostname!\n", getpid());
    
    //set hostname
    sethostname("container",10);

    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    const int gid=getgid(), uid=getuid();

    printf("Parent: eUID = %ld; eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());
 
    printf("Parent [%5d] - start a container!\n", getpid());

    int container_pid = clone(container_main, container_stack+STACK_SIZE, 
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWUSER | SIGCHLD, NULL);

    
    printf("Parent [%5d] - Container [%5d]!\n", getpid(), container_pid);

    printf("Parent [%5d] - user/group mapping done!\n", getpid());

    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

当我以 dawn 这个用户执行的该程序的时候，那么会显示以下图所示的效果。使用 root 用户的时候是一样的：

接下去，咱们要开始来实现映射的效果了，也就是让 dawn 这个用户在容器中显示为 0。代码是几乎彻底拿耗子叔的博客上的，连接可见文末：

int pipefd[2];

void set_map(char* file, int inside_id, int outside_id, int len) {
    FILE* mapfd = fopen(file, "w");
    if (NULL == mapfd) {
        perror("open file error");
        return;
    }
    fprintf(mapfd, "%d %d %d", inside_id, outside_id, len);
    fclose(mapfd);
}

void set_uid_map(pid_t pid, int inside_id, int outside_id, int len) {
    char file[256];
    sprintf(file, "/proc/%d/uid_map", pid);
    set_map(file, inside_id, outside_id, len);
}

int container_main(void* arg) {

    printf("Container [%5d] - inside the container!\n", getpid());

    printf("Container: eUID = %ld; eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    /* 等待父进程通知后再往下执行（进程间的同步） */
    char ch;
    close(pipefd[1]);
    read(pipefd[0], &ch, 1);

    printf("Container [%5d] - setup hostname!\n", getpid());
    //set hostname
    sethostname("container",10);

    //remount "/proc" to make sure the "top" and "ps" show container's information
    mount("proc", "/proc", "proc", 0, NULL);

    execv(container_args[0], container_args);
    printf("Something's wrong!\n");
    return 1;
}

int main() {
    const int gid=getgid(), uid=getuid();

    printf("Parent: eUID = %ld; eGID = %ld, UID=%ld, GID=%ld\n",
            (long) geteuid(), (long) getegid(), (long) getuid(), (long) getgid());

    pipe(pipefd);
 
    printf("Parent [%5d] - start a container!\n", getpid());

    int container_pid = clone(container_main, container_stack+STACK_SIZE, 
            CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUSER | SIGCHLD, NULL);

    
    printf("Parent [%5d] - Container [%5d]!\n", getpid(), container_pid);

    //To map the uid/gid, 
    // we need edit the /proc/PID/uid_map (or /proc/PID/gid_map) in parent
    set_uid_map(container_pid, 0, uid, 1);

    printf("Parent [%5d] - user/group mapping done!\n", getpid());

    /* 通知子进程 */
    close(pipefd[1]);

    waitpid(container_pid, NULL, 0);
    printf("Parent - container stopped!\n");
    return 0;
}

实现的最终效果如图所示，能够看到在容器内部将 dawn 这个用户 UID 显示为了 0（root），但其实这个容器中的 /bin/bash 进程仍是以一个普通用户，也就是 dawn 来运行的，只是显示出来的 UID 是 0，因此当查看 /root 目录的时候仍是没有权限。

User Namespace 是以普通用户运行的，可是别的 Namespace 须要 root 权限，那么当使用多个 Namespace 该怎么办呢？咱们能够先用通常用户建立 User Namespace，而后把这个通常用户映射成 root，那么在容器内用 root 来建立其余的 Namespace。

Network Namespace

隔离容器中的网络，每一个容器都有本身的虚拟网络接口和 IP 地址。在 Linux 中，可使用 ip 命令建立 Network Namespace（Docker 的源码中，它没有使用 ip 命令，而是本身实现了 ip 命令内的一些功能）。

下面就使用 ip 命令来说解一下 Network Namespace 的构建，以 bridge 网络为例。bridge 网络的拓扑图通常以下图所示，其中 br0 是 Linux 网桥。

在使用 Docker 的时候，若是启动一个 Docker 容器，并使用 ip link show 查看当前宿主机上的网络状况，那么你会看到有一个 docker0 还有一个 veth**** 的虚拟网卡，这个 veth 的虚拟网卡就是上图中 veth，而 docker0 就至关于上图中的 br0。

那么，咱们可使用下面这些命令便可建立跟 docker 相似的效果（参考自耗子叔的博客，连接见文末参考，结合上图加了一些文字）。

## 1. 首先，咱们先增长一个网桥 lxcbr0，模仿 docker0
brctl addbr lxcbr0
brctl stp lxcbr0 off
ifconfig lxcbr0 192.168.10.1/24 up #为网桥设置IP地址

## 2. 接下来，咱们要建立一个 network namespace ，命名为 ns1

# 增长一个 namesapce 命令为 ns1 （使用 ip netns add 命令）
ip netns add ns1 

# 激活 namespace 中的 loopback，即127.0.0.1（使用 ip netns exec ns1 至关于进入了 ns1 这个 namespace，那么 ip link set dev lo up 至关于在 ns1 中执行的）
ip netns exec ns1   ip link set dev lo up 

## 3. 而后，咱们须要增长一对虚拟网卡

# 增长一对虚拟网卡，注意其中的 veth 类型。这里有两个虚拟网卡：veth-ns1 和 lxcbr0.1，veth-ns1 网卡是要被安到容器中的，而 lxcbr0.1 则是要被安到网桥 lxcbr0 中的，也就是上图中的 veth。
ip link add veth-ns1 type veth peer name lxcbr0.1

# 把 veth-ns1 按到 namespace ns1 中，这样容器中就会有一个新的网卡了
ip link set veth-ns1 netns ns1

# 把容器里的 veth-ns1 更名为 eth0 （容器外会冲突，容器内就不会了）
ip netns exec ns1  ip link set dev veth-ns1 name eth0 

# 为容器中的网卡分配一个 IP 地址，并激活它
ip netns exec ns1 ifconfig eth0 192.168.10.11/24 up


# 上面咱们把 veth-ns1 这个网卡按到了容器中，而后咱们要把 lxcbr0.1 添加上网桥上
brctl addif lxcbr0 lxcbr0.1

# 为容器增长一个路由规则，让容器能够访问外面的网络
ip netns exec ns1     ip route add default via 192.168.10.1

## 4. 为这个 namespace 设置 resolv.conf，这样，容器内就能够访问域名了
echo "nameserver 8.8.8.8" > conf/resolv.conf

上面基本上就至关于 docker 网络的原理，只不过：

Docker 不使用 ip 命令而是，本身实现了 ip 命令内的一些功能。
Docker 的 resolv.conf 没有使用这样的方式，而是将其写到指定的 resolv.conf 中，以后在启动容器的时候将其和 hostname、host 一块儿以只读的方式加载到容器的文件系统中。
docker 使用进程的 PID 来作 network namespace 的名称。

同理，咱们还可使用以下的方式为正在运行的 docker 容器增长一个新的网卡

ip link add peerA type veth peer name peerB 
brctl addif docker0 peerA 
ip link set peerA up 
ip link set peerB netns ${container-pid} 
ip netns exec ${container-pid} ip link set dev peerB name eth1 
ip netns exec ${container-pid} ip link set eth1 up 
ip netns exec ${container-pid} ip addr add ${ROUTEABLE_IP} dev eth1

Namespace 状况查看

Cgroup 的操做接口是文件系统，位于 /sys/fs/cgroup 中。假如想查看 namespace 的状况一样能够查看文件系统，namespace 主要查看 /proc/<pid>/ns 目录。

咱们以上面的 [PID Namespace 程序](#PID Namespace) 为例，当这个程序运行起来以后，咱们能够看到其 PID 为 11702。

以后，咱们保持这个子进程运行，而后打开另外一个 shell，查看这个程序建立的子进程的 PID，也就是容器中运行的进程在主机中的 PID。

最后，咱们分别查看 /proc/11702/ns 和 /proc/11703/ns 这两个目录的状况，也就是查看这两个进程的 namespace 状况。能够看到其中 cgroup、ipc、mnt、net、user 都是同一个 ID，而 pid、uts 是不一样的 ID。若是两个进程的 namespace 编号相同，那么表示这两个进程位于同一个 namespace 中，不然位于不一样 namespace 中。

若是能够查看 ns 的状况以外，这些文件一旦被打开，只要 fd 被占用着，即便 namespace 中全部进程都已经结束了，那么建立的 namespace 也会一直存在。好比可使用 mount --bind /proc/11703/ns/uts ~/uts，让 11703 这个进程的 UTS Namespace 一直存在。

总结

Namespace 技术实际上修改了应用进程看待整个计算机“视图”，即它的”视图“已经被操做系统作了限制，只能”看到“某些指定的内容，这仅仅对应用进程产生了影响。可是对宿主机来讲，这些被隔离了的进程，其实仍是进程，跟宿主机上其余进程并没有太大区别，都由宿主机统一管理。只不过这些被隔离的进程拥有额外设置过的 Namespace 参数。那么 Docker 项目在这里扮演的，更可能是旁路式的辅助和管理工做。以下左图所示

所以，相比虚拟机的方式，容器会更受欢迎。这是假如使用虚拟机的方式做为应用沙盒，那么必需要由 Hypervisor 来负责建立虚拟机，这个虚拟机是真实存在的，而且里面必需要运行一个完整的 Guest OS 才能执行用户的应用进程。这样就致使了采用虚拟机的方式以后，不可避免地带来额外的资源消耗和占用。根据实验，一个运行着 CentOS 的 KVM 虚拟机启动后，在不作优化的状况下，虚拟机就须要占用 100-200 MB 内存。此外，用户应用运行在虚拟机中，它对宿主机操做系统的调用就不可避免地要通过虚拟机软件的拦截和处理，这自己就是一层消耗，尤为对资源、网络和磁盘 IO 的损耗很是大。

而假如使用容器的方式，容器化以后应用本质仍是宿主机上的一个进程，这也就意味着由于虚拟机化带来的性能损耗是不存在的；而另外一方面使用 Namespace 做为隔离手段的容器并不须要单独的 Guest OS，这就使得容器额外的资源占用几乎能够忽略不计。

总得来讲，“敏捷”和“高性能”是容器相对于虚拟机最大的优点，也就是容器能在 PaaS 这种更加细粒度的资源管理平台上大行其道的重要缘由。

可是！基于 Linux Namespace 的隔离机制相比于虚拟化技术也有不少不足之处，其中最主要的问题就是隔离不完全。

首先，容器只是运行在宿主机上的一种特殊进程，那么容器之间使用的仍是同一个宿主机上的操做系统。尽管能够在容器里面经过 mount namesapce 单独挂载其余不一样版本的操做系统文件，好比 centos、ubuntu，可是这并不能改变共享宿主机内核的事实。这就意味着你要在 windows 上运行 Linux 容器，或者在低版本的 Linux 宿主机上运行高版本的 Linux 容器都是行不通的。

而拥有虚拟机技术和独立 Guest OS 的虚拟机就要方便多了。
其次，在 Linux 内核中，有不少资源和对象都是不能被 namespace 化的，好比时间。假如你的容器中的程序使用 settimeofday(2) 系统调用修改了时间，整个宿主机的时间都会被随之修改。

相比虚拟机里面能够随意折腾的自由度，在容器里部署应用的时候，“什么能作，什么不能作” 是用户必须考虑的一个问题。以外，容器给应用暴露出来的攻击面是至关大的，应用“越狱”的难度也比虚拟机低不少。虽然，实践中可使用 Seccomp 等技术对容器内部发起的全部系统调用进行过滤和甄别来进行安全加固，但这种方式由于多了一层对系统调用的过滤，也会对容器的性能产生影响。所以，在生产环境中没有人敢把运行在物理机上的 Linux 容器直接暴露到公网上。

另外，容器是一个“单进程”模型。容器的本质是一个进程，用户的应用进程实际上就是容器里 PID=1 的进程，而这个进程也是后续建立的全部进程的父进程。这也就意味着，在一个容器中，你没办法同时运行两个不一样的应用，除非能事先找到一个公共的 PID=1 的程序来充当二者的父进程，好比使用 systemd 或者 supervisord。容器的设计更可能是但愿容器和应用同生命周期的，而不是容器还在运行，而里面的应用早已经挂了。

★
上面这段话我的的理解是：由于建立出子进程以后，子进程须要运行的，而此时父进程须要等待子进程运行结束，至关于只有子进程在运行。好比容器中的第一个进程每每就是业务须要的进程，也就是 entrypoint 指定的程序运行起来的进程。而建立出子进程会致使这个进程被暂停。即便将子进程改成后台执行，可是因为容器中 PID=1 的进程压根没有管理后台进程的能力，因此仍是会有进程没法管理。
”

巨人的肩膀

极客时间---《深刻剖析 Kubernetes》---张磊老师

DOCKER基础技术：LINUX NAMESPACE（上）

DOCKER基础技术：LINUX NAMESPACE（下）。

本文分享自微信公众号 - 多选参数（zhouxintalk）。
若有侵权，请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一块儿分享。