Linux Namespace系列（02）：UTS namespace (CLONE_NEWUTS)

时间 2019-11-09

标签 linux namespace 系列 uts clone newuts 栏目 Linux 繁體版

原文原文链接

UTS namespace用来隔离系统的hostname以及NIS domain name。html

这两个资源能够经过sethostname(2)和setdomainname(2)函数来设置，以及经过uname(2), gethostname(2)和getdomainname(2)函数来获取.（这里括号中的2表示这个函数是system call，具体其余数字的含义请参看man的帮助文件）node

术语UTS来自于调用函数uname()时用到的结构体: struct utsname. 而这个结构体的名字源自于"UNIX Time-sharing System".linux

因为UTS namespace最简单，因此放在最前面介绍，在这篇文章中咱们将会熟悉UTS namespace以及和namespace相关的三个系统调用的使用。git

注意： NIS domain name和DNS没有关系，关于他的介绍能够看这里，因为本人对它不太了解，因此在本文中不作介绍。github

下面的全部例子都在ubuntu-server-x86_64 16.04下执行经过shell

建立新的UTS namespace

多说无益，直接上代码，我尽可能将注释写的足够详细，请仔细看代码和输出结果ubuntu

注意:bash

为了代码简单起见，只在clone函数那作了错误处理，关于clone函数的详细介绍请参考man-pagesdom

为了描述方便，某些地方会用hostname来区分UTS namespace，如hostname为container001的namespace，将会被描述成namespace container001。ssh

#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define NOT_OK_EXIT(code, msg); {if(code == -1){perror(msg); exit(-1);} }

//子进程从这里开始执行
static int child_func(void *hostname)
{
    //设置主机名
    sethostname(hostname, strlen(hostname));

    //用一个新的bash来替换掉当前子进程，
    //执行完execlp后，子进程没有退出，也没有建立新的进程,
    //只是当前子进程再也不运行本身的代码，而是去执行bash的代码,
    //详情请参考"man execlp"
    //bash退出后，子进程执行完毕
    execlp("bash", "bash", (char *) NULL);

    //从这里开始的代码将不会被执行到，由于当前子进程已经被上面的bash替换掉了

    return 0;
}

static char child_stack[1024*1024]; //设置子进程的栈空间为1M

int main(int argc, char *argv[])
{
    pid_t child_pid;

    if (argc < 2) {
        printf("Usage: %s <child-hostname>\n", argv[0]);
        return -1;
    }

    //建立并启动子进程，调用该函数后，父进程将继续日后执行，也就是执行后面的waitpid
    child_pid = clone(child_func,  //子进程将执行child_func这个函数
                    //栈是从高位向低位增加，因此这里要指向高位地址
                    child_stack + sizeof(child_stack),
                    //CLONE_NEWUTS表示建立新的UTS namespace，
                    //这里SIGCHLD是子进程退出后返回给父进程的信号，跟namespace无关
                    CLONE_NEWUTS | SIGCHLD,
                    argv[1]);  //传给child_func的参数
    NOT_OK_EXIT(child_pid, "clone");

    waitpid(child_pid, NULL, 0); //等待子进程结束

    return 0;    //这行执行完以后，父进程结束
}

在上面的代码中：

父进程建立新的子进程，而且设置CLONE_NEWUTS，这样就会建立新的UTS namespace而且让子进程属于这个新的namespace，而后父进程一直等待子进程退出
子进程在设置好新的hostname后被bash替换掉
当bash退出后，子进程退出，接着父进程也退出

下面看看输出效果

#------------------------第一个shell窗口------------------------
#将上面的代码保存为namespace_uts_demo.c， 
#而后用gcc将它编译成可执行文件namespace_uts_demo
dev@ubuntu:~/code$ gcc namespace_uts_demo.c -o namespace_uts_demo   

#启动程序，传入参数container001
#建立新的UTS namespace须要root权限，因此用到sudo
dev@ubuntu:~/code$ sudo ./namespace_uts_demo container001

#新的bash被启动，从shell的提示符能够看出，hostname已经被改为了container001
#这里bash的提示符是‘#’，表示bash有root权限，
#这是由于咱们是用sudo来运行的程序，因而咱们程序建立的子进程有root权限
root@container001:~/code#

#用hostname命令再确认一下
root@container001:~/code# hostname
container001

#pstree是用来查看系统中进程之间父子关系的工具
#下面的输出过滤掉了跟namespace_uts_demo无关的内容
#本次操做是经过ssh客户端远程链接到Linux主机进行的，
#因此bash(24429)的父进程是一系列的sshd进程，
#咱们在bash(24429)里面执行了sudo ./namespace_uts_demo container001
#因此有了sudo(27332)和咱们程序namespace_uts_d(27333)对应的进程，
#咱们的程序本身clone了一个新的子进程，因为clone的时候指定了参数CLONE_NEWUTS，
#因此新的子进程属于一个新的UTS namespace，而后这个新进程调用execlp后被bash替换掉了，
#因而有了bash(27334)， 这个bash进程拥有全部当前子进程的属性， 
#因为咱们的pstree命令是在bash(27334)里面运行的，
#因此这里pstree(27345)是bash(27334)的子进程
root@container001:~/code# pstree -pl
systemd(1)───sshd(24351)───sshd(24428)───bash(24429)───sudo(27332)──
─namespace_uts_d(27333)───bash(27334)───pstree(27345)

#验证一下咱们运行的bash进程是否是bash(27334)
#下面这个命令能够输出当前bash的PID
root@container001:~/code# echo $$
27334

#验证一下咱们的父进程和子进程是否不在同一个UTS namespace
root@container001:~/code# readlink /proc/27333/ns/uts
uts:[4026531838]
root@container001:~/code# readlink /proc/27334/ns/uts
uts:[4026532445]
#果真不属于同一个UTS namespace，说明新的uts namespace建立成功

#默认状况下，子进程应该继承父进程的namespace
#systemd(1)是咱们程序父进程namespace_uts_d(27333)的祖先进程，
#他们应该属于同一个namespace
root@container001:~/code# readlink /proc/1/ns/uts
uts:[4026531838]

#全部bash(27334)里面执行的进程应该和bash(27334)属于一样的namespace
#self指向当前运行的进程，在这里即readlink进程
root@container001:~/code# readlink /proc/self/ns/uts
uts:[4026532445]

#------------------------第二个shell窗口------------------------
#从新打开一个新的shell窗口，确认这个shell和上面的namespace_uts_d(27333)属于同一个namespace
dev@ubuntu:~/code$ readlink /proc/$$/ns/uts
uts:[4026531838]

#老的namespace中的hostname仍是原来的，不受新的namespace影响
dev@ubuntu:~/code$ hostname     
ubuntu
#有兴趣的同窗能够在两个shell窗口里面分别用命令hostname设置hostname试试，
#会发现他们两个之间相互不受影响，这里就不演示了


#------------------------第一个shell窗口------------------------
#继续回到原来的shell，试试在container001里面再运行一下那个程序会怎样
root@container001:~/code# ./namespace_uts_demo container002

#建立了一个新的UTS namespace，hostname被改为了container002
root@container002:~/code#
root@container002:~/code# hostname
container002

#新的UTS namespace
root@container002:~/code# readlink /proc/$$/ns/uts
uts:[4026532455]

#进程间的关系和上面的差很少，在后面又生成了namespace_uts_d(27354)和bash(27355)
root@container002:~/code# pstree -pl
systemd(1)───sshd(24351)───sshd(24428)───bash(24429)───sudo(27332)──
─namespace_uts_d(27333)───bash(27334)───namespace_uts_d(27354)──
─bash(27355)───pstree(27367)

#退出bash(27355)后，它的父进程namespace_uts_d(27354)也接着退出，
#因而又回到了进程bash(27334)中，hostname因而也回到了container001
#注意： 在bash(27355)退出的过程当中，并无任何进程的namespace发生变化，
#只是全部属于namespace container002的进程都执行完退出了
root@container002:~/code# exit
exit
root@container001:~/code#
root@container001:~/code# hostname
container001

将当前进程加入指定的namespace

仍是直接上代码，有了前面的铺垫，这里的代码就很是简单了，请仔细看代码和输出结果

#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define NOT_OK_EXIT(code, msg); {if(code == -1){perror(msg); exit(-1);} }

int main(int argc, char *argv[])
{
    int fd, ret;

    if (argc < 2) {
        printf("%s /proc/PID/ns/FILE\n", argv[0]);
        return -1;
    }

    //获取namespace对应文件的描述符
    fd = open(argv[1], O_RDONLY);
    NOT_OK_EXIT(fd, "open");

    //执行完setns后，当前进程将加入指定的namespace
    //这里第二个参数为0，表示由系统本身检测fd对应的是哪一种类型的namespace
    ret = setns(fd, 0);
    NOT_OK_EXIT(ret, "open");

    //用一个新的bash来替换掉当前子进程
    execlp("bash", "bash", (char *) NULL);

    return 0;
}

在上面的代码中，程序经过setns调用让本身加入到参数指定的namespace中，而后用bash替换掉本身，开始执行bash。

再来看结果

#--------------------------第一个shell窗口----------------------
#重用上面建立的namespace container001
#先确认一下hostname是否正确，
root@container001:~/code# hostname
container001

#获取bash的PID
root@container001:~/code# echo $$
27334

#获得bash所属的UTS namespace
root@container001:~/code# readlink /proc/27334/ns/uts
uts:[4026532445]



#--------------------------第二个shell窗口----------------------
#从新打开一个shell窗口，将上面的代码保存为文件namespace_join.c并编译
dev@ubuntu:~/code$ gcc namespace_join.c -o namespace_join

#运行程序前，确认下当前bash不属于namespace container001
dev@ubuntu:~/code$ hostname
ubuntu
dev@ubuntu:~/code$ readlink /proc/$$/ns/uts
uts:[4026531838]

#执行程序，使其加入第一个shell窗口中的bash所在的namespace
#27334是第一个shell窗口中bash的pid
dev@ubuntu:~/code$ sudo ./namespace_join /proc/27334/ns/uts
root@container001:~/code#

#加入成功，bash提示符里面的hostname以及UTS namespace的inode number和第一个shell窗口的都同样
root@container001:~/code# hostname
container001
root@container001:~/code# readlink /proc/$$/ns/uts
uts:[4026532445]

退出当前namespace并加入新建立的namespace

继续看代码

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

#define NOT_OK_EXIT(code, msg); {if(code == -1){perror(msg); exit(-1);} }

static void usage(const char *pname)
{
    char usage[] = "Usage: %s [optins]\n"
                   "Options are:\n"
                   "    -i   unshare IPC namespace\n"
                   "    -m   unshare mount namespace\n"
                   "    -n   unshare network namespace\n"
                   "    -p   unshare PID namespace\n"
                   "    -u   unshare UTS namespace\n"
                   "    -U   unshare user namespace\n";
    printf(usage, pname);
    exit(0);
}

int main(int argc, char *argv[])
{
    int flags = 0, opt, ret;

    //解析命令行参数，用来决定退出哪一个类型的namespace
    while ((opt = getopt(argc, argv, "imnpuUh")) != -1) {
        switch (opt) {
            case 'i': flags |= CLONE_NEWIPC;        break;
            case 'm': flags |= CLONE_NEWNS;         break;
            case 'n': flags |= CLONE_NEWNET;        break;
            case 'p': flags |= CLONE_NEWPID;        break;
            case 'u': flags |= CLONE_NEWUTS;        break;
            case 'U': flags |= CLONE_NEWUSER;       break;
            case 'h': usage(argv[0]);               break;
            default:  usage(argv[0]);
        }
    }

    if (flags == 0) {
        usage(argv[0]);
    }

    //执行完unshare函数后，当前进程就会退出当前的一个或多个类型的namespace,
    //而后进入到一个或多个新建立的不一样类型的namespace
    ret = unshare(flags);
    NOT_OK_EXIT(ret, "unshare");

    //用一个新的bash来替换掉当前子进程
    execlp("bash", "bash", (char *) NULL);

    return 0;
}

看运行效果：

#将上面的代码保存为文件namespace_leave.c并编译
dev@ubuntu:~/code$ gcc namespace_leave.c -o namespace_leave

#查看当前bash所属的UTS namespace
dev@ubuntu:~/code$ readlink /proc/$$/ns/uts
uts:[4026531838]

#执行程序， -u表示退出并加入新的UTS namespace
dev@ubuntu:~/code$ sudo ./namespace_leave -u
root@ubuntu:~/code#

#再次查看UTS namespace，已经变了，说明已经离开原来的namespace并加入了新的namespace
#细心的同窗可能已经发现这里的inode number恰好和上面namespace container002的相同，
#这说明在container002被销毁后，inode number被回收再利用了
root@ubuntu:~/code# readlink /proc/$$/ns/uts
uts:[4026532455]

#反复执行几回，获得相似的结果
root@ubuntu:~/code# ./namespace_leave -u
root@ubuntu:~/code# readlink /proc/$$/ns/uts
uts:[4026532456]
root@ubuntu:~/code# ./namespace_leave -u
root@ubuntu:~/code# readlink /proc/$$/ns/uts
uts:[4026532457]
root@ubuntu:~/code# ./namespace_leave -u
root@ubuntu:~/code# readlink /proc/$$/ns/uts
uts:[4026532458]

内核中的实现

上面演示了这三个函数的功能，那么UTS namespace在内核中又是怎么实现的呢？

在老版本中，UTS相关的信息保存在一个全局变量中，全部进程都共享这个全局变量，gethostname()的实现大概以下

asmlinkage long sys_gethostname(char __user *name, int len)
{
  ...
  if (copy_to_user(name, system_utsname.nodename, i))
    errno = -EFAULT;
  ...
}

在新的Linux内核中，在每一个进程对应的task结构体struct task_struct中，增长了一个叫nsproxy的字段，类型是struct nsproxy

struct task_struct {
  ...
  /* namespaces */
  struct nsproxy *nsproxy;
  ...
}

struct nsproxy {
  atomic_t count;
  struct uts_namespace *uts_ns;
  struct ipc_namespace *ipc_ns;
  struct mnt_namespace *mnt_ns;
  struct pid_namespace *pid_ns_for_children;
  struct net       *net_ns;
  struct cgroup_namespace *cgroup_ns;
};

因而新的gethostname()的实现大概就是这样

static inline struct new_utsname *utsname(void)
{
  //current指向当前进程的task结构体
  return &current->nsproxy->uts_ns->name;
}

SYSCALL_DEFINE2(gethostname, char __user *, name, int, len)
{
  struct new_utsname *u;
  ...
  u = utsname();
  if (copy_to_user(name, u->nodename, i)){
    errno = -EFAULT;
  }
  ...
}

处于不一样UTS namespace中的进程，它task结构体里面的nsproxy->uts_ns所指向的结构体是不同的，因而达到了隔离UTS的目的。

其余类型的namespace基本上也是差很少的原理。

总结

namespace的本质就是把原来全部进程全局共享的资源拆分红了不少个一组一组进程共享的资源
当一个namespace里面的全部进程都退出时，namespace也会被销毁，因此抛开进程谈namespace没有意义
UTS namespace就是进程的一个属性，属性值相同的一组进程就属于同一个namespace，跟这组进程之间有没有亲戚关系无关
clone和unshare都有建立并加入新的namespace的功能，他们的主要区别是：
- unshare是使当前进程加入新建立的namespace
- clone是建立一个新的子进程，而后让子进程加入新的namespace
UTS namespace没有嵌套关系，即不存在说一个namespace是另外一个namespace的父namespace