《linux内核设计与实现》阅读笔记-进程与调度

时间 2019-12-14

标签 linux内核设计与实现阅读笔记进程调度栏目 Linux 繁體版

原文原文链接

1、进程

process:html

executing program code(text section)
data section containing global variables
open files
pending signals
internal kernel data
address space
one or more threads of execution

Processes, in effect, are the living result of running program code.linux

这是 LKD 对进程的经典描述。算法

1.一、进程描述符

进程描述符(Process Descriptor)在 linux 中就是指 struct task_struct 结构体，这个结构体在 32 位机器上大约是 1.7KB。数据结构

1.1.一、PID

struct task_struct {
    ...
    pit_t pid;
    ...
}

1.1.二、current 宏

linux 一般获取一个指向 task_struct 的指针，经过指针直接操做进程。针对不一样体系结构实现了 current 宏。例如在 x86 下:ide

+---------+
     | current |
     +----+----+
          |
          v
    +-----+---------+
    | get_current() |
    +-----+---------+
          |
          v
+---------+------------+
| percpu_read_stable() |
+---------+------------+
          |
          v
  +-------+----------+
  | percpu_from_op() |
  +------------------+

#define __percpu_arg(x)     "%%"__stringify(__percpu_seg)":%P" #x    %%

#ifdef CONFIG_X86_64
#define __percpu_seg        gs
#define __percpu_mov_op     movq
#else
#define __percpu_seg        fs
#define __percpu_mov_op     movl
#endif

asm(movl "%%fs:%P1","%0" : 
    "=r" (pfo_ret__) :
    "p" (&(var))

asm(movq "%%gs:%P1", "%0" : 
    "=r" (pfo_ret__) :
    "p" (&(var))

这段汇编将段寄存器 fs:P1 gs:P2 处的内容读出来(参考:linux内核数据结构)，那这个位置的内容究竟是什么呢？(TODO)wordpress

上一个宏在 /arch/x86/include/asm 中；另外在源码 /include/asm-generic 中还通用宏定义:函数

+---------+
       | current |
       +----+----+
            |
            v
    +-------+-------+
    | get_current() |
    +-------+-------+
            |
            v
+-----------+-----------+
| current_thread_info() |
+-----------+-----------+
            |
            v
 +----------+-----------+
 | percpu_read_stable() |
 +----------------------+

union thread_union {
    struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

1.二、进程状态

#define TASK_RUNNING        0
#define TASK_INTERRUPTIBLE  1
#define TASK_UNINTERRUPTIBLE    2
#define __TASK_STOPPED      4
#define __TASK_TRACED       8

struct task_struct {
    ...
    volatile long state;
    ...
}

set_current_state(state);
set_task_state(current, state);

1.三、进程的经历

+----------+       +----------+      +----------+
|  fork()  +------>+  exec()  +----->+  exit()  |
+----------+       +----+-----+      +----+-----+
                        |                 |
                        |                 v
                        |            +----+-----+
                        +----------->+  wait()  +--------->
                                     +----------+

1.3.1 进程建立(CoW fork)

Copy-on-Write(CoW) 中译写时拷贝。在 CoW fork() 后，父子进程全部数据都只有一份，即它们映射到的物理内存是相同的。它们的 PTE 标志都是 read-only，一旦父进程或者子进程对共享区域执行了写操做，因此就会触发 Page Fault。系统发现 Page Fault 是由于写 CoW 区域形成。系统将写操做区域复制一份，而后将触发这个操做的进程的 PTE 指向新复制内存(并设置PTE为Write)。从新执行写操做，这时候复制的区域的写操做成功。post

linux 实现了 CoW fork。性能

+------------+   +-------------+   +-------------+   +-----------------+
| sys_fork() |   | sys_vfork() |   | sys_clone() |   | kernel_thread() |
+------+-----+   +-------------+   +----+--------+   +-------+---------+
       |               |                |                    |
       |               +------+  +------+                    |
       |                      |  |                           |
       +-------------------+  |  |  +------------------------+
                           |  |  |  |
                          +v--v--v--v--+
                          |  do_fork() |
                          +------+-----+
                                 |
                         +-------+--------+
                         | copy_process() |
                         +----+---+-------+
      +--------------------+  |   |  |------------------------------+
      |                       |   +---------------+                 |
      v                       v                   v                 v
 +----+--------+      +-------+---------+     +---+----------+    +-+---+
 | alloc_pid() |      |dup_task_struct()|     | copy_flags() |    | ... |
 +-------------+      +-----------------+     +--------------+    +-----+

子进程共享 or 复制父进程的资源，取决于 flags 参数:this

#define CSIGNAL         0x000000ff
#define CLONE_VM        0x00000100
#define CLONE_FS        0x00000200
#define CLONE_FILES     0x00000400
#define CLONE_SIGHAND   0x00000800
...
#define CLONE_NEWNET    0x40000000
#define CLONE_IO        0x80000000

fork 成功后，linux 一般让子进程先运行。缘由以下:

假设，父子进程返回用户空间后，调度父进程先运行。父进程可能执行一个写操做，这时会触发 CoW。若是调度让子进程先运行，子进程在 fork 后一般会执行 exec。就不和父进程共享数据了，后面便是父进程再执行写操做，也不会触发 CoW。

对于 linux 来讲，线程(Thread)是一种特殊的进程。建立的是线程仍是进程，取决于 fork 时的 flag 参数:

// 线程
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

// 进程
clone(SIGCHLD, 0);

其实 linux 里面没有严格的线程概念，它的线程就是进程(由于linux中进程已然很轻量)。

Interestingly, note that threads share the virtual memory abstraction, whereas each receives its own virtualized processor.

1.3.2 进程终结

结束进程生命周期由两种方式:

显示执行 exit()
隐式执行 exit()

第二种状况指 C 的编译器会在 main() 函数的返回后执行 exit()。

NORET_TYPE void do_exit(long code)
{
    ...
    exit_signals(tsk);  /* sets PF_EXITING */
    ...
    tsk->exit_code = code;
    ...
    exit_mm(tsk); /*release the mm_struct held by this process*/
    ...
    exit_sem(tsk); /* 退出 IPC 信号量队列 */
    exit_files(tsk);
    exit_fs(tsk);
    ...
    exit_notify(tsk, group_dead);
    ...
    schedule();
    BUG();

    /* Avoid "noreturn function does return".  */
    for (;;)
        cpu_relax();    /* For when BUG is null */
}

这个函数永远不会返回。如今这个进程已经被标志为 EXIT_ZOMBIE。之因此还称它为进程，是由于这个进程还有三个资源没有释放:

kernel stack
thread_info structure
task_struct structure.

这三个资源存在的意义是为了通知父进程，让父进程来释放。

父进程执行 wait 族函数来释放上诉资源:

+-------------+
                      | sys_wait4() |
                      +------+------+
                             |
                             v
                  +----------+---------+
                  | wait_task_zombie() |
                  +----------+---------+
                             |
                             v
                     +-------+--------+
                     | release_task() |
                     +------+---------+
         +---------------+  |    +-------------------+
         |                  |                        |
         v                  v                        v
+--------+--------+     +---+---------------+     +--+---+
| __exit_signal() |     | put_task_struct() |     | ...  |
+-----------------+     +-------------------+     +------+

自此，一个进程/线程在操做系统中的痕迹永远抹去了。

2、进程调度

调度策略(Scheduling policies):

SCHED_NORMAL/SCHED_OTHER
SCHED_FIFO
SCHED_RR
SCHED_BATCH
SCHED_IDLE

进程分类:

普通进程(Normal Process)
- 交互式进程(interactive process)
- 批处理进程(batch process)
实时进程(Real-Time Process)

实时进程的调度策略为: SCHED_FIFO/SCHED_RR；普通进程的调度策略为: SCHED_NORMAL。

优先级:

实时优先级(0~99，数值越高优先级越高)
Nice 优先级(-20~19/100~139，数值越高优先级越低)

实时进程使用实时优先级，而普通进程则使用 Nice 优先级。在 linux 中实时进程老是优先于普通进程调度。因此这两种优先级互不干扰。

调度器类:

rt_sched_class
fair_sched_class
idle_sched_class

这几个类的类型都是 struct sched_class。调度器类也有优先级。

调度器实体(Scheduler Entity):

sched_entity
sched_rt_entity
sched_dl_entity

The highest priority scheduler class that has a runnable process wins, selecting who runs next.

2.一、普通进程调度

linux 中，普通进程调度实现了彻底公平调度(Completely Fair Scheduler)算法。

CFS is based on a simple concept: Model process scheduling as if the system had an ideal, perfectly multitasking processor. In such a system, each process would receive 1/n of the processor’s time, where n is the number of runnable processes, and we’d schedule them for infinitely small durations, so that in any measurable period we’d have run all n processes for the same amount of time.

上面描述的只是一种理想状况。假设系统中有 100 个进程，measurable period 假设为 1ms(极端例子)。每一个进程每运行 0.01ms 就要进行一次上下文切换。这是不现实的。

可是咱们须要一种标准来衡量 CFS 的性能，因而提出两个概念:

targeted latency
minimum granularity(默认值为 1ms)

总结一句话就是: 在 targeted latency 长的时间内，要让每一个进程都能被调度到，且每一个进程的运行时间不低于 minimum granularity。

目前来讲只是在纸上谈兵。关键是每次调度一个进程后，到底应该运行多长时间呢？在 CFS 中，这个时间由全部普通进程的 Nice 值决定。

先经过 Nice 值计算每一个进程[i]的权重(weight):

weight[i] ≈ 1024 / (1.25)^(nice[i])

而后再由权重计算出该进程应该占用的 CPU 比例:

CPU proportion[i] = weight[i]/weight[1] + ... + weight[n]

这是一种几何加权。经过这种方式，使用 CFS 调度运行普通进程，能达到几乎完美的多任务。CFS 的实现分为四部分:

Time Accounting
Process Selection
The Scheduler Entry Point
Sleeping and Waking Up

2.1.一、Time Accounting

struct task_struct {
    ...
    struct sched_entity se;
    ...
}

struct sched_entity {
    ...
    u64         vruntime;
    ...
}

对于理想的 CFS 模型来讲，每一个进程的 vruntime 都是相同的，但现实中却不一样。

CFS uses vruntime to account for how long a process has run and thus how much longer it ought to run.

static void update_curr(struct cfs_rq *cfs_rq)
{
    ...
    delta_exec = (unsigned long)(now - curr->exec_start);
    ...
    __update_curr(cfs_rq, curr, delta_exec);
    ...
}

static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    ...
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);
    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);
}

能够看到 vruntime 通过加权计算。

2.1.二、Process Selection

CFS 选择 vruntime 最小的进程调度运行。为了查找迅速，CFS 使用红黑树来组织 struct cfs_rq 运行队列:

struct cfs_rq {
    ...
    struct sched_entity *curr, *next, *last;
    ...
}

vruntime 最小的 sched_entity 在红黑树的最左边。

2.1.三、The Scheduler Entry Point

linux 中总调度入口在 kernel/sched.c/schedule() 中，这个函数的核心是 pick_next_task() 函数:

static inline struct task_struct *
pick_next_task(struct rq *rq)
{
    const struct sched_class *class;
    struct task_struct *p;
    ...
    class = sched_class_highest;
    for ( ; ; ) {
        p = class->pick_next_task(rq);
        if (p)
            return p;
        class = class->next;
    }
}

这个函数看上去挺简单，实际上倒是整个进程调度的精华所在。上面提到过 struct sched_class 的变量有 3 个:

fair_sched_class
rt_sched_class
idle_sched_class

在 sched_rt.c 中，fair_sched_class 为本身从新注册了函数:

static const struct sched_class rt_sched_class = {
    .next           = &fair_sched_class,
    .enqueue_task       = enqueue_task_rt,
    .dequeue_task       = dequeue_task_rt,
    .yield_task     = yield_task_rt,

    .check_preempt_curr = check_preempt_curr_rt,

    .pick_next_task     = pick_next_task_rt,
    .put_prev_task      = put_prev_task_rt,
    ...
}

因此 pick_next_task() 的逻辑就是: 先按调度类优先级从高到底排序，执行各自的 pick_next_task_*() 函数。在各自的 struct *_rq 运行队列中找一个合适的进程。调度类优先级最高的是:

#define sched_class_highest (&rt_sched_class)

2.1.四、Sleeping and Waking Up

主动 sleep
被动 sleep

内核使用一个结构体来组织休眠的 task:

struct __wait_queue_head {
    spinlock_t lock;
    struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

实现原理相似 xv6 中的的 sleep/wakeup。

2.二、实时进程调度

实时进程使用另外一种调度方式，其实现比 CFS 要简单不少。在 kernel/sched_rt.c 中，实时进程的策略有两种:

SCHED_FIFO
SCHED_RR

SCHED_RR 是带有时间片的 SCHED_FIFO。

struct task_struct {
    ...
    struct sched_rt_entity rt;
    ...
}

参考资料: