systemtap 探秘（二）- 由 probe 生成的 C 代码

时间 2019-12-08

原文原文链接

上一篇文章，我简单地介绍了 systemtap 的工做流程，以及第1、第二个阶段的内容。从这篇文章开始，咱们将步入本系列的重头戏 - 负责生成 C 代码的第三阶段。node

咱们能够经过 stap -v test.stp -p3 > out.c 这样的命令，让 stap 把生成的 C 代码重定向到 out.c 去。segmentfault

hello, world

按照惯例，先从一个 ”hello world“ 示例开始。api

probe begin {
    printf("hello")
}

probe oneshot {
    printf(" wor")
}

probe end {
    printf("ld\n")
}

出于本人的趣味，这里把一个完整的 hello world 断成三截。经过查找特定的字符串，咱们能够很快地从生成的 C 代码里找到这三个 probe 对应生成的代码。session

static void probe_3646 (struct context * __restrict__ c) {
  __label__ deref_fault;
  __label__ out;
  struct probe_3646_locals * __restrict__ l = & c->probe_locals.probe_3646;
  (void) l;
  if (c->actionremaining < 1) { c->last_error = "MAXACTION exceeded"; goto out; }
  (void)
  ({
    _stp_print ("hello");
  });
deref_fault: __attribute__((unused));
out:
  _stp_print_flush();
}

上面就是 probe begin 对应的代码。函数

咱们能够看到，每一个 probe 在执行时都会传递一个 context 参数。每一个 context 参数中有一个 struct probe_id_locals 变量。这个变量是用来存储本地变量的，固然咱们的 hello world 示例中没有用到本地变量，因此它们都是空的。atom

而后是检查 MAXACTION exceeded 的部分，这部分参考 systemtap 的文档，是限制一个 systemtap probe 的执行时间的，避免出现内核失去响应的情况。lua

接下来是debug

(void)
  ({
    _stp_print ("hello");
  });

咱们能够看到，printf 这条语句被编译成对应的内置函数的调用。并且为了防止污染，每条语句的编译结果还特地加了层花括号和大括号。rest

剩下两个 probe 大同小异，只是 probe oneshot 会多一个 function___global_exit__overload_0 。function___global_exit__overload_0 调用了 _stp_exit 内置函数。code

每一个 probe 都会一个对应的 struct stap_be_probe 实例。从代码里能看到，enter_be_probe 函数会执行该 probe 的 handler，具体是在这么一行：

(*stp->probe->ph) (c);

这一行以前的是一些准备代码，以后的则是检查执行过程当中是否有错误发生和统计执行时间等操做。注意传递给 probe 函数的 context 会被复用的。

而 enter_be_probe 会被 systemtap_module_init 和 systemtap_module_exit 调用。具体而言，probe begin 和 probe oneshot 会在 systemtap_module_init 这个函数里调用（它们对应的 struct stap_be_probe 的 type 都是 0），而 probe end 会在 systemtap_module_exit 这个函数里调用（type 是 1）。顾名思义，systemtap_module_init 和 systemtap_module_exit 分别在会话开始和结束时调用。你能够在 systemtap 源码的 runtime/transport/transport.txt 这个文件里看到调用它们的具体流程。

能够这么认为，systemtap 运行时有一个 begin 和 end 阶段，probe begin 和 probe oneshot 都是运行在 begin 阶段的。然后者会调用 _stp_exit 函数，标记要进入到 end 阶段了。最后 probe end 会在 end 阶段中运行。

那么，begin 和 end 之间，是否存在一个中间阶段呢？答案固然是确定的。接下来，让咱们看看一个包含 timer 的例子。

timer

把 probe oneshot 换成 probe timer.ms(149)：

probe timer.ms(149) {
    printf(" wor")
    exit()
}

比较生成出来的 probe 对应的 C 代码，基本上跟原来是同样的。可是 probe 部分以外有两点不一样。

一是没有 probe timer.ms(149) 对应的 struct stap_be_probe 了。由于 probe timer.ms(149) 不是在 begin 或者 end 阶段运行的。

二是多了个 struct stap_hrtimer_probe 类型。这个即是 probe timer.ms(149) 对应的 probe 类型了。从生成的代码能够看到，在 systemtap_module_init 里面有一个 _stp_hrtimer_create。这个函数注册了 _stp_hrtimer_notify_function。而 _stp_hrtimer_notify_function 几乎是 enter_be_probe 的一个翻版。

值得注意的是，_stp_hrtimer_notify_function 在统计执行时间时多了一个检查：

if (interval > STP_OVERLOAD_INTERVAL) {
          if (c->cycles_sum > STP_OVERLOAD_THRESHOLD) {
            _stp_error ("probe overhead exceeded threshold");
            atomic_set (session_state(), STAP_SESSION_ERROR);
            atomic_inc (error_count());
          }
          c->cycles_base = cycles_atend;
          c->cycles_sum = 0;
        }

这是为了不一段时间内太多的时间用于执行 systemtap 而设置的，防止内核失去响应。

带 timer 的 stp 脚本生成的 C 代码中，并非在 begin 阶段以后就经过 _stp_exit 切入到 end 阶段，而是注册了个 timer，并在 timer 里执行 probe 的逻辑。在这以后，才由于 timer 中调用了 _stp_exit 而切入到 end 阶段。

下面，让咱们看看带 uprobe 的例子。

uprobe

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    printf(" wor")
    exit()
}

上面的 stp 代码挂载了 luajit 可执行文件的 lj_str_new 函数。注意要想把这个脚本运行起来，须要确保已经提供了 luajit 的 debuginfo。

生成的 C 代码里，该 probe 对应的类型是 stapiu_consumer。

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

奇怪的是这里面的 0x6a55。代码里并无这个数，它是怎么来的呢？

经过 readelf -s /usr/local/openresty/luajit/bin/luajit | grep lj_str_new 咱们能看到，这个函数的地址是 0x406a55。固然，实际的运行地址应该是 X + 0x406a55，而 X 是随机的。因为 0x400000 是在程序连接时固定的基址，咱们能够认为 lj_str_new 的地址是 X + 0x40000 + 0x6a55。换句话说，把 0x6a55 做为 offset 就能肯定 lj_str_new 这个函数的位置。这也是为何须要提供 luajit 的 debuginfo，由于没有 debuginfo 的话，是没法肯定 lj_str_new 的地址的。

stapiu_consumer 是在 stapiu_probe_handler 里执行的，执行过程跟前两种 probe 同样。systemtap 会检查当前已存在和新建立的全部进程，若是某些进程的可执行文件匹配某个 probe，会把对应的 probe 经过内核 API 注册上去。内核触发回调时就会执行该函数。

值得强调的是，每一个匹配的进程都会执行 probe。指定 -x PID 其实只会设置 target() 的值。若是不想被多个进程触发，你还须要本身在 stp 代码里解决：

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new") {
    _target = target();
    if (pid() != _target) {
        next;
    }

    printf(" wor")
    exit()
}

-c CMD 也是一样的，该选项其实就是建立一个子进程，并以该子进程的 PID 做为 target() 的值。

uretprobe

最后，看下跟 uprobe 相对的，uretprobe 的状况。

probe process("/usr/local/openresty/luajit/bin/luajit").function("lj_str_new").return {
    printf(" wor")
    exit()
}

由上面的 stp 代码生成的 C 代码基本上相似于 uprobe。只是 stapiu_consumer 有点不一样：

static struct stapiu_consumer stap_inode_uprobe_consumers[] = {
  { .return_p=1, .target=&stap_inode_uprobe_targets[0], .offset=(loff_t)0x6a55ULL, .probe=(&stap_probes[1]), },
};

多了个 return_p=1。

预告

下一篇咱们会看看 stp 的各类类型是如何编译成对应的 C 代码，并讨论更多的 systemtap 实现细节。