如何在 Linux 中跟踪系统调用？答案

【问题标题】：How do I trace a system call in Linux?如何在 Linux 中跟踪系统调用？
【发布时间】：2015-07-02 15:04:54
【问题描述】：

我将如何跟踪从陷阱到内核的系统调用，如何传递参数，系统调用如何位于内核中，系统调用在内核中的实际处理，返回给用户以及如何恢复状态？

【问题讨论】：

【解决方案1】：

您可以使用 -f 和 -ff 选项。像这样的：

strace -f -e trace=process bash -c 'ls; :'

-f 跟踪当前跟踪创建的子进程进程作为 fork(2) 系统调用的结果。

-ff 如果 -o 文件名选项生效，每个进程跟踪都会写入 filename.pid 其中 pid 是每个进程的数字进程 ID。这是与 -c 不兼容，因为没有保留每个进程的计数。

【讨论】：

注意：“进程”是指内核的进程概念，在用户空间中通常称为“线程”。

【解决方案2】：

ftrace 其实用起来比较简单。这是Steven, "Mr. ftrace", Rostedt 的经典文章。第二部分是here。

有一个由 Linux 基金会的 Jan-Simon Möller 撰写的 free video，以及许多其他不错的介绍性文章，您可以使用“ftrace 教程”或“ftrace 示例”等搜索词找到它们。

【讨论】：

【解决方案3】：

SystemTap

这是迄今为止我发现的最强大的方法。它甚至可以显示调用参数：Does ftrace allow capture of system call arguments to the Linux kernel, or only function names?

用法：

sudo apt-get install systemtap
sudo stap -e 'probe syscall.mkdir { printf("%s[%d] -> %s(%s)\n", execname(), pid(), name, argstr) }'

然后在另一个终端上：

sudo rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b

样本输出：

mkdir[4590] -> mkdir("/tmp/a", 0777)
mkdir[4593] -> mkdir("/tmp/b", 0777)

文档：https://sourceware.org/systemtap/documentation.html

似乎是基于 kprobes 的：https://sourceware.org/systemtap/archpaper.pdf

另请参阅：How to trace just system call events with ftrace without showing any other functions in the Linux kernel?

在 Ubuntu 18.04、Linux 内核 4.15 上测试。

ltrace -S 同时显示系统调用和库调用

因此，这个很棒的工具可以进一步了解可执行文件正在做什么。

例如，我用它来分析 dlopen 正在制作的系统调用：https://unix.stackexchange.com/questions/226524/what-system-call-is-used-to-load-libraries-in-linux/462710#462710

ftrace 最小可运行示例

在https://stackoverflow.com/a/29840482/895245 中提到过，但这里有一个最小的可运行示例。

使用sudo运行：

#!/bin/sh
set -eux

d=debug/tracing

mkdir -p debug
if ! mountpoint -q debug; then
  mount -t debugfs nodev debug
fi

# Stop tracing.
echo 0 > "${d}/tracing_on"

# Clear previous traces.
echo > "${d}/trace"

# Find the tracer name.
cat "${d}/available_tracers"

# Disable tracing functions, show only system call events.
echo nop > "${d}/current_tracer"

# Find the event name with.
grep mkdir "${d}/available_events"

# Enable tracing mkdir.
# Both statements below seem to do the exact same thing,
# just with different interfaces.
# https://www.kernel.org/doc/html/v4.18/trace/events.html
echo sys_enter_mkdir > "${d}/set_event"
# echo 1 > "${d}/events/syscalls/sys_enter_mkdir/enable"

# Start tracing.
echo 1 > "${d}/tracing_on"

# Generate two mkdir calls by two different processes.
rm -rf /tmp/a /tmp/b
mkdir /tmp/a
mkdir /tmp/b

# View the trace.
cat "${d}/trace"

# Stop tracing.
echo 0 > "${d}/tracing_on"

umount debug

样本输出：

# tracer: nop
#
#                              _-----=> irqs-offhttps://sourceware.org/systemtap/documentation.html
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           mkdir-5619  [005] .... 10249.262531: sys_mkdir(pathname: 7fff93cbfcb0, mode: 1ff)
           mkdir-5620  [003] .... 10249.264613: sys_mkdir(pathname: 7ffcdc91ecb0, mode: 1ff)

这个方法的一个很酷的地方是它一次显示系统上所有进程的函数调用，尽管您也可以使用set_ftrace_pid 过滤感兴趣的 PID。

文档地址：https://www.kernel.org/doc/html/v4.18/trace/index.html

在 Ubuntu 18.04、Linux 内核 4.15 上测试。

GDB 一步调试 Linux 内核

根据您需要的内部细节级别，这是一个选项：How to debug the Linux kernel with GDB and QEMU?

strace 最小可运行示例

这是strace 的最小可运行示例：How should strace be used? 具有独立的 hello world，这使得一切工作方式非常清晰。

更多信息

https://en.pingcap.com/blog/how-to-trace-linux-system-calls-in-production-with-minimal-impact-on-performance 可能值得一读，它提到：
```
perf top -F 49 -e raw_syscalls:sys_enter --sort comm,dso --show-nr-samples
```
以及基于 BPF 的跟踪循环：https://github.com/kinvolk/traceloop，文章声称这是一种非常快速的方法：
```
sudo -E ./traceloop cgroups --dump-on-exit /sys/fs/cgroup/system.slice/sshd.service
```

【讨论】：