使用 trace-cmd 进行内核跟踪

trace-cmd 是一个易于使用、功能丰富的实用程序，用于跟踪 Linux 内核函数。

Puzzle pieces coming together to form a computer screen

图片来源：

Opensource.com

在我之前的文章中，我解释了如何使用 ftrace 跟踪内核函数。通过写入和读取文件来使用 ftrace 可能会很繁琐，所以我使用了一个包装器来围绕它运行命令，并带有选项来启用和禁用跟踪、设置过滤器、查看输出、清除输出等等。

trace-cmd 命令是一个实用程序，可以帮助您做到这一点。在本文中，我使用 trace-cmd 执行与我在 ftrace 文章中相同的任务。由于我会频繁地回顾那篇文章，我建议您在阅读本文之前先阅读它。

安装 trace-cmd

我在本文中以 root 用户身份运行命令。

ftrace 机制内置于内核中，您可以使用以下命令验证它是否已启用：

# mount | grep tracefs
none on /sys/kernel/tracing type tracefs (rw,relatime,seclabel)

但是，您需要手动安装 trace-cmd 实用程序。

# dnf install trace-cmd -y

列出可用的跟踪器

当使用 ftrace 时，您必须查看文件的内容才能查看有哪些跟踪器可用。但是使用 trace-cmd，您可以使用以下命令获取此信息：

# trace-cmd list -t
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

启用 function 跟踪器

在我之前的文章中，我使用了两个跟踪器，在这里我也会这样做。使用以下命令启用您的第一个跟踪器 function：

$ trace-cmd start -p function
  plugin 'function'

查看跟踪输出

启用跟踪器后，您可以使用 show 参数查看输出。这里只显示前 20 行以使示例简短（有关输出的解释，请参阅我之前的文章）：

# trace-cmd show | head -20
## tracer: function
#
# entries-in-buffer/entries-written: 410142/3380032   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
           gdbus-2606    [004] ..s. 10520.538759: __msecs_to_jiffies <-rebalance_domains
           gdbus-2606    [004] ..s. 10520.538760: load_balance <-rebalance_domains
           gdbus-2606    [004] ..s. 10520.538761: idle_cpu <-load_balance
           gdbus-2606    [004] ..s. 10520.538762: group_balance_cpu <-load_balance
           gdbus-2606    [004] ..s. 10520.538762: find_busiest_group <-load_balance
           gdbus-2606    [004] ..s. 10520.538763: update_group_capacity <-update_sd_lb_stats.constprop.0
           gdbus-2606    [004] ..s. 10520.538763: __msecs_to_jiffies <-update_group_capacity
           gdbus-2606    [004] ..s. 10520.538765: idle_cpu <-update_sd_lb_stats.constprop.0
           gdbus-2606    [004] ..s. 10520.538766: __msecs_to_jiffies <-rebalance_domains

停止跟踪并清除缓冲区

跟踪继续在后台运行，您可以继续使用 show 查看输出。

要停止跟踪，请使用 stop 参数运行 trace-cmd：

# trace-cmd stop

要清除缓冲区，请使用 clear 参数运行它：

# trace-cmd clear

启用 function_graph 跟踪器

通过运行以下命令启用第二个跟踪器 function_graph：

# trace-cmd start -p function_graph
  plugin 'function_graph'

再次使用 show 参数查看输出。正如预期的那样，输出与第一个跟踪输出略有不同。这次它包含了一个 function calls 链：

# trace-cmd show | head -20
## tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 4)   0.079 us    |        } /* rcu_all_qs */
 4)   0.327 us    |      } /* __cond_resched */
 4)   0.081 us    |      rcu_read_unlock_strict();
 4)               |      __cond_resched() {
 4)   0.078 us    |        rcu_all_qs();
 4)   0.243 us    |      }
 4)   0.080 us    |      rcu_read_unlock_strict();
 4)               |      __cond_resched() {
 4)   0.078 us    |        rcu_all_qs();
 4)   0.241 us    |      }
 4)   0.080 us    |      rcu_read_unlock_strict();
 4)               |      __cond_resched() {
 4)   0.079 us    |        rcu_all_qs();
 4)   0.235 us    |      }
 4)   0.095 us    |      rcu_read_unlock_strict();
 4)               |      __cond_resched() {

使用 stop 和 clear 命令停止跟踪并清除缓冲区

# trace-cmd stop
# trace-cmd clear

调整跟踪以增加深度

如果您想在函数调用中看到更多深度，您可以调整跟踪器：

# trace-cmd start -p function_graph --max-graph-depth 5
  plugin 'function_graph'

现在，当您将此输出与之前看到的输出进行比较时，您应该会看到更多嵌套的函数调用：

# trace-cmd show | head -20
## tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 6)               |        __fget_light() {
 6)   0.804 us    |          __fget_files();
 6)   2.708 us    |        }
 6)   3.650 us    |      } /* __fdget */
 6)   0.547 us    |      eventfd_poll();
 6)   0.535 us    |      fput();
 6)               |      __fdget() {
 6)               |        __fget_light() {
 6)   0.946 us    |          __fget_files();
 6)   1.895 us    |        }
 6)   2.849 us    |      }
 6)               |      sock_poll() {
 6)   0.651 us    |        unix_poll();
 6)   1.905 us    |      }
 6)   0.475 us    |      fput();
 6)               |      __fdget() {

了解可用于跟踪的函数

如果您只想跟踪某些函数而忽略其余函数，则需要知道确切的函数名称。您可以使用 list 参数后跟 -f 来获取它们。此示例搜索常见的内核函数 kmalloc，该函数用于在内核中分配内存：

# trace-cmd list -f | grep kmalloc
bpf_map_kmalloc_node
mempool_kmalloc
__traceiter_kmalloc
__traceiter_kmalloc_node
kmalloc_slab
kmalloc_order
kmalloc_order_trace
kmalloc_large_node
__kmalloc
__kmalloc_track_caller
__kmalloc_node
__kmalloc_node_track_caller
[...]

这是我的测试系统上可用函数的总数：

# trace-cmd list -f | wc -l
63165

您还可以跟踪与特定内核模块相关的函数。假设您要跟踪 kvm 内核模块相关函数。确保模块已加载：

# lsmod  | grep kvm_intel
kvm_intel             335872  0
kvm                   987136  1 kvm_intel

再次使用 list 参数运行 trace-cmd，并从输出中 grep 以 ] 结尾的行。这将过滤掉内核模块。然后 grep 内核模块 kvm_intel，您应该会看到与该内核模块相关的所有函数：

# trace-cmd list -f | grep ]$  | grep kvm_intel
vmx_can_emulate_instruction [kvm_intel]
vmx_update_emulated_instruction [kvm_intel]
vmx_setup_uret_msr [kvm_intel]
vmx_set_identity_map_addr [kvm_intel]
handle_machine_check [kvm_intel]
handle_triple_fault [kvm_intel]
vmx_patch_hypercall [kvm_intel]

[...]

vmx_dump_dtsel [kvm_intel]
vmx_dump_sel [kvm_intel]

跟踪特定函数

现在您已经知道如何查找感兴趣的函数，请将这些知识应用于一个示例。与之前的文章一样，尝试跟踪文件系统相关函数。我的测试系统上的文件系统是 ext4。

此过程略有不同；您不是使用 start，而是使用 record 参数后跟您要跟踪的函数的“模式”来运行命令。您还需要指定您想要的跟踪器；在本例中，它是 function_graph。该命令会持续记录跟踪，直到您使用 Ctrl+C 停止它。因此，几秒钟后，按 Ctrl+C 停止跟踪：

# trace-cmd list -f | grep ^ext4_

# trace-cmd record -l ext4_* -p function_graph
  plugin 'function_graph'
Hit Ctrl^C to stop recording
^C
CPU0 data recorded at offset=0x856000
    8192 bytes in size
[...]

查看记录的跟踪

要查看您之前记录的跟踪，请使用 report 参数运行命令。从输出中可以清楚地看出，过滤器起作用了，您只看到 ext4 相关函数跟踪：

# trace-cmd report | head -20
[...]
cpus=8
       trace-cmd-12697 [000] 11303.928103: funcgraph_entry:                   |  ext4_show_options() {
       trace-cmd-12697 [000] 11303.928104: funcgraph_entry:        0.187 us   |    ext4_get_dummy_policy();
       trace-cmd-12697 [000] 11303.928105: funcgraph_exit:         1.583 us   |  }
       trace-cmd-12697 [000] 11303.928122: funcgraph_entry:                   |  ext4_create() {
       trace-cmd-12697 [000] 11303.928122: funcgraph_entry:                   |    ext4_alloc_inode() {
       trace-cmd-12697 [000] 11303.928123: funcgraph_entry:        0.101 us   |      ext4_es_init_tree();
       trace-cmd-12697 [000] 11303.928123: funcgraph_entry:        0.083 us   |      ext4_init_pending_tree();
       trace-cmd-12697 [000] 11303.928123: funcgraph_entry:        0.141 us   |      ext4_fc_init_inode();
       trace-cmd-12697 [000] 11303.928123: funcgraph_exit:         0.931 us   |    }
       trace-cmd-12697 [000] 11303.928124: funcgraph_entry:        0.081 us   |    ext4_get_dummy_policy();
       trace-cmd-12697 [000] 11303.928124: funcgraph_entry:        0.133 us   |    ext4_get_group_desc();
       trace-cmd-12697 [000] 11303.928124: funcgraph_entry:        0.115 us   |    ext4_free_inodes_count();
       trace-cmd-12697 [000] 11303.928124: funcgraph_entry:        0.114 us   |    ext4_get_group_desc();

跟踪特定 PID

假设您要跟踪与特定持久标识符 (PID) 相关的函数。打开另一个终端并记下正在运行的 shell 的 PID：

# echo $$
10885

再次运行 record 命令，并使用 -P 选项传递 PID。这次，让终端运行（即，暂时不要按 Ctrl+C）：

# trace-cmd record -P 10885 -p function_graph
  plugin 'function_graph'
Hit Ctrl^C to stop recording

在 shell 上运行一些活动

返回到您运行 shell 的另一个终端，该 shell 具有特定的 PID，并运行任何命令，例如 ls 以列出文件：

# ls
Temp-9b61f280-fdc1-4512-9211-5c60f764d702
tracker-extract-3-files.1000
v8-compile-cache-1000
[...]

返回到您启用跟踪的终端，然后按 Ctrl+C 停止跟踪：

# trace-cmd record -P 10885 -p function_graph
  plugin 'function_graph'
Hit Ctrl^C to stop recording
^C
CPU1 data recorded at offset=0x856000
    618496 bytes in size
[...]

在跟踪的输出中，您可以在左侧看到 PID 和 Bash shell，在右侧看到与其相关的函数调用。这对于缩小您的跟踪范围非常有用：

# trace-cmd report  | head -20

cpus=8
          <idle>-0     [001] 11555.380581: funcgraph_entry:                   |  switch_mm_irqs_off() {
          <idle>-0     [001] 11555.380583: funcgraph_entry:        1.703 us   |    load_new_mm_cr3();
          <idle>-0     [001] 11555.380586: funcgraph_entry:        0.493 us   |    switch_ldt();
          <idle>-0     [001] 11555.380587: funcgraph_exit:         7.235 us   |  }
            bash-10885 [001] 11555.380589: funcgraph_entry:        1.046 us   |  finish_task_switch.isra.0();
            bash-10885 [001] 11555.380591: funcgraph_entry:                   |  __fdget() {
            bash-10885 [001] 11555.380592: funcgraph_entry:        2.036 us   |    __fget_light();
            bash-10885 [001] 11555.380594: funcgraph_exit:         3.256 us   |  }
            bash-10885 [001] 11555.380595: funcgraph_entry:                   |  tty_poll() {
            bash-10885 [001] 11555.380597: funcgraph_entry:                   |    tty_ldisc_ref_wait() {
            bash-10885 [001] 11555.380598: funcgraph_entry:                   |      ldsem_down_read() {
            bash-10885 [001] 11555.380598: funcgraph_entry:                   |        __cond_resched() {