如何使用 perf 和 Python 分析您的系统

新的 Linux 工具 curt 使用 perf 命令的 Python 脚本功能，按进程、按任务和按 CPU 分析系统利用率。

metrics and data shown on a computer screen

图片来源：

Opensource.com

现代计算机的性能和容量不断提高。如果不断提高的容量没有得到充分利用，这也就无关紧要了。以下是对“curt”背后的动机和工作的描述，“curt”是一个新的 Linux 系统工具，用于使用 perf 命令的 Python 脚本功能，按进程、按任务和按 CPU 测量和分解系统利用率。

我有幸在 Texas Linux Fest 2018 上介绍了这个主题，在这里我更深入地探讨了细节，包括了更多信息的链接，并扩展了我的演讲范围。

系统利用率

在讨论计算时，让我们从一些断言开始

每个计算系统在无所事事时速度都一样快。
创建计算系统是为了做事。
当计算系统在做事情时，比无所事事时更擅长做事。

现代计算系统具有许多执行流

通常，大型系统是通过字面上将较小的系统连接在一起而创建的。在 IBM，这些较小的系统有时被称为 CEC（中央电子复合体的缩写，发音为“keks”）。
每个系统中都有多个处理器模块插槽。
每个插槽有时有多个芯片（以双芯片模块 - DCM - 或多芯片模块 - MCM - 的形式）。
每个芯片有多个核心。
每个核心有多个线程。

总而言之，单个计算系统中可能存在数千个执行线程。

理想情况下，所有这些执行流都 100% 忙于进行有用的工作。衡量单个执行流（CPU 线程）利用率的一种方法是该线程计划和运行任务的时间百分比。（请注意，我没有说“做有用的工作”。创建一种衡量有用工作的工具留给读者作为练习。）通过扩展，系统利用率是系统所有执行流计划和运行任务的总时间百分比。同样，利用率可以针对单个任务定义。 任务利用率是任务生命周期中积极运行在任何 CPU 线程上的时间百分比。通过扩展，进程利用率是其任务的集体利用率。

利用率测量工具

有一些工具可以测量系统利用率：uptime、vmstat、mpstat、nmon 等。有一些工具可以测量单个进程利用率：time。没有很多工具可以测量系统范围内的按进程和按任务利用率。AIX 上的一个这样的命令是 curt。根据 IBM 知识中心：“curt 命令将 AIX 跟踪文件作为输入，并生成许多与处理器 (CPU) 利用率和进程/线程/pthread 活动相关的统计信息。”

AIX curt 命令报告系统范围、按处理器、按进程和按任务的统计信息，用于应用程序处理（用户时间）、系统调用（系统时间）、虚拟机监控程序调用、内核线程、中断和空闲时间。

这似乎是一个类似于 Linux 系统的命令的良好模型。

利用率数据

在开始创建任何利用率分析工具之前，重要的是要知道需要哪些数据。由于利用率直接关系到任务是否正在积极运行，因此需要相关的调度事件：任务何时开始运行，何时暂停？跟踪任务在哪个 CPU 上运行很重要，因此隐式迁移需要迁移事件。还有某些系统调用会强制显式迁移。任务的创建和删除显然很重要。由于我们想了解用户时间、系统时间、虚拟机监控程序时间和中断时间，因此需要显示这些任务状态之间转换的事件。

Linux 内核包含所有这些事件的“tracepoints”。可以直接在内核的 debugfs 文件系统中启用这些事件的跟踪，该文件系统通常挂载在 /sys/kernel/debug 中，在 tracing 目录 (/sys/kernel/debug/tracing) 中。

记录跟踪数据的一种更简单方法是使用 Linux perf 命令。

perf 命令

perf 是一个非常强大的用户空间命令，用于跟踪或计数硬件和软件事件。

软件事件在内核中预定义，可以在用户空间代码中预定义，并且可以在内核或用户空间代码中动态创建（作为“探测器”）。

然而，perf 可以做的远不止跟踪和计数。

perf stat

perf 的 stat 子命令将运行一个命令，计数一些常见的感兴趣的事件，并生成一个简单的报告

Performance counter stats for './load 100000':
 
      90537.006424      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
               915      page-faults:u             #    0.010 K/sec                  
   386,836,206,133      cycles:u                  #    4.273 GHz                      (66.67%)
     3,488,523,420      stalled-cycles-frontend:u #    0.90% frontend cycles idle     (50.00%)
   287,222,191,827      stalled-cycles-backend:u  #   74.25% backend cycles idle      (50.00%)
   291,102,378,513      instructions:u            #    0.75  insn per cycle         
                                                  #    0.99  stalled cycles per insn  (66.67%)
    43,730,320,236      branches:u                #  483.010 M/sec                    (50.00%)
       822,030,340      branch-misses:u           #    1.88% of all branches          (50.00%)
 
      90.539972837 seconds time elapsed

perf record、perf report 和 perf annotate

对于更有趣的分析，perf 命令还可以用于记录事件以及与事件发生时任务状态相关的信息

$ perf record ./some-command
[ perf record: Woken up 55 times to write data ]
[ perf record: Captured and wrote 13.973 MB perf.data (366158 samples) ]
$ perf report --stdio --show-nr-samples --percent-limit 4
# Samples: 366K of event 'cycles:u'
# Event count (approx.): 388851358382
#
# Overhead       Samples  Command  Shared Object      Symbol                                          
# ........  ............  .......  .................  ................................................
#
    62.31%        228162  load     load               [.] main
    19.29%         70607  load     load               [.] sum_add
    18.33%         67117  load     load               [.] sum_sub

此示例显示了一个程序，该程序大约 60% 的运行时间花费在函数 main 中，大约 20% 的时间分别花费在子函数 sum_sub 和 sum_add 中。请注意，perf record 使用的默认事件是“cycles”。后面的示例将展示如何将 perf record 与其他事件一起使用。

perf report 可以进一步按源代码行报告运行时统计信息（如果编译时使用 -g 标志生成调试信息）

$ perf report --stdio --show-nr-samples --percent-limit 4 --sort=srcline
# Samples: 366K of event 'cycles:u'
# Event count (approx.): 388851358382
#
# Overhead       Samples  Source:Line                        
# ........  ............  ...................................
#
    19.40%         71031  load.c:58
    16.16%         59168  load.c:18
    15.11%         55319  load.c:14
    13.30%         48690  load.c:66
    13.23%         48434  load.c:70
     4.58%         16767  load.c:62
     4.01%         14677  load.c:56

此外，perf annotate 可以显示程序每条指令的统计信息

$ perf annotate --stdio
Percent |      Source code & Disassembly of load for cycles:u (70607 samples)
------------------------------------------------------------------------------
         :      0000000010000774 <sum_add>:
         :      int sum_add(int sum, int value) {
   12.60 :        10000774:   std     r31,-8(r1)
    0.02 :        10000778:   stdu    r1,-64(r1)
    0.00 :        1000077c:   mr      r31,r1
   41.90 :        10000780:   mr      r10,r3
    0.00 :        10000784:   mr      r9,r4
    0.05 :        10000788:   stw     r10,32(r31)
   23.78 :        1000078c:   stw     r9,36(r31)
         :              return (sum + value);
    0.76 :        10000790:   lwz     r10,32(r31)
    0.00 :        10000794:   lwz     r9,36(r31)
   14.75 :        10000798:   add     r9,r10,r9
    0.00 :        1000079c:   extsw   r9,r9
         :      }
    6.09 :        100007a0:   mr      r3,r9
    0.02 :        100007a4:   addi    r1,r31,64
    0.03 :        100007a8:   ld      r31,-8(r1)
    0.00 :        100007ac:   blr

（注意：此代码未优化。）

perf top

与 top 命令类似，top 命令（以规则的更新间隔）显示使用最多 CPU 时间的进程，perf top 将显示系统上所有进程中使用最多 CPU 时间的函数，这是粒度上的一个很好的飞跃。

perf list

到目前为止的示例都使用了默认事件，即运行周期。有数百甚至数千种不同类型的事件。perf list 将显示所有这些事件。以下仅是一些示例

$ perf list
  instructions                                       [Hardware event]
  context-switches OR cs                             [Software event]
  L1-icache-loads                                    [Hardware cache event]
  mem_access OR cpu/mem_access/                      [Kernel PMU event]
cache:
  pm_data_from_l2                                   
       [The processor's data cache was reloaded from local core's L2 due to a demand load]
floating point:
  pm_fxu_busy                                       
       [fxu0 busy and fxu1 busy]
frontend:
  pm_br_mpred_cmpl                                  
       [Number of Branch Mispredicts]
memory:
  pm_data_from_dmem                                 
       [The processor's data cache was reloaded from another chip's memory on the same Node or Group (Distant) due to a demand load]
  pm_data_from_lmem                                 
       [The processor's data cache was reloaded from the local chip's Memory due to a demand load]
  rNNN                                               [Raw hardware event descriptor]
  raw_syscalls:sys_enter                             [Tracepoint event]
  syscalls:sys_enter_chmod                           [Tracepoint event]
  sdt_libpthread:pthread_create                      [SDT event]

标记为 Hardware event、Hardware cache event、Kernel PMU event 的事件，以及 cache、floating point、frontend 和 memory 等类别下的大多数（如果不是全部）事件是硬件事件，由硬件计数，并且每次达到特定计数时触发。触发后，将在内核跟踪缓冲区中输入条目，其中包含关联任务的当前状态。 Raw hardware event 代码是硬件事件的字母数字编码。这些主要是在硬件比内核更新并且用户需要启用对该硬件而言是新的事件时才需要。用户很少（如果有的话）需要使用原始事件代码。

标记为 Tracepoint event 的事件嵌入在内核中。当内核执行该代码段时，将触发这些事件。内核支持的每个系统调用都有“syscalls”事件。 raw_syscalls 事件为每个系统调用触发。由于正在主动跟踪的事件数量有限制，因此当需要跟踪大量系统调用时，raw_syscalls 事件可能更实用。

标记为 SDT event 的事件用于软件定义的跟踪点 (SDT)。这些可以嵌入在应用程序或库代码中，并根据需要启用。启用后，它们的行为与其他事件完全相同：当（系统上正在跟踪的任何任务）执行该代码段时，将在内核跟踪缓冲区中输入条目，其中包含关联任务的当前状态。这是一项非常强大的功能，可以证明非常有用。

perf buildid-cache 和 perf probe

启用 SDT 非常容易。首先，使 perf 知道某个库的 SDT

$ perf buildid-cache -v --add /lib/powerpc64le-linux-gnu/libpthread.so.0
$ perf list | grep libpthread
[…]
  sdt_libpthread:pthread_create                      [SDT event]
[…]

然后，将 SDT 定义转换为可用的跟踪点

$ /usr/bin/sudo perf probe sdt_libpthread:pthread_create
Added new event:
  sdt_libpthread:pthread_create (on %pthread_create in /lib/powerpc64le-linux-gnu/libpthread-2.27.so)
You can now use it in all perf tools, such as:
    perf record -e sdt_libpthread:pthread_create -aR sleep 1
$ perf record -a -e sdt_libpthread:pthread_create ./test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.199 MB perf.data (9 samples) ]

请注意，应用程序或库中的任何位置都可以转换为跟踪点。要查找应用程序中可以转换为跟踪点的函数，请将 perf probe 与 –funcs 一起使用

$ perf probe –x ./load --funcs
[…]
main
sum_add
sum_sub

要启用 ./load 应用程序的函数 main 作为跟踪点

/usr/bin/sudo perf probe –x ./load main
Added new event:
  probe_load:main      (on main in /home/pc/projects/load-2.1pc/load)
You can now use it in all perf tools, such as:
    perf record –e probe_load:main –aR sleep 1
$ perf list | grep load:main
  probe_load:main                                     [Tracepoint event]
$ perf record –e probe_load:main ./load
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB perf.data (1 samples) ]

perf script

继续前面的示例，可以使用 perf script 遍历 perf.data 文件并输出每个记录的内容

$ perf script
            Load 16356 [004] 80526.760310: probe_load:main: (4006a2)

处理 perf 跟踪数据

前面的讨论和示例表明，perf 可以收集系统利用率分析所需的数据。但是，如何处理这些数据以产生所需的结果？

perf eBPF

perf 中一项相对较新且新兴的技术称为 eBPF。 BPF 是 Berkeley Packet Filter 的首字母缩写，它是一种类似于 C 的语言，最初用于（毫不奇怪）内核中的网络数据包过滤。 eBPF 是扩展 BPF 的首字母缩写，它是一种基于 BPF 的类似但更强大的 C 语言。

最新版本的 perf 可用于将编译后的 eBPF 代码合并到内核中，以安全且智能地处理任何数量用途的事件，但有一些限制。

该功能非常强大，对于事件相关数据和统计信息的实时、持续更新非常有用。

但是，由于此功能正在兴起，因此当前 Linux 发行版上的支持参差不齐。这有点复杂（或者，换句话说，我还没有弄清楚）。它也仅适用于在线使用；没有离线功能。由于这些原因，我在此处不再进一步介绍它。

perf 数据文件

perf record 生成一个 perf.data 文件。该文件是一个结构化的二进制文件，没有特别完善的文档记录，没有用于访问的编程接口，并且不清楚存在哪些兼容性保证。由于这些原因，我选择不直接使用 perf.data 文件。

perf script

上面的最后一个示例显示了如何使用 perf script 遍历 perf.data 文件并发出有关其中每个记录的基本信息。这对于处理文件和跟踪状态更改以及计算系统利用率分析所需的统计信息所需的模型是合适的。

perf script 有多种操作模式，包括 perf 自带的几种更高级别的脚本，这些脚本基于 perf.data 文件中的跟踪数据生成统计信息。

$ perf script -l
List of available trace scripts:
  rw-by-pid                            system-wide r/w activity
  rwtop [interval]                     system-wide r/w top
  wakeup-latency                       system-wide min/max/avg wakeup latency
  failed-syscalls [comm]               system-wide failed syscalls
  rw-by-file <comm>                    r/w activity for a program, by file
  failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid
  intel-pt-events                      print Intel PT Power Events and PTWRITE
  syscall-counts-by-pid [comm]         system-wide syscall counts, by pid
  export-to-sqlite [database name] [columns] [calls] export perf data to a sqlite3 database
  futex-contention                     futext contention measurement
  sctop [comm] [interval]              syscall top
  event_analyzing_sample               analyze all perf samples
  net_dropmonitor                      display a table of dropped frames
  compaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] display time taken by mm compaction
  export-to-postgresql [database name] [columns] [calls] export perf data to a postgresql database
  stackcollapse                        produce callgraphs in short form for scripting use
  netdev-times [tx] [rx] [dev=] [debug] display a process of packet and processing time
  syscall-counts [comm]                system-wide syscall counts
  sched-migration                      sched migration overview
$ perf script failed-syscalls-by-pid /bin/ls
 
syscall errors:
 
comm [pid]                           count
------------------------------  ----------
 
ls [18683]
  syscall: access          
    err = ENOENT                         1
  syscall: statfs          
    err = ENOENT                         1
  syscall: ioctl           
    err = ENOTTY                         3

这些脚本是什么样的？让我们来看看。

$ locate failed-syscalls-by-pid
/usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
[…]
$ rpm –qf /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
perf-4.14.0-46.el7a.x86_64
$ $ ls /usr/libexec/perf-core/scripts
perl  python
$ perf script -s lang
 
Scripting language extensions (used in perf script -s [spec:]script.[spec]):
 
  Perl                                       [Perl]
  pl                                         [Perl]
  Python                                     [Python]
  py                                         [Python]

因此，这些脚本随 perf 一起提供，并且 Python 和 Perl 都是受支持的语言。

请注意，在本文的全部内容中，我将仅参考 Python。

perf 脚本

这些脚本是如何工作的？以下是 /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py 中的重要摘录

def raw_syscalls__sys_exit(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, ret):
[…]
        if ret < 0:
[…]
                        syscalls[common_comm][common_pid][id][ret] += 1

函数 raw_syscalls__sys_exit 具有与关联事件的所有数据相关的参数。该函数的其余部分仅增加与命令、进程 ID 和系统调用关联的计数器。其余代码没有做那么多。大部分复杂性都在事件处理例程的函数签名中。

幸运的是，perf 可以轻松地找出各种跟踪点事件处理函数的正确签名。

perf script –gen-script

对于 raw_syscalls 事件，我们可以生成一个仅包含这些事件的跟踪

$ perf list | grep raw_syscalls
  raw_syscalls:sys_enter                             [Tracepoint event]
  raw_syscalls:sys_exit                              [Tracepoint event]
$ perf record -e 'raw_syscalls:*' /bin/ls >/dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (176 samples) ]

然后，我们可以让 perf 生成一个脚本，其中包含 perf.data 文件中事件的事件处理函数的示例实现

$ perf script --gen-script python
generated Python script: perf-script.py

我们在脚本中发现了什么？

def raw_syscalls__sys_exit(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, ret):
[…] 
def raw_syscalls__sys_enter(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, args):

两个事件处理函数都使用它们的签名指定。很好！

请注意，此脚本适用于 perf script –s

$ perf script -s ./perf-script.py
in trace_begin
raw_syscalls__sys_exit     7 94571.445908134    21117 ls                    id=0, ret=0
raw_syscalls__sys_enter     7 94571.445942946    21117 ls                    id=45, args=���?bc���?�
[…]

现在我们有了一个模板，可以基于此编写 Python 脚本来解析感兴趣的事件，以报告系统利用率。

perf 脚本

perf script –gen-script 生成的 Python 脚本不是直接可执行的。它们必须由 perf 调用

$ perf script –s ./perf-script.py

这里到底发生了什么？

首先，perf 启动。 script 子命令的 -s 选项指示将使用外部脚本。
perf 建立 Python 运行时环境。
perf 加载指定的脚本。
perf 运行脚本。脚本可以执行正常的初始化，甚至处理命令行参数，尽管传递参数有点笨拙，需要在 perf 和脚本的参数之间使用 -- 分隔符
```
$ perf script -s ./perf-script.py -- --script-arg1 [...]
```
perf 处理跟踪文件的每个记录，调用脚本中适当的事件处理函数。这些事件处理函数可以执行它们需要做的任何事情。

利用率

看起来 perf 脚本具有足够的功能来提供可行的解决方案。生成系统利用率统计信息需要哪些类型的信息？

任务创建 (fork, pthread_create)
任务终止 (exit)
任务替换 (exec)
任务迁移，显式或隐式，以及当前 CPU
任务调度
系统调用
虚拟机监控程序调用
中断

了解任务在各种系统调用、处理中断或显式调用虚拟机监控程序上花费的时间比例可能很有帮助。这些时间类别中的每一个都可以被视为任务的“状态”，并且需要跟踪从一种状态转换到另一种状态的方法

该图表最重要的点是每个状态转换都有事件。

任务创建：clone 系统调用
任务终止：sched:sched_process_exit
任务替换：sched:sched_process_exec
任务迁移：sched_setaffinity 系统调用（显式），sched:sched_migrate_task（隐式）
任务调度：sched:sched_switch
系统调用：raw_syscalls:sys_enter，raw_syscalls:sys_exit
虚拟机监控程序调用：（特定于 POWER）powerpc:hcall_entry，powerpc:hcall_exit
中断：irq:irq_handler_entry，irq:irq_handler_exit

Linux 的 curt 命令

perf 提供了一个合适的框架，可以用来捕获系统利用率的必要数据。 Linux 内核中有一组足够的事件可用于跟踪。 Python 脚本功能允许强大而灵活的方式来处理跟踪数据。现在是编写工具的时候了。

高级设计

在处理每个事件时，必须更新受影响任务的相关状态

新任务？创建并初始化数据结构以跟踪任务的状态
- 命令
- 进程 ID
- 任务 ID
- 迁移计数 (0)
- 当前 CPU
此任务的新 CPU？为 CPU 特定数据创建并初始化数据结构
- 用户时间 (0)
- 系统时间 (0)
- 虚拟机监控程序时间 (0)
- 中断时间 (0)
- 空闲时间 (0)
此任务的新事务？为事务特定数据创建并初始化数据结构
- 经过时间 (0)
- 计数 (0)
- 最小值 (maxint)，最大值 (0)
现有任务？
- 累积先前状态的时间
- 事务结束？累积事务的时间，调整最小值、最大值
设置新状态
保存当前时间（进入当前状态的时间）
迁移？增加迁移计数

高级示例

对于 raw_syscalls:sys_enter 事件

如果以前从未见过此任务，则分配并初始化新的任务数据结构
如果 CPU 对于此任务是新的，则分配并初始化新的 CPU 数据结构
如果此系统调用对于此任务是新的，则分配并初始化新的调用数据结构
在任务数据结构中
- 在当前状态（“用户”）的存储桶中累积自上次状态更改以来的时间
- 设置新状态（“系统”）
- 将当前时间戳保存为新状态的此时间段的开始时间

边缘情况

sys_exit 作为任务的第一个事件

如果任务的跟踪中的第一个事件是 raw_syscalls:sys_exit

没有匹配的 raw_syscalls:sys_enter 来确定此系统调用的开始时间。
自跟踪开始以来累积的时间都花费在系统调用中，并且需要添加到此系统调用的所有调用中花费的总经过时间中。
此系统调用的经过时间未知。
在计算此系统调用的平均值、最小值或最大值统计信息时，考虑此经过时间是不准确的。

在这种情况下，该工具会创建一个名为“pending”的单独存储桶，用于存储无法在平均值、最小值或最大值中计算的系统调用中花费的时间。

所有事务事件（系统调用、虚拟机监控程序调用和中断）都需要“pending”存储桶。

sys_enter 作为任务的最后一个事件

类似地，如果任务的跟踪中的最后一个事件是 raw_syscalls:sys_enter

没有匹配的 raw_syscalls:sys_exit 来确定此系统调用的结束时间。
从系统调用开始到跟踪结束的累积时间都花费在系统调用中，并且需要添加到此系统调用的所有调用中花费的总经过时间中。
此系统调用的经过时间未知。
在计算此系统调用的平均值、最小值或最大值统计信息时，考虑此经过时间是不准确的。

此经过时间也累积在“pending”存储桶中。

所有事务事件（系统调用、虚拟机监控程序调用和中断）都需要“pending”存储桶。

由于只有在跟踪结束时才能发现此条件，因此该工具中需要一个最终的“wrap-up”步骤，在该步骤中，根据所有已知任务的最终状态完成这些任务的统计信息。

不确定的状态

繁忙的任务（或短跟踪）可能永远看不到来自任务的事件，从中可以确定任务的状态。例如，如果仅看到任务的 sched:sched_switch 或 sched:sched_task_migrate 事件，则无法确定该任务的状态。但是，已知该任务存在并且正在运行。

由于无法确定实际状态，因此任务的运行时累积在一个单独的存储桶中，任意称为“busy-unknown”。为了完整起见，此时间也会显示在最终报告中。

不可见的任务

对于非常非常繁忙的任务（或短跟踪），任务可能在收集跟踪的整个时间内都在积极运行，但该任务的任何事件都不会出现在跟踪中。它从未迁移、暂停或被迫等待。

此类任务无法被该工具识别，并且不会出现在报告中。

curt.py Python 类

Task

每个任务一个
保存所有任务特定数据（命令、进程 ID、状态、CPU、CPU 数据结构列表 [见下文]、迁移计数、每个调用数据结构列表 [见下文]）
维护任务状态

Call

每个唯一事务，每个任务一个（例如，一个用于“open”系统调用，一个用于“close”系统调用，一个用于 IRQ 27 等）
保存特定于调用的数据（例如，开始时间戳、计数、经过时间、最小值、最大值）
根据需要分配（延迟分配）
存储在任务内的 Python 字典中，该字典由调用的唯一标识符（例如，系统调用代码、IRQ 编号等）索引。

CPU

每个 CPU 一个，在该 CPU 上已观察到此任务正在运行
保存每个 CPU 的任务数据（例如，用户时间、系统时间、虚拟机监控程序调用时间、中断时间）
根据需要分配（延迟分配）
存储在任务内的 Python 字典中，该字典由 CPU 编号索引。

curt.py 事件处理示例

如前所述，perf script 将迭代跟踪中的所有事件，并为每个事件调用适当的事件处理函数。

给定上面的高级示例，首次尝试用于 sys_exit 的事件处理函数可能是

tasks = {}

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
 
  # convert the multiple timestamp values into a single value
  timestamp = nsecs(common_secs, common_nsecs)

  # find this task's data structure
  try:
    task = tasks[common_pid]
  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

  if common_cpu not in task.cpus:
    # new CPU!
    task.cpu = common_cpu
    task.cpus[common_cpu] = CPU()

  # compute time spent in the previous state ('user')
  delta = timestamp – task.timestamp
  # accumulate 'user' time for this task/CPU
  task.cpus[task.cpu].user += delta
  if id not in task.syscalls:
    # new system call for this task!
    task.syscalls[id] = Call()

  # change task's state
  task.mode = 'sys'

  # save the timestamp for the last event (this one) for this task
  task.timestamp = timestamp

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the multiple timestamp values into a single value
  timestamp = nsecs(common_secs, common_nsecs)

  # get the task data structure
  task = tasks[common_pid]

  # compute elapsed time for this system call
  delta =  task.timestamp - timestamp

  # accumulate time for this task/system call
  task.syscalls[id].elapsed += delta
  # increment the tally for this task/system call
  task.syscalls[id].count += 1
  # adjust statistics
  if delta < task.syscalls[id].min:
    task.syscalls[id].min = delta
  if delta > task.syscalls[id].max:
    task.syscalls[id].max = delta

  # accumulate time for this task's state on this CPU
  task.cpus[common_cpu].system += delta

  # change task's state
  task.mode = 'user'

  # save the timestamp for the last event (this one) for this task
  task.timestamp = timestamp

处理边缘情况

以下是一些可能且必须处理的边缘情况。

Sys_exit 作为第一个事件

由于系统范围的跟踪可以在任意时间启动，因此任务的第一个事件当然可能是 raw_syscalls:sys_exit。这需要将来自 raw_syscalls:sys_enter 的事件处理函数的新任务发现的相同代码添加到 raw_syscalls:sys_exit 的处理程序中。这

  # get the task data structure
  task = tasks[common_pid]

变为这样

  # find this task's data structure
  try:
    task = tasks[common_pid]
  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

另一个问题是，由于没有系统调用开始时间的时间戳，因此无法正确累积此系统调用的数据。从跟踪开始到此事件的时间已由此任务在系统调用中花费。忽略此时间是不准确的。以使其用于计算平均值、最小值或最大值的方式合并此时间也是不准确的。唯一合理的选择是将此时间单独累积，称其为“pending”系统时间。为了准确计算此时间，必须知道跟踪的第一个事件的时间戳。由于任何事件都可能是跟踪中的第一个事件，因此每个事件都必须有条件地保存其时间戳（如果它是第一个事件）。需要一个全局变量

start_timestamp = 0

并且每个事件处理函数都必须有条件地保存其时间戳

  # convert the multiple timestamp values into a single value
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = 0:
    start_timestamp = timestamp

因此，raw_syscalls:sys_exit 的事件处理函数变为

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the multiple timestamp values into a single value
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = 0:
    start_timestamp = timestamp

  # find this task's data structure
  try:
    task = tasks[common_pid]

    # compute elapsed time for this system call
    delta =  task.timestamp - timestamp

    # accumulate time for this task/system call
    task.syscalls[id].elapsed += delta
    # increment the tally for this task/system call
    task.syscalls[id].count += 1
    # adjust statistics
    if delta < task.syscalls[id].min:
      task.syscalls[id].min = delta
    if delta > task.syscalls[id].max:
      task.syscalls[id].max = delta

  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

    # compute elapsed time for this system call
    delta =  start_timestamp - timestamp

    # accumulate time for this task/system call
    task.syscalls[id].pending += delta

  # accumulate time for this task's state on this CPU
  task.cpus[common_cpu].system += delta

  # change task's state
  task.mode = 'user'

  # save the timestamp for the last event (this one) for this task
  task.timestamp = timestamp

Sys_enter 作为最后一个事件

与将 sys_exit 作为任务的第一个事件类似的问题是，当 sys_enter 是任务的最后一个事件时。为了完整起见，必须累积在系统调用中花费的时间，但不能准确地影响平均值、最小值或最大值。此时间也将累积到单独的“pending”状态。

为了准确确定 pending 系统调用的经过时间（从 sys_entry 到跟踪期间结束），需要跟踪文件中最后一个事件的时间戳。不幸的是，在事件已被处理之前，无法知道哪个事件是最后一个事件。因此，所有事件都必须将其各自的时间戳保存在全局变量中。

可能有很多任务处于看到它们的最后一个事件是 sys_enter 的状态。因此，在处理完最后一个事件后，需要一个最终的“wrap up”步骤来完成这些任务的统计信息。幸运的是，在处理完最后一个事件后，perf 会调用一个 trace_end 函数。

最后，我们需要在每个 sys_enter 中保存系统调用的 id。

curr_timestamp = 0 

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):

  # convert the multiple timestamp values into a single value 
  curr_timestamp = nsecs(common_secs, common_nsecs)
[…] 
  task.syscall = id
[…]

def trace_end(): 
        for tid in tasks.keys(): 
                task = tasks[tid] 
                # if this task ended while executing a system call
                if task.mode == 'sys':
                        # compute the time from the entry to the system call to the end of the trace period
                        delta = curr_timestamp - task.timestamp
                        # accumulate the elapsed time for this system call
                        task.syscalls[task.syscall].pending += delta
                        # accumulate the system time for this task/CPU
                        task.cpus[task.cpu].sys += delta

迁移

任务迁移是指当在一个 CPU 上运行的任务被移动到另一个 CPU 时。这可能通过以下方式发生：

显式请求（例如，调用 sched_setaffinity），或者
内核隐式地（例如，负载均衡或腾空要离线的 CPU）

检测到时

应增加任务的迁移计数
应更新先前 CPU 的统计信息
如果 CPU 对于任务是新的，则可能需要更新和初始化新的 CPU 数据结构
任务的当前 CPU 设置为新的 CPU

为了获得准确的统计信息，必须尽快检测到任务迁移。第一种情况，显式请求，发生在系统调用中，并且可以在该系统调用的 sys_exit 事件中检测到。第二种情况有其自身的事件 sched:sched_migrate_task，因此它需要一个新的事件处理函数。

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the multiple timestamp values into a single value
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = 0:
    start_timestamp = timestamp

  # find this task's data structure
  try:
    task = tasks[common_pid]

    # compute elapsed time for this system call
    delta =  task.timestamp - timestamp

    # accumulate time for this task/system call
    task.syscalls[id].elapsed += delta
    # increment the tally for this task/system call
    task.syscalls[id].count += 1
    # adjust statistics
    if delta < task.syscalls[id].min:
      task.syscalls[id].min = delta
    if delta > task.syscalls[id].max:
      task.syscalls[id].max = delta

  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

    task.cpu = common_cpu

    # compute elapsed time for this system call
    delta =  start_timestamp - timestamp

    # accumulate time for this task/system call
    task.syscalls[id].pending += delta

  If common_cpu != task.cpu:
    task.migrations += 1
    # divide the time spent in this syscall in half...
    delta /= 2
    # and give have to the previous CPU, below, and half to the new CPU, later
    task.cpus[task.cpu].system += delta

  # accumulate time for this task's state on this CPU
  task.cpus[common_cpu].system += delta

  # change task's state
  task.mode = 'user'

  # save the timestamp for the last event (this one) for this task
  task.timestamp = timestamp

def sched__sched_migrate_task(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, comm, pid, prio, orig_cpu, 
        dest_cpu, perf_sample_dict):

  If start_timestamp = 0:
    start_timestamp = timestamp

  # find this task's data structure
  try:
    task = tasks[common_pid]
  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

    task.cpu = common_cpu

    If common_cpu not in task.cpus:
      task.cpus[common_cpu] = CPU()

    task.migrations += 1

任务创建

为了准确收集任务的统计信息，必须知道任务何时创建。任务可以使用 fork() 创建，这将创建一个新进程，或者使用 pthread_create() 创建，这将在同一进程中创建一个新任务。幸运的是，两者都通过 clone 系统调用体现出来，并通过 sched:sched_process_fork 事件显现出来。任务的生命周期从 sched_process_fork 事件开始。出现的边缘情况是新任务的第一个可能事件是

当新任务开始运行时，sched_switch。在此事件发生之前，应将新任务视为空闲状态
clone 系统调用的 sys_exit。新任务的初始状态需要基于创建它的任务的状态，包括在 clone 系统调用中。

必须处理的一个边缘情况是，如果创建任务（父任务）尚不为人所知，则必须创建并初始化它，并且假设它自跟踪开始以来一直在积极运行。

def sched__sched_process_fork(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, parent_comm, parent_pid, child_comm, child_pid):
  global start_timestamp, curr_timestamp
  curr_timestamp = self.timestamp
  if (start_timestamp == 0):
    start_timestamp = curr_timestamp
  # find this task's data structure
  try:
    task = tasks[common_pid]
  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task
  try:
    parent = tasks[self.parent_tid]
  except:
    # need to create parent task here!
    parent = Task(start_timestamp, self.command, 'sys', self.pid)
    parent.sched_stat = True # ?
    parent.cpu = self.cpu
    parent.cpus[parent.cpu] = CPU()
    tasks[self.parent_tid] = parent
 
    task.resume_mode = parent.mode
    task.syscall = parent.syscall
    task.syscalls[task.syscall] = Call()
    task.syscalls[task.syscall].timestamp = self.timestamp

任务退出

类似地，为了完整和准确的任务统计信息，必须知道任务何时终止。有一个事件用于此目的：sched:sched_process_exit。这一个很容易处理，因为所做的努力只是关闭统计信息并适当地设置模式，因此任何跟踪结束处理都不会认为任务仍在活动状态

def sched__sched_process_exit_old(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, comm, pid, prio):
  global start_timestamp, curr_timestamp
  curr_timestamp = self.timestamp
  if (start_timestamp == 0):
    start_timestamp = curr_timestamp

  # find this task's data structure
  try:
    task = tasks[common_pid]
  except:
    # new task!
    task = Task()
    # save the command string
    task.comm = common_comm
    task.timestamp = curr_timestamp
    # save the new task in the global list (dictionary) of tasks
    tasks[common_pid] = task

  delta = timestamp – task.timestamp
  task.sys += delta
  task.mode = 'exit'

输出

以下是 curt 显示的报告示例，略微重新格式化以适应较窄的页面宽度，并删除了空闲时间分类数据（这使得输出非常宽），并且为了简洁起见。可以看到两个进程，1497 和 2857。进程 1497 有两个任务，1497 和 1523。每个任务都有每个 CPU 的摘要和系统范围（“ALL”CPU）的摘要。每个任务的数据之后是该任务的系统调用数据（如果有）、虚拟机监控程序调用数据（如果有）和中断数据（如果有）。在每个进程的各自任务之后是每个进程的摘要。进程 2857 有一个任务 2857-0，它是 exec() 系统调用替换进程映像之前的上一个任务映像。在所有进程之后是系统范围的摘要。

1497:
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [  1497] X             2  0.076354  0.019563  0.000000  0.000000  0.000000 15.818719 |   0.6%
   [  1497] X           ALL  0.076354  0.019563  0.000000  0.000000  0.000000 15.818719 |   0.6%     0
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     (  0)read                 2  0.004699     0.000000     0.002350     0.002130     0.002569
     (232)epoll_wait           1  9.968375     5.865208     9.968375     9.968375     9.968375
 
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [  1523] InputThread   1  0.052598  0.037073  0.000000  0.000000  0.000000 15.824965 |   0.6%
   [  1523] InputThread ALL  0.052598  0.037073  0.000000  0.000000  0.000000 15.824965 |   0.6%     0
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     (  0)read                14  0.011773     0.000000     0.000841     0.000509     0.002185
     (  1)write                2  0.010763     0.000000     0.005381     0.004974     0.005789
     (232)epoll_wait           1  9.966649     5.872853     9.966649     9.966649     9.966649
 
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [   ALL]             ALL  0.128952  0.056636  0.000000  0.000000  0.000000 31.643684 |   0.6%     0
 
2857:
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [  2857] execs.sh      1  0.257617  0.249685  0.000000  0.000000  0.000000  0.266200 |  65.6%
   [  2857] execs.sh      2  0.000000  0.023951  0.000000  0.000000  0.000000  0.005728 |  80.7%
   [  2857] execs.sh      5  0.313509  0.062271  0.000000  0.000000  0.000000  0.344279 |  52.2%
   [  2857] execs.sh      6  0.136623  0.128883  0.000000  0.000000  0.000000  0.533263 |  33.2%
   [  2857] execs.sh      7  0.527347  0.194014  0.000000  0.000000  0.000000  0.990625 |  42.1%
   [  2857] execs.sh    ALL  1.235096  0.658804  0.000000  0.000000  0.000000  2.140095 |  46.9%     4
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     (  9)mmap                15  0.059388     0.000000     0.003959     0.001704     0.017919
     ( 14)rt_sigprocmask      12  0.006391     0.000000     0.000533     0.000431     0.000711
     (  2)open                 9  2.253509     0.000000     0.250390     0.008589     0.511953
     (  3)close                9  0.017771     0.000000     0.001975     0.000681     0.005245
     (  5)fstat                9  0.007911     0.000000     0.000879     0.000683     0.001182
     ( 10)mprotect             8  0.052198     0.000000     0.006525     0.003913     0.018073
     ( 13)rt_sigaction         8  0.004281     0.000000     0.000535     0.000458     0.000751
     (  0)read                 7  0.197772     0.000000     0.028253     0.000790     0.191028
     ( 12)brk                  5  0.003766     0.000000     0.000753     0.000425     0.001618
     (  8)lseek                3  0.001766     0.000000     0.000589     0.000469     0.000818
 
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [2857-0] perf          6  0.053925  0.191898  0.000000  0.000000  0.000000  0.827263 |  22.9%
   [2857-0] perf          7  0.000000  0.656423  0.000000  0.000000  0.000000  0.484107 |  57.6%
   [2857-0] perf        ALL  0.053925  0.848321  0.000000  0.000000  0.000000  1.311370 |  40.8%     1
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     (  0)read                 0  0.000000     0.167845           --           --           --
     ( 59)execve               0  0.000000     0.000000           --           --           --
 
ALL:
-- [  task] command     cpu      user       sys       irq        hv      busy      idle |  util% moves
   [   ALL]             ALL 10.790803 29.633170  0.160165  0.000000  0.137747 54.449823 |   7.4%    50
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     (  1)write             2896  1.623985     0.000000     0.004014     0.002364     0.041399
     (102)getuid            2081  3.523861     0.000000     0.001693     0.000488     0.025157
     (142)sched_setparam     691  7.222906    32.012841     0.024925     0.002024     0.662975
     ( 13)rt_sigaction       383  0.235087     0.000000     0.000614     0.000434     0.014402
     (  8)lseek              281  0.169157     0.000000     0.000602     0.000452     0.013404
     (  0)read               133  2.782795     0.167845     0.020923     0.000509     1.864439
     (  7)poll                96  8.583354   131.889895     0.193577     0.000626     4.596280
     (  4)stat                93  7.036355     1.058719     0.183187     0.000981     3.661659
     ( 47)recvmsg             85  0.146644     0.000000     0.001725     0.000646     0.019067
     (  3)close               79  0.171046     0.000000     0.002165     0.000428     0.020659
     (  9)mmap                78  0.311233     0.000000     0.003990     0.001613     0.017919
     (186)gettid              74  0.067315     0.000000     0.000910     0.000403     0.014075
     (  2)open                71  3.081589     0.213059     0.184248     0.001921     0.937946
     (202)futex               62  5.145112   164.286154     0.405566     0.000597    11.587437
 
  -- ( ID)name             count   elapsed      pending      average      minimum      maximum
     ( 12)i8042               10  0.160165     0.000000     0.016016     0.010920     0.032805
 
Total Trace Time: 15.914636 ms

障碍和问题

以下是在 curt 的开发中遇到的一些问题。

乱序事件

更具挑战性的问题之一是发现 perf.data 文件中的事件可能时间顺序错乱。对于试图仔细监控状态转换的程序来说，这是一个严重的问题。例如，跟踪可能包括以下事件序列，这些事件按它们在跟踪文件中出现的顺序显示

time 0000:  sys_enter syscall1
time 0007:  sys_enter syscall2
time 0006:  sys_exit syscall1
time 0009:  sys_exit syscall2

只是盲目地按照事件呈现给它们各自的事件处理函数的顺序（时间顺序错误）处理这些事件将导致不正确的统计信息（或更糟）。

处理乱序事件的最用户友好的方法包括

首先通过更改 perf record 的工作方式来防止跟踪出现乱序事件
提供一种重新排序跟踪文件中事件的方法，也许通过增强 perf inject
修改 perf script 的工作方式，以便按时间顺序将事件呈现给事件处理函数

但是用户友好与直接或容易不同。此外，以上任何一项都不在用户的控制之下。

我选择为传入事件实现一个队列，该队列应足够深，以便正确地重新排序所有事件。这需要对代码进行重大重新设计，包括为每个事件实现类，并将每个事件类型的事件处理移动到该事件类的方法中。

在重新设计的代码中，实际的事件处理程序的唯一工作是将事件中的相关数据保存到事件类的实例中，将其排队，然后处理队列中顶部（时间最旧）的事件

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
         event = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id)
        process_event(event)

简单的可重新排序的排队机制在一个通用函数中

events = []
n_events = 0
def process_event(event):
        global events,n_events,curr_timestamp
        i = n_events
        while i > 0 and events[i-1].timestamp > event.timestamp:
                i = i-1
        events.insert(i,event)
        if n_events < params.window:
                n_events = n_events+1
        else:
                event = events[0]
                # need to delete from events list now,
                # because event.process() could reenter here
                del events[0]
                if event.timestamp < curr_timestamp:
                        sys.stderr.write("Error: OUT OF ORDER events detected.\n  Try increasing the size of the look-ahead window with --window=<n>\n")
                event.process()

请注意，队列的大小是可配置的，主要用于性能和限制内存消耗。当队列大小不足以消除乱序事件时，该函数将报告。值得考虑是否将这种情况视为灾难性故障并选择终止程序。

为每种事件类型实现一个类导致了一些关于重构的考虑，以便通用代码可以合并到基类中

class Event (object):
 
        def __init__(self):
                self.timestamp = 0
                self.cpu = 0
                self.tid = 0
                self.command = 'unknown'
                self.mode = 'unknown'
                self.pid = 0
 
        def process(self):
                global start_timestamp
 
                try:
                        task = tasks[self.tid]
                        if task.pid == 'unknown':
                                tasks[self.tid].pid = self.pid
                except:
                        task = Task(start_timestamp, self.command, self.mode, self.pid)
                        tasks[self.tid] = task
 
                if self.cpu not in task.cpus:
                        task.cpus[self.cpu] = CPU()
                        if task.cpu == 'unknown':
                                task.cpu = self.cpu
 
                if self.cpu != task.cpu:
                        task.cpu = self.cpu
                        task.migrations += 1
 
                return task

然后将类似地构造每种事件类型的类

class Event_sys_enter ( Event ):
 
        def __init__(self, timestamp, cpu, tid, comm, id, pid):
                self.timestamp = timestamp
                self.cpu = cpu
                self.tid = tid
                self.command = comm
                self.id = id
                self.pid = pid
                self.mode = 'busy-unknown'
                
        def process(self):
                global start_timestamp, curr_timestamp
                curr_timestamp = self.timestamp
                if (start_timestamp == 0):
                        start_timestamp = curr_timestamp
 
                task = super(Event_sys_enter, self).process()
 
                if task.mode == 'busy-unknown':
                        task.mode = 'user'
                        for cpu in task.cpus:
                                task.cpus[cpu].user = task.cpus[cpu].busy_unknown
                                task.cpus[cpu].busy_unknown = 0
 
                task.syscall = self.id
                if self.id not in task.syscalls:
                        task.syscalls[self.id] = Call()
 
                task.syscalls[self.id].timestamp = curr_timestamp
                task.change_mode(curr_timestamp, 'sys')

上面的进一步重构也很明显，将基于任务状态更改和状态更改本身更新相关统计信息的通用代码移动到 Task 类的 change_mode 方法中。

跟踪开始时间戳

如上所述，对于依赖于经过时间的脚本，应该有一种更简单的方法来获取跟踪中的第一个时间戳，而不是强制每个事件处理函数有条件地将其时间戳保存为跟踪开始时间戳。

笨拙的调用

调用 perf Python 脚本（包括脚本参数）的语法有点笨拙

$ perf script –s ./curt.py -- --window=80

此外，perf Python 脚本本身不可执行也很笨拙。

curt.py 脚本已直接可执行，并将调用 perf，而 perf 又将调用该脚本。实现有点令人困惑，但易于使用

$ ./curt.py --window=80

此脚本必须检测到何时被直接调用。由 perf 建立的 Python 环境是一个虚拟模块，perf Python 脚本从中导入

try:
        from perf_trace_context import *

如果此导入失败，则脚本被直接调用。在这种情况下，脚本将 exec perf，指定自身作为要运行的脚本，并传递任何命令行参数

except:
        if len(params.file_or_command) == 0:
                params.file_or_command = [ "perf.data" ]
        sys.argv = ['perf', 'script', '-i' ] + params.file_or_command + [ '-s', sys.argv[0] ]
        sys.argv.append('--')
        sys.argv += ['--window', str(params.window)]
        if params.debug:
                sys.argv.append('--debug')
        sys.argv += ['--api', str(params.api)]
        if params.debug:
                print sys.argv
        os.execvp("perf", sys.argv)
        sys.exit(1)

通过这种方式，该脚本不仅可以直接运行，还可以使用 perf script 命令运行。

需要同时注册事件

perf 启用事件的方式的人工产物可能会导致意外的跟踪数据。例如，指定

$ perf record –a –e raw_syscalls:sys_enter –e raw_syscalls:sys_exit ./command

将导致跟踪文件以单个任务（perf 命令本身）的以下一系列事件开始

sys_enter
sys_enter
sys_enter
…

发生这种情况是因为 perf 将为系统上的每个 CPU 注册 sys_enter 事件（由于 -a 参数），然后它将为每个 CPU 注册 sys_exit 事件。在后一种情况下，由于已为每个 CPU 启用了 sys_enter 事件，因此该事件会显示在跟踪中；但是由于在调用返回后才在每个 CPU 上启用 sys_exit，因此 sys_exit 调用不会显示在跟踪中。相反的问题发生在跟踪文件的末尾，跟踪中出现一系列 sys_exit 事件，因为 sys_enter 事件已被禁用。

此问题的解决方案是将事件分组，但这方面的文档记录不足

$ perf record –e '{raw_syscalls:sys_enter,raw_syscalls:sys_exit}' ./command

使用此语法，sys_enter 和 sys_exit 事件将同时启用。

繁琐的记录步骤

计算任务的完整统计信息需要许多不同的事件。这导致记录命令非常长且复杂

$ perf record -e '{raw_syscalls:*,sched:sched_switch,sched:sched_migrate_task,sched:sched_process_exec,sched:sched_process_fork,sched:sched_process_exit,sched:sched_stat_runtime,sched:sched_stat_wait,sched:sched_stat_sleep,sched:sched_stat_blocked,sched:sched_stat_iowait,powerpc:hcall_entry,powerpc:hcall_exit}' -a *command --args*

此问题的解决方案是使脚本能够自行执行记录步骤，即自行调用 perf。进一步的增强是在记录完成后继续进行，并报告该记录中的统计信息

if params.record:
        # [ed. Omitting here the list of events for brevity]
        eventlist = '{' + eventlist + '}' # group the events
        command = ['perf', 'record', '--quiet', '--all-cpus',
                '--event', eventlist ] + params.file_or_command
        if params.debug:
                print command
        subprocess.call(command)

记录和报告所需的命令语法变为

$ ./curt.py --record ./command

进程 ID 和 perf API 变更

perf 脚本对进程 ID 的处理方式有些随意。请注意上面，为生成的事件处理函数提供的常用参数之一名为 common_pid。这不是进程 ID，而是任务 ID。实际上，在当前许多基于 Linux 的发行版上，无法从 perf Python 脚本中确定任务的进程 ID。这对于想要计算进程统计信息的脚本来说是一个严重的问题。

幸运的是，在 Linux 内核 v4.14 中，为每个事件处理函数提供了一个额外的参数——perf_sample_dict——这是一个字典，可以从中提取进程 ID：（perf_sample_dict['sample']['pid']）。

不幸的是，当前的 Linux 发行版可能没有该版本的 Linux 内核。如果脚本编写为期望该额外参数，则脚本将失败并报告错误

TypeError: irq__irq_handler_exit_new() takes exactly 11 arguments (10 given)

理想情况下，应该提供一种自动发现是否传递了额外参数的方法，以允许脚本轻松地使用旧的和新的 API 运行，并在新的 API 可用时利用它。不幸的是，目前还没有显而易见的方法。

由于很明显使用新的 API 来确定进程范围的统计信息具有价值，因此 curt 提供了一个命令行选项来使用新的 API。然后，curt 利用 Python 的惰性函数绑定在运行时调整要使用的 API

if params.api == 1:
        dummy_dict = {}
        dummy_dict['sample'] = {}
        dummy_dict['sample']['pid'] = 'unknown'
        raw_syscalls__sys_enter = raw_syscalls__sys_enter_old
        […]
else:
        raw_syscalls__sys_enter = raw_syscalls__sys_enter_new
        […]

这需要每个事件有两个函数

def raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, perf_sample_dict):
 
        event = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id, perf_sample_dict['sample']['pid'])
        process_event(event)
 
def raw_syscalls__sys_enter_old(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
        global dummy_dict
        raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, dummy_dict)

请注意，旧 API 的事件处理函数将使用新 API 的函数，传递一个静态定义的字典，其中包含足够的数据，以便像 perf_sample_dict['sample']['pid'] 这样访问它时可以工作（结果为 'unknown'）。

在其他 CPU 上报告的事件

并非所有引用任务的事件都从任务正在运行的 CPU 上报告。这可能会导致人为地提高迁移计数和其他不正确的统计信息。对于这些类型的事件 (sched_stat)，事件 CPU 将被忽略。

显式迁移（没有 sched_migrate 事件）

虽然方便地有一个事件用于指示内核何时决定将任务从一个 CPU 迁移到另一个 CPU，但没有事件用于指示任务何时自行请求迁移。这些迁移通过系统调用 (sched_setaffinity) 实现，因此 sys_exit 事件处理程序必须将事件 CPU 与任务的 CPU 进行比较，如果不同，则假定发生了迁移。（这在上面已经描述过，但为了完整起见，此处在“问题”部分重复。）

将系统调用 ID 映射到名称是特定于架构的

系统调用在事件中仅以唯一的数字标识符标识。这些标识符不易被人类在报告中理解。这些数字标识符不容易映射到它们的助记符，因为它们是特定于架构的，并且新的系统调用可以在较新的内核中添加。幸运的是，perf 提供了一种将系统调用数字标识符映射到系统调用名称的方法。下面是一个简单的例子

from Util import syscall_name
def raw_syscalls__sys_enter(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, args, perf_sample_dict):
                print "%s id=%d" % (syscall_name(id), id)

不幸的是，使用 syscall_name 引入了对 audit python 绑定的依赖。此依赖关系正在上游版本的 perf 中被删除。

将虚拟机管理程序调用 ID 映射到名称是不存在的

与系统调用类似，虚拟机管理程序调用也仅使用数字标识符进行标识。对于 IBM 的 POWER 虚拟机管理程序，它们是静态定义的。不幸的是，perf 没有提供将虚拟机管理程序调用标识符映射到助记符的方法。 curt 包含一个（硬编码的）函数来做到这一点

hcall_to_name = {
        '0x4':'H_REMOVE',
        '0x8':'H_ENTER',      
        '0xc':'H_READ',      
        '0x10':'H_CLEAR_MOD', 
[…]
}
 
def hcall_name(opcode):
        try:
                return hcall_to_name[hex(opcode)]
        except:
                return str(opcode)

命令字符串作为字节数组

perf 将命令名称和字符串参数存储在 Python 字节数组中。不幸的是，在 Python 中打印字节数组会打印字节数组中的每个字符——即使字符串以 null 结尾也是如此。例如

$ perf record –a –e 'sched:sched_switch' sleep 3
$ perf script –g Python
generated Python script: perf-script.py
$ perf script -s ./perf-script.py
in trace_begin
sched__sched_switch      3 664597.912692243    21223 perf                  prev_comm=perf^@-terminal-^@, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/3^@^@^@^@^@, next_pid=23, next_prio=0
[…]

一种解决方案是根据 null 终止截断这些字节数组的长度，在打印之前根据需要进行截断

def null(ba):
        null = ba.find('\x00')
        if null >= 0:
                ba = ba[0:null]
        return ba

def sched__sched_switch(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, prev_comm, prev_pid, prev_prio, prev_state,
        next_comm, next_pid, next_prio, perf_sample_dict):

                print "prev_comm=%s, prev_pid=%d, prev_prio=%d, " \
                "prev_state=%s, next_comm=%s, next_pid=%d, " \
                "next_prio=%d" % \
                (null(prev_comm), prev_pid, prev_prio,
                flag_str("sched__sched_switch", "prev_state", prev_state),
                null(next_comm), next_pid, next_prio)

这样可以很好地清理输出

sched__sched_switch      3 664597.912692243    21223 perf                  prev_comm=perf, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/3, next_pid=23, next_prio=0

动态映射，例如 IRQ 号到名称

与系统调用和虚拟机管理程序调用不同，中断号 (IRQ) 由内核按需动态分配，因此不可能存在将 IRQ 号映射到名称的静态表。幸运的是，perf 将名称传递给事件的 irq_handler_entry 例程。这允许脚本创建一个将 IRQ 号映射到名称的字典

irq_to_name = {}
def irq__irq_handler_entry_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, irq, name, perf_sample_dict):
        irq_to_name[irq] = name
        event = Event_irq_handler_entry(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, irq, name, getpid(perf_sample_dict))
        process_event(event)

有点奇怪的是，perf 没有将名称传递给 irq_handler_exit 例程。因此，跟踪可能只看到 IRQ 的 irq_handler_exit，并且必须能够容忍这种情况。在这里，没有将 IRQ 映射到名称，而是将 IRQ 号作为字符串返回

def irq_name(irq):
        if irq in irq_to_name:
                return irq_to_name[irq]
        return str(irq)

任务 0

任务 0 无处不在。它不是真正的任务。它是“空闲”状态的替代品。当 CPU 将要（或来自）“空闲”状态时，它是赋予 sched_switch 事件处理程序的任务 ID。它通常是“中断”被中断的任务。将任务 0 的统计信息视为真实任务来跟踪是没有意义的。目前，curt 忽略任务 0。但是，这会丢失一些信息，例如花费在中断处理中的一些时间。 curt 应该（但目前没有）跟踪任务 0 的有趣（非空闲）时间。

虚假的 sched_migrate_task 事件（相同的 CPU）

极少数情况下，会发生 sched_migrate_task 事件，其中源 CPU 和目标 CPU 是相同的。换句话说，任务未迁移。为了避免人为地夸大迁移计数，必须显式忽略这种情况

class Event_sched_migrate_task (Event):
        def process(self):
[…]
                if self.cpu == self.dest_cpu:
                        return

exec

exec 系统调用的语义是当前进程的映像被完全新的进程映像替换，而进程 ID 不变。这对于基于进程（任务）ID 跟踪进程（实际上是任务）的统计信息来说很尴尬。更改非常重要，以至于每个任务的统计信息都应单独累积，因此需要关闭当前任务的统计信息，并应初始化一组新的统计信息。挑战在于旧任务和新任务都具有相同的进程（任务）ID。 curt 通过用数字后缀标记任务的任务 ID 来解决此问题

class Event_sched_process_exec (Event):
  def process(self):
    global start_timestamp, curr_timestamp
    curr_timestamp = self.timestamp
    if (start_timestamp == 0):
      start_timestamp = curr_timestamp
 
    task = super(Event_sched_process_exec, self).process()
 
    new_task = Task(self.timestamp, self.command, task.mode, self.pid)
    new_task.sched_stat = True
    new_task.syscall = task.syscall
    new_task.syscalls[task.syscall] = Call()
    new_task.syscalls[task.syscall].timestamp = self.timestamp
 
    task.change_mode(curr_timestamp, 'exit')
 
    suffix=0
    while True:
      old_tid = str(self.tid)+"-"+str(suffix)
      if old_tid in tasks:
        suffix += 1
      else:
        break
  
    tasks[old_tid] = tasks[self.tid]
 
    del tasks[self.tid]
 
    tasks[self.tid] = new_task

这将清楚地分隔不同进程映像的统计信息。在下面的示例中，perf 命令（任务“9614-0”）exec 了 exec.sh（任务“9614-1”），而 exec.sh 又 exec 了自身（任务“9614”）

-- [  task] command   cpu      user       sys       irq        hv      busy      idle |  util% moves
    [  9614] execs.sh    4  1.328238  0.485604  0.000000  0.000000  0.000000  2.273230 |  44.4%
    [  9614] execs.sh    7  0.000000  0.201266  0.000000  0.000000  0.000000  0.003466 |  98.3%
    [  9614] execs.sh  ALL  1.328238  0.686870  0.000000  0.000000  0.000000  2.276696 |  47.0%     1

-- [  task] command   cpu      user       sys       irq        hv      busy      idle |  util% moves
    [9614-0] perf        3  0.000000  0.408588  0.000000  0.000000  0.000000  2.298722 |  15.1%
    [9614-0] perf        4  0.059079  0.028269  0.000000  0.000000  0.000000  0.611355 |  12.5%
    [9614-0] perf        5  0.000000  0.067626  0.000000  0.000000  0.000000  0.004702 |  93.5%
    [9614-0] perf      ALL  0.059079  0.504483  0.000000  0.000000  0.000000  2.914779 |  16.2%     2
 
-- [  task] command   cpu      user       sys       irq        hv      busy      idle |  util% moves
    [9614-1] execs.sh    3  1.207972  0.987433  0.000000  0.000000  0.000000  2.435908 |  47.4%
    [9614-1] execs.sh    4  0.000000  0.341152  0.000000  0.000000  0.000000  0.004147 |  98.8%
    [9614-1] execs.sh  ALL  1.207972  1.328585  0.000000  0.000000  0.000000  2.440055 |  51.0%     1

发行版支持

令人惊讶的是，Ubuntu 中目前不支持 perf 的 Python 绑定。有关更多详细信息，请关注传奇故事。

跟踪事件数量的限制

随着 curt 变得越来越复杂，可能需要将越来越多的事件包含在跟踪文件中。 perf 目前每个 CPU 每个事件需要一个文件描述符。当最大打开文件描述符数不是系统上 CPU 数量的很大倍数时，这会成为问题。在具有大量 CPU 的系统上，这很快就会成为问题。例如，默认的最大打开文件描述符数通常为 1,024。具有四个插槽的 IBM POWER8 系统可能每个插槽有 12 个内核，每个内核有 8 个线程 (CPU)。这样的系统有 4 * 12 * 8 = 392 个 CPU。在这种情况下，perf 只能跟踪大约两个事件！一种解决方法是（显着）增加最大打开文件描述符数（如果系统管理员已将硬限制配置得足够高，则为 ulimit –n；或者管理员可以在 /etc/security/limits.conf 中为 nofile 设置更高的限制）。

总结

我希望本文展示了 perf 的强大功能——特别是 perf 启用的 Python 脚本的实用性和灵活性——以执行内核跟踪数据的复杂处理。此外，它还展示了在测试此类技术的边界时可能遇到的一些问题和边缘情况。

请随时下载并使用此处描述的 curt 工具，报告问题，提出改进建议，或在 curt GitHub 页面上贡献您自己的代码。

标签

Python

德克萨斯 Linux 节

Paul Clarke

IBM Linux 技术中心的软件架构师，致力于移植、性能分析和调优工具以及核心系统库。软件架构师、开发人员、调试器、性能分析师、顾问、Linux on POWER 布道者和导师。

更多关于我

评论已关闭。

本作品根据 Creative Commons Attribution-Share Alike 4.0 International License 获得许可。

系统利用率

利用率测量工具

利用率数据

perf 命令

perf stat

perf record、perf report 和 perf annotate

perf top

perf list

perf buildid-cache 和 perf probe

perf script

处理 perf 跟踪数据

perf eBPF

perf 数据文件

perf script

perf 脚本

perf script –gen-script

perf 脚本

利用率

Linux 的 curt 命令

高级设计

高级示例

边缘情况

sys_exit 作为任务的第一个事件

sys_enter 作为任务的最后一个事件

不确定的状态

不可见的任务

curt.py Python 类

Task

Call

CPU

curt.py 事件处理示例

处理边缘情况

Sys_exit 作为第一个事件

Sys_enter 作为最后一个事件

迁移

任务创建

任务退出

输出

障碍和问题

乱序事件

跟踪开始时间戳

笨拙的调用

需要同时注册事件

繁琐的记录步骤

进程 ID 和 perf API 变更

在其他 CPU 上报告的事件

显式迁移（没有 sched_migrate 事件）

将系统调用 ID 映射到名称是特定于架构的

将虚拟机管理程序调用 ID 映射到名称是不存在的

命令字符串作为字节数组

动态映射，例如 IRQ 号到名称

任务 0

虚假的 sched_migrate_task 事件（相同的 CPU）

exec

发行版支持

跟踪事件数量的限制

总结

评论已关闭。

相关内容