如何对几行 C 编程代码进行基准测试？答案

【问题标题】：How to benchmark a few lines of C programming code?如何对几行 C 编程代码进行基准测试？
【发布时间】：2020-11-25 11:44:22
【问题描述】：

我最近听说了无分支编程的想法，我想尝试一下，看看它是否可以提高性能。我有以下 C 函数。

int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}

删除 if 分支后，我有这个：

int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}

现在我想知道无分支版本是否更快。我四处搜寻，发现了一个名为 hyperfine (https://github.com/sharkdp/hyperfine) 的工具。于是我写了下面的main函数，用hyperfine测试了square函数的两个版本。

int main() {
    printf("%d\n", square(38));
    return 0;
}

问题是基于超精细结果，我无法确定哪个版本更好。在 C 编程中，人们通常如何确定函数的哪个版本更快？

以下是我的一些hyperfine 结果。

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.4 ms ±   0.2 ms    [User: 2.2 ms, System: 3.2 ms]
  Range (min … max):     4.9 ms …   6.1 ms    230 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       6.1 ms ±   0.7 ms    [User: 2.2 ms, System: 3.7 ms]
  Range (min … max):     5.0 ms …   9.7 ms    225 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.5 ms ±   0.3 ms    [User: 2.1 ms, System: 3.5 ms]
  Range (min … max):     4.9 ms …   7.0 ms    211 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.6 ms ±   0.4 ms    [User: 2.0 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   7.0 ms    217 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.


C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.7 ms ±   0.3 ms    [User: 1.9 ms, System: 4.0 ms]
  Range (min … max):     5.0 ms …   6.6 ms    220 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.6 ms ±   0.3 ms    [User: 1.9 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   6.9 ms    219 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 1.5 ms, System: 4.0 ms]
  Range (min … max):     5.2 ms …   7.3 ms    224 runs

C:\my_projects\untitled>

【问题讨论】：

how does people usually determine which version of a function is faster? 在这种简单的情况下，查看生成的程序集。请注意，您不是单独对代码进行基准测试，而是对编译器+编译器选项+代码的组合进行基准测试。
查看程序集，您可以看到两个代码示例使用了多少指令。对于这个最小的代码，我认为（取决于您正在工作的平台）每个汇编程序命令都需要一个时钟周期。
其次，您似乎也在测量 printf 的执行时间，这很可能比您的函数的执行时间高几个数量级。这将隐藏功能之间任何差异的影响。这就像试图在太阳旁边看到一颗遥远恒星的光：它是不可见的。
为什么不直接做int tmp = num > 10; return num * num * tmp;？这更简单，可能更快。此外，条件跳转只有在现代处理器上难以预测时才会变慢。但是，由于num > 10 在计算square(38) 时始终为真，因此带有分支的版本应该很快。
@KrisVandermotten：更重要的是，他们正在测量整个过程启动和退出的总时间！！这通常包括一个或两个页面错误，以及动态链接。使用printf 可能是其中的一个重要部分，尤其是在慢速终端窗口中的 Windows 上，但是，是的，完全疯了。不是更通用的Idiomatic way of performance evaluation? 的完全复制品，但它指出了几个方法问题，以及试图为这么短的东西找到一个简单的一维成本的根本缺陷。

标签： c performance performance-testing benchmarking microbenchmark

【解决方案1】：

如何对几行 C 编程代码进行基准测试？

编译代码并检查编译器生成的程序集。

通常使用 Godbolt 并在那里检查生成的程序集。 Godbolt link.

一种半不可靠的方法是计算执行的汇编指令。我不了解 Windows - 我在 linux 上工作。使用 gdb，我使用 in this question 提供的代码并使用：

// 1.c
#if MACRO
int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}
#else
int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}
#endif
// start-stop places for counting assembly instructions
// Adding attribute and a specific asm syntax that is a GNU extension
// So that the compiler will not optimize the functions out
__attribute__((__noinline__)) void begin() { __asm__("nop"); }
__attribute__((__noinline__)) void finish() { __asm__("nop"); }
// trying to use volatile so that compiler 
// wouldn't optimize the function completely out
volatile int arg = 38, res;
int main() {
    begin();
    res = square(arg);
    finish();
}

然后在 bash 中编译和基准测试：

# a short function to count number of instructions executed between "begin" and "finish" functions
$ b() { printf "%s\n" 'set editing off' 'set prompt' 'set confirm off' 'set pagination off' 'b begin' 'r' 'set $count=0' 'while ($pc != finish)' 'stepi' 'set $count=$count+1' 'end' 'printf "The count of instruction between begin and finish is: %d\n", $count' 'q' | gdb "$1" |& grep 'The count'; }

# then compile and measure
$ gcc -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 34
$ gcc -D MACRO=1 1.c ; b a.out
The count of instruction between begin and finish is: 22

看起来在我的平台上使用 gcc10 编译器，没有任何选项，没有优化第二个版本执行了 12 条短指令。但是将编译器输出与优化进行比较是没有意义的。启用优化后有一条指令不同：

$ gcc -O -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 11
$ gcc -O -D MACRO=1 1.c ; b a.out 
The count of instruction between begin and finish is: 10

注意事项：

使用您的代码square(38) 可以优化为无操作。
使用您的代码和hyperfine branchless.exe，您正在比较printf 的执行情况，即。刷新输出并打印它所花费的时间，而不是 square() 的执行时间。
如 in that answer 所述，您可以在可用时使用硬件计数器。

【讨论】：

请不要说“快 12 条指令”。说“缩短 12 条指令”，如果他们想实现这一飞跃，则将“更快”的推断留给读者。通常确实，如果没有内存瓶颈，执行的指令更少 = 由于前端的瓶颈而更快，但这种小规模的性能实际上有 3 个维度：前端 uops、从输入到延迟输出，以及它需要哪些后端执行端口。大多数指令是 1 uop 和类似的速度，但例外包括 div，imul 有 3 个周期延迟。
更多详情请见What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?。还有How many CPU cycles are needed for each assembly instruction? - 正如我所说，这个问题没有答案。您不能为每条指令分配一个编号并累加成本；超标量乱序执行不能那样工作。您必须弄清楚前端、延迟或单个执行端口中的哪一个是给定循环的瓶颈

【解决方案2】：

正如备注中所说，printf 的执行时间大于您要测量的时间，并且与此无关，您尝试测量的时间太小。

要进行度量，您必须将 square 放在一个文件中，并在循环中将它的调用放在另一个文件中，也不要使用文字，否则生成的代码可以直接是结果，仅此而已（永远不要低估编译器在了解所有情况时能够进行的优化的力量，例如 C++ constexpr）。

例如：

文件 c1.c

int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}

文件 c2.c

int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}

文件 main.c

#include <stdio.h>

extern int square(int);

int main(int argc, char ** argv)
{
  int n, v, r = 0;
  
  if ((argc == 3) && 
      (sscanf(argv[1], "%d", &n) == 1) &&
      (sscanf(argv[2], "%d", &v) == 1))
    while (n--)
      r += square(v);
  return r;
}

使用第一种解决方案（未优化）：

/tmp % gcc c1.c main.c 
/tmp % time ./a.out 1000000000 38
2.315u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
2.316u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
2.316u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp %

使用第二种解决方案（未优化）：

/tmp % gcc c2.c main.c 
/tmp % time ./a.out 1000000000 38
3.087u 0.000s 0:03.21 95.9% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
3.107u 0.000s 0:03.23 95.9% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
3.098u 0.000s 0:03.22 95.9% 0+0k 0+0io 0pf+0w
/tmp  %

因此，如果没有优化，第二个提案需要更多时间，即使它们在优化编译之间的差异几乎为零，情况仍然如此：

/tmp % gcc -O2 c1.c main.c
/tmp % time ./a.out 1000000000 38
1.337u 0.000s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.336u 0.001s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.343u 0.000s 0:01.39 96.4% 0+0k 0+0io 0pf+0w
/tmp % 
/tmp % 
/tmp % gcc -O2 c2.c main.c
/tmp % time ./a.out 1000000000 38
1.341u 0.000s 0:01.39 96.4% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.343u 0.000s 0:01.40 95.7% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.339u 0.000s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp %

我在 Linux 下做过，但你可以在 Windows 下用你的工具来做同样的事情

生成的优化代码如下：

第一种方式：

square:
.LFB0:
    .cfi_startproc
    movl    %edi, %edx
    xorl    %eax, %eax
    imull   %edi, %edx
    cmpl    $11, %edi
    cmovge  %edx, %eax
    ret

第二种方式：

square:
.LFB0:
    .cfi_startproc
    xorl    %eax, %eax
    cmpl    $10, %edi
    setg    %al
    imull   %edi, %eax
    imull   %eax, %eax
    ret
    .cfi_endproc

【讨论】：

在没有优化的情况下衡量性能是毫无意义的
@P__J__ 当然，我确实展示了有无之间的区别，并展示了优化使结果与源之间的差异更小
您可能在回答中提出的关键点是 GCC 将 if 优化为使用 cmov 的无分支 asm，这比 setg+imul 更有效。这称为“如果转换”。在 C 源代码中使用三元运算符最可靠。
@PeterCordes 确定这很有趣，但完全脱离主题，我记得你的问题是如何对短代码进行基准测试吗？
这段代码太短了，以至于它的执行会与周围的代码重叠，甚至在编译时优化到周围的代码中，这取决于该代码对相同变量所做的其他事情。那部分没有简单的答案，请参阅我的 cmets 其他答案和问题。将其作为具有相同输入的非内联函数重复调用将测量其吞吐量，但不会测量其延迟。在您的案例中，总体吞吐量瓶颈包括调用/调用开销，与它的短时间相比，这是不可忽略的。