Clang 为直观上应该等效的表达式提供了非常不同的性能答案

【问题标题】：Clang giving very different performance for expressions which intuitively should be equivalentClang 为直观上应该等效的表达式提供了非常不同的性能
【发布时间】：2014-05-14 04:41:31
【问题描述】：

谁能解释一下这些表达式之间的这些相当大的性能差异，我希望这些差异具有相似的性能。我正在发布模式下使用 Apple LLVM 版本 5.1 (clang-503.0.38)（基于 LLVM 3.4svn）进行编译。

这是我的测试代码（只需将 CASE 更改为 1、2、3 或 4 来测试自己）：

#include <iostream>
#include <chrono>

#define CASE 1

inline int foo(int n) {
    return
#if CASE == 1
    (n % 2) ? 9 : 6

#elif CASE == 2
    (n % 2) == true ? 9 : 6

#elif CASE == 3
    6 + (n % 2) * 3

#elif CASE == 4
    6 + bool(n % 2) * 3

#endif
    ;
}

int main(int argc, const char* argv[])
{
    std::chrono::time_point<std::chrono::system_clock> start, end;
    start = std::chrono::system_clock::now();

    int n = argc;
    for (int i = 0; i < 100000000; ++i) {
        n += foo(n);
    }

    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;

    std::cout << "elapsed time: " << elapsed_seconds.count() << "\n";
    std::cout << "value: " << n << "\n";

    return 0;
}

这是我得到的时间：

CASE   EXPRESSION                TIME
1      (n % 2) ? 9 : 6           0.1585
2      (n % 2) == true ? 9 : 6   0.3491
3      6 + (n % 2) * 3           0.2559
4      6 + bool(n % 2) * 3       0.1906

CASE 1 和 CASE 2 的组装区别如下：

案例 1：

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    andl    $1, %ecx
    leal    (%rcx,%rcx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

案例 2：

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    shrl    $31, %ecx
    addl    %ebx, %ecx
    andl    $-2, %ecx
    movl    %ebx, %edx
    subl    %ecx, %edx
    cmpl    $1, %edx
    sete    %cl
    movzbl  %cl, %ecx
    leal    (%rcx,%rcx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

这是 CASE 3 和 CASE 4 在组装上的区别：

案例 3：

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    shrl    $31, %ecx
    addl    %ebx, %ecx
    andl    $-2, %ecx
    movl    %ebx, %edx
    subl    %ecx, %edx
    leal    (%rdx,%rdx,2), %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

案例 4：

Ltmp12:
LBB0_1:                                 ## =>This Inner Loop Header: Depth=1
    ##DEBUG_VALUE: main:argv <- RSI
    ##DEBUG_VALUE: i <- 0
    .loc    1 24 0                  ## /Test/main.cpp:24:0
    movl    %ebx, %ecx
    andl    $1, %ecx
    negl    %ecx
    andl    $3, %ecx
Ltmp13:
    .loc    1 48 14                 ## /Test/main.cpp:48:14
    leal    6(%rbx,%rcx), %ebx

【问题讨论】：

你的编译标志是？？在不知道应用了哪些优化的情况下，这是一个毫无意义的比较。到目前为止 - 不同数量的汇编指令给出不同的执行时间，这应该是显而易见的。
我使用的是 Xcode，所以我不容易看到所有的标志，但我能看到的一个重要标志是 -Os
-O2 看起来像什么？否则，看起来优化器失败了。
对于 coliru (-O3) 的 clang++ 和 g++，(n % 2) != false ? 9 : 6 明显快于 (n % 2) == true ? 9 : 6
@dyp 这很有趣！看起来像是优化器的极端案例。

标签： c++ performance clang performance-testing clang++

【解决方案1】：

这个答案目前只涵盖前两种情况的区别。

(n % 2) 的可能值是什么？肯定是0 和1，对吧？

错了。是0、1 和 -1。因为n 是一个有符号整数，而the result of % can be negative。

(n % 2) ? 6 : 9 将表达式n % 2 隐式转换为bool。此转换的结果是true IFF 值非零。因此转换等价于(n % 2) != 0。

在(n % 2) == true ? 6 : 9 中，对于比较(n % 2) == true，通常的算术转换 应用于双方（注意bool 是算术类型）。 true 被提升为价值 1 的 int。所以转换相当于(n % 2) == 1。

(n % 2) != 0 和(n % 2) == 1 的两个转换对于否定的n 产生不同的结果：设n = -1。那么n % 2 == -1，而-1 != 0是true，但是-1 == 1是false。

因此，编译器必须引入一些额外的复杂性来处理符号。

如果您将n 设为无符号整数，或以任何其他方式消除符号问题（例如通过比较n % 2 != false），运行时的差异就会消失。

我通过查看程序集输出得到了这个想法，尤其是以下行：

shrl    $31, %eax

首先使用最高位对我来说毫无意义，直到我意识到最高位被用作符号。

【讨论】：

在使用unsigned int 时，案例 3 和 4 之间也没有区别。不过，两者都比情况 1 和 2 慢。
好吧，有道理！