为什么这个 C++ 函数会产生如此多的分支错误预测？答案

【问题标题】：Why does this C++ function produce so many branch mispredictions?为什么这个 C++ 函数会产生如此多的分支错误预测？
【发布时间】：2017-01-23 15:40:17
【问题描述】：

设A 是一个包含奇数个零和一的数组。如果n 是A 的大小，则构造A 使得第一个ceil(n/2) 元素为0，其余元素为1。

所以如果n = 9，A 看起来像这样：

0,0,0,0,0,1,1,1,1

我们的目标是在数组中找到1s 的总和，我们使用这个函数来做到这一点：

s = 0;
void test1(int curIndex){
    //A is 0,0,0,...,0,1,1,1,1,1...,1

    if(curIndex == ceil(n/2)) return;

    if(A[curIndex] == 1) return;

    test1(curIndex+1);
    test1(size-curIndex-1);

    s += A[curIndex+1] + A[size-curIndex-1];

}

对于给定的问题，这个函数相当愚蠢，但它是一个不同函数的模拟，我希望看起来像这样并且产生相同数量的分支错误预测。

下面是整个实验的代码：

#include <iostream>
#include <fstream>

using namespace std;


int size;
int *A;
int half;
int s;

void test1(int curIndex){
    //A is 0,0,0,...,0,1,1,1,1,1...,1

    if(curIndex == half) return;
    if(A[curIndex] == 1) return;

    test1(curIndex+1);
    test1(size - curIndex - 1);

    s += A[curIndex+1] + A[size-curIndex-1];

}


int main(int argc, char* argv[]){

    size = atoi(argv[1]);
    if(argc!=2){
        cout<<"type ./executable size{odd integer}"<<endl;
        return 1;
    }
    if(size%2!=1){
        cout<<"size must be an odd number"<<endl;
        return 1;
    }
    A = new int[size];

    half = size/2;
    int i;
    for(i=0;i<=half;i++){
        A[i] = 0;
    }
    for(i=half+1;i<size;i++){
        A[i] = 1;
    }

    for(i=0;i<100;i++) {
        test1(0);
    }
    cout<<s<<endl;

    return 0;
}

输入g++ -O3 -std=c++11 file.cpp编译，输入./executable size{odd integer}运行。

我正在使用 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz，8 GB RAM，L1 缓存 256 KB，L2 缓存 1 MB，L3 缓存 6 MB。

运行 perf stat -B -e branches,branch-misses ./cachetests 111111 给了我以下信息：

   Performance counter stats for './cachetests 111111':

    32,639,932      branches                                                    
     1,404,836      branch-misses             #    4.30% of all branches        

   0.060349641 seconds time elapsed

如果我删除线

s += A[curIndex+1] + A[size-curIndex-1];

我从 perf 得到以下输出：

  Performance counter stats for './cachetests 111111':

    24,079,109      branches                                                    
        39,078      branch-misses             #    0.16% of all branches        

   0.027679521 seconds time elapsed

当它甚至不是 if 语句时，该行与分支预测有什么关系？

在我看来，在test1() 的第一个ceil(n/2) - 1 调用中，两个 if 语句都是错误的。在ceil(n/2)-th 调用中，if(curIndex == ceil(n/2)) 将为真。在剩余的n-ceil(n/2) 调用中，第一条语句为假，第二条语句为真。

为什么英特尔无法预测如此简单的行为？

现在让我们看看第二种情况。假设A 现在有交替的零和一。我们总是从 0 开始。所以如果 n = 9 A 看起来像这样：

0,1,0,1,0,1,0,1,0

我们要使用的函数如下：

void test2(int curIndex){
    //A is 0,1,0,1,0,1,0,1,....
    if(curIndex == size-1) return;
    if(A[curIndex] == 1) return;

    test2(curIndex+1);
    test2(curIndex+2);

    s += A[curIndex+1] + A[curIndex+2];

}

这里是整个实验的代码：

#include <iostream>
#include <fstream>

using namespace std;


int size;
int *A;
int s;

void test2(int curIndex){
    //A is 0,1,0,1,0,1,0,1,....
    if(curIndex == size-1) return;
    if(A[curIndex] == 1) return;

    test2(curIndex+1);
    test2(curIndex+2);

    s += A[curIndex+1] + A[curIndex+2];

}

int main(int argc, char* argv[]){

    size = atoi(argv[1]);
    if(argc!=2){
        cout<<"type ./executable size{odd integer}"<<endl;
        return 1;
    }
    if(size%2!=1){
        cout<<"size must be an odd number"<<endl;
        return 1;
    }
    A = new int[size];
    int i;
    for(i=0;i<size;i++){
        if(i%2==0){
            A[i] = false;
        }
        else{
            A[i] = true;
        }
    }

    for(i=0;i<100;i++) {
        test2(0);
    }
    cout<<s<<endl;

    return 0;
}

我使用与以前相同的命令运行 perf：

    Performance counter stats for './cachetests2 111111':

    28,560,183      branches                                                    
        54,204      branch-misses             #    0.19% of all branches        

   0.037134196 seconds time elapsed

删除该行再次改善了一些情况：

   Performance counter stats for './cachetests2 111111':

    28,419,557      branches                                                    
        16,636      branch-misses             #    0.06% of all branches        

   0.009977772 seconds time elapsed

现在如果我们分析函数，if(curIndex == size-1) 将是 false n-1 次，if(A[curIndex] == 1) 将在 true 和 false 之间交替。

在我看来，这两个函数都应该很容易预测，但第一个函数并非如此。同时，我不确定那条线发生了什么以及为什么它在改善分支行为方面发挥作用。

【问题讨论】：

你确定这是对的吗？我看到双重递归最终会遍历数组两次
不同的汇编代码是什么样的？
在第一个函数中，如果curIndex 没有指向最后一个0 并且也没有指向1，我们递增curIndex。如果数组是从0 索引的，倒数第二个0 将位于(floor(n/2) - 1) 位置，我们将进行的最高跳跃将指向n-(floor(n/2) - 1)-1 = n - floor(n/2)，它应该指向最后一个0 之后的元素.如果我们在位置0，我们将跳转到(n-0-1)，它将指向数组中的最后一个元素。至于第二个函数，我们也是这样做的，当我们到达最后一个0时，索引将等于n-1，所以我们将停止。
@jsguy 可惜还没人回答。我建议添加performance 标签，后面有很多标签，因此可能会吸引一些错过这个问题的人。我自己已经提出了这个修改，但被拒绝了。我不想再提交了，我把它留在这里作为给你的建议。您的来电。
你用cachegrind看了吗？ (valgrind.org/docs/manual/cg-manual.html)

标签： c++ performance branch-prediction

【解决方案1】：

这是我盯着它看了一段时间后的想法。首先，这个问题很容易用-O2重现，所以最好把它用作参考，因为它生成简单的非展开代码，易于分析。 -O3 的问题本质上是一样的，只是不太明显。

所以，对于第一种情况（半零和半一模式）编译器生成此代码：

 0000000000400a80 <_Z5test1i>:
   400a80:       55                      push   %rbp
   400a81:       53                      push   %rbx
   400a82:       89 fb                   mov    %edi,%ebx
   400a84:       48 83 ec 08             sub    $0x8,%rsp
   400a88:       3b 3d 0e 07 20 00       cmp    0x20070e(%rip),%edi        #
   60119c <half>
   400a8e:       74 4f                   je     400adf <_Z5test1i+0x5f>
   400a90:       48 8b 15 09 07 20 00    mov    0x200709(%rip),%rdx        #
   6011a0 <A>
   400a97:       48 63 c7                movslq %edi,%rax
   400a9a:       48 8d 2c 85 00 00 00    lea    0x0(,%rax,4),%rbp
   400aa1:       00 
   400aa2:       83 3c 82 01             cmpl   $0x1,(%rdx,%rax,4)
   400aa6:       74 37                   je     400adf <_Z5test1i+0x5f>
   400aa8:       8d 7f 01                lea    0x1(%rdi),%edi
   400aab:       e8 d0 ff ff ff          callq  400a80 <_Z5test1i>
   400ab0:       89 df                   mov    %ebx,%edi
   400ab2:       f7 d7                   not    %edi
   400ab4:       03 3d ee 06 20 00       add    0x2006ee(%rip),%edi        #
   6011a8 <size>
   400aba:       e8 c1 ff ff ff          callq  400a80 <_Z5test1i>
   400abf:       8b 05 e3 06 20 00       mov    0x2006e3(%rip),%eax        #
   6011a8 <size>
   400ac5:       48 8b 15 d4 06 20 00    mov    0x2006d4(%rip),%rdx        #
   6011a0 <A>
   400acc:       29 d8                   sub    %ebx,%eax
   400ace:       48 63 c8                movslq %eax,%rcx
   400ad1:       8b 44 2a 04             mov    0x4(%rdx,%rbp,1),%eax
   400ad5:       03 44 8a fc             add    -0x4(%rdx,%rcx,4),%eax
   400ad9:       01 05 b9 06 20 00       add    %eax,0x2006b9(%rip)        #
   601198 <s>
   400adf:       48 83 c4 08             add    $0x8,%rsp
   400ae3:       5b                      pop    %rbx
   400ae4:       5d                      pop    %rbp
   400ae5:       c3                      retq   
   400ae6:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   400aed:       00 00 00

非常简单，正如您所期望的那样——两个条件分支，两个来电。它为我们提供了有关 Core 2 Duo T6570、AMD 的这个（或类似的）统计数据飞鸿 II X4 925 和酷睿 i7-4770：

$ perf stat -B -e branches,branch-misses ./a.out 111111
5555500

 Performance counter stats for './a.out 111111':

        45,216,754      branches                                                    
         5,588,484      branch-misses             #   12.36% of all branches        

       0.098535791 seconds time elapsed

如果您要进行此更改，请在递归调用之前移动赋值：

 --- file.cpp.orig  2016-09-22 22:59:20.744678438 +0300
 +++ file.cpp   2016-09-22 22:59:36.492583925 +0300
 @@ -15,10 +15,10 @@
      if(curIndex == half) return;
      if(A[curIndex] == 1) return;

 +    s += A[curIndex+1] + A[size-curIndex-1];
      test1(curIndex+1);
      test1(size - curIndex - 1);

 -    s += A[curIndex+1] + A[size-curIndex-1];

  }

图片变化：

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         39,495,804      branches                                                    
             54,430      branch-misses             #    0.14% of all branches        

        0.039522259 seconds time elapsed

是的，正如已经指出的，它与尾递归直接相关优化，因为如果你要编译补丁代码 -fno-optimize-sibling-calls 你会得到同样的“坏”结果。那么让我们看看我们在装配尾调用优化中有什么：

 0000000000400a80 <_Z5test1i>:
   400a80:       3b 3d 16 07 20 00       cmp    0x200716(%rip),%edi        #
   60119c <half>
   400a86:       53                      push   %rbx
   400a87:       89 fb                   mov    %edi,%ebx
   400a89:       74 5f                   je     400aea <_Z5test1i+0x6a>
   400a8b:       48 8b 05 0e 07 20 00    mov    0x20070e(%rip),%rax        #
   6011a0 <A>
   400a92:       48 63 d7                movslq %edi,%rdx
   400a95:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400a99:       74 4f                   je     400aea <_Z5test1i+0x6a>
   400a9b:       8b 0d 07 07 20 00       mov    0x200707(%rip),%ecx        #
   6011a8 <size>
   400aa1:       eb 15                   jmp    400ab8 <_Z5test1i+0x38>
   400aa3:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
   400aa8:       48 8b 05 f1 06 20 00    mov    0x2006f1(%rip),%rax        #
   6011a0 <A>
   400aaf:       48 63 d3                movslq %ebx,%rdx
   400ab2:       83 3c 90 01             cmpl   $0x1,(%rax,%rdx,4)
   400ab6:       74 32                   je     400aea <_Z5test1i+0x6a>
   400ab8:       29 d9                   sub    %ebx,%ecx
   400aba:       8d 7b 01                lea    0x1(%rbx),%edi
   400abd:       8b 54 90 04             mov    0x4(%rax,%rdx,4),%edx
   400ac1:       48 63 c9                movslq %ecx,%rcx
   400ac4:       03 54 88 fc             add    -0x4(%rax,%rcx,4),%edx
   400ac8:       01 15 ca 06 20 00       add    %edx,0x2006ca(%rip)        #
   601198 <s>
   400ace:       e8 ad ff ff ff          callq  400a80 <_Z5test1i>
   400ad3:       8b 0d cf 06 20 00       mov    0x2006cf(%rip),%ecx        #
   6011a8 <size>
   400ad9:       89 c8                   mov    %ecx,%eax
   400adb:       29 d8                   sub    %ebx,%eax
   400add:       89 c3                   mov    %eax,%ebx
   400adf:       83 eb 01                sub    $0x1,%ebx
   400ae2:       39 1d b4 06 20 00       cmp    %ebx,0x2006b4(%rip)        #
   60119c <half>
   400ae8:       75 be                   jne    400aa8 <_Z5test1i+0x28>
   400aea:       5b                      pop    %rbx
   400aeb:       c3                      retq   
   400aec:       0f 1f 40 00             nopl   0x0(%rax)

它有四个条件分支，一次调用。那么我们来分析一下数据到目前为止。

首先，从处理器的角度来看，什么是分支指令？它是call、ret、j*（包括直接jmp）和loop中的任何一个。 call 和 jmp 有点不直观，但它们对于正确计算事物至关重要。

总的来说，我们预计这个函数会被调用 11111100 次，每次调用一次元素，大约是11M。在非尾调用优化版本中，我们看到 45M 个分支，main() 中的初始化只有 111K，其他都是次要的，所以这个数字的主要贡献来自我们的函数。我们的函数是call-ed，它评估第一个je，除了一个之外，在所有情况下都是正确的，然后它评估第二个je，它有一半的时间是正确的，然后它要么递归地调用自己（但是我们已经计算过该函数被调用了 11M 次）或返回（就像它在递归调用之后所做的那样。所以每 11M 调用有 4 个分支指令，正是我们看到的数字。在这些中，大约 550 万个分支被遗漏了，即表明这些未命中都来自一个错误预测的指令，要么是评估了 1100 万次但大约有 50% 的时间遗漏的东西，要么是评估了一半时间但总是遗漏的东西。

在尾调用优化版本中我们有什么？我们有一个函数叫做大约 550 万次，但现在每次调用都会引发一个 call，最初有两个分支（第一个在所有情况下都是 true，除了一个，第二个始终为 false，因为我们的数据），然后是 jmp，然后是一个调用（但我们已经计算过我们有 550 万次调用），然后在 400ae8 的一个分支和在 400ab6 的一个分支（由于我们的数据而总是如此），然后返回。因此，平均而言，有 4 个条件分支、1 个无条件跳转、1 个调用和 1 个间接分支（从函数返回），550 万乘以 7 得出的总分支数约为 3900 万个，正如我们在 perf 输出中看到的那样。

我们所知道的是，处理器通过一个函数调用来预测流中的事物完全没有问题（即使这个版本有更多的条件分支）并且它有两个函数调用的问题。所以这表明问题出在函数的返回中。

不幸的是，我们对具体如何分支的细节知之甚少我们现代处理器的预测器工作。我能找到的最好的分析 is this 它表明处理器有一个大约 16 个条目的返回堆栈缓冲区。如果我们要在手头发现这一发现再次返回我们的数据，事情就会开始澄清一点。

当你有半零和半一模式时，你正在递归非常深入test1(curIndex+1)，但随后你开始返回并打电话给test1(size-curIndex-1)。该递归从不深于一调用，因此可以完美地预测回报。但请记住，我们是现在有 55555 次调用，处理器只记得最后 16 次，所以它是从 55539 级深度开始，它猜不到我们的回报也就不足为奇了，更令人惊讶的是，它可以通过尾调用优化版本做到这一点。

实际上，尾调用优化版本的行为表明缺少有关退货的任何其他信息，处理器只是假设正确一个是最后一个看到的。也可以通过行为来证明非尾调用优化版本，因为它深入到了 55555 个调用 test1(curIndex+1) 然后在返回时它总是深入一层 test1(size-curIndex-1)，所以当我们从 55555-deep 上升到 55539-deep 时（或无论您的处理器返回缓冲区是什么）它都会调用 test1(size-curIndex-1)，从那里返回，它绝对没有关于下一次返回的信息，所以它假设我们要返回到最后看到的地址（这是要返回的地址 test1(size-curIndex-1)) 这显然是错误的。 55539 次错误。和函数的 100 次循环，正好是 5.5M 分支预测未命中我们看到了。

现在让我们来看看您的交替模式及其代码。这段代码是实际上非常不同，如果你要分析它是如何进入的深度。在这里你有你的test2(curIndex+1) always 立即返回你的test2(curIndex+2)总是更深入。所以回报来自 test2(curIndex+1) 总是被完美预测（他们只是不深入够了），当我们要完成对test2(curIndex+2)的递归时，它总是返回同一点，全部 55555 次，所以处理器没有有问题。

这可以通过对原始半零代码的小改动来进一步证明：

--- file.cpp.orig       2016-09-23 11:00:26.917977032 +0300
+++ file.cpp    2016-09-23 11:00:31.946027451 +0300
@@ -15,8 +15,8 @@
   if(curIndex == half) return;
   if(A[curIndex] == 1) return;

-  test1(curIndex+1);
   test1(size - curIndex - 1);
+  test1(curIndex+1);

   s += A[curIndex+1] + A[size-curIndex-1];

所以现在生成的代码仍然没有经过尾调用优化（在汇编方面它与原始代码非常相似），但您在 perf 输出中会得到类似的结果：

$ perf stat -B -e branches,branch-misses ./a.out 111111 
5555500

 Performance counter stats for './a.out 111111':

        45 308 579      branches                                                    
            75 927      branch-misses             #    0,17% of all branches        

       0,026271402 seconds time elapsed

正如预期的那样，现在我们的第一次调用总是立即返回，第二次调用深度为 55555，然后只返回同一点。

现在解决了这个问题，让我展示一下我的袖子。在一个系统上，并且也就是 Core i5-5200U 非尾调用优化的原始半零和半一版本显示了这个结果：

 $ perf stat -B -e branches,branch-misses ./a.out 111111
 5555500

  Performance counter stats for './a.out 111111':

         45 331 670      branches                                                    
             16 349      branch-misses             #    0,04% of all branches        

        0,043351547 seconds time elapsed

所以，显然，Broadwell 可以轻松处理这种模式，这让我们回到我们对我们的分支预测逻辑了解多少的问题现代处理器。

【讨论】：

我想我的答案错了。由于我使用的是 i5-6400，因此它与您使用 Broadwell 的测试用例发生的情况相同。 GJ 的答案很好。
作为旁注，我偶然发现了这份文件：agner.org/optimize/microarchitecture.pdf 恕我直言，必读。

【解决方案2】：

以下代码是尾递归的：函数的最后一行不需要调用，只需一个分支到函数开始使用第一个参数的点：

void f(int i) {
    if (i == size) break;
    s += a[i];
    f(i + 1);
}

但是，如果我们打破它并使其成为非尾递归：

void f(int i) {
    if (i == size) break;
    f(i + 1);
    s += a[i];
}

编译器无法推断出后者是尾递归的原因有很多，但在您给出的示例中，

test(A[N]);
test(A[M]);
s += a[N] + a[M];

同样的规则适用。编译器无法确定这是尾递归，但由于这两个调用（请参阅before 和after），它无法执行此操作。

您似乎期望编译器对此执行的是一个执行几个简单条件分支、两个调用和一些加载/添加/存储的函数。

相反，编译器正在展开这个循环并生成具有很多分支点的代码。这样做的部分原因是编译器认为这样会更有效（涉及 less 个分支），但部分原因是它减少了运行时递归深度。

int size;
int* A;
int half;
int s;

void test1(int curIndex){
  if(curIndex == half || A[curIndex] == 1) return;
  test1(curIndex+1);
  test1(size-curIndex-1);
  s += A[curIndex+1] + A[size-curIndex-1];
}

产生：

test1(int):
        movl    half(%rip), %edx
        cmpl    %edi, %edx
        je      .L36
        pushq   %r15
        pushq   %r14
        movslq  %edi, %rcx
        pushq   %r13
        pushq   %r12
        leaq    0(,%rcx,4), %r12
        pushq   %rbp
        pushq   %rbx
        subq    $24, %rsp
        movq    A(%rip), %rax
        cmpl    $1, (%rax,%rcx,4)
        je      .L1
        leal    1(%rdi), %r13d
        movl    %edi, %ebp
        cmpl    %r13d, %edx
        je      .L42
        cmpl    $1, 4(%rax,%r12)
        je      .L42
        leal    2(%rdi), %ebx
        cmpl    %ebx, %edx
        je      .L39
        cmpl    $1, 8(%rax,%r12)
        je      .L39
        leal    3(%rdi), %r14d
        cmpl    %r14d, %edx
        je      .L37
        cmpl    $1, 12(%rax,%r12)
        je      .L37
        leal    4(%rdi), %edi
        call    test1(int)
        movl    %r14d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %esi
        movl    16(%rax,%r12), %edx
        subl    %r14d, %esi
        movslq  %esi, %rsi
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L10:
        movl    %ecx, %edi
        subl    %ebx, %edi
        leal    -1(%rdi), %r14d
        cmpl    %edx, %r14d
        je      .L38
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L38
        call    test1(int)
        movl    %r14d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %edx
        movl    4(%rax,%r15), %esi
        movl    %ecx, %edi
        subl    %r14d, %edx
        subl    %ebx, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, s(%rip)
.L13:
        movslq  %edi, %rdi
        movl    12(%rax,%r12), %r8d
        addl    -4(%rax,%rdi,4), %r8d
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L7:
        movl    %ecx, %ebx
        subl    %r13d, %ebx
        leal    -1(%rbx), %r14d
        cmpl    %edx, %r14d
        je      .L41
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L41
        cmpl    %edx, %ebx
        je      .L18
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r8
        movq    %r8, (%rsp)
        je      .L18
        leal    1(%rbx), %edi
        call    test1(int)
        movl    %ebx, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movq    (%rsp), %r8
        movl    %ecx, %esi
        subl    %ebx, %esi
        movl    4(%rax,%r8), %edx
        movslq  %esi, %rsi
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L18:
        movl    %ecx, %edi
        subl    %r14d, %edi
        leal    -1(%rdi), %ebx
        cmpl    %edx, %ebx
        je      .L40
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r8
        je      .L40
        movq    %r8, (%rsp)
        call    test1(int)
        movl    %ebx, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movq    (%rsp), %r8
        movl    %ecx, %edx
        movl    %ecx, %edi
        subl    %ebx, %edx
        movl    4(%rax,%r8), %esi
        subl    %r14d, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, %r8d
        movl    %esi, s(%rip)
.L20:
        movslq  %edi, %rdi
        movl    4(%rax,%r15), %esi
        movl    %ecx, %ebx
        addl    -4(%rax,%rdi,4), %esi
        subl    %r13d, %ebx
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L16:
        movslq  %ebx, %rbx
        movl    8(%rax,%r12), %edi
        addl    -4(%rax,%rbx,4), %edi
        addl    %edi, %esi
        movl    %esi, s(%rip)
        jmp     .L4
.L45:
        movl    s(%rip), %edx
.L23:
        movslq  %ebx, %rbx
        movl    4(%rax,%r12), %ecx
        addl    -4(%rax,%rbx,4), %ecx
        addl    %ecx, %edx
        movl    %edx, s(%rip)
.L1:
        addq    $24, %rsp
        popq    %rbx
        popq    %rbp
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
.L36:
        rep ret
.L42:
        movl    size(%rip), %ecx
.L4:
        movl    %ecx, %ebx
        subl    %ebp, %ebx
        leal    -1(%rbx), %r14d
        cmpl    %edx, %r14d
        je      .L45
        movslq  %r14d, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r15
        je      .L45
        cmpl    %edx, %ebx
        je      .L25
        movslq  %ebx, %rsi
        cmpl    $1, (%rax,%rsi,4)
        leaq    0(,%rsi,4), %r13
        je      .L25
        leal    1(%rbx), %esi
        cmpl    %edx, %esi
        movl    %esi, (%rsp)
        je      .L26
        cmpl    $1, 8(%rax,%r15)
        je      .L26
        leal    2(%rbx), %edi
        call    test1(int)
        movl    (%rsp), %esi
        movl    %esi, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movl    (%rsp), %esi
        movq    A(%rip), %rax
        movl    %ecx, %edx
        subl    %esi, %edx
        movslq  %edx, %rsi
        movl    12(%rax,%r15), %edx
        addl    -4(%rax,%rsi,4), %edx
        addl    %edx, s(%rip)
        movl    half(%rip), %edx
.L26:
        movl    %ecx, %edi
        subl    %ebx, %edi
        leal    -1(%rdi), %esi
        cmpl    %edx, %esi
        je      .L43
        movslq  %esi, %r8
        cmpl    $1, (%rax,%r8,4)
        leaq    0(,%r8,4), %r9
        je      .L43
        movq    %r9, 8(%rsp)
        movl    %esi, (%rsp)
        call    test1(int)
        movl    (%rsp), %esi
        movl    %esi, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movl    (%rsp), %esi
        movq    A(%rip), %rax
        movq    8(%rsp), %r9
        movl    %ecx, %edx
        movl    %ecx, %edi
        subl    %esi, %edx
        movl    4(%rax,%r9), %esi
        subl    %ebx, %edi
        movslq  %edx, %rdx
        addl    -4(%rax,%rdx,4), %esi
        movl    half(%rip), %edx
        addl    s(%rip), %esi
        movl    %esi, s(%rip)
.L28:
        movslq  %edi, %rdi
        movl    4(%rax,%r13), %r8d
        addl    -4(%rax,%rdi,4), %r8d
        addl    %r8d, %esi
        movl    %esi, s(%rip)
.L25:
        movl    %ecx, %r13d
        subl    %r14d, %r13d
        leal    -1(%r13), %ebx
        cmpl    %edx, %ebx
        je      .L44
        movslq  %ebx, %rdi
        cmpl    $1, (%rax,%rdi,4)
        leaq    0(,%rdi,4), %rsi
        movq    %rsi, (%rsp)
        je      .L44
        cmpl    %edx, %r13d
        je      .L33
        movslq  %r13d, %rdx
        cmpl    $1, (%rax,%rdx,4)
        leaq    0(,%rdx,4), %r8
        movq    %r8, 8(%rsp)
        je      .L33
        leal    1(%r13), %edi
        call    test1(int)
        movl    %r13d, %edi
        notl    %edi
        addl    size(%rip), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rdi
        movq    8(%rsp), %r8
        movl    %ecx, %edx
        subl    %r13d, %edx
        movl    4(%rdi,%r8), %eax
        movslq  %edx, %rdx
        addl    -4(%rdi,%rdx,4), %eax
        addl    %eax, s(%rip)
.L33:
        subl    %ebx, %ecx
        leal    -1(%rcx), %edi
        call    test1(int)
        movl    size(%rip), %ecx
        movq    A(%rip), %rax
        movl    %ecx, %esi
        movl    %ecx, %r13d
        subl    %ebx, %esi
        movq    (%rsp), %rbx
        subl    %r14d, %r13d
        movslq  %esi, %rsi
        movl    4(%rax,%rbx), %edx
        addl    -4(%rax,%rsi,4), %edx
        movl    s(%rip), %esi
        addl    %edx, %esi
        movl    %esi, s(%rip)
.L31:
        movslq  %r13d, %r13
        movl    4(%rax,%r15), %edx
        subl    %ebp, %ecx
        addl    -4(%rax,%r13,4), %edx
        movl    %ecx, %ebx
        addl    %esi, %edx
        movl    %edx, s(%rip)
        jmp     .L23
.L44:
        movl    s(%rip), %esi
        jmp     .L31
.L39:
        movl    size(%rip), %ecx
        jmp     .L7
.L41:
        movl    s(%rip), %esi
        jmp     .L16
.L43:
        movl    s(%rip), %esi
        jmp     .L28
.L38:
        movl    s(%rip), %esi
        jmp     .L13
.L37:
        movl    size(%rip), %ecx
        jmp     .L10
.L40:
        movl    s(%rip), %r8d
        jmp     .L20
s:
half:
        .zero   4
A:
        .zero   8
size:
        .zero   4

对于交替值的情况，假设 size == 7：

test1(curIndex = 0)
{
    if (curIndex == size - 1) return;  // false x1
    if (A[curIndex] == 1) return;  // false x1

    test1(curIndex + 1 => 1) {
        if (curIndex == size - 1) return;  // false x2
        if (A[curIndex] == 1) return;  // false x1 -mispred-> returns
    }

    test1(curIndex + 2 => 2) {
        if (curIndex == size - 1) return; // false x 3
        if (A[curIndex] == 1) return;  // false x2
        test1(curIndex + 1 => 3) {
            if (curIndex == size - 1) return;  // false x3
            if (A[curIndex] == 1) return;  // false x2 -mispred-> returns
        }
        test1(curIndex + 2 => 4) {
            if (curIndex == size - 1) return;  // false x4
            if (A[curIndex] == 1) return; // false x3
            test1(curIndex + 1 => 5) {
                if (curIndex == size - 1) return; // false x5
                if (A[curIndex] == 1) return; // false x3 -mispred-> returns
            }
            test1(curIndex + 2 => 6) {
                if (curIndex == size - 1) return; // false x5 -mispred-> returns
            }
            s += A[5] + A[6];
        }
        s += A[3] + A[4];
    }
    s += A[1] + A[2];
}

让我们想象一个案例

size = 11;
A[11] = { 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0 };

test1(0)
  -> test1(1)
       -> test1(2)
            -> test1(3)  -> returns because 1
            -> test1(4)
                 -> test1(5)
                      -> test1(6)
                           -> test1(7) -- returns because 1
                           -> test1(8)
                                -> test1(9) -- returns because 1
                                -> test1(10) -- returns because size-1
                      -> test1(7) -- returns because 1
                 -> test1(6)
                   -> test1(7)
                   -> test1(8)
                        -> test1(9) -- 1
                        -> test1(10) -- size-1
       -> test1(3)  -> returns
  -> test1(2)
       ... as above

或

size = 5;
A[5] = { 0, 0, 0, 0, 1 };

test1(0)
  -> test1(1)
       -> test1(2)
            -> test1(3)
                 -> test1(4)  --  size
                 -> test1(5)  --  UB
            -> test1(4)
       -> test1(3)
            -> test1(4)  -- size
            -> test1(5)  -- UB
  -> test1(2)
       ..

您挑出的两种情况（交替模式和半模式）是最佳极端情况，编译器选择了一些中间情况，它会尽力处理。

【讨论】：

【解决方案3】：

问题是这样的：

if(A[curIndex] == 1) return;

由于一些优化，测试函数的每次调用都会交替此比较的结果，因为数组是，例如0,0,0,0,0,1,1,1,1

换句话说：

curIndex = 0 -> A[0] = 0
test1(curIndex + 1) -> curIndex = 1 -> A[1] = 0

但是，处理器架构 MIGHT（很大可能，因为它取决于；对我来说，优化被禁用 - i5-6400）有一个称为 runahead 的功能（沿着分支预测执行），它在进入分支之前执行管道中剩余的指令；所以它会在有问题的 if 语句之前执行test1(size - curIndex -1)。

删除归因的时候，就进入另一个优化，正如user1850903所说。

【讨论】：

【解决方案4】：

有趣的是，在第一次执行中，您的分支比在第二次执行中增加了大约 30%（32M 分支与 24 Mbranches）。

我使用 gcc 4.8.5 和相同的标志（加上 -S）为您的应用程序生成了汇编代码，并且这些程序集之间存在显着差异。有冲突语句的代码大约有 572 行，而没有相同语句的代码只有 409 行。着眼于符号_Z5test1i——test1) 的修饰C++ 名称，例程有367 行长，而第二种情况仅占用202 行。从所有这些行来看，第一种情况包含 36 个分支（加上 15 条调用指令），第二种情况包含 34 个分支（加上 1 条调用指令）。

有趣的是，使用-O1 编译应用程序不会暴露两个版本之间的这种差异（尽管分支错误预测更高，大约 12%）。使用-O2 显示了两个版本之间的差异（12% 对 3% 的分支错误预测）。

我不是编译器专家，无法理解编译器使用的控制流和逻辑，但看起来编译器能够实现更智能的优化（可能包括 user1850903 在他的回答中指出的尾递归优化）部分代码不存在。

【讨论】：

【解决方案5】：

删除s += A[curIndex+1] + A[size-curIndex-1]; 行启用尾递归优化。这种优化只有在递归调用位于函数的最后一行时才会发生。

https://en.wikipedia.org/wiki/Tail_call

【讨论】：