while循环中的分支优化，为什么更少的指令花费更多的运行时间？答案

【问题标题】：Optimization of branch in while loop, why less instruction cost more running time?while循环中的分支优化，为什么更少的指令花费更多的运行时间？
【发布时间】：2015-06-27 01:53:03
【问题描述】：

我有一个用 C 语言实现的数据结构项目，它为其他程序导出各种 API。最近想对 grofile 分析的 Hot 函数做一些优化。这是供您参考的项目。

https://github.com/Incarnation-p-lee/libds 有一个热函数 binary_search_tree_node_insert，如下：

/*
 * RETURN the pointer of inserted node of the binary search tree
 *        If root is NULL or node is NULL, RETURN NULL.
 */
struct binary_search_tree *
binary_search_tree_node_insert(struct binary_search_tree *root,
    struct binary_search_tree *node)
{
    register struct binary_search_tree **iter;

    if (!node || !root) {
        pr_log_warn("Attempt to access NULL pointer.\n");
    } else {
        iter = &root;
        while (*iter) {
            if (node->chain.nice == (*iter)->chain.nice) {
                if (*iter == node) {
                    pr_log_info("Insert node exist, nothing will be done.\n");
                } else {
                    doubly_linked_list_merge((*iter)->chain.link, node->chain.link);
                }
                return *iter;
#ifndef OPT_HOT
            } else if (node->chain.nice > (*iter)->chain.nice) {
                    iter = &(*iter)->right;
            } else if (node->chain.nice < (*iter)->chain.nice) {
                    iter = &(*iter)->left;
#else
            } else {
                binary_search_tree_insert_path_go_through(node, iter);
#endif
            }
        }
        return *iter = node;
    }

    return NULL;
}

我想优化两个 else-if 部分，因为它是半到半分支，这可能会对管道产生很大影响。所以我写了一个宏 binary_search_tree_insert_path_go_through 替换这两个分支。实现如下：

/*
 * 1. node->nice => rbx, *iter => rcx.
 * 2. compare rbx, and 0x8(rcx).
 * 3. update iter.
 */
#define binary_search_tree_insert_path_go_through(node, iter) \
    asm volatile (                                            \
        "mov $0x18, %%rax\n\t"                                \
        "mov $0x20, %%rdx\n\t"                                \
        "mov 0x8(%1), %%rbx\n\t"                              \
        "mov (%0), %%rcx\n\t"                                 \
        "cmp 0x8(%%rcx), %%rbx\n\t"                           \
        "cmovg %%rdx, %%rax\n\t"                              \
        "lea (%%rcx, %%rax), %0\n\t"                          \
        :"+r"(iter)                                           \
        :"r"(node)                                            \
        :"rax", "rbx", "rcx", "rdx")

但是这个函数的单元测试对于这个变化已经下降了大约 6-8%。从 objdump 代码（右手边的优化代码）来看，它的指令较少，我很难理解为什么优化前要花费更多时间。

还有一些细节供大家参考：

struct collision_chain {
    struct doubly_linked_list *link;
    sint64                    nice;
};
/*
 * binary search tree
 */
struct binary_search_tree {
    struct collision_chain chain;
    sint32                 height;  /* reserved for avl */
    /* root node has height 0, NULL node has height -1 */
    union {
        struct binary_search_tree *left;
        struct avl_tree           *avl_left;    /* reserved for avl   */
        struct splay_tree         *splay_left;  /* reserved for splay */
    };
    union {
        struct binary_search_tree *right;
        struct avl_tree           *avl_right;    /* reserved for avl   */
        struct splay_tree         *splay_right;  /* reserved for splay */
    };
};
struct doubly_linked_list {
    uint32                    sid;
    void                      *val;
    struct doubly_linked_list *next;
    struct doubly_linked_list *previous;
};

它运行在具有2核i5-3xxM的virtual-box上，cpuinfo如下：

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
stepping        : 9
microcode       : 0x19
cpu MHz         : 2568.658
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 5137.31
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
stepping        : 9
microcode       : 0x19
cpu MHz         : 2568.658
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm
bogomips        : 5137.31
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

【问题讨论】：

在 x86_64 64 位 lfs 机器中，Linux pli.lfs 3.10.10 #1 SMP Sun Mar 2 18:07:33 CST 2014 x86_64 GNU/Linux。
您能否提供更多处理器细节：cat /proc/cpuinfo？
指令少并不意味着它会运行得更快。如果您使用 -O3，则代码很可能比较低级别的优化更长。即使是具有 1 或 2 条指令形成无限或非常长循环的 sn-p 也会比具有更多指令但更少循环的程序运行时间更长
汇编比较代码在gcc下用-O3编译。
这里nice的类型是什么，这个变量的可能取值范围是多少？

标签： c optimization data-structures

【解决方案1】：

我不知道现代处理器是否也一样，但Linus really didn't like the CMOV instruction back in '07。

由于您正在进行微优化，请将相等检查移至最后一个位置。它几乎总是错误的，但你在每次迭代中都做到了。

此外，我会尝试 not 使用指针到指针的模式。由于指针别名问题，间接往往会使优化器窒息。相反，使用带有两个指针的英寸蠕虫模式：

void insert(NODE *x, NODE **root) {
  NODE *trail = NULL;
  NODE *lead = *root;
  while (lead) {
    trail = lead;
    if (x->key < lead->key)
      lead = lead->left;
    else if (x->key > lead->key)
      lead = lead->right;
    else return; // do nothing;
  }
  // lead has found null, so insert
  if (trail)
    // redo the last key comparison
    if (x->key < trail->key)
      trail->left = x;
    else
      trail->right = x;
  else 
    *root = x;
}

在我的 MacBook 上，它编译为以下 64 位代码，其中循环仅包含 10 条指令。从您帖子中的微小列表很难看出，但它看起来要长得多：

    pushq   %rbp
    movq    %rsp, %rbp
    movq    (%rsi), %rdx
    testq   %rdx, %rdx
    je      LBB0_11
    movl    16(%rdi), %ecx
LBB0_2:                                 
    movq    %rdx, %rax     # dx=lead, ax=trail
    cmpl    16(%rax), %ecx # comparison with key
    jge     LBB0_4         # first branch
    movq    %rax, %rdx     # go left (redundant because offset(left)==0!)
    jmp     LBB0_6
LBB0_4:                                 
    jle     LBB0_12        # second branch
    leaq    8(%rax), %rdx  # go right
LBB0_6:                                 
    movq    (%rdx), %rdx   # move lead down the tree
    testq   %rdx, %rdx     # test for null
    jne     LBB0_2         # loop if not
    testq   %rax, %rax     # insertion logic
    je      LBB0_11
    movl    16(%rdi), %ecx
    cmpl    16(%rax), %ecx
    jge     LBB0_10
    movq    %rdi, (%rax)
    popq    %rbp
    retq
LBB0_11:
    movq    %rdi, (%rsi)
LBB0_12:                   # return for equal keys
    popq    %rbp
    retq
LBB0_10:
    movq    %rdi, 8(%rax)
    popq    %rbp
    retq

如果比较代价高昂（您没有展示什么是“好”），您还可以尝试存储跟踪比较的二进制结果，以便最终检查使用它而不是重做比较。

【讨论】：

我是一名大学学生，上个季度刚刚完成了我的装配课程（不过我们使用了 SPARC）。这个答案中有很多内容我很想更好地理解。我将从这个开始：您能否详细说明“英寸蠕虫”模式而不是指针对指针？这对编译有何影响？
非常感谢。将相等比较放在最后一个位置可能会稍微影响性能，但我认为内联 asm 丢弃了 cmp 和 jmp 指令，它可能比 if-else 更快。
或者另一方面，我是否应该重写while循环下的所有代码以进行优化。
@IncarnationP.Lee 你还没准备好关于 CMOV 的 Linux Tovalds 文章？它可能是一条非常昂贵的指令，并且比分支慢。此循环中的分支类型正是多执行单元 CPU 设计为在其上执行良好的类型。它们或多或少并行执行分支的两侧，然后在比较完成时丢弃一个计算，因此现在根本就存在管道刷新。
其实我经常看到 Intel 编译器发出 cmovs 而不是跳转或设置 gcc 和 clang 之类的标志，所以这些条件移动现在可能要好得多

【解决方案2】：

不是直接回答您的问题，但您可以完全避免使用else if：

sint64 mask,left,right;
...
if (node->chain.nice == (*iter)->chain.nice)
{
    ...
}
else
{
    mask  = ((*iter)->chain.nice - node->chain.nice) >> 63;
    left  = (sint64)(&(*iter)->left);
    right = (sint64)(&(*iter)->right);
    iter  = (struct binary_search_tree**)((left & mask) | (right & ~mask));
}

【讨论】：

不错的建议，我会试一试。
如果你想做这些技巧，只需将子指针放在一个数组中，其中 children[0] 为左，children[1] 为右，然后说 node = node->children[x- >key > node->key].
@Gene：这是我最初的想法，但你最终不得不： 1. 在每次迭代时将它们存储到数组中（因为它们每次都不同）。 2.使用符号位作为数组的索引，这基本上相当于我的“技巧”，因为您仍然需要应用>>和&才能检索该位......所以底线，我认为由于上述第一个原因，最好采用这种方式（即我的方式）。谢谢。
@IncarnationP.Lee：不客气。请注意，如果表达式(*iter)->chain.nice - node->chain.nice 本身溢出，则它可能无法按预期工作。我还没有完全调查，我什至不确定这是一个问题。但如果是这样，那么您也许可以使用溢出标志位来解决它，据我所知，这是特定于您的硬件架构的（即超出 C 语言标准的范围）。
我不认为把它放在 children[0] 中是个好主意，因为它需要两条指令来进行 echo 迭代。