需要帮助在给定 64 位汇编指令的 C 中构建 Long loop(long x, int n) 函数答案

【问题标题】：Need help constructing a Long loop(long x, int n) function in C given 64 bit assembly instructions需要帮助在给定 64 位汇编指令的 C 中构建 Long loop(long x, int n) 函数
【发布时间】：2017-03-02 20:31:41
【问题描述】：

我有以下来自 C 函数 long loop(long x, int n) 的汇编代码 x 在%rdi 中，n 在%esi 中，在 64 位机器上。我已经将我的 cmets 写在我认为组装说明正在做的事情上。

loop:

    movl   %esi, %ecx // store the value of n in register ecx
    movl   $1, %edx // store the value of 1 in register edx (rdx).initial mask
    movl   $0, %eax //store the value of 0 in register eax (rax). this is initial return value 
    jmp    .L2 

.L3 

    movq   %rdi, %r8 //store the value of x in register r8  
    andq    %rdx, %r8 //store the value of (x & mask) in r8
    orq    %r8, %rax //update the return value rax by (x & mask | [rax] ) 
    salq   %cl, %rdx //update the mask rdx by ( [rdx] << n)

.L2 

    testq  %rdx, %rdx //test mask&mask
    jne    .L3 // if (mask&mask) != 0, jump to L3    
    rep; ret

我有以下需要对应汇编代码的C函数：

    long loop(long x, int n){
          long result = _____ ;   
          long mask; 
       // for (mask = ______; mask ________; mask = ______){ // filled in as:
          for (mask = 1;      mask != 0;     mask <<n) {
              result |= ________;
          } 
          return result;
     }

我需要一些帮助来填补空白，我不能 100% 确定组装说明是什么，但我已经通过对每一行进行评论来尽力而为。

【问题讨论】：

我们不是“做我的功课”的网站。向你的老师寻求建议。
这些练习的第一步是猜测哪些变量在哪些寄存器中。需要注意的一件事是函数的返回值最终在rax 寄存器中。
对不起，我不是要你做我的作业。我只是将 cmets 添加到我认为每条指令正在进行的操作中。我需要帮助将它连接到 C 代码。
test 指令主要用于设置标志位，方便后面的分支指令使用。因此，虽然从技术上讲 test 正在执行 mask & mask，但等效的 C 代码只是 mask != 0。
orq 指令后面的注释不太对。你应该再看一遍。

标签： c assembly x86-64 reverse-engineering att

【解决方案1】：

您的 cmets 中几乎已经有了它。

long loop(long x, long n) {
    long result = 0;
    long mask;
    for (mask = 1; mask != 0; mask <<= n) {
        result |= (x & mask);
    }
    return result;
}

因为result是返回值，而返回值存储在%rax中，所以movl $0, %eax最初将0加载到result中。

在 for 循环中，%r8 保存与 result 进行或运算的值，就像您在 cmets 中提到的那样，它就是 x & mask。

【讨论】：

【解决方案2】：

该函数将每个nth 位复制到result。

为了记录，实现中充满了错过的优化，特别是如果我们正在调整 Sandybridge 系列，其中bts reg,reg 只有 1 uop 和 1c 延迟，但 shl %cl 是 3哎呀。（BTS 在 Intel P6 / Atom / Silvermont CPU 上也是 1 uop）

bts 在 AMD K8/Bulldozer/Zen 上只有 2 微秒。 BTS reg,reg 以与 x86 整数移位相同的方式屏蔽移位计数，因此 bts %rdx, %rax 实现 rax |= 1ULL << (rdx&0x3f)。即在 RAX 中设置位 n。

（这段代码显然设计得简单易懂，甚至没有使用最著名的 x86 窥视孔优化 xor-zeroing，但看看我们如何高效地实现同样的事情很有趣。 )

更明显的是，在循环内执行and 是不好的。相反，我们可以用每个nth 位设置一个掩码，然后返回x & mask。这具有额外的优势，即在条件分支之后使用非ret 指令，我们不需要rep 前缀作为ret 的填充，即使我们关心调整 AMD Phenom CPU 中的分支预测器. （因为它不是条件分支之后的第一个字节。）

# x86-64 System V:  x in RDI,  n in ESI
mask_bitstride:                      # give the function a meaningful name
    mov    $1, %eax                  # mask = 1
    mov    %esi, %ecx                # unsigned bitidx = n   (new tmp var)

# the loop always runs at least 1 iteration, so just fall into it
.Lloop:                     # do {
    bts    %rcx, %rax                # rax |= 1ULL << bitidx
    add    %esi, %ecx                # bitidx += n
    cmp    $63, %ecx                 # sizeof(long)*CHAR_BIT - 1
    jbe    .Lloop           # }while(bitidx <= maxbit);  // unsigned condition

    and    %rdi, %rax       # return x & mask
    ret                # not following a JCC so no need for a REP prefix even on K10

我们假设n 在 0..63 范围内，否则 C 将具有未定义的行为。在这种情况下，此实现不同于问题中基于班次的实现。 shl 版本会将 n==64 视为无限循环，因为 shift count = 0x40 & 0x3f = 0，所以 mask 永远不会改变。此bitidx += n 版本将在第一次迭代后退出，因为idx 立即变为 >=63，即超出范围。

一个不太极端的情况是n=65 将复制所有位（移位计数为 1）；这只会复制低位。

两个版本都为n=0 创建了一个无限循环。我使用了无符号比较，所以否定的n 会立即退出循环。

在英特尔 Sandybridge 系列上，原始的内部循环是 7 uop。（mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1）。这将成为前端的瓶颈，或 SnB/IvB 上的 ALU 吞吐量。

我的版本只有 3 微秒，运行速度大约是原来的两倍。（每个时钟 1 次迭代。）

【讨论】：