优化打印计数器的循环答案

【问题标题】：Optimizing a loop that prints a counter优化打印计数器的循环
【发布时间】：2016-12-07 22:17:23
【问题描述】：

我有一个非常小的循环程序，可以打印从 5000000 到 1 的数字。我想让它运行得尽可能快。

我正在使用 NASM 学习 linux x86-64 汇编。

global  main
extern  printf
main:
    push    rbx                     
    mov rax,5000000d
print:
    push    rax                     
    push    rcx                     
    mov     rdi, format             
    mov     rsi, rax                
    call    printf                  
    pop     rcx                     
    pop     rax                     
    dec     rax                     
    jnz     print                   
    pop     rbx                     
    ret

format:
db  "%ld", 10, 0

【问题讨论】：

你知道printf在做什么怪物计算吗？ :-o 你应该。检查数字如何转换为十进制格式。彼得的答案显然摆脱了这一点，就像您的示例一样，它很容易对字符串进行操作并避免使用数字，在这种特殊情况下这 快得多。

标签： linux assembly optimization nasm x86-64

【解决方案1】：

对 printf 的调用完全支配了即使是效率极低的循环的运行时间。（您是否注意到即使您从未在任何地方使用过 rcx 也会推送/弹出 rcx？也许这是使用 the slow LOOP instruction 时遗留下来的）。

要了解有关编写高效 x86 asm 的更多信息，请参阅 Agner Fog's Optimizing Assembly guide。（还有他的微架构指南，如果你想真正了解特定 CPU 的详细信息以及它们的不同之处：一个 uarch CPU 上的最佳方案可能不在另一个 CPU 上。例如，IMUL r64 在 Intel 上具有更好的吞吐量和延迟CPU 比 AMD 上的高，但 CMOV 和 ADC 在 Intel pre-Broadwell 上是 2 uop，具有 2 个周期延迟。而在 AMD 上是 1，因为 3 输入 ALU m-ops（FLAGS + 两个寄存器）对 AMD 来说不是问题.) 另请参阅x86 标签 wiki 中的其他链接。

在不更改对 printf 的 5M 调用的情况下纯粹优化循环仅作为如何正确编写循环的示例有用，而不是实际加速此代码。但让我们从这个开始：

; trivial fixes to loop efficiently while calling the same slow function
global  main
extern  printf
main:
    push    rbx
    mov     ebx, 5000000         ; don't waste a REX prefix for constants that fit in 32 bits
.print:
    ;; removed the push/pops from inside the loop.
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
    mov     edi, format          ; static data / code always has a 32-bit address
    mov     esi, ebx
    xor     eax, eax             ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
    call    printf                  
    dec     ebx
    jnz     .print

    pop     rbx                ; restore rbx, the one call-preserved reg we actually used.
    xor     eax,eax            ; successful exit status.
    ret

section .rodata       ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db  "%ld", 10, 0

为了加快速度，我们应该利用冗余将连续整数转换为字符串。由于"5000000\n" 只有 8 个字节长（包括换行符），所以字符串表示适合 64 位寄存器。

我们可以将该字符串存储到缓冲区中，并将指针增加字符串长度。（因为对于较小的数字它会变得更短，所以只需将当前字符串长度保存在寄存器中，您可以在它更改的特殊情况分支中更新它。）

我们可以就地递减字符串表示，以避免（重新）执行除以 10 以将整数转换为十进制字符串的过程。

由于进位/借位不会自然地在寄存器内传播，并且AAS 指令在 64 位模式下不可用（并且仅适用于 AX，甚至不适用于 EAX，而且速度很慢），我们必须我们自己做。我们每次都减 1，所以我们知道会发生什么。我们可以通过展开 10 次来处理最低有效数字，因此没有分支来处理它。

还要注意，由于我们想要按打印顺序排列数字，进位无论如何都会走错方向，因为 x86 是 little-endian。如果有一个很好的方法来利用我们的字符串以其他字节顺序，我们可以使用 BSWAP 或 MOVBE。（但请注意，MOVBE r64 是 Skylake 上的 3 个融合域微指令，其中 2 个是 ALU 微指令。BSWAP r64 也是 2 个微指令。）

也许我们应该在 XMM 向量寄存器的两半中并行处理奇数/偶数计数器。但是一旦字符串短于 8B，它就会停止工作。一次存储一个数字字符串，我们可以很容易地重叠。不过，我们可以在向量 reg 中进行进位传播，并使用 MOVQ 和 MOVHPS 分别存储两半。或者由于从 0 到 5M 的数字中有 4/5 是 7 位数字，因此对于可以存储两个数字的整个 16B 向量的特殊情况，可能值得编写代码。

处理较短字符串的更好方法：SSSE3 PSHUFB 将两个字符串打乱到左包装在向量寄存器中，然后一个 MOVUPS 一次存储两个。 shuffle 掩码只需要在字符串长度（位数）发生变化时更新，因此不经常执行的进位处理特殊情况代码也可以做到这一点。

循环的热部分的矢量化应该非常简单且成本低廉，并且性能应该几乎翻倍。

;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.

section  .bss
printbuf:  resb 1024*128 + 4096     ;  Buffer size ~= half L2 cache size on Intel SnB-family.  Or use a giant buffer that we write() once.  Or maybe vmsplice to give it away to the kernel, since we only run once.

global  main
extern  printf
main:
    push    rbx

    ; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
    mov     rdx, `5000000\n`   ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10).  Note that YASM doesn't support back-ticks for C-style backslash processing.
    mov     r9, 1<<56         ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
    ;xor     r9d, r9d
    ;bts      r9, 56           ; IDK if this code-size optimization outside the loop would help or not.

    mov     eax, 8            ; string length.
    mov     edi, printbuf

.storeloop:

    ;;  rdx = "????x9\n".  We compute the start value for the next iteration, i.e. counter -= 10 in rdx.

    mov     r8, rdx
    ;;  r8 = rdx.  We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
    ;;  The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
    ;; narrower than 8B means that our stores overlap, but that's fine.

    ;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop.  This gives out-of-order execution more to play with.
    ;;  It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).

    mov     [rdi], r8
    sub     r8, r9             ; r8 = "xxx8\n"

    mov     [rdi + rax], r8    ; defer p += len by using a 2-reg addressing mode
    sub     r8, r9             ; r8 = "xxx7\n"

    lea     edi, [rdi + rax*2]  ; if we had len*3 in another reg, we could defer this longer
           ;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here.  Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible.  This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.

    ;; repeat that block 3 more times.
    ;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.

    ;; anyway, then for the last one:
    mov     [rdi], r8             ; r8 = "xxx1\n"
    sub     r8, r9
    mov     [rdi + rax], r8       ; r8 = "xxx0\n"

    lea     edi, [rdi + rax*2]


    ;; compute next iteration's RDX.  It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
    mov     rcx, r9
    shr     rcx, 8      ; maybe hoist this constant out, too
    ; rcx = 1 in the second-lowest digit
    sub     rdx, rcx

    ; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
    shl     rcx, 5
    test    rdx, rcx
    jz      .carry_second_digit
    ; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
    ; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
    ; it only runs once per 100 digits, so it doesn't have to be super-fast

    ; maybe only do buffer-length checks in the carry-handling branch,
    ; in which case the jz .carry  can be  jnz .storeloop
    cmp     edi, esi              ; } while(p < endp)
    jbe     .storeloop

    ; write() system call on the buffer.
    ; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.

    pop     rbx
    xor     eax,eax            ; successful exit status.
    ret

这并没有完全充实，但应该让您了解哪些方法可能会奏效。

如果使用 SSE2 进行矢量化，可能会使用一个标量整数寄存器来跟踪何时需要突破和处理进位。即从 10 开始的递减计数器。

即使是这个标量版本也可能接近于每个时钟维持一个存储，这会使存储端口饱和。它们只有 8B 的存储空间（当字符串变短时，有用的部分也会变短），所以如果我们没有缓存未命中的瓶颈，我们肯定会将性能留在桌面上。但是对于 3GHz CPU 和双通道 DDR3-1600（~25.6GB/s 理论最大带宽），每时钟 8B 足以使单核主内存饱和。

我们可以并行化，并将 5M .. 1 的范围分成块。通过一些巧妙的数学运算，我们可以找出写入"2500000\n" 的第一个字符的字节，或者我们可以让每个线程以正确的顺序调用write() 本身。（或者使用相同的巧妙数学方法让它们以不同的文件偏移量独立调用pwrite(2)，这样内核就会负责同一文件的多个写入者的所有同步。）

【讨论】：

谢谢彼得的描述性答案。我需要花时间来理解整个答案现在我想问另一个问题...所以在发布问题之前，我测量了花费的时间运行，它是 60 秒，然后我向你应用第一个建议文件，我得到了 82 秒，我对此感到惊讶，所以我再次运行我的原始代码，它在大约 90 秒内运行，所以我在运行 ubuntu 14.04.5 时发生了什么一台具有核心 i5 4200m 和 8 GB 内存的笔记本电脑，当我有 60 秒时更奇怪的是，我同时运行了很多应用程序，非常奇怪
@Adou：如果您让它在终端窗口中打印，那将是一个更大的瓶颈。要为程序计时，请将其输出重定向到 /dev/null。或者至少到一个文件。将其输入wc -c 也是合理的。使用time ./a.out > /dev/null 并查看用户时间、系统时间和实时时间。也可以试试perf stat ./a.out > /dev/null，看看你的 CPU 运行它的频率。它应该会很快提升，但涡轮增压对笔记本电脑 CPU 有很大影响（因为它们有限的功率预算意味着最大涡轮增压明显高于最大持续涡轮增压）。
另请参阅High throughput Fizz Buzz 以获得将这种想法发挥到极致的代码，使用寄存器递增 ASCII 计数器，而不是单独重新进行 int->string 转换。

【解决方案2】：

您实际上是在打印一个固定的字符串。我会将该字符串预先生成为一个长常量。

然后程序变成对write 的单个调用（或处理不完整写入的短循环）。

【讨论】：

我认为您可以以接近 memset 的速度动态生成它，这比从磁盘读取预先生成的字符串要快得多。看我的回答。
@PeterCordes 是的，我假设程序已经加载到内存中。
这是一个错误的假设。特别是与就地重写一个小缓冲区相比，因此当内核将其复制到页面缓存中时，write() 调用正在读取二级缓存中已经很热的数据。（或进入管道缓冲区，或其他）。