您正在加载 DX 的低 16 位,而高位(包括符号位)则保留之前存在的任何垃圾。使用 16 位操作数大小进行比较。
计算负数或非负数,然后从总计数中减去,得到另一个。
如果您需要对负数和正数进行计数,则需要两个计数器,一个 test 或 cmp 后跟两个分支(这样零就不会进入任何一个计数器)。
改编自 Sten 的回答,但有一些改进。注意test value, -1 等价于cmp value, 0。
section .rodata
word_array dw -1,2,-3,4
len equ $-word_array ; length in bytes. assembler constant, so we can mov reg, imm8/imm32 rather than loading it as data.
section .text
;; clobbers ESI, ECX. Returns in EAX, EDX
proc:
mov esi, word_array ; esi points to the array. In MASM, use OFFSET word_array
mov ecx, len/2 - 1 ; [esi + ecx*2] points to the last element
xor edx, edx ; non_neg_count = 0
countloop:
; cmp [esi + ecx*2], 0 ; This can't macro-fuse (memory and immediate operand). Also can't micro-fuse on SnB, because of a 2-reg addressing mode
movsx eax, word [esi + ecx*2] ; use a 2-reg addressing mode to save loop overhead, since this there's no ALU execution port component to this insn. It doesn't need to micro-fuse to be one uop
test eax, eax ; can macro-fuse with js
js isNegative
inc edx ; counting non-negative numbers
isNegative:
dec ecx ; can macro-fuse with jge, but probably won't unless alignment stops it from being decoded in the same cycle as the earlier test/js
jge countloop ; jge, not jnz, because we want ecx from [0 : len-1], rather than [1 : len]
; after the loop, ecx=-1, edx=non_neg_count
; neg_count = array_count - non_neg_count
mov eax, len/2
sub eax, edx ; eax = neg_count
ret ; return values in eax, edx
英特尔上的循环是 4 微秒。 (或者更可能是在 Haswell 之前的 Sandybridge 上的 5 个,如果两个测试/分支对在同一个周期中击中解码器,那么只有一个宏融合。HSW 可以在一个解码组中进行 2 个宏融合)。
带有sets bl / add edx, ebx 的无分支版本可能运行良好。
您可以通过将 eax 归零,然后在循环中使用 scasw 将 ax 与 [esi] 进行比较,并将 esi 增加 2 来稍微节省代码大小,但这通常不是提高性能的好选择。
如果正面与非负面很重要:
section .rodata
word_array dw -1,2,0,-3,4
len equ $-word_array ; length in bytes. assembler constant, so we can mov reg, imm8/imm32 rather than loading it as data.
section .text
;; clobbers ESI, EDI, EBP. Returns in EAX, EDX
proc_pos_and_neg:
mov esi, word_array ; esi points to the array. In MASM, use OFFSET word_array
xor edx, edx ; pos_count = 0
xor eax, eax ; neg_count = 0
lea edi, [esi + len] ; points one past the end of the array
xor ebx, ebx ; clear upper portion, because setcc r32 isn't available, only setcc r8 :(
countloop:
cmp word [esi], 0
setg bl ; 0 or 1, depending on array[i] > 0
lea edx, [edx + ebx] ; add without affecting flags
setl bl
add eax, ebx ; can clobber flags now
add esi, 2 ; simple pointer-increment
cmp esi, edi
jb countloop ; loop while our pointer is below the pointer to one-past-the-end
ret ; neg_count in eax, pos_count in edx
如果需要的话,零计数是n - eax - edx,其中n 是元素的数量。
我在这里使用了不同的循环结构只是为了多样化。循环应该是 7 微秒。
在 setcc 写入 bl 后读取 ebx 避免了部分寄存器合并损失,因为我们在循环外对 EBX 进行了异或归零。 (保存/恢复 EBX 的上下文切换或中断将消除该性能优势,但对于短循环,可能仍然值得将异或归零提升到循环之外。)