接下来的两个解决方案是Paul R's answer 的替代方案。
当在性能关键循环的上下文中需要掩码时,这些解决方案很有意义。
SSE2
__m128i bit_mask_v2(unsigned int n){ /* Create an __m128i vector with the n most significant bits set to 1 */
__m128i ones_hi = _mm_set_epi64x(-1,0); /* Binary vector of bits 1...1 and 0...0 */
__m128i ones_lo = _mm_set_epi64x(0,-1); /* Binary vector of bits 0...0 and 1...1 */
__m128i cnst64 = _mm_set1_epi64x(64);
__m128i cnst128 = _mm_set1_epi64x(128);
__m128i shift = _mm_cvtsi32_si128(n); /* Move n to SSE register */
__m128i shift_hi = _mm_subs_epu16(cnst64,shift); /* Subtract with saturation */
__m128i shift_lo = _mm_subs_epu16(cnst128,shift);
__m128i hi = _mm_sll_epi64(ones_hi,shift_hi); /* Shift the hi bits 64-n positions if 64-n>=0, else no shift */
__m128i lo = _mm_sll_epi64(ones_lo,shift_lo); /* Shift the lo bits 128-n positions if 128-n>=0, else no shift */
return _mm_or_si128(lo,hi); /* Merge hi and lo */
}
SSSE3
SSSE3 案例更有趣。 pshufb 指令用作小型查找表。我花了一些时间才弄清楚(饱和)算术和常量的正确组合。
__m128i bit_mask_SSSE3(unsigned int n){ /* Create an __m128i vector with the n most significant bits set to 1 */
__m128i sat_const = _mm_set_epi8(247,239,231,223, 215,207,199,191, 183,175,167,159, 151,143,135,127); /* Constant used in combination with saturating addition */
__m128i sub_const = _mm_set1_epi8(248);
__m128i pshub_lut = _mm_set_epi8(0,0,0,0, 0,0,0,0,
0b11111111, 0b11111110, 0b11111100, 0b11111000,
0b11110000, 0b11100000, 0b11000000, 0b10000000);
__m128i shift_bc = _mm_set1_epi8(n); /* Broadcast n to the 16 8-bit elements. */
__m128i shft_byte = _mm_adds_epu8(shift_bc,sat_const); /* The constants sat_const and sub_const are selected such that */
__m128i shuf_indx = _mm_sub_epi8(shft_byte,sub_const); /* _mm_shuffle_epi8 can be used as a tiny lookup table */
return _mm_shuffle_epi8(pshub_lut,shuf_indx); /* which finds the right bit pattern at the right position. */
}
功能
对于 OP 指定的 1<=n<=128,函数 bit_mask_Paul_R(n)(Paul R 的回答),
和bit_mask_v2(n) 产生相同的结果:
bit_mask_Paul_R( 0) = FFFFFFFFFFFFFFFF 0000000000000000
bit_mask_Paul_R( 1) = 8000000000000000 0000000000000000
bit_mask_Paul_R( 2) = C000000000000000 0000000000000000
bit_mask_Paul_R( 3) = E000000000000000 0000000000000000
.....
bit_mask_Paul_R(126) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFC
bit_mask_Paul_R(127) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFE
bit_mask_Paul_R(128) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF
bit_mask_v2( 0) = 0000000000000000 0000000000000000
bit_mask_v2( 1) = 8000000000000000 0000000000000000
bit_mask_v2( 2) = C000000000000000 0000000000000000
bit_mask_v2( 3) = E000000000000000 0000000000000000
.....
bit_mask_v2(126) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFC
bit_mask_v2(127) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFE
bit_mask_v2(128) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF
bit_mask_SSSE3( 0) = 0000000000000000 0000000000000000
bit_mask_SSSE3( 1) = 8000000000000000 0000000000000000
bit_mask_SSSE3( 2) = C000000000000000 0000000000000000
bit_mask_SSSE3( 3) = E000000000000000 0000000000000000
.....
bit_mask_SSSE3(126) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFC
bit_mask_SSSE3(127) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFE
bit_mask_SSSE3(128) = FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF
对于n=0,最合理的结果是零向量,即
由bit_mask_v2(n) 和bit_mask_SSSE3(n) 制作。
性能
为了大致了解不同功能的性能,使用了以下代码:
__m128i sum = _mm_setzero_si128();
for (i=0;i<1000000000;i=i+1){
sum=_mm_add_epi64(sum,bit_mask_Paul_R(i)); // or use next line instead
// sum=_mm_add_epi64(sum,bit_mask_v2(i));
// sum=_mm_add_epi64(sum,bit_mask_SSSE3(i));
}
_mm_storeu_si128((__m128i*)x,sum);
printf("sum = %016lX %016lX\n", x[1],x[0]);
代码的性能稍微取决于指令编码的类型。
GCC 选项opts1 = -O3 -m64 -Wall -march=nehalem 导致非 vex 编码的 sse 指令,
而opts2 = -O3 -m64 -Wall -march=sandybridge 编译为 vex 编码的 avx128 指令。
gcc 5.4 的结果是:
Cycles per iteration on Intel Skylake, estimated with: perf stat -d ./a.out
opts1 opts2
bit_mask_Paul_R 6.0 7.0
bit_mask_v2 3.8 3.3
bit_mask_SSSE3 3.0 3.0
在实践中,性能将取决于 cpu 类型和周围的代码。
bit_mask_SSSE3 的性能受端口 5 压力限制;
每次迭代的三个指令(一个 movd 和两个 pshufb-s)由端口 5 处理。
使用 AVX2,可以编写更高效的代码,see here。