位旋转帮助：扩展位以遵循给定的位掩码答案

【问题标题】：Bit twiddle help: Expanding bits to follow a given bitmask位旋转帮助：扩展位以遵循给定的位掩码
【发布时间】：2016-03-08 22:26:56
【问题描述】：

我对“扩展位”的快速方法感兴趣，可以定义如下：

令B为n位的二进制数，即B\in {0,1}^n时间>
令P为B中所有1/true位的位置，即1 << p[i] & B == 1，且|P|=k
对于另一个给定数，A \in {0,1}^k，令 Ap 为 Ap 的位扩展形式em>A 给定 B，这样Ap[j] == A[j] << p[j]。
“位扩展”的结果是Ap。

几个例子：

给定B：0010 1110，A：0110，然后Ap 应该是 0000 1100
给定 B：1001 1001，A：1101，那么Ap应该是1001 0001

以下是一种简单的算法，但我不禁感到有一种更快/更简单的方法可以做到这一点。

unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
  int k = popcount(B); // cuda function, but there are good methods for this
  unsigned int Ap = 0;
  int j = k-1;
  // Starting at the most significant bit,
  for (int i = n - 1; i >= 0; --i) {
    Ap <<= 1;
    // if B is 1, add the value at A[j] to Ap, decrement j. 
    if (B & (1 << i)) {
      Ap += (A >> j--) & 1;
    }
  }
  return Ap;
}

【问题讨论】：

这看起来像 x86 指令集的 AVX 扩展中的 PDEP 指令的功能（它也作为内在函数公开）。我不知道在 NVIDIA GPU 上具有类似功能的 GPU 指令，并且没有 CUDA 内在函数。您是否需要完全通用模拟此功能，或者您实际上是在查看特定的、定义更狭义的此类比特沉积的实例？
你能澄清一下规范吗？据我所知，A 是要存放其位的操作数，B 是控制每个存放位的目标位的掩码。 n的作用到底是什么？由于B 可能有一些稀疏性，似乎最好从最不重要的一端开始迭代B 中的1 位。
@njuffa 我刚刚查看了PDEP 和PEXT 指令，它看起来更接近PEXT 指令。关于您的其他问题，我需要完整的此功能。 B 的值范围从 0 到 ~0，但我只关心低于给定值的 k 的值。 n 的值只是数字的位数——uint32_t 为 32，short 为 16。
你在问题中给出的例子对应PDEP；在发布答案之前，我还在下面的代码中运行了这两个示例：pdep (0x6, 0x2e) = 0c; pdep (0xd, 0x99) = 91。由于n 似乎只是操作数中的位数，因此不需要它。如果您需要各种操作数类型的版本，您可以使用模板创建重载版本，因为 CUDA 是 C++ 的子集。
@njuffa 你是对的。感谢您指出这一点。

标签： algorithm cuda bit-manipulation

【解决方案1】：

问题似乎是要求对 BMI2 指令 PDEP 进行 CUDA 仿真，该指令采用源操作数 a，并根据掩码 b 的 1 位的位置存放其位。当前出货的 GPU 上没有对相同或类似操作的硬件支持；也就是说，直到并包括 Maxwell 架构。

根据给出的两个示例，我假设掩码 b 通常是稀疏的，并且我们可以通过仅迭代 b 的 1 位来最小化工作。这可能会导致 GPU 上出现不同的分支，但如果不了解特定用例，性能的确切权衡是未知的。目前，我假设利用掩码 b 中的稀疏性对性能的积极影响比散度的负面影响更大。

在下面的仿真代码中，我减少了潜在的“昂贵”移位操作的使用，而是主要依赖简单的 ALU 指令。在各种 GPU 上，移位指令的执行吞吐量低于简单的整数运算。我在代码中保留了一个单一的转变，离开关键路径，以避免执行受到算术单元的限制。如果需要，可以将表达式 1U << i 替换为加法：引入一个变量 m，该变量在循环之前初始化为 1，并且每次循环都加倍。

基本思想是依次隔离掩码b的每个1位（从最低有效端开始），并与@987654333的第i位的值@，并将结果合并到扩展的目标中。在使用来自b 的 1 位后，我们将其从掩码中删除，并迭代直到掩码变为零。

为了避免将a 的第 i 位移动到位，我们简单地将其隔离，然后通过简单的否定将其值复制到所有更重要的位，利用两者的优势整数的补码表示。

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0;
    int i;
    for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
    }
    return r;
}

上面提到的没有任何移位操作的变体如下所示：

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0, m = 1;
    while (b) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
        m = m + m; // mask for next bit of 'a'
    }
    return r;
}

在下面的 cmets 中，@Evgeny Kluev 指出了 chessprogramming 网站上的无移位 PDEP 仿真，它看起来可能比我上面的两个实现中的任何一个都快；看来值得一试。

【讨论】：

几乎相同的算法发布在chessprogramming。但他们根本不使用移位指令。