有效地随机打乱单词序列的位答案

【问题标题】：Efficiently randomly shuffling the bits of a sequence of words有效地随机打乱单词序列的位
【发布时间】：2019-12-10 11:24:59
【问题描述】：

考虑 C++ 标准库中的以下算法：std::shuffle，它具有以下签名：

template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);

它将给定范围[first, last) 中的元素重新排序，以使这些元素的每个可能排列都具有相同的出现概率。

我正在尝试实现相同的算法，但它在位级别上工作，随机打乱输入序列单词的位。考虑到一系列 64 位字，我正在尝试实现：

template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)

问题：如何尽可能高效地做到这一点（必要时使用编译器内在函数）？我不一定要寻找完整的实现，而是更多的建议/研究方向，因为我真的不清楚有效地实现它是否可行。

【问题讨论】：

我可能认为这不是随机播放，而是生成一个随机位序列，打包为一个 uint64 数组，其中 1 和 0 位的数量等于输入的数量。 -- 我并不是说这一定会使任务更容易，但我认为它可能会。
如何将std::bitset 与500 的想法一起使用？我不知道性能，但我喜欢std::bitset ;)
@Demolishun 不，这会比其他人更喜欢一些排列（如果我理解正确的话）。例如。如果你有两个词：一个设置所有位，一个不设置，你只会得到两种可能的排列
@Justin 是的，这正是我想要避免的。
可能会带来更多优化机会的一点是，您不需要所有可能的排列，只需要所有可能的组合。顺序无关紧要，因为一个位与另一个没有区别。

标签： c++ algorithm optimization random bit-manipulation

【解决方案1】：

很明显，渐近地，速度是O(N)，其中N 是位数。我们的目标是改进其中涉及的常量。

免责声明：所提出的算法的描述是一个粗略的草图。有很多东西需要添加，特别是需要注意很多细节才能使其正常工作。不过，估计的执行时间不会与此处声称的不同。

基线算法

最明显的是textbook approach，它接受N操作，每个操作都涉及调用random_generator，它需要R毫秒，并访问两个不同位的位值，并设置新值给他们总共4 * A 毫秒（A 是读/写一位的时间）。假设数组查找操作需要C 毫秒。所以这个算法的总时间是N * (R + 4 * A + 2 * C)毫秒（大约）。假设随机数生成需要更多时间也是合理的，即R >> A == C。

提出的算法

假设位存储在字节存储中，即我们将使用字节块。

unsigned char bit_field[field_size = N / 8];

首先，让我们计算一下我们的位集中1 位的数量。为此，我们可以使用查找表并以字节数组的形式遍历位集：

# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
  bitcount_lookup[i] = 0;
  for (int b = 0; b < 8; ++b)
    bitcount_lookup[i] += (i >> b) & 1;
}

我们可以将此视为预处理开销（因为它也可以在编译时计算）并说它需要0 毫秒。现在，计算1 位数很容易（以下将花费(N / 8) * C 毫秒）：

int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
  bitcount += bitcount_lookup[*it];

现在，我们随机生成N / 8 数字（我们称结果数组为gencnt[N / 8]），每个都在[0..8] 范围内，因此它们总和为bitcount。这有点棘手，而且很难均匀地完成（与基线算法相比，生成均匀分布的“正确”算法相当慢）。一个非常统一但快速的解决方案大致是：

用值v = bitcount / (N / 8) 填充gencnt[N / 8] 数组。
随机选择N / 16“黑色”单元格。其余的是“白色”。该算法类似于random permutation，但只占数组的一半。
在[0..v] 范围内生成N / 16 随机数。我们就叫他们tmp[N / 16]吧。
将“黑色”单元格增加tmp[i] 值，将“白色”单元格减少tmp[i]。这将确保总和为bitcount。

之后，我们将有一个统一的随机数组gencnt[N / 8]，其值是特定“单元格”中1 的字节数。全部生成于：

(N / 8) * C   +  (N / 16) * (4 * C)  +  (N / 16) * (R + 2 * C)
^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^
filling step      random coloring              filling

毫秒（这个估计是在我的脑海中通过具体实现完成的）。最后，我们可以将指定位数设置为1 的字节查找表（可以编译开销，甚至在编译时为constexpr，因此假设这需要0 毫秒）：

std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
  random_lookup[c] = { /* numbers with `c` bits set to `1` */ };

然后，我们可以如下填写我们的bit_field（大约需要(N / 8) * (R + 3 * C) 毫秒）：

for (int i = 0; i < field_size; i++) {
  bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];

总结一切，我们有总执行时间：
T = (N / 8) * C +
    (N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) + 
    (N / 8) * (R + 3 * C)

  = N * (C + (3/16) * R)  <  N * (R + 4 * A + 2 * C)
    ^^^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^
     proposed algorithm        naive baseline algo
虽然它不是真正均匀随机的，但它确实将位分布相当均匀和随机，而且速度非常快，希望能在您的用例中完成工作。

【讨论】：

统计设置位数有很多优化机会。 std::accumulate/reduce 和 __builtin_popcount 是实现它的一种简单方法，或者 SIMD approach 可能会起作用。

【解决方案2】：

观察涉及通过 Fisher-Yates 交换的实际混洗位，对于产生精确等效的位的随机分布并不是必需的。

#include <iostream>
#include <vector>
#include <random>

// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
    std::random_device rd;
    static std::mt19937 gen(rd());  //mersenne_twister_engine seeded with rd()

    // count the number of set bits and clear bvector
    int set_bits_count = 0;
    for (int i=0; i < bvector.size(); i++)
        if (bvector[i])
        {
            set_bits_count++;
            bvector[i] = 0;
        }

    // set a bit if a random value in range bvector.size()-bit_loc-1 is
    // less than the number of bits remaining to be placed. This produces exactly the same
    // distribution as a random shuffle but only does an insertion of a 1 bit rather than
    // a swap. It requires counting the number of 1 bits. There are efficient ways
    // of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
    for (int bit_loc = 0; set_bits_count; bit_loc++)
    {
        std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
        auto x = dis(gen);
        if (x < set_bits_count)
        {
            bvector[bit_loc] = true;
            set_bits_count--;
        }
    }
    return bvector;
}

这相当于将bools 改组到vector<bool> 中。它对高速缓存行友好且不涉及交换。它按照 OP 的要求以可执行但简单的算法形式呈现。可以做很多事情来优化这一点，例如提高位计数和清除数组的速度。

这会设置 10 位中的 4 位，调用“shuffle”例程 100,000 次，并打印 1 位在 10 个位置中的每个位置出现的次数。每个位置应该是 40,000 左右。

int main()
{
    std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
    std::vector<int> totals(initial.size());
    for (int i = 0; i < 100000; i++)
        {
        auto a_distribution = DistributeBitsRandomly(initial);
        for (int ii = 0; ii < totals.size(); ii++)
            if (a_distribution[ii])
                totals[ii]++;
        }
    for (auto cnt : totals)
        std::cout << cnt << "\n";
}

可能的输出：

【讨论】：