迭代生成自然数的排列答案

【问题标题】：Iteratively generating a permutation of natural numbers迭代生成自然数的排列
【发布时间】：2018-07-18 23:11:48
【问题描述】：

我有一个有点不寻常的问题，之前可能有人问过，也可能没有（虽然我没有找到任何东西，但我可能只是寻找错误的流行语）。

我的任务很简单：给定自然数的“列表”，直到 N [0, 1, 2, ... N - 1] 我想打乱这个序列。例如。当我输入数字 4 时，一个可能的结果是 [3, 0, 1, 2]。随机性应该可以通过一些种子来确定（但这是大多数通用语言中的 PRNG 的标准）。

天真的方法是只实例化一个大小为 N 的数组，用数字填充它并使用任何洗牌算法。

然而问题是，这种方法的内存复杂度是 O(n)，在我的特殊情况下是难以处理的。我的想法是，编写一个生成器，迭代地提供结果列表中的数字。

更准确地说，我想要一些以迭代方式提供数字的“算法”。更准确地说，一个概念类应该是这样的：

class Generator {
   // some state
   int nextNumber(...) {
      // some magic
   }
}

并且迭代地调用 nextNumber 方法提供序列的数字（即 [0, 1, ... N - 1] 的任何排列。当然，此生成器实例的状态应该具有比 O 更好的内存复杂度(n) 再次（我将一无所获）。

有什么算法可以做，我想要什么？

【问题讨论】：

我想到了以素数为模的原始根。不过，该顺序可能不够随机，无法满足您的需求。
我也想过在每一步中只使用一个哈希函数并返回 hash(x++)，但是这可能会导致冲突，我想要一个精确的排列...
en.m.wikipedia.org/wiki/Format-preserving_encryption
FWIW，我对答案中的代码进行了一些改进，并添加了一些您可能会觉得有用的更多信息。

标签： algorithm permutation

【解决方案1】：

这是一个相当简单的 Python 3 实现 Format-preserving encryption 使用平衡的 Feistel network，这是我大约 2 年前写的。它可以在 32 位系统上对 N 最多 2⁶⁴ 或在 64 位 Python 构建上的 2¹²⁸ 执行您想要的索引排列。这是由于hash() 函数返回的整数的大小。请参阅sys.hash_info 以查找您的系统的限制。使用可以返回更大位长度的值的更高级的哈希函数并不难，但我不想让这段代码更复杂或更慢。

更新

我对之前的版本做了一些小的改进，并在 cmets 中添加了更多信息。我们不使用从哈希函数返回的低位，而是使用高位，这通常会提高随机性，尤其是对于短位长度。我还添加了另一个散列函数xxhash by Yann Collet，它比 Python 的 hash 在这个应用程序中的工作效果好多，特别是对于较短的位长度，虽然它有点慢。 xxhash 算法的avalanche effect 比内置的hash 高得多，因此得到的排列往往更容易打乱。

虽然此代码适用于 stop 的小值，但它更适合处理 stop >= 2**16。如果您需要置换更小的范围，那么在list(range(stop)) 上使用random.shuffle 可能是一个好主意。它会更快，并且不会使用那么多 RAM：list(range(2**16)) 在 32 位机器上消耗大约 1280 KB。

您会注意到我使用字符串作为随机数生成器的种子。对于这个应用程序，我们希望随机化器有足够的熵，并且使用大字符串（或bytes）是一种简单的方法，正如random module 文档提到的那样。即便如此，当stop 很大时，这个程序只能产生所有可能排列的一小部分。 stop == 35 有 35 个！（35 阶乘）不同的排列，还有 35！ > 2¹³²，但是我们的密钥的总比特长度只有 128，所以它们不能覆盖所有这些排列。我们可以增加 Feistel 轮数以获得更多覆盖，但显然这对于 stop 的大值是不切实际的。

''' Format preserving encryption using a Feistel network

    This code is *not* suitable for cryptographic use.

    See https://en.wikipedia.org/wiki/Format-preserving_encryption
    https://en.wikipedia.org/wiki/Feistel_cipher
    http://security.stackexchange.com/questions/211/how-to-securely-hash-passwords

    A Feistel network performs an invertible transformation on its input,
    so each input number produces a unique output number. The netword operates
    on numbers of a fixed bit width, which must be even, i.e., the numbers
    a particular network operates on are in the range(4**k), and it outputs a
    permutation of that range.

    To permute a range of general size we use cycle walking. We set the
    network size to the next higher power of 4, and when we produce a number
    higher than the desired range we simply feed it back into the network,
    looping until we get a number that is in range.

    The worst case is when stop is of the form 4**k + 1, where we need 4
    steps on average to reach a valid n. In the typical case, where stop is
    roughly halfway between 2 powers of 4, we need 2 steps on average.

    Written by PM 2Ring 2016.08.22
'''

from random import Random

# xxhash by Yann Collet. Specialised for a 32 bit number
# See http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html

def xxhash_num(n, seed):
    n = (374761397 + seed + n * 3266489917) & 0xffffffff
    n = ((n << 17 | n >> 15) * 668265263) & 0xffffffff
    n ^= n >> 15
    n = (n * 2246822519) & 0xffffffff
    n ^= n >> 13
    n = (n * 3266489917) & 0xffffffff
    return n ^ (n >> 16)

class FormatPreserving:
    """ Invertible permutation of integers in range(stop), 0 < stop <= 2**64
        using a simple Feistel network. NOT suitable for cryptographic purposes.
    """
    def __init__(self, stop, keystring):
        if not 0 < stop <= 1 << 64:
            raise ValueError('stop must be <=', 1 << 64)

        # The highest number in the range
        self.maxn = stop - 1

        # Get the number of bits in each part by rounding
        # the bit length up to the nearest even number
        self.shiftbits = -(-self.maxn.bit_length() // 2)
        self.lowmask = (1 << self.shiftbits) - 1
        self.lowshift = 32 - self.shiftbits

        # Make 4 32 bit round keys from the keystring.
        # Create an independent random stream so we
        # don't intefere with the default stream.
        stream = Random()
        stream.seed(keystring)
        self.keys = [stream.getrandbits(32) for _ in range(4)]
        self.ikeys = self.keys[::-1]

    def feistel(self, n, keys):
        # Split the bits of n into 2 parts & perform the Feistel
        # transformation on them.
        left, right = n >> self.shiftbits, n & self.lowmask
        for key in keys:
            left, right = right, left ^ (xxhash_num(right, key) >> self.lowshift)
            #left, right = right, left ^ (hash((right, key)) >> self.lowshift) 
        return (right << self.shiftbits) | left

    def fpe(self, n, reverse=False):
        keys = self.ikeys if reverse else self.keys
        while True:
            # Cycle walk, if necessary, to ensure n is in range.
            n = self.feistel(n, keys)
            if n <= self.maxn:
                return n

def test():
    print('Shuffling a small number')
    maxn = 10
    fpe = FormatPreserving(maxn, 'secret key string')
    for i in range(maxn):
        a = fpe.fpe(i)
        b = fpe.fpe(a, reverse=True)
        print(i, a, b)

    print('\nShuffling a small number, with a slightly different keystring')
    fpe = FormatPreserving(maxn, 'secret key string.')
    for i in range(maxn):
        a = fpe.fpe(i)
        b = fpe.fpe(a, reverse=True)
        print(i, a, b)

    print('\nHere are a few values for a large maxn')
    maxn = 10000000000000000000
    print('maxn =', maxn)
    fpe = FormatPreserving(maxn, 'secret key string')
    for i in range(10):
        a = fpe.fpe(i)
        b = fpe.fpe(a, reverse=True)
        print('{}: {:19} {}'.format(i, a, b))

    print('\nUsing a set to test that there are no collisions...')
    maxn = 100000
    print('maxn', maxn)
    fpe = FormatPreserving(maxn, 'secret key string')
    a = {fpe.fpe(i) for i in range(maxn)}
    print(len(a) == maxn)

    print('\nTesting that the operation is bijective...')
    for i in range(maxn):
        a = fpe.fpe(i)
        b = fpe.fpe(a, reverse=True)
        assert b == i, (i, a, b)
    print('yes')

if __name__ == "__main__":
    test()

输出

Shuffling a small number
0 4 0
1 2 1
2 5 2
3 9 3
4 1 4
5 3 5
6 7 6
7 0 7
8 6 8
9 8 9

Shuffling a small number, with a slightly different keystring
0 9 0
1 8 1
2 3 2
3 5 3
4 2 4
5 6 5
6 1 6
7 4 7
8 7 8
9 0 9

Here are a few values for a large maxn
maxn = 10000000000000000000
0: 7071024217413923554 0
1: 5613634032642823321 1
2: 8934202816202119857 2
3:  296042520195445535 3
4: 5965959309128333970 4
5: 8417353297972226870 5
6: 7501923606289578535 6
7: 1722818114853762596 7
8:  890028846269590060 8
9: 8787953496283620029 9

Using a set to test that there are no collisions...
maxn 100000
True

Testing that the operation is bijective...
yes
0 4
1 2
2 5
3 9
4 1
5 3
6 7
7 0
8 6
9 8

以下是如何使用它来制作生成器：

def ipermute(stop, keystring):
    fpe = FormatPreserving(stop, keystring)
    for i in range(stop):
        yield fpe.fpe(i)

for i, v in enumerate(ipermute(10, 'secret key string')):
    print(i, v)

输出

它相当快（对于 Python），但它绝对不适合密码学。可以通过将 Feistel 轮数增加到至少 5 轮并使用合适的加密哈希函数（例如 Blake2）来实现加密级别。此外，需要使用加密方法来生成 Feistel 密钥。当然，除非您确切地知道自己在做什么，否则不应该编写加密软件，因为编写容易受到定时攻击等攻击的代码太容易了。

【讨论】：

【解决方案2】：

您正在寻找的是函数形式的伪随机排列，例如 f，它将 1 到 N 的数字映射到伪随机双射中的 1 到 N 的数字方式。然后，要以伪随机排列生成第 nth 个数，只需返回 f(n)

这与加密本质上是相同的问题。带密钥的分组密码是伪随机双射函数。如果你以某种顺序只提供一次所有可能的明文块，它将以不同的伪随机顺序返回所有可能的密文块。

因此，要解决像您这样的问题，您实际上要做的是创建一个适用于从 1 到 N 的数字而不是 256 位块或其他数字的密码。您可以使用密码学中的工具来执行此操作。

例如，您可以使用 Feistel 结构 (https://en.wikipedia.org/wiki/Feistel_cipher) 构造置换函数，如下所示：

令 W 为 floor(sqrt(N))，令函数的输入为 x
如果 x
x = (x+(N-W^2))%N
重复步骤 (2) 和 (3) 若干次。你做的越多，结果看起来越随机。步骤 (3) 确保 x

由于这个函数由多个步骤组成，每个步骤都会将0到N-1的数字以双射的方式映射到0到N-1的数字上，所以整个函数也将具有这个属性。如果你给它输入从 0 到 N-1 的数字，你会以伪随机顺序将它们取回。

【讨论】：

【解决方案3】：

我认为您在这里处理的是排列的等级。（我可能是错的）。我为此写了一个罗塞塔代码task；以及在此 here 和 here 上回答其他 SO 问题。

这有用吗？

【讨论】：

第二个环节你知道你的算法的内存复杂度是多少吗？
不确定你的。问题？您是在谈论获得真正的随机位吗？
不，我的意思是这样Memory Complexity?
很抱歉给您带来了困惑。我要问的是，生成特定排列需要多少内存？它取决于输入的大小吗？我问是因为我已经发布了一个答案（现已删除），它完全符合您的建议（即生成 nth 词典排列）。就内存复杂度而言，我的算法是 O(n)。 OP 要求解小于 O(n)。
嗯。我认为 Q. 可能已经澄清，上述内容也会失败。