如何洗牌二维二进制矩阵，保留边际分布答案

【问题标题】：How to shuffle a 2d binary matrix, preserving marginal distributions如何洗牌二维二进制矩阵，保留边际分布
【发布时间】：2020-09-23 17:54:08
【问题描述】：

假设我有一个 (n*m) 二进制矩阵df，类似于以下内容：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   0   1   1   0
1 | 0   1   0   0   0   0   0   0
2 | 0   0   0   0   1   0   0   0
3 | 0   0   0   0   0   1   0   1
4 | 0   1   1   0   1   0   0   0
5 | 1   0   1   1   1   0   0   1

我想打乱矩阵中的值以创建一个形状相同的new_df，这样两个边缘分布相同，如下所示：

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   1   0   0   1
1 | 0   0   0   0   1   0   0   0
2 | 0   0   0   0   0   0   0   1
3 | 0   1   1   0   0   0   0   0
4 | 1   0   0   0   1   1   0   0
5 | 0   1   1   1   0   1   1   0

在新矩阵中，每一行的和等于原矩阵中对应行的和，同样，新矩阵中的列与原矩阵中对应列的和相同。

解决方案很容易检查：

# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))  

# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))

如果 n*m 很小，我可以使用蛮力方法进行随机播放：

def shuffle_2d(df):
    """Shuffles a multidimensional binary array, preserving marginal distributions"""
    # get a list of indices where the df is 1
    rowlist = []
    collist = []
    for i_row, row in df.iterrows():
        for i_col, val in row.iteritems():
            if df.loc[i_row, i_col] == 1:
                rowlist.append(i_row)
                collist.append(i_col)

    # create an empty df of the same shape
    new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)

    # shuffle until you get no repeat coordinates 
    # this is so you don't increment the same cell in the matrix twice
    repeats = 999
    while repeats > 1:
        pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
        repeats = pd.value_counts(pairs).max()

    # populate new data frame at indicated points
    for i_row, i_col in pairs:
        new_df.at[i_row, i_col] += 1

    return new_df

问题在于蛮力方法的扩展性很差。（正如印第安纳琼斯和最后的十字军东征中的那句话：https://youtu.be/Ubw5N8iVDHI?t=3）

作为一个快速演示，对于 n*n 矩阵，获得可接受的随机播放所需的尝试次数如下：（一次运行）

n   attempts
2   1
3   2
4   4
5   1
6   1
7   11
8   9
9   22
10  4416
11  800
12  66
13  234
14  5329
15  26501
16  27555
17  5932
18  668902
...

是否有一个简单的解决方案可以保留确切的边际分布（或告诉您没有其他可能的模式可以保留该分布）？

作为后备方案，我还可以使用一种近似算法来最小化每行的平方误差之和。

谢谢！ =)

编辑：出于某种原因，在我写这个问题之前我没有找到现有的答案，但是在发布之后它们都显示在侧边栏中：

Is it possible to shuffle a 2D matrix while preserving row AND column frequencies?

Randomize matrix in perl, keeping row and column totals the same

有时你需要做的就是问...

【问题讨论】：

您能否提供一个移位表应该是什么样子的示例？以及输入输入的数据结构？（它是数字数组的数组吗？）
当然 - 编辑了原始帖子以包含如何转换原始帖子的示例。第一行代码df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8))) 将数据生成为 pandas DataFrame，尽管它也可以很容易地成为二维 numpy 数组。
我的尝试将基于离散选择域中的精确求解器：cp、sat 或整数编程。它们都是合理且完整的，并且可以对失败进行推理。现在，获得均匀性 并不是那么简单，方法可能会假设均匀性的重要性（以及矩阵的大小）。从琐碎的方法，例如：忽略它并希望它足够好，枚举 X 解决方案并随机选择 1 以使用更多理论通过引入约束（例如 xor）来实现更多一致性，这些约束会杀死解决方案，直到我们找到具有单个 sol

标签： python algorithm shuffle approximation

【解决方案1】：

主要感谢https://stackoverflow.com/a/2137012/6361632 的启发，这是一个似乎可行的解决方案：


def flip1(m):
    """
    Chooses a single (i0, j0) location in the matrix to 'flip'
    Then randomly selects a different (i, j) location that creates
    a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
    element leaves the marginal distributions unaltered.  
    Changes those elements, and returns 1.

    If such a quad cannot be completed from the original position, 
    does nothing and returns 0.
    """
    i0 = np.random.randint(m.shape[0])
    j0 = np.random.randint(m.shape[1])

    level = m[i0, j0]
    flip = 0 if level == 1 else 1  # the opposite value

    for i in np.random.permutation(range(m.shape[0])):  # try in random order
        if (i != i0 and  # don't swap with self
            m[i, j0] != level):  # maybe swap with a cell that holds opposite value
            for j in np.random.permutation(range(m.shape[1])):
                if (j != j0 and  # don't swap with self
                    m[i, j] == level and  # check that other swaps work
                    m[i0, j] != level):
                    # make the swaps
                    m[i0, j0] = flip
                    m[i0, j] = level
                    m[i, j0] = level
                    m[i, j] = flip
                    return 1

    return 0

def shuffle(m1, n=100):
    m2 = m1.copy()
    f_success = np.mean([flip1(m2) for _ in range(n)])

    # f_success is the fraction of flip attempts that succeed, for diagnostics
    #print(f_success)

    # check the answer
    assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
    assert(all(m1.sum(axis=0) == m2.sum(axis=0)))

    return m2

我们可以称之为：

m1 = np.random.binomial(1, .3, size=(6,8))

array([[0, 0, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0]])

m2 = shuffle(m1)

array([[0, 0, 0, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1]])

我们需要多少次迭代才能达到稳态分布？我这里设置了默认值 100，对于这些小矩阵来说已经足够了。

下面我绘制了原始矩阵和混洗矩阵（500 次）在不同迭代次数下的相关性。

for _ in range(500):
    m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
    m2 = shuffle(m1, n_iters)
    corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])

plt.hist(corrs, bins=40, alpha=.4, label=n_iters)

对于 9x9 矩阵，我们看到了改进，直到大约 25 次迭代，超过了我们处于稳定状态。

对于 18x18 矩阵，我们看到从 100 次迭代到 250 次迭代的收益很小，但不会超出太多。

请注意，对于较大的矩阵，开始和结束分布之间的相关性较低，但我们需要更长的时间才能到达那里。

【讨论】：

【解决方案2】：

您必须寻找两行和两列，它们的切点给出了一个矩阵，顶部为 1 0，底部为 0 1（或相反）。您可以切换这些值（到 01 和 10）。

甚至还有一种算法，可以从 Verhelst (2008, link to article page) 开发的具有相同边际的所有可能矩阵中采样（在 R 包 RaschSampler 中实现）。

Wang (2020, link) 的新算法在某些情况下效率更高，也可用。

【讨论】：