从具有定义概率的 Pandas 组内进行抽样答案

【问题标题】：Sampling from within Pandas groups with defined probabilities从具有定义概率的 Pandas 组内进行抽样
【发布时间】：2018-05-14 23:48:45
【问题描述】：

考虑以下 Pandas 数据框，

df = pd.DataFrame(
    [
         ['X', 0, 0.5],
         ['X', 1, 0.5],

         ['Y', 0, 0.25],
         ['Y', 1, 0.3],
         ['Y', 2, 0.45],

         ['Z', 0, 0.6],
         ['Z', 1, 0.1],
         ['Z', 2, 0.3]
    ], columns=['NAME', 'POSITION', 'PROB'])

请注意，df 为每个唯一的 NAME 值定义了离散概率分布，即

assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()

我想做的是从这些概率分布中抽样。

我们可以将POSITION 视为对应于概率的值。因此，当考虑X 时，样本将是0，概率为0.5 和1，概率为0.5。

我想创建一个新的数据框，其中包含代表这些样本的列 ['NAME', 'POSITION', 'PROB', 'SAMPLE']。每个唯一的SAMPLE 值代表一个新样本。 PROB 列现在始终为 0 或 1，表示在给定样本中是否选择了给定行。例如，如果我要选择 3 个样本，下面是一个示例结果，

df_samples = pd.DataFrame(
    [
         ['X', 0, 1, 0],
         ['X', 1, 0, 0],
         ['X', 0, 0, 1],
         ['X', 1, 1, 1],
         ['X', 0, 1, 2],
         ['X', 1, 0, 2],

         ['Y', 0, 1, 0],
         ['Y', 1, 0, 0],
         ['Y', 2, 0, 0],
         ['Y', 0, 0, 1],
         ['Y', 1, 0, 1],
         ['Y', 2, 1, 1],
         ['Y', 0, 1, 2],
         ['Y', 1, 0, 2],
         ['Y', 2, 0, 2],

         ['Z', 0, 0, 0],
         ['Z', 1, 0, 0],
         ['Z', 2, 1, 0],
         ['Z', 0, 0, 1],
         ['Z', 1, 0, 1],
         ['Z', 2, 1, 1],
         ['Z', 0, 1, 2],
         ['Z', 1, 0, 2],
         ['Z', 2, 0, 2],
    ], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])

当然，由于涉及到随机性，这只是众多可能结果之一。

该程序的单元测试是随着样本的增加，根据大数定律，每个(NAME, POSITION) 对的样本平均数应该趋于实际概率。可以根据使用的总样本计算置信区域，然后确保真实概率在其中。例如，使用 normal approximation to binomial outcomes（要求总样本 n_samples 为“大”）（-4 sd，4 sd）区域测试将是：

z = 4

p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']

CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)

assert p_true < CI_upper
assert p_true > CI_lower

在 Pandas 中最有效的方法是什么？我觉得我想将一些sample 函数应用于df.groupby('NAME') 对象。

附言

更明确地说，这是使用 Numpy 执行此操作的一种非常冗长的方法。

n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
    idx = df['NAME'] == name
    position_samples = np.random.choice(df.loc[idx, 'POSITION'], 
                                        n_samples, 
                                        p=df.loc[idx, 'PROB'])
    prob = np.zeros([idx.sum(), n_samples])
    prob[position_samples, np.arange(n_samples)] = 1
    position = np.tile(np.arange(idx.sum())[:, None], n_samples)
    sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T

    df_list.append(pd.DataFrame(
        [[name, prob.ravel()[i], position.ravel()[i], 
          sample.ravel()[i]] 
         for i in range(n_samples*idx.sum())], 
        columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))

df_samples = pd.concat(df_list)

【问题讨论】：

有一种方法可以做这样的事情。问题是你的问题。你说“最简单的解释方式......”我不同意。我希望看到更好的解释。
你在困惑什么？
@rwolst，很遗憾，我不明白你的任何问题。也许是我，但也许不是。您可能想edit 详细说明您的逻辑。
“如果我选择 3 个样本”是什么意思？您提到了概率，但没有提供您想要的概率。您可以拥有无数种概率，它们都生成特定的组合 3 个样本。我不明白您是否想要一种生成 3 个样本或 1 个样本的机制。最好阅读minimal reproducible example 并相应地编辑您的帖子。
可能是我。到电脑前我会编辑。

标签： python pandas

【解决方案1】：

如果我理解正确，您正在寻找 groupby + sample 然后是一些索引内容

概率的第一个样本：

n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
                               .sample(n_samples, replace=True,
                                       weights=x.PROB)) \
                               .reset_index(drop=True)

现在添加额外的列：

df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1


print(df_samples)

  NAME  POSITION  SAMPLE  PROB
0    X         1       0     1
1    X         0       1     1
2    X         1       2     1
3    Y         1       0     1
4    Y         1       1     1
5    Y         1       2     1
6    Z         2       0     1
7    Z         0       1     1
8    Z         0       2     1

请注意，这不包括初始问题中要求的每个样本的 0 概率位置，但它是一种更简洁的信息存储方式。

如果我们还想包含概率为 0 的位置，我们可以合并到其他位置，如下所示：

domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME', 
                      suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
                     df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)

【讨论】：

这似乎非常接近，并且示例函数看起来确实是我想要的。我已经编辑了答案，以与我要求的输出格式相同。它仍然如前所述，不包括每个样本的 0 概率位置。想了想，其实也不是什么大问题。
添加了一个最终合并，它提供了所需的输出，但对于使用 Pandas 清理代码的比我更好的人来说，这非常高兴。