是否有用于对大样本集进行二次采样以匹配另一个样本中变量分布的python函数答案

【问题标题】：Is there a python function for subsampling a large sample set to match the distribution of a variable in another sample是否有用于对大样本集进行二次采样以匹配另一个样本中变量分布的python函数
【发布时间】：2021-12-06 08:16:20
【问题描述】：

我有 2 个样本数据集（熊猫数据框）：

df_1 = 700 名学生
df_2 = 200 名学生

每个数据框都有相同的列

student_id
身高

我想对df_1 进行子集化，这样它也有 200 名学生，他们的身高分布与df_2 中的学生相同。我有df_2 学生的平均值、标准、最小值、中位数，如果我能以某种方式使用的话。

【问题讨论】：

标签： python python-3.x pandas

【解决方案1】：

好吧，我不确定我是否完全正确，但让我建议两个步骤。

二次抽样

要从您的数据集中创建随机子样本，您可以使用以下函数，该函数将从您的初始 df 返回一个新的数据帧（深拷贝）（它不会被变异）。

seed = 42
np.random.seed(seed)

def subsample(df, size: int):
    assert 0 < size < len(data)
    subsample_indexes = np.random.randint(0, len(data), size)
    return df.iloc[subsample_indexes, :].copy()

相同的分布？

为此，我建议您可以使用上面的子采样功能，进行一些迭代（例如 50 个子样本）将每个子样本的分布与df2 的分布进行比较，伪代码是这样的，

def compare_distributions(df, df_compare, n_subsample = 50):
    preserve_subsample = False
    n = 1
    while not preserve_subsample or n < n_subsample:
        df_sub = subsample(df, 200)
        # check if distributions is similar
        # here you may conduct a hypothesis test and/or
        # look at some statistics
        preserve_subsample = compare(df_sub, df_compare)
        n += 1
    if not preserve_subsample:
        # return empty df
        return pd.DataFrame()
    return df_sub

【讨论】：

【解决方案2】：

您可以将pd.cut() 与df.sample() 结合使用。

例如，使用 pd.cut() 获取将 df_2.height 分成 20 个 bin 的 bin 边缘，每个 bin 有 10 个学生。然后遍历这些 bin，对于每个 bin，使用 df.sample() 从属于该高度 bin 的 df_1 子集中抽取 10 名学生。如果任何此类垃圾箱中的学生少于 10 人，您可以考虑替换抽样，或从一开始就减少垃圾箱的数量。

这将为您提供df_1 的随机子集，其高度分布与df_2 中的高度分布大致相同。

【讨论】：