【问题标题】：Pandas: Sampling from a DataFrame according to a target distributionPandas：根据目标分布从 DataFrame 中采样
【发布时间】：2020-12-23 13:29:52
【问题描述】：

我有一个 Pandas DataFrame，其中包含一个数据集 D 的实例，这些实例都有一些连续值 x。 x 以某种方式分布，比如说统一，可以是任何东西。

我想从D 中抽取n 样本，其中x 具有我可以采样或近似的目标分布。这来自一个数据集，这里我只是取正态分布。

如何从D 中抽样实例，以使抽样中x 的分布等于/类似于我指定的任意分布？

现在，我对一个值x、子集D 进行采样，使其包含所有x +- eps 并从中采样。但是当数据集变大时，这会很慢。人们一定想出了更好的解决方案。也许解决方案已经很好，但可以更有效地实施？

我可以将x 拆分为分层，这样会更快，但是没有这个有没有解决方案？

我当前的代码，运行良好但速度很慢（30k/100k 需要 1 分钟，但我有 200k/700k 左右。）

import numpy as np
import pandas as pd
import numpy.random as rnd
from matplotlib import pyplot as plt
from tqdm import tqdm

n_target = 30000
n_dataset = 100000

x_target_distribution = rnd.normal(size=n_target)
# In reality this would be x_target_distribution = my_dataset["x"].sample(n_target, replace=True)

df = pd.DataFrame({
    'instances': np.arange(n_dataset),
    'x': rnd.uniform(-5, 5, size=n_dataset)
    })

plt.hist(df["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

def sample_instance_with_x(x, eps=0.2):
    try:
        return df.loc[abs(df["x"] - x) < eps].sample(1)
    except ValueError: # fallback if no instance possible
        return df.sample(1)

df_sampled_ = [sample_instance_with_x(x) for x in tqdm(x_target_distribution)]
df_sampled = pd.concat(df_sampled_)

plt.hist(df_sampled["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

【问题讨论】：

标签： python pandas sampling

【解决方案1】：

与其在df.x 中生成新点并找到最近的邻居，不如根据您的目标分布定义每个点应该被采样的概率。您可以使用np.random.choice。对于像这样的高斯目标分布，在一秒钟左右从df.x 中采样一百万个点：

x = np.sort(df.x)
f_x = np.gradient(x)*np.exp(-x**2/2)
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

sample_probs 是关键数量，因为它可以连接回数据框或用作df.sample 的参数，例如：

# sample df rows without replacement
df_samples = df["x"].sort_values().sample(
    n=1000, 
    weights=sample_probs, 
    replace=False,
)

plt.hist(samples, bins=100, density=True)的结果：

高斯分布x，均匀目标分布

让我们看看当原始样本是从高斯分布中提取并且我们希望从均匀目标分布中对其进行采样时，此方法的执行情况：

x = np.sort(np.random.normal(size=100000))
f_x = np.gradient(x)*np.ones(len(x))
sample_probs = f_x/np.sum(f_x)
samples = np.random.choice(x, p=sample_probs, size=1000000)

其实很好。从高斯尾部采样的点很少，这些点被分配给均匀采样的大概率；这就是为什么它们是稀疏的并且比中间部分有更多的样本。

方法

为x 中的样本计算近似概率，格式如下：

概率(x_i) ~ delta_x*rho(x_i)

其中rho(x_i) 是密度函数，np.gradient(x) 用作微分值。如果忽略差异权重，f_x 将在重采样中过度表示接近点而不足表示稀疏点。我最初犯了这个错误，影响很小是x是均匀分布的（但通常可能很重要）：

【讨论】：