r 子集按组对数据帧进行分层；子集每组的最大观察量，只要真假布尔值是平衡的答案

【问题标题】：r subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balancedr 子集按组对数据帧进行分层；子集每组的最大观察量，只要真假布尔值是平衡的
【发布时间】：2018-10-22 22:07:41
【问题描述】：

R 子集按组对数据帧进行分层；只要 true 和 false 布尔值是平衡的，每组的最大观察量的子集：（python 答案也被接受）

我有一个包含来自 600 个餐厅 ID 的 10000 个样本的数据集，其中一些 ID 缺失，并且在运行任何模型之前我需要将其平衡到 50:50 的偏差布尔值。要重新创建数据集，这里是代码

x<-floor(runif(10000, 0, 600)) #make a dataset of 10000 samples from 600 restaurant IDs
x<-sort(x)
y<-sample(0:1,10000,prob=c(.16,.84),replace=TRUE) #make a biased boolean for those 10000 samples
df = data.frame(x,y) #dataframe has random number of restaurants and biased boolean
colnames(df) <- c("Restaurant_ID","Restaurant_Bool")
summary(df)
nrow(df)

z<-floor(runif(10, 0, 600)) #create a 10 restaurants by ID that are missing from the dataset
for (i in 10) {
  df<-df[!(df$Restaurant_ID==z[i]),] #remove those restaurants by ID from the dataset
}
summary(df)
nrow(df)

数据集的真：假比率约为 84:16，但该数字也因餐厅 ID 而异

类似于按餐厅 ID 进行分层，我需要将真实观察的数量限制为等于每个餐厅 ID 的错误观察数量

我不知道如何编写此代码，并且有任何帮助

例如，对于 restaurant_ID 0，可能有 10 个观察值，其中 8 个为真，2 个为假。没有 restaurant_ID 1。

对于 restaurant_ID 2，可能有 8 个观察值，其中 3 个为真，5 个为假。

    X restaurant_ID Restaurant_Bool
    1 0             1
    2 0             1
    3 0             1
    4 0             0
    5 0             1
    6 0             1
    7 0             1
    8 0             0
    9 0             1
   10 0             1
   11 2             0
   12 2             0
   13 2             1
   14 2             0
   15 2             1
   16 2             0
   17 2             1
   18 2             0
   ...

我想要一个子集结果，其中 Restaurant_Bool == 0 的数量与 Restaurant_Bool == 1 的数量相同，只要最大观察数量基于每个布尔观察的最小数量进行子集化restaurant_ID

 X restaurant_ID Restaurant_Bool
 1 0             1
 2 0             1
 4 0             0
 8 0             0
11 2             0
12 2             0
13 2             1
15 2             1
16 2             0
17 2             1
...

这可能是第一个子集，另一个子集可以使用其他观察结果随机重新创建具有相同规则的另一个子集：

 X restaurant_ID Restaurant_Bool
 6 0             1
 7 0             1
 4 0             0
 8 0             0
14 2             0
18 2             0
13 2             1
15 2             1
16 2             0
17 2             1
...

...依此类推，通过将 Restaurant_Bool == 1 的相同样本数与 Restaurant_Bool == 0 per restaurant_ID 保持相同，可以创建来自同一数据集的多个不同子集

如果 Restaurant_Bool == 0 的观测值比 Restaurant_Bool == 1 多，则使用代表最少的布尔值重新创建每个餐厅 ID 的数据集，如果其中任一为真，则可以从数据集中删除整个餐厅 ID or false 没有观察结果

我想按 restaurant_ID 进行分层的原因是，我在制作模型时需要保留的其余列可能存在一些内部相关性

我找到的最接近的答案是 Subset panel data by group ，但只要 true 和 false 布尔值是平衡的，我想保持每个 restaurant_ID 的最大观察量并没有生效

【问题讨论】：

标签： python r boolean subset panel

【解决方案1】：

在python中，代码如下所示

创建一个新的空数据集并编写一个按 restaurant_id 分组的 for 循环，并找到每个子组 Restaurant_Bool 的最小 n 数量

如果 n 为 0，则创建一个 catch，然后转到下一个 restaurant_id

将推荐和不推荐合并到临时 group_reviews 数据框中，并将 group_reviews 评论附加到 balance_reviews 数据框中，同时断言 Restaurant_Bool 的平均值为 0.5

在每组循环结束后，断言整个数据帧balanced_reviews的Restaurant_Bool平均值为0.5

balanced_reviews = pd.DataFrame()
for restaurant_id, group in reviews.groupby('restaurant_id'):
    take_n = min((group['Restaurant_Bool'] == 0).sum(), (group['Restaurant_Bool'] == 1).sum())
    if take_n == 0:
        continue
    reg_reviews = group[group['Restaurant_Bool'] == 1].sample(n=take_n, random_state=0)
    not_reviews = group[group['Restaurant_Bool'] == 0].sample(n=take_n, random_state=0)
    group_reviews = reg_reviews.append(not_reviews)

    assert group_reviews['Restaurant_Bool'].mean() == .5
    balanced_reviews = balanced_reviews.append(group_reviews)

assert balanced_reviews['Restaurant_Bool'].mean() == .5

【讨论】：