如何提取具有多个因组而异的条件的随机样本？答案

【问题标题】：How to extract a random sample with multiple conditions that vary by group?如何提取具有多个因组而异的条件的随机样本？
【发布时间】：2015-07-27 05:46:10
【问题描述】：

我有一个跨国数据集，其中每个受访者至少有一本日记。每个受访者的日记数量和日记完成日因国家/地区而异。

例如，在一个国家/地区，每位受访者仅完成 1 篇日记（一半受访者仅在周末完成，而另一半仅在工作日完成）。在另一个国家，每个受访者完成了 2 篇日记（一个周末 - 一个工作日），而在另一个国家，每个人都完成了 7 篇日记（一周中的每一天）。还有一些调查显示，一些受访者返回了 2 篇日记，而另一些则返回了 3 篇；有些人每个人都退回了4本日记。数据如下所示：

country_id<-rep(1:4,c(8,8,14,10))
diarist_id<-c(11:18,rep(21:24,each=2),
              rep(31:32,each=7),
              rep(41:44,c(3,3,2,2)))
diary_id<-c(111:118,211,212,221,222,231,232,241,242,
            311:317,321:327,411,412,413,
            421,422,423,431,432,441,442)
weekend<-c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,
           0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,
           0,1,0,1,0,1,0,1,0)

dat<-data.frame(country_id,diarist_id,diary_id,weekend)

我正在尝试从每个国家/地区抽取“一人一日记”的随机样本。但在国家层面，我需要 - 大约 - 29% 的日记是周末日记。如何按组抽取这样的条件随机样本？

【问题讨论】：

您是否考虑过在sample 中使用prob 选项？
@MichaelChirico 我不知道如何在“样本”中整合条件

标签： r random-sample

【解决方案1】：

我认为这就是你所追求的。为了清楚起见，我选择拆分样本；可能有一种方法可以在不这样做的情况下获得您想要的东西，但它没有来找我。

我会用data.table:

set.seed(100)
library(data.table)
setDT(dat) #turn dat into a data.table (by reference)
country_n<-5 #how many observations you'd like per country

#split the data by weekend status
weekend.dat<-dat[weekend==T]
#we have to take care that there are actually enough
#  weekend observations in each country, so we take the
#  minimum of 29% of country_n (rounded) and the total
#  number of weekend observations in that country
weekend.sample<-
  weekend.dat[weekend.dat[,.I[sample(.N,min(round(.29*country_n),.N))],
                          by=country_id]$V1]

#repeat for the weekday sample, except take 71% this time
weekday.dat<-dat[weekend==F]
weekday.sample<-
  weekday.dat[weekday.dat[,.I[sample(.N,min(round(.71*country_n),.N))],
                          by=country_id]$V1]

#combine; setkey orders the data (as well as other
#  things that may be useful later on)
full.sample<-setkey(rbindlist(list(weekend.sample,weekday.sample)),
                    country_id,diarist_id,diary_id)

这是为我给定的随机种子生成的样本

> full.sample
    country_id diarist_id diary_id weekend
 1:          1         12      112       0
 2:          1         13      113       1
 3:          1         14      114       0
 4:          1         16      116       0
 5:          1         18      118       0
 6:          2         21      212       0
 7:          2         22      221       1
 8:          2         22      222       0
 9:          2         23      232       0
10:          2         24      242       0
11:          3         31      315       0
12:          3         31      316       0
13:          3         31      317       0
14:          3         32      321       1
15:          3         32      324       0
16:          4         41      411       1
17:          4         42      421       0
18:          4         42      423       0
19:          4         43      432       0
20:          4         44      442       0

【讨论】：

@Eva 请注意，有些diarist_id 不止一次出现。根据您的描述，我不确定这是否可以。上面的代码可以很容易地调整（通过unique 函数）来处理这种情况。
谢谢！我刚看到这个消息！是的，我所说的“一个人 - 一本日记”是指不应重复任何 diarist_id。我会尝试调整代码。