【发布时间】:2014-12-08 19:04:52
【问题描述】:
我有两个数据框,remove 和 dat(实际的数据框)。 remove 指定了在dat 中找到的因子变量的各种组合,以及要采样的数量 (remove$cases)。
可重现的例子:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
我想要完成的是从remove 连续读取并将其用于子集dat。我目前的方法如下:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
循环运行时cat() 的输出看起来正确,例如:dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct",如果我将其粘贴到dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,],我得到正确的子集。
但是,如果我按照dat[selection, ] 编写的方式运行循环,则每个子集仅返回NAs。如果我使用subset(),我会得到相同的结果。请注意,我在上面有replace = TRUE 仅仅是因为随机抽样。在实际应用中,每个组合的案例总是比要求的多。
我知道我可以通过这种方式使用paste() 为lm() 和其他函数动态构造公式,但在将其转换为使用[,] 时显然缺少一些东西。
任何建议将不胜感激!
【问题讨论】: