多次随机子集数据集并计算均值和方差答案

【问题标题】：Randomly subset data set multiple times and calculate means and variances多次随机子集数据集并计算均值和方差
【发布时间】：2011-10-12 16:03:22
【问题描述】：

我从来没有得出任何结论：这个问题，所以我想我会重新措辞并再次问。

我想对我的数据集进行 10,000 次二次抽样，以便为我的每个响应生成均值和 95% 的 CI。

以下是数据集结构的示例：

x <- read.table(tc <- textConnection("
study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         3         B       1.2      2.9
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  2         2         A       1.9      0.3 
  3         1         A       0.2      0.5"), header = TRUE); close(tc)

我只想对每个研究/变量组合进行一次二次抽样。因此，例如，子集数据集将如下所示：

study      expt    variable  value1  value2
  1         1         A       1.0      1.1 
  1         2         B       1.1      2.1 
  1         4         C       1.5      2.3 
  2         1         A       1.7      0.3 
  3         1         A       0.2      0.5

注意第 3 行和第 6 行已经消失，因为它们都测量了一个变量两次（第一种情况是 B，第二种情况是 A）。

我想一次又一次地绘制二次抽样数据集，因此我可以推导出 value1 和 value2 的整体均值，每个变量的置信区间为 95%。所以在整个子采样例程之后我想要的输出是：

variable   mean_value1   lower_value1  upper_value1  mean_value2  etc....
   A            2.3           2.0          2.6           2.1
   B            2.5           2.0          3.0           2.5
   C            2.1           1.9          2.3           2.6

这是我必须获取子集的一些代码：

 subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination, 
                                                  #how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there

sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
    idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
    x$value[idx+expIdx] #now get a vector of values
})

感谢任何帮助。我知道这很复杂，所以如果您需要澄清，请告诉我！

【问题讨论】：

鼓励提供最小的可重现示例。请参阅：stackoverflow.com/questions/5963269/…
抱歉，感谢您更新我的问题——对 StackOverflow 来说还是新手！
@jslefche：有a decent blog article about writing questions。
谢谢大家，你会从我的下一个问题中看到我把你的 cmets 放在心上：stackoverflow.com/questions/6819047/…
好的，所以你描述的不是引导，它涉及重新采样你的数据（替换）；您所描述的是数据的随机子集。我可以编写一些代码来执行此操作，但您确定这是您想要的吗？

标签： r subset mean confidence-interval

【解决方案1】：

这是一个解决方案，虽然是公平的警告，但它不会很好地扩展，而且我不知道这种方案的统计有效性：

#Replicate your example data
set.seed(1)
dat <- expand.grid(Study = 1:4,Experiment = 1:3, Response = LETTERS[1:4])
dat$Value1 <- runif(48)
dat$Value2 <- runif(48)

#Function to apply to each Response level
#Note the rather inefficient use of ddply 
# in a for loop to do the 'stratified' 
# subsampling you describe
myFun <- function(x,B){
    rs <- matrix(NA,B,2)
    for (i in 1:B){
        temp <- ddply(x,.(Study), .fun = function(x) x[sample(1:nrow(x),1),])
        rs[i,] <- colMeans(temp[,4:5])
    }
    c(Value1 = mean(x$Value1), quantile(rs[,1],probs=c(0.025,0.975)),
            Value2 = mean(x$Value2), quantile(rs[,2],probs=c(0.025,0.975)))
}

ddply(dat,.(Response),.fun = myFun,B=50)

示例输出

  Response    Value1      2.5%     97.5%    Value2      2.5%     97.5%
1        A 0.4914725 0.2721876 0.8311799 0.4600546 0.2596446 0.6909686
2        B 0.5941457 0.4018281 0.8047503 0.5241470 0.2865285 0.7099486
3        C 0.4596998 0.2752685 0.6340614 0.5761497 0.3546133 0.8115933
4        D 0.5550651 0.2717772 0.7298913 0.4645609 0.1868757 0.7985816

【讨论】：

不幸的是，这似乎仍然从数据集中随机选择研究。
@jslefche - 这是不正确的。这段代码中没有Study的随机选择。

【解决方案2】：

按研究、实验和变量拆分数据，然后将引导程序应用于每个子集。有很多方法可以做到这一点，包括：

sdfr <- with(dfr, split(dfr, list(Study, Experiment, Variable)))
sdfr <- Filter(nrow, sdfr)   #to remove empty data frames

lapply(sdfr, function(x) 
{
  boot(x$Response1, statistic = mean, R = 10000, sim = "parametric")
})

【讨论】：

OP 对自举的描述有点模糊，但他们可能是为了引导每个变量 within 研究的实验中的响应估计，在这种情况下你d 拆分时简单地省略 Experiment。使用 OP 的示例数据，我认为这段代码将在每种情况下引导一个值。
@jslefche - 这听起来不像 Richie 的代码在做什么。此外，对您的引导过程的这种描述对我来说比您的 OP 中的描述更没有意义。也许您可以编辑您的问题，向我们展示您认为输出应该是什么样子？
@Joran -- 抱歉让您感到困惑，我对此还是有点陌生。我在上面编辑了我的帖子，并试图尽可能明确。如果您需要更多信息，请告诉我！