【问题标题】:Random subsampling of a data.frame and summarize mean and standard deviationdata.frame 的随机子采样并总结均值和标准差
【发布时间】:2021-07-23 06:43:32
【问题描述】:

在R中,我有一些这种形式的生态数据:

sample <- seq(1, 20, by=1)

group <- c("A","A","A","B","B","C","D","E","E","E","E","E","E",
          "E","E","E","E","F","F","F")

df <- data.frame(sample, group)

其中sample 是样本编号,group 是与每个样本相关的不同分类群

我总共有 20 个样本(实际上更多),我可以通过以下方式获得某个组的相对频率:

data.frame(table(group)/length(group))

group Freq
1     A 0.15
2     B 0.10
3     C 0.05
4     D 0.05
5     E 0.50
6     F 0.15

现在我想对我的数据框进行 100 次二次抽样(10 个样本),并获得每组的平均相对频率以及标准偏差。

我该怎么做?

【问题讨论】:

    标签: r random


    【解决方案1】:

    您可以使用以下代码

    data <- with(
        df,
        proportions(
            replicate(
                100,
                table(
                    factor(Group[sample(Sample, 10)], levels = unique(Group))
                )
            ), 2
        )
    )
    

    获得

    > data
       
        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
      A  0.2  0.3  0.2  0.1  0.1  0.1  0.3  0.2  0.1   0.2   0.1   0.2   0.2   0.2
      B  0.1  0.2  0.0  0.2  0.2  0.1  0.1  0.0  0.2   0.1   0.1   0.1   0.1   0.0
      C  0.1  0.0  0.0  0.1  0.1  0.1  0.0  0.1  0.1   0.1   0.0   0.1   0.0   0.1
      D  0.0  0.0  0.1  0.0  0.1  0.1  0.0  0.1  0.0   0.0   0.0   0.1   0.0   0.1
      E  0.4  0.5  0.5  0.4  0.5  0.6  0.6  0.5  0.3   0.5   0.7   0.4   0.7   0.5
      F  0.2  0.0  0.2  0.2  0.0  0.0  0.0  0.1  0.3   0.1   0.1   0.1   0.0   0.1
    
        [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
      A   0.1   0.0   0.1   0.2   0.1   0.2   0.1   0.3   0.2   0.3   0.1   0.1
      B   0.1   0.1   0.1   0.1   0.1   0.1   0.0   0.1   0.1   0.1   0.2   0.1
      C   0.1   0.0   0.1   0.1   0.1   0.0   0.0   0.0   0.0   0.0   0.0   0.1
      D   0.1   0.0   0.0   0.1   0.0   0.1   0.1   0.1   0.1   0.0   0.1   0.1
      E   0.5   0.6   0.6   0.3   0.5   0.5   0.6   0.4   0.5   0.4   0.4   0.5
      F   0.1   0.3   0.1   0.2   0.2   0.1   0.2   0.1   0.1   0.2   0.2   0.1
    
        [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
      A   0.1   0.2   0.2   0.1   0.2   0.2   0.0   0.3   0.2   0.1   0.0   0.2
      B   0.1   0.1   0.2   0.1   0.1   0.0   0.1   0.2   0.1   0.1   0.1   0.0
      C   0.0   0.0   0.0   0.1   0.0   0.1   0.1   0.1   0.1   0.0   0.1   0.0
      D   0.1   0.1   0.0   0.1   0.1   0.1   0.1   0.1   0.0   0.0   0.1   0.1
      E   0.5   0.3   0.4   0.5   0.4   0.5   0.6   0.2   0.4   0.5   0.4   0.6
      F   0.2   0.3   0.2   0.1   0.2   0.1   0.1   0.1   0.2   0.3   0.3   0.1
    
        [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
      A   0.1   0.2   0.1   0.2   0.2   0.1   0.1   0.1   0.2   0.2   0.2   0.1
      B   0.0   0.1   0.0   0.1   0.1   0.2   0.0   0.0   0.2   0.1   0.1   0.1
      C   0.0   0.1   0.1   0.1   0.0   0.1   0.0   0.0   0.1   0.1   0.0   0.0
      D   0.0   0.1   0.0   0.0   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1
      E   0.7   0.4   0.5   0.4   0.5   0.5   0.7   0.8   0.2   0.4   0.4   0.6
      F   0.2   0.1   0.3   0.2   0.1   0.0   0.1   0.0   0.2   0.1   0.2   0.1
    
        [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
      A   0.1   0.0   0.0   0.2   0.3   0.0   0.2   0.2   0.2   0.1   0.1   0.2
      B   0.1   0.2   0.2   0.1   0.0   0.2   0.0   0.1   0.2   0.2   0.2   0.1
      C   0.0   0.0   0.1   0.0   0.1   0.1   0.0   0.0   0.0   0.0   0.1   0.0
      D   0.1   0.1   0.0   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.0
      E   0.5   0.5   0.6   0.5   0.3   0.4   0.4   0.5   0.4   0.5   0.4   0.5
      F   0.2   0.2   0.1   0.1   0.2   0.2   0.3   0.1   0.1   0.1   0.1   0.2
    
        [,63] [,64] [,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74]
      A   0.2   0.3   0.2   0.1   0.2   0.1   0.2   0.3   0.3   0.1   0.2   0.2
      B   0.0   0.0   0.1   0.2   0.1   0.1   0.1   0.1   0.0   0.0   0.0   0.1
      C   0.0   0.0   0.1   0.0   0.0   0.1   0.0   0.0   0.0   0.1   0.1   0.1
      D   0.0   0.1   0.0   0.0   0.1   0.1   0.1   0.0   0.1   0.0   0.0   0.1
      E   0.6   0.4   0.6   0.5   0.4   0.4   0.5   0.6   0.5   0.6   0.6   0.4
      F   0.2   0.2   0.0   0.2   0.2   0.2   0.1   0.0   0.1   0.2   0.1   0.1
    
        [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86]
      A   0.1   0.2   0.1   0.2   0.2   0.0   0.2   0.1   0.2   0.0   0.1   0.1
      B   0.1   0.1   0.2   0.2   0.2   0.1   0.1   0.1   0.2   0.2   0.1   0.1
      C   0.1   0.0   0.1   0.0   0.0   0.1   0.1   0.0   0.1   0.1   0.1   0.0
      D   0.1   0.1   0.0   0.0   0.1   0.1   0.1   0.1   0.0   0.1   0.0   0.0
      E   0.4   0.5   0.6   0.6   0.4   0.5   0.4   0.4   0.5   0.4   0.4   0.6
      F   0.2   0.1   0.0   0.0   0.1   0.2   0.1   0.3   0.0   0.2   0.3   0.2
    
        [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98]
      A   0.2   0.3   0.2   0.1   0.1   0.2   0.2   0.1   0.2   0.1   0.3   0.3
      B   0.0   0.0   0.1   0.2   0.0   0.2   0.1   0.1   0.0   0.0   0.0   0.1
      C   0.1   0.1   0.0   0.1   0.0   0.1   0.1   0.1   0.0   0.1   0.0   0.0
      D   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.1   0.1   0.0
      E   0.7   0.5   0.6   0.5   0.7   0.5   0.5   0.6   0.7   0.6   0.4   0.4
      F   0.0   0.1   0.1   0.1   0.2   0.0   0.1   0.1   0.1   0.1   0.2   0.2
    
        [,99] [,100]
      A   0.1    0.2
      B   0.1    0.2
      C   0.0    0.0
      D   0.0    0.1
      E   0.7    0.4
      F   0.1    0.1
    

    基于实现的data,可以通过

    得到meansd
    • mean 分组
    > rowMeans(data)
        A     B     C     D     E     F 
    0.167 0.096 0.051 0.056 0.492 0.138
    
    • sd 分组
    > apply(data, 1, sd)
             A          B          C          D          E          F
    0.08577631 0.07035265 0.05000000 0.05016136 0.12583057 0.07869517
    

    【讨论】:

      【解决方案2】:

      查看tidyversepurrrdplyr)的功能方法。不确定您要如何处理标准差:

      library(tidyverse)
      
      times <- 100
      subpopulation <- 21
      
      sample_summary <- function(time, df_in = df, subpop = subpopulation){
      
          df_temp <- df_in[sample(1:nrow(df_in), size = subpop, replace = TRUE),]
          df_summary <- df_temp %>% group_by(group) %>% summarize(mean_freq = n() / subpop) 
          df_summary$experiment <- time
          
          return(df_summary)
      }
      
      1:times %>%
          map_dfr(., sample_summary)
      

      【讨论】:

        【解决方案3】:

        不太清楚,但是在base R中这样的东西怎么样。想法是在一个列表中创建100个样本,lapply()对每个元素进行相对频率计算,最后把它放在一个data.frame()中进行聚合并计算mean()sd()

        # first an empty list
        listed <- list()
        
        # now you create a data.frame with all the groups in unique()
        unique_groups <- data.frame(group = unique(df$group))
        
        # now let's populate it:
        # set seed for sake of reproducibility
        set.seed(1234)
        for(i in 1:100){
                      # sampling
                      temp <- df[sample(nrow(df), 10), ]
                      # merge with the unique data frame
                      temp <- merge(unique_groups, temp, by = 'group', all.x = T)
                      # replace NAs with 0s
                      temp[is.na(temp)] <- 0
                      # put it in list
                      listed[[i]] <- temp
         }
        
        # here you apply to each element of the list your frequency calc
        listed_freq <- lapply(listed, function(x) data.frame(table(x$group)/length(x$group)) )
        
        # put it as data.frame
        df_freq <- do.call(rbind, listed_freq)
        
        # here you aggregate and calculate mean and sd
        aggregate(. ~ Var1, data = df_freq, FUN = function(x) c(mn = mean(x), stdev = sd(x) ) )
        

        结果:

          Var1    Freq.mn Freq.stdev
        1    A 0.17333333 0.05958659
        2    B 0.15362319 0.05023389
        3    C 0.10000000 0.00000000
        4    D 0.10000000 0.00000000
        5    E 0.47300000 0.11621558
        6    F 0.16813187 0.06972824
        

        【讨论】:

        • 非常感谢您的回答,它非常有用,我从中学到了很多。唯一的问题是平均值和标准差不考虑缺少一个或多个组的子样本(即它们的相对频率为 0)。我需要计算均值和标准差,包括所有子样本。
        • 不客气:见编辑。我们只需要一个具有独特组的新 df,并稍微调整一下循环。您也可以使用 complete() 之类的功能来管理它,但 imo 这更直观。
        猜你喜欢
        • 2021-10-14
        • 2019-08-20
        • 1970-01-01
        • 1970-01-01
        • 2013-05-22
        • 2014-02-24
        • 2021-10-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多