在 R 中重新采样和循环 dplyr 函数答案

【问题标题】：Resample and looping over dplyr functions in R在 R 中重新采样和循环 dplyr 函数
【发布时间】：2019-12-26 03:58:50
【问题描述】：

我有以下包含 8 个独特治疗组的数据集 (dat)。我想从每个唯一组中抽取 3 个点并存储它们的均值和方差。我想使用循环将所有值存储在输出中来执行此操作 1000 次（带替换的示例）。我试着做这个循环，但我一直遇到unexpected '=' in:"output[i] <- summarise(group_by(new_df[i], fertilizer,crop, level),mean[i]="

关于如何修复它的任何建议，或者让它变得更多

fertilizer <- c("N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P","N","N","N","N","N","N","N","N","N","N","N","N","P","P","P","P","P","P","P","P","P","P","P","P")

crop <- c("alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group","alone","group")

level <- c("low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","high","low","low","high","low")

growth <- c(0,0,1,2,90,5,2,5,8,55,1,90,2,4,66,80,1,90,2,33,56,70,99,100,66,80,1,90,2,33,0,0,1,2,90,5,2,2,5,8,55,1,90,2,4,66,0,0)

dat <- data.frame(fertilizer, crop, level, growth)

library(dplyr)

for(i in 1:1000){
  new_df[i] <- dat %>% 
                  group_by(fertilizer, crop, level) %>% 
                  sample_n(3)
  output[i] <- summarise(
                  group_by(new_df[i], fertilizer, crop, level),
                  mean[i] = mean(growth), 
                  var[i] = sd(growth) * sd(growth))
}

【问题讨论】：

summarize(..., mean[i]=...) 不好有几个原因：（1）summarize 不采用这样的索引赋值（虽然它适用于 REPL 上的简单向量）； (2) 我认为，将变量命名为（通用）函数可能是不好的形式，但这只是我的两分钱。主要是第一个。
我修正了您代码中的错字，请在提问前检查您提供给我们的代码。

标签： r for-loop dplyr sample resampling

【解决方案1】：

我认为您不需要循环。您可以通过一次对每个组采样3*1000 值来更快地做到这一点，分配sample_id 并将其添加到分组变量中，最后summarize 以获得所需的值。这样，您只调用一次所有函数。 -

dat %>% 
  group_by(fertilizer, crop, level) %>% 
  sample_n(3*1000, replace = T) %>% 
  mutate(sample_id = rep(1:1000, each = 3)) %>% 
  group_by(sample_id, add = TRUE) %>% 
  summarise(
    mean = mean(growth, na.rm = T),
    var = sd(growth)^2
  ) %>% 
  ungroup()

# A tibble: 8,000 x 6
   fertilizer crop  level sample_id  mean      var
   <chr>      <chr> <chr>     <int> <dbl>    <dbl>
 1 N          alone high          1 30.7  2640.   
 2 N          alone high          2  1       0    
 3 N          alone high          3 60.3  2640.   
 4 N          alone high          4  1.33    0.333
 5 N          alone high          5  1.33    0.333
 6 N          alone high          6 60.3  2640.   
 7 N          alone high          7  1.33    0.333
 8 N          alone high          8 30.3  2670.   
 9 N          alone high          9  1.33    0.333
10 N          alone high         10 60.7  2581.   
# ... with 7,990 more rows

【讨论】：

有没有办法将最终表格作为数据框而不是小标题？ Tibbles 以后更难操作
@Biotechgeek 您可以使用 as.data.frame() 将任何 tibble 转换为数据框。
当我尝试将相同的概念应用于我的原始数据时，Error in summarise_impl(.data, dots) : Evaluation error: dims [product 8000] do not match the length of object [1]. 出现错误，您认为这是为什么？
应该有sd(growth, na.rm = T)^2
很难说，但似乎您使用的任何汇总函数都不会为每组产生一个值。是的，sd(growth, na.rm = T)^2 更好。

【解决方案2】：

试试这个：

replicate(2, {
  dat %>%
    group_by(fertlizer, crop, level) %>%
    sample_n(3) %>%
    summarize(mu = mean(growth), sigma2 = sd(growth)^2) %>%
    ungroup()
}, simplify = FALSE)
# [[1]]
# # A tibble: 8 x 5
#   fertlizer crop  level    mu  sigma2
#   <fct>     <fct> <fct> <dbl>   <dbl>
# 1 N         alone high   1       1   
# 2 N         alone low   30.7  2641.  
# 3 N         group high  33.3  2408.  
# 4 N         group low   56     553   
# 5 P         alone high  22.7  1409.  
# 6 P         alone low    2.33    2.33
# 7 P         group high  40.3  1336.  
# 8 P         group low   23    1387   
# [[2]]
# # A tibble: 8 x 5
#   fertlizer crop  level    mu sigma2
#   <fct>     <fct> <fct> <dbl>  <dbl>
# 1 N         alone high   30.3  2670.
# 2 N         alone low    52.7  2069.
# 3 N         group high   61.7  2408.
# 4 N         group low    20     925 
# 5 P         alone high   35.3  3042.
# 6 P         alone low    19.7   990.
# 7 P         group high   14.3   270.
# 8 P         group low    32    2524.

（将2 替换为您的1000。）

【讨论】：