管理 dplyr group_by 函数以在与 group_modify 结合使用时保留分组变量答案

【问题标题】：Managing dplyr group_by function to keep the grouping variable when used in combination with group_modify管理 dplyr group_by 函数以在与 group_modify 结合使用时保留分组变量
【发布时间】：2020-08-14 12:33:22
【问题描述】：

我正在尝试使用函数group_modify（我已经了解了here）。

目标是获取data.frame，将其与group_by 拆分，然后应用自制函数进行一些重组（即排序，选择“最佳行”，如果超过一个，则取平均值） .我需要输出 data.frame 具有 all 原始列的列。

这是一个让一切变得更清晰的 RE：

数据：

library(dplyr)
(dd <- data.frame(id = c("a", "a", "b", "b", "c", "c", "c"), cat = c("s2", "s1", "s1", "s1", "s3", "s2", "s2"), val = 1:7))
  id cat val
1  a  s2   1
2  a  s1   2
3  b  s1   3
4  b  s1   4
5  c  s3   5
6  c  s2   6
7  c  s2   7

我的功能（显示我的问题的基本功能，但不完全是我实际使用的功能）：

simple_fun <- function(slice, key){
  big_out_to_show_error <<- slice

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])

  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val))
  }
  
  return(data.frame(temp2))
  
}

我想要的输出（每个 ID 一行具有“最佳”cat，如果超过一行，则为 val 的平均值，并且具有原始 data.frame 中的所有变量):

  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

我对@987654332@ 函数的尝试抛出了一个错误：

dd %>% 
   group_by(id) %>%
   group_modify(simple_fun)
 Show Traceback
 
 Rerun with Debug
 Error: Column `id` is unknown

这是因为使用的slice 不包括分组变量。从这个简单的代码可以看出这一点，它在 main 函数中使用了 big_out_to_show_error <<- slice 行并限制为 id=="a"：

filter(dd, id=="a") %>% 
   group_by(id) %>%
   group_modify(simple_fun)
# A tibble: 1 x 3
# Groups:   id [1]
  id    cat     val
  <fct> <fct> <int>
1 a     s1        2

big_out_to_show_error
# A tibble: 2 x 2
  cat     val
  <fct> <int>
1 s2        1
2 s1        2

如何管理 group_by 函数以仍然将分组变量放入切片中，以便我的函数与 group_modify 一起使用？

作为旁注，我真的很想了解和修复 dplyr group_by 的行为。我已经知道基本的 R 方法：

split(dd, dd$id) %>% 
  lapply(simple_fun) %>% 
  do.call("rbind", .)
  id cat val
a  a  s1 2.0
b  b  s1 3.5
c  c  s2 6.5

谢谢

【问题讨论】：

标签： r group-by dplyr

【解决方案1】：

group_modify() 为每个组创建两个对象 - 一个包含子集数据的小标题和一个包含组信息的单独的单行小标题。

由于group_modify()返回数据时会自动恢复组信息，所以一般不需要将这些信息保留在子集数据中，默认情况下会被删除。但是，您可以使用.keep 参数来保留它，但是如果在您的函数返回数据时存在组变量，这将导致错误。

因此您可以通过使用.keep 参数来修复您的函数，然后在返回数据之前删除分组变量：

simple_fun <- function(slice, key){

  temp1 <- arrange(slice, cat)
  
  temp2 <- temp1 %>% 
    filter(cat==temp1$cat[1])
  
  if(nrow(temp2)>1) {
    temp2 <- temp2 %>% 
      group_by(id, cat) %>% 
      summarise(val = mean(val), .groups = "drop")
  }   
  temp2 %>%
    select(-id)      
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun, .keep = TRUE)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
  <chr> <chr> <dbl>
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

您还可以简化函数以完全回避这个问题：

simple_fun2 <- function(slice, key){

slice %>% 
    slice_min(cat, 1) %>%
    summarise(cat = unique(cat),
              val = mean(val))
}

dd %>% 
  group_by(id) %>%
  group_modify(simple_fun2)

# A tibble: 3 x 3
# Groups:   id [3]
  id    cat     val
  <chr> <chr> <dbl>
1 a     s1      2  
2 b     s1      3.5
3 c     s2      6.5

【讨论】：

感谢您的回答，它确实提供了丰富的信息。我不知道slice_min 函数，真的有一个 tidyverse 函数可以处理所有事情......我的实际代码运行在 130 万行上，而且速度很慢，所以我测试了不同的选项并显示 @ 987654328@ 方法是最快的，当与lapply 一起使用时（您可以查看我刚刚在另一个答案中生成的基准）。

【解决方案2】：

27 φ 9 答案很完美，回答我的问题。现在，考虑到分析数据集有多种选择，而且我的数据集相当大（130 万行），我做了一个快速基准测试来比较 Base R (split/lapply) 和 Tidyverse (@987654325 @/group_modify) 方法使用两种可能的函数（使用arrange 的一种和使用slice_min 的一种）。

它可能不是最佳/完美/最先进的编程，但它是一个快速而肮脏的比较，它给出了进行此分析的最有效方法的公平想法。

library(dplyr)
library(microbenchmark)
library(ggplot2)

nbrows <- 200
set.seed(1234)
bigdd <- data.frame(id = sample(nbrows/2, nbrows, replace = T), 
                    cat = sample(c("S1", "S2", "S3"), nbrows, replace = T),
                    val = runif(nbrows)) %>% 
  arrange(id)

f_baser_arrange <- function(dd){
  
  simple_fun0 <- function(slice, key){
    temp1 <- arrange(slice, cat)
    temp2 <- temp1 %>% 
      filter(cat==temp1$cat[1])
    if(nrow(temp2)>1) {
      temp2 <- temp2 %>% 
        group_by(id, cat) %>% 
        summarise(val = mean(val), .groups = 'drop')
    }
    return(data.frame(temp2))
  }
  
  split(dd, dd$id) %>% 
    lapply(simple_fun0) %>% 
    do.call("rbind", .)
}

f_baser_slice_min <- function(dd){
  simple_fun3 <- function(slice, key){
    slice %>% 
      slice_min(cat, 1) %>%
      summarise(id = unique(id),
                cat = unique(cat),
                val = mean(val))
  }
  
  split(dd, dd$id) %>% 
    lapply(simple_fun3) %>% 
    do.call("rbind", .)
}

f_tidy_arrange <- function(dd){
  simple_fun1 <- function(slice, key){
    temp1 <- arrange(slice, cat)
    temp2 <- temp1 %>% 
      filter(cat==temp1$cat[1])
    if(nrow(temp2)>1) {
      temp2 <- temp2 %>% 
        group_by(cat) %>% 
        summarise(val = mean(val), .groups = 'drop')
    }
    return(data.frame(temp2))
  }
  
  dd %>% 
    group_by(id) %>%
    group_modify(simple_fun1)
}

f_tidy_slice_min <- function(dd){
  simple_fun2 <- function(slice, key){
    slice %>% 
      slice_min(cat, 1) %>%
      summarise(cat = unique(cat),
                val = mean(val))
  }
  
  dd %>% 
    group_by(id) %>%
    group_modify(simple_fun2)
}

res <- microbenchmark(f_baser_arrange(bigdd),
               f_baser_slice_min(bigdd),
               f_tidy_arrange(bigdd),
               f_tidy_slice_min(bigdd),
               times = 100)

data.frame(res) %>% 
  mutate(Philosophy = ifelse(grepl("baser", expr), "Base R", "Tidyverse"),
         Method = ifelse(grepl("arrange", expr), "arrange", "slice_min")) %>% 
  ggplot(aes(x=Philosophy, y=time, color=Method))+
  geom_boxplot(position=position_dodge(0.5))

产生：

我们注意到基本 R split/lapply 方法通常比 Tidyverse group_by/group_modify 方法更快。我们还注意到 @27 φ 9 slice_min 比我原来的 arrange 方法更快。

此外，通过将lapply 更改为parLapply，基本 R 方法和速度更快。

【讨论】：