是否有一些功能可以使用 group_by 在 R dplyr 中保留唯一值？答案

【问题标题】：Is there some function to keep unique values in R dplyr with group_by?是否有一些功能可以使用 group_by 在 R dplyr 中保留唯一值？
【发布时间】：2021-05-05 12:09:30
【问题描述】：

我有一个带有id 变量的data.frame（或tiibble 或其他）。我经常用dplyr::group_by为这个id做一些操作，所以

data %>%
    group_by(id) %>%
    summarise/mutate/...()

通常，对于每个id，我都有其他唯一的非数字变量，例如id 所属的project 或country 以及id 的其他特征（例如性别，等等。）。当我使用上面的summarise 函数时，除非我指定，否则这些其他变量都会丢失

data %>%
    group_by(id) %>%
    summarise(across(c(project, country, gender, ...), unique),...)

或

data %>%
    group_by(id, project, country, gender, ...) %>%
    summarise()

是否有一些函数可以检测这些变量，这些变量对于每个 id 都是唯一的，因此不必指定它们？

谢谢！

PS：我主要是问dplyr和group_by相关的功能，但是其他环境比如R-base或者data.table也可以。

【问题讨论】：

您是否考虑过ungroup()-ing 数据，或迭代您想要分组的变量，例如map()?
恐怕答案是否定的，没有自动检测此类变量。您已经拥有的解决方案就是要走的路。 1) 在group_by 中提及它们，2) 使用across + unique 3) 使用across + first 将它们保留在数据中。
想查看答案吗？
@mnist 我已经看到了答案。谢谢你的。
是否希望提供任何形式的反馈，例如评论/赞成/接受？

标签： r dplyr group-by tidyverse

【解决方案1】：

这在应用程序中更高级一些，但您正在寻找的是分组变量的线性组合。您可以将这些转换为因子，然后使用一些线性代数。

您可以使用caret 中的findLinearCombos() 来定位这些。不过要按照我认为您想要的方式组织起来需要一些工作。

这样的事情可能会奏效。我也没有对此进行广泛的测试。

包

library(dplyr)
library(caret)
library(purrr)

功能

group_by_lc <- function(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data)) {
  
  # capture the ... and convert to a character vector
  .groups <- rlang::ensyms(...)
  .groups_chr <- map_chr(.groups, rlang::as_name)
  
  # convert all character and factor variables to a numeric
  d <- .data %>% 
    mutate(across(where(is.factor), as.character), 
           across(where(is.character), as.factor),
           across(where(is.factor), as.integer))
  
  
  # find linear combinations of the character / factor variables
  lc <- caret::findLinearCombos(d)
  
  # see if any of your grouping variables have linear combinations
  find_group_match <- function(known_groups, lc_pair) {
    if (any(lc_pair %in% known_groups)) unique(c(lc_pair, known_groups)) else NULL
  }
  
  # convert column indices to names
  lc_pairs <- map(lc$linearCombos, ~ names(d)[.x])
  
  # iteratively look for linear combinations of known grouping variabels
  lc_cols <- reduce(lc_pairs, find_group_match, .init = .groups_chr)

  # find new grouping variables
  added_groups <- rlang::syms(lc_cols[!(lc_cols %in% .groups_chr)])

  # apply the grouping to your groups and the linear combinations
  group_by(.data, !!!.groups, !!!added_groups, .add = .add, .drop = .drop)
    
}

用法

data <- tibble(V = LETTERS[1:10], W = letters[1:10], X = paste0(V, W), Y = rep(LETTERS[1:5], each = 2), Z = runif(10))
group_by_lc(data, W)

结果

您可以看到它是如何添加到所有其他分组变量中的。您可以通过其他方式重做这一切，关键部分是构建 added_groups 列表以找到它们。

# A tibble: 10 x 5
# Groups:   W, X, V [10]
   V     W     X     Y          Z
   <chr> <chr> <chr> <chr>  <dbl>
 1 A     a     Aa    A     0.884 
 2 B     b     Bb    A     0.133 
 3 C     c     Cc    B     0.194 
 4 D     d     Dd    B     0.407 
 5 E     e     Ee    C     0.256 
 6 F     f     Ff    C     0.0976
 7 G     g     Gg    D     0.635 
 8 H     h     Hh    D     0.0542
 9 I     i     Ii    E     0.0104
10 J     j     Jj    E     0.464

【讨论】：

【解决方案2】：

我没有对它进行广泛的测试，但它应该可以完成这项工作

library(dplyr)

myData <- tibble(X = c(1, 1, 2, 2, 2, 3),
                 Y = LETTERS[c(1, 1, 2, 2, 2, 3)],
                 R = rnorm(6))
myData
#> # A tibble: 6 x 3
#>       X Y          R
#>   <dbl> <chr>  <dbl>
#> 1     1 A      0.463
#> 2     1 A     -0.965
#> 3     2 B     -0.403
#> 4     2 B     -0.417
#> 5     2 B     -2.28 
#> 6     3 C      0.423

group_by_id_vars <- function(.data, ...) {
  # group by the prespecified ID variables
  .data <- .data %>% group_by(...)
  
  # how many groups do these ID determine
  ID_groups <- .data %>% n_groups()
  
  # Get the number of groups if the initial grouping variables are combined
  # with other variables
  groupVars <- sapply(substitute(list(...))[-1], deparse) #specified grouping Variable
  nms <- names(.data) # all variables in .data
  res <- sapply(nms[!nms %in% groupVars], 
                function(x) {
                  .data %>%
                    # important to specify add = TRUE to combine the variable 
                    # with the IDs
                    group_by(across(all_of(x)), .add = TRUE) %>% 
                    n_groups()})
  
  # which combinations are identical, i.e. this variable does not increase the
  # number of groups in the data if combined with IDvars
  v <- names(res)[which(res == ID_groups)]
  
  # group the data accordingly
  .data <- .data %>% ungroup() %>% group_by(across(all_of(c(groupVars, v))))
  return(.data)
}

myData %>% 
  group_by_id_vars(X) %>% 
  summarise(n = n())
#> `summarise()` regrouping output by 'X' (override with `.groups` argument)
#> # A tibble: 3 x 3
#> # Groups:   X [3]
#>       X Y         n
#>   <dbl> <chr> <int>
#> 1     1 A         2
#> 2     2 B         3
#> 3     3 C         1

【讨论】：