如何将变量与其他聚合变量相加，同时将剩余变量保留在 R 中？答案

【问题标题】：How to sum a variable on other aggregated variables, whilst keeping remaining variables in R?如何将变量与其他聚合变量相加，同时将剩余变量保留在 R 中？
【发布时间】：2021-02-22 11:42:00
【问题描述】：

我正在尝试聚合 12.000 obs 的数据集。有 37 个变量，我想按 2 个变量分组并按 1 求和。
所有其他变量必须保留，因为这些变量包含以后分析的重要信息。
大多数剩余变量在组内包含相同的值，我想从其他变量中选择第一个值。
为了更好地了解正在发生的事情，我创建了一个随机的小型测试数据集（10 obs。5 个变量）。

row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)

df
   row y set1 set2 set3
1    1 1    1    1    1
2    2 1    1    1    1
3    3 1    1    1    2
4    4 2    1    2    2
5    5 2    1    2    3
6    6 2    1    2    3
7    7 3    2    1    4
8    8 3    2    1    4
9    9 3    2    2    5
10  10 4    3    1    5

我想根据 set1 和 set2 聚合数据，获取 sum(y) 值，同时通过选择剩余列中的第一个值来保留其他列（此处为 row 和 set3），从而生成以下聚合数据框（或小标题）：

# row y set1  set2  set3
# 1   3 1     1     1
# 4   6 1     2     2
# 7   6 2     1     4
# 9   3 2     2     5
# 10  4 3     1     5

我检查了其他问题以寻找可能的解决方案，但未能解决我的问题。
我研究并尝试过的最重要的问题和网站是：
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?

我发现在dplyr 中使用summarise 总是会删除剩余的变量。
我想已经找到了R: How to aggregate some columns while keeping other columns 的解决方案，因为重现该示例给出了令人满意的结果。
作为使用

library(dplyr)
df_aggr1 <-
  df %>%
  group_by(set1, set2) %>%
  slice(which.max(y))

导致

# A tibble: 5 x 5
# Groups:   set1, set2 [5]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     1
2     4     2     1     2     2
3     7     3     2     1     4
4     9     3     2     2     5
5    10     4     3     1     5

但是，使用

library(dplyr)
df_aggr2 <-
  df %>%
  group_by(set1, set2) %>%
  slice(sum(y))

导致：

# A tibble: 1 x 5
# Groups:   set1, set2 [1]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     1     1     1     2

其中 y 显然没有求和，所以我不明白发生了什么。

我错过了什么？
提前致谢！

【问题讨论】：

标签： r group-by aggregate

【解决方案1】：

当字面上指定你想要第一个值时，它对我有用，即：

library(tidyverse)
df %>%
  group_by(set1, set2) %>%
  summarize(y = sum(y),
            row = row[1],
            set3 = set3[1])

 A tibble: 5 x 5
# Groups:   set1 [3]
   set1  set2     y   row  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     3     1     1
2     1     2     6     4     2
3     2     1     6     7     4
4     2     2     3     9     5
5     3     1     4    10     5

编辑：要保留所有其他列而不指定，您可以使用across() 并指示您希望将此聚合应用于除一列之外的每一列。

df %>%
  group_by(set1, set2) %>%
  summarize(
    across(!y, ~ .x[1]), 
    y = sum(y)
  )

# A tibble: 5 x 5
# Groups:   set1 [3]
   set1  set2   row  set3     y
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     3
2     1     2     4     2     6
3     2     1     7     4     6
4     2     2     9     5     3
5     3     1    10     5     4

【讨论】：

感谢您的快速回复！我忘了指出我使用了dplyr。我第一次弄错了 4 obs。与dplyr 相比，tidyverse 解决方案的结果。但现在它正在工作（虽然我不确定，因为我也使用了tidyverse，如果它现在真的选择dplyr）。不知道那里发生了什么，但tidyverse 似乎无论如何都可以工作，所以这可能是解决方案。您知道一种一次性选择所有剩余变量的方法吗？否则我需要为所有剩余的列添加 34 行。
嘿，tidyverse 应该默认使用dplyr 库，除非您之后导入了另一个具有相同函数名称的库。我已经编辑了我的回复以显示如何一次选择所有剩余的变量:)