dplyr `across()` 函数和分组时的数据帧长度答案

【问题标题】：dplyr `across()` function and data frame length while groupingdplyr `across()` 函数和分组时的数据帧长度
【发布时间】：2020-05-01 16:34:42
【问题描述】：

packageVersion("dplyr")
#[1] ‘0.8.99.9002’

请注意，本题使用 dplyr 的新 across() 函数。要安装 dplyr 的最新开发版本，请发出 remotes::install_github("tidyverse/dplyr") 命令。要恢复到已发布的 dplyr 版本，请发出 install.packages("dplyr") 命令。如果您在未来某个时间阅读本文并且已经在 dplyr 1.X+ 上，则无需担心此说明。

library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3), 
                      rep(as.Date("2020-02-01"), 2)),
             Type = c("A", "A", "B", "C", "C"),
             col1 = 1:5,
             col2 = c(0, 8, 0, 3, 0),
             col3 = c(25:29),
             colX = rep(99, 5))
#> # A tibble: 5 x 6
#>   Date       Type   col1  col2  col3  colX
#>   <date>     <chr> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 A         1     0    25    99
#> 2 2020-01-01 A         2     8    26    99
#> 3 2020-01-01 B         3     0    27    99
#> 4 2020-02-01 C         4     3    28    99
#> 5 2020-02-01 C         5     0    29    99

我想按行对列 1 到 X 求和，按“日期”和“类型”分组。我总是从第三列开始（即col1），但永远不会知道colX 中X 的数值。没关系，因为我可以使用数据帧的长度来确定我需要走多远 'out' 才能捕获所有列，直到数据帧结束。这是我的方法：

df %>% 
  group_by(Date, Type) %>% 
  summarize(across(3:length(.)), sum())
#> Error: Problem with `summarise()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Locations 5 and 6 don't exist.
#> i There are only 4 columns.
#> i Input `..1` is `across(3:length(.))`.
#> i The error occured in group 1: Date = 2020-01-01, Type = "A".
#> Run `rlang::last_error()` to see where the error occurred.

但似乎我对基本 R length(.) 函数的使用不正确。我是否以正确的方式使用 dplyr 的新 across() 函数？如何在我需要的管道部分中获取数据帧的长度？我永远不会知道到底有多少列，实际名称也不像我的示例数据框那样干净。

【问题讨论】：

标签： r dplyr

【解决方案1】：

packageVersion("dplyr")
#[1] ‘0.8.99.9002’

首先，您的语法有一点问题，select 语句和函数都在 across 调用中。

df %>% summarize(across(3:length(.),sum))
## A tibble: 1 x 4
#   col1  col2  col3  colX
#  <int> <dbl> <int> <dbl>
#1    15    11   135   495

以下代码不起作用，因为您无法选择当前正在使用group_by-ed 的列。

df %>% 
   group_by(Date, Type) %>% 
   summarize(across(3:length(.), sum))
#Error: Problem with `summarise()` input `..1`.
#x Can't subset columns that don't exist.
#x Locations 5 and 6 don't exist.
#ℹ There are only 4 columns.

当您尝试以下操作时，这一点很明显：

df %>% 
   group_by(Date, Type) %>% 
   summarize(across(everything(), sum))
## A tibble: 3 x 6
## Groups:   Date [2]
#  Date       Type   col1  col2  col3  colX
#  <date>     <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A         3     8    51   198
#2 2020-01-01 B         3     0    27    99
#3 2020-02-01 C         9     3    57   198

其他选项包括starts_with tidy-select 动词。

df %>% 
  group_by(Date, Type) %>% 
  summarize(across(starts_with("col"), sum))
## A tibble: 3 x 6
## Groups:   Date [2]
#  Date       Type   col1  col2  col3  colX
#  <date>     <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A         3     8    51   198
#2 2020-01-01 B         3     0    27    99
#3 2020-02-01 C         9     3    57   198

row-wise 和 column-wise 小插曲非常好。逐行的实际上讨论了 group_by 列是如何子集的。

【讨论】：

感谢语法更正，但即使使用适当的语法，分组似乎仍然会产生错误。你能试试这个吗？ df %>% group_by(Date, Type) %>% summarize(across(3:length(.), sum))?我收到一条错误消息，“无法对不存在的列进行子集化。位置 5 和 6 不存在。”。