使用 dplyr 1.0.0 绕过 for 循环答案

【问题标题】：Sidestepping for-loops using dplyr 1.0.0使用 dplyr 1.0.0 绕过 for 循环
【发布时间】：2020-11-23 04:15:46
【问题描述】：

我刚刚开始体会到新 dplyr 1.0.0 的强大功能。但是在阅读完这些小插曲后，我需要再读一些，当然没有更多了，所以我再次转向 SO。

假设我有以下数据集# 使用 rowwise 和 c_across 计算新变量 rm(list = ls())

library(tidyverse)
set.seed(1)
df <- tibble(d_1_a = round(sample(1:10,10,replace=T)),
             d_1_b = round(sample(1:10,10,replace=T)),
             d_1_c = round(sample(1:10,10,replace=T)),
             d_1_d = round(sample(1:10,10,replace=T)),
             d_2_a = round(sample(1:10,10,replace=T)),
             d_2_b = round(sample(1:10,10,replace=T)),
             d_2_c = round(sample(1:10,10,replace=T)),
             d_2_d = round(sample(1:10,10,replace=T)))

我想计算数据集中列子集的行总和，并将它们添加到现有数据集中。我想出了以下for循环

for (i in 1:2) {
  namesCols <- grep(paste0("^d_",i,"_[a-z]$"), names(df), perl = T) # indexes of subset of columns
  newDF <- df %>% select(all_of(namesCols)) # extract subset of columns from main
  totDF <- newDF %>% rowwise() %>% 
                     mutate(!!paste0("sum_",i) := sum(c_across(everything()))) %>% # new column from old 
                     select(starts_with("sum")) # now extract just the new column as a dataframe
  df <- cbind(df,totDF) # binds the new column to the old dataframe
}

现在如果我们调用原始数据集 df

d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
1      9     5     5    10     9     2     6     7    29    24
2      4    10     5     6     7     2     8     6    25    23
3      7     6     2     4     8     6     7     1    19    22
4      1    10    10     4     6     6     1     5    25    18
5      2     7     9    10    10     1     4     6    28    21
6      7     9     1     9     7     3     8     1    26    19
7      2     5     4     7     3     3     9     9    18    24
8      3     5     3     6    10     8     9     7    17    34
9      1     9     6     9     6     6     7     7    25    26
10     5     9    10     8     8     7     4     3    32    22

我们可以看到两个总和列，每个列都是根据原始数据集中现有列的不同子集计算得出的，然后添加到该数据集的末尾。

但我很想学习一些新的dplyr/purrr voodoo，但我不知道语法是如何工作的。

谁能推荐我的 for 循环的 tidyverse 版本？

【问题讨论】：

标签： r for-loop dplyr tidyverse

【解决方案1】：

for 循环的字面翻译是 -

library(dplyr)
library(purrr)

bind_cols(df, map_dfc(1:2, function(i) {
  df %>% 
    transmute(!!paste0("sum_",i) := rowSums(
              select(., matches(paste0("^d_",i,"_[a-z]$")))))
}))

#   d_1_a d_1_b d_1_c d_1_d d_2_a d_2_b d_2_c d_2_d sum_1 sum_2
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     9     5     5    10     9     2     6     7    29    24
# 2     4    10     5     6     7     2     8     6    25    23
# 3     7     6     2     4     8     6     7     1    19    22
# 4     1    10    10     4     6     6     1     5    25    18
# 5     2     7     9    10    10     1     4     6    28    21
# 6     7     9     1     9     7     3     8     1    26    19
# 7     2     5     4     7     3     3     9     9    18    24
# 8     3     5     3     6    10     8     9     7    17    34
# 9     1     9     6     9     6     6     7     7    25    26
#10     5     9    10     8     8     7     4     3    32    22

不过，我们也可以使用split.default -

bind_cols(df, df %>%
  split.default(sub('.*(\\d+).*', '\\1', names(.))) %>%
  imap_dfc(~.x %>% transmute(!!paste0("sum_",.y) := rowSums(.))))

其中sub 部分返回关于如何拆分列的分组。

sub('.*(\\d+).*', '\\1', names(df))
#[1] "1" "1" "1" "1" "2" "2" "2" "2"

【讨论】：