基于具有很多条件的多个列进行汇总答案

【问题标题】：summarise based on multiple columns with a lot of conditions基于具有很多条件的多个列进行汇总
【发布时间】：2018-12-04 13:19:49
【问题描述】：

样本数据

df <- data.frame( id = 1:10,
                  group = c(1,1,1,1,1,2,2,2,2,2),
                  p1 = c("A", NA, "A", "A", "B", NA, NA, NA, NA, "C"),
                  p2 = c("F", NA, "G", "G", "A", "H", NA, NA, NA, NA),
                  stringsAsFactors = FALSE )

#     id group   p1   p2
#  1   1     1    A    F
#  2   2     1 <NA> <NA>
#  3   3     1    A    G
#  4   4     1    A    G
#  5   5     1    B    A
#  6   6     2 <NA>    H
#  7   7     2 <NA> <NA>
#  8   8     2 <NA> <NA>
#  9   9     2 <NA> <NA>
# 10  10     2    C <NA>

我想按组汇总df，以便从中获取总计列

唯一标识
任何 p 列值不是 NA 的唯一 id
任何 p 列值等于“A”的唯一 ID

期望的输出

data.frame( group = c(1,2),
            total = c(5,5),
            with_any_p = c(4,2),
            with_any_p_is_A = c(4,0),
            stringsAsFactors = FALSE)

#   group total with_any_p with_any_p_is_A
# 1     1     5          4               4
# 2     2     5          2               0

到目前为止的代码

我知道我可以使用以下方法获得所需的输出：

df %>% group_by( group ) %>% 
  summarise( total = n_distinct( id[] ),
             with_any_p = n_distinct( id[ !is.na(p1) | ! is.na(p2) ] ), 
             with_any_p_is_A = n_distinct( id[ p1 == "A" | p2 == "A" ], na.rm = TRUE ) )

# # A tibble: 2 x 4
#   group total with_any_p with_any_p_is_A
#   <dbl> <int>      <int>           <int>
# 1     1     5          4               4
# 2     2     5          2               0

问题

但由于我的生产数据包含很多“p-columns”，我不想为 p1-p100 重新键入上述 or 语句

我可以使用filter_at 选择所需的行/子集：

p.cols <- paste0( "p", 1:2 )
#for with_any_p
df %>% filter_at( vars( p.cols ), any_vars( !is.na(.) ) )
#for with_any_p_is_A
df %>% filter_at( vars( p.cols ), any_vars( . == "A" ) )

但我现在确实知道如何将这些选择汇总。

这是否可以使用与我已有的代码相同的“样式”来完成，以便我一次性获得所需的结果，而无需绑定/连接多个结果？

【问题讨论】：

by(df[, c("p1", "p2")], df$group, FUN = function(x){ cbind(total = nrow(x), with_any_p = sum(as.logical(rowSums(is.na(x)))), with_any_p_is_A = sum(as.logical(rowSums(x == "A", na.rm = TRUE)))) })

标签： r dplyr

【解决方案1】：

这是使用初始宽到长转换的任意数量"p" 列的解决方案

df %>%
    gather(key, val, -id, -group) %>%
    group_by(group) %>%
    summarise(
        total = n_distinct(id),
        with_any_p = n_distinct(id[!is.na(val)]),
        with_any_p_is_A = n_distinct(id[val == "A"], na.rm = T))
## A tibble: 2 x 4
#  group total with_any_p with_any_p_is_A
#  <dbl> <int>      <int>           <int>
#1     1     5          4               4
#2     2     5          2               0

评论：我假设除id 和group 之外的所有列都是"p" 列。如果不是这种情况，您可能必须更改 gather 语句以反映您更一般的列结构。

【讨论】：

聪明！我没想过切换到长（er）格式。更好的（对我来说）是df %>% gather(key, val, p.cols ) %>% ...，这样我就不必取消选择多个列...