在 dplyr 中汇总并为没有值的类别插入 0答案

【问题标题】：Summarize in dplyr and insert 0 for categories with no values在 dplyr 中汇总并为没有值的类别插入 0
【发布时间】：2021-03-20 16:37:25
【问题描述】：

想象一下你有这样的数据：

set.seed(2021)

age <- floor(runif(35, min = 20, max = 25))

dat <- data.frame(age)

dat %>%
  mutate(education = sample(c("Low", "Mid-level", "High"), 
                           size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
  group_by(age, education) %>%
  summarise(n = n())

结果：

     age education     n
   <dbl> <chr>     <int>
 1    20 High          1
 2    20 Low           2
 3    21 Low           3
 4    21 Mid-level     2
 5    22 High          2
 6    22 Low           4
 7    23 Low           4
 8    23 Mid-level     2
 9    24 High          1
10    24 Low          10
11    24 Mid-level     4

如您所见，例如，20 岁时的“中级”教育不计算在内，因此该类别已从数据框中排除。是否可以将该值显示为 0？

例如

# A tibble: 11 x 3
# Groups:   age [5]
     age education     n
   <dbl> <chr>     <int>
 1    20 High          1
 2    20 Low           2
 3    20 Mid-level     0

【问题讨论】：

您是否尝试将 .drop = FALSE 添加到汇总部分

标签： r dplyr

【解决方案1】：

您可以使用 count 和 .drop = FALSE 作为参数，而不是 group_by 和 summarise。您需要先制作教育列因子，因此您可以尝试在最后添加：

  count(age, as.factor(education), .drop = FALSE)

编辑：整理因素以获得更清晰的结果

dat %>%
  mutate(education = sample(
    c("Low", "Mid-level", "High"),
    size = nrow(dat),
    prob = c(0.55, 0.2, 0.25),
    replace = TRUE
  )) %>%
# convert to factor with levels in specified order
  mutate(education = factor(education, levels = c("Low", "Mid-level", "High"))) %>%
  count(age, education, .drop = FALSE)

结果：

   age education  n
1   20       Low  2
2   20 Mid-level  0
3   20      High  1
4   21       Low  3
5   21 Mid-level  2
6   21      High  0
7   22       Low  4
8   22 Mid-level  0
9   22      High  2
10  23       Low  4
11  23 Mid-level  2
12  23      High  0
13  24       Low 10
14  24 Mid-level  4
15  24      High  1

【讨论】：

【解决方案2】：

由于 age = 20 和 education = "Mid Level" 的组合在数据框中不存在 - summarise() 无法猜测。

这样做的一种方法是明确指定所有可能的组合并与输出连接，如下所示：

join_df <- expand.grid(age = unique(age), 
                       education = c("Low", "Mid-level", "High"))

dat %>%
  mutate(education = sample(c("Low", "Mid-level", "High"), 
                            size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
  group_by(age, education) %>%
  summarise(n = n()) %>% 
  full_join(join_df, by = c("age", "education")) %>% 
  tidyr::replace_na(list(n = 0)) %>% 
  arrange(age, education)

【讨论】：

【解决方案3】：

使用tidyr::complete

library(dplyr)
library(tidyr)
set.seed(2021)

age <- floor(runif(35, min = 20, max = 25))

dat <- data.frame(age)

incomplete_data <- dat %>%
  mutate(education = sample(c("Low", "Mid-level", "High"), 
    size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
  group_by(age, education) %>%
  summarise(n = n(), .groups = "drop")

收入数据

# A tibble: 11 x 3
     age education     n
 * <dbl> <chr>     <int>
 1    20 High          1
 2    20 Low           2
 3    21 Low           3
 4    21 Mid-level     2
 5    22 High          2
 6    22 Low           4
 7    23 Low           4
 8    23 Mid-level     2
 9    24 High          1
10    24 Low          10
11    24 Mid-level     4

使用complete函数

complete_data <- incomplete_data %>% 
  complete(age, education, fill = list(n = 0))

输出

# A tibble: 15 x 3
     age education     n
   <dbl> <chr>     <dbl>
 1    20 High          1
 2    20 Low           2
 3    20 Mid-level     0
 4    21 High          0
 5    21 Low           3
 6    21 Mid-level     2
 7    22 High          2
 8    22 Low           4
 9    22 Mid-level     0
10    23 High          0
11    23 Low           4
12    23 Mid-level     2
13    24 High          1
14    24 Low          10
15    24 Mid-level     4

【讨论】：