【问题标题】:Summarize in dplyr and insert 0 for categories with no values在 dplyr 中汇总并为没有值的类别插入 0
【发布时间】:2021-03-20 16:37:25
【问题描述】:

想象一下你有这样的数据:

set.seed(2021)

age <- floor(runif(35, min = 20, max = 25))

dat <- data.frame(age)

dat %>%
  mutate(education = sample(c("Low", "Mid-level", "High"), 
                           size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
  group_by(age, education) %>%
  summarise(n = n())

结果:

     age education     n
   <dbl> <chr>     <int>
 1    20 High          1
 2    20 Low           2
 3    21 Low           3
 4    21 Mid-level     2
 5    22 High          2
 6    22 Low           4
 7    23 Low           4
 8    23 Mid-level     2
 9    24 High          1
10    24 Low          10
11    24 Mid-level     4

如您所见,例如,20 岁时的“中级”教育不计算在内,因此该类别已从数据框中排除。是否可以将该值显示为 0?

例如

# A tibble: 11 x 3
# Groups:   age [5]
     age education     n
   <dbl> <chr>     <int>
 1    20 High          1
 2    20 Low           2
 3    20 Mid-level     0

【问题讨论】:

  • 您是否尝试将 .drop = FALSE 添加到汇总部分

标签: r dplyr


【解决方案1】:

您可以使用 count 和 .drop = FALSE 作为参数,而不是 group_by 和 summarise。您需要先制作教育列因子,因此您可以尝试在最后添加:

  count(age, as.factor(education), .drop = FALSE) 

编辑:整理因素以获得更清晰的结果

dat %>%
  mutate(education = sample(
    c("Low", "Mid-level", "High"),
    size = nrow(dat),
    prob = c(0.55, 0.2, 0.25),
    replace = TRUE
  )) %>%
# convert to factor with levels in specified order
  mutate(education = factor(education, levels = c("Low", "Mid-level", "High"))) %>%
  count(age, education, .drop = FALSE) 

结果:

   age education  n
1   20       Low  2
2   20 Mid-level  0
3   20      High  1
4   21       Low  3
5   21 Mid-level  2
6   21      High  0
7   22       Low  4
8   22 Mid-level  0
9   22      High  2
10  23       Low  4
11  23 Mid-level  2
12  23      High  0
13  24       Low 10
14  24 Mid-level  4
15  24      High  1

【讨论】:

    【解决方案2】:

    由于 age = 20education = "Mid Level" 的组合在数据框中不存在 - summarise() 无法猜测。

    这样做的一种方法是明确指定所有可能的组合并与输出连接,如下所示:

    join_df <- expand.grid(age = unique(age), 
                           education = c("Low", "Mid-level", "High"))
    
    dat %>%
      mutate(education = sample(c("Low", "Mid-level", "High"), 
                                size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
      group_by(age, education) %>%
      summarise(n = n()) %>% 
      full_join(join_df, by = c("age", "education")) %>% 
      tidyr::replace_na(list(n = 0)) %>% 
      arrange(age, education)
    

    【讨论】:

      【解决方案3】:

      使用tidyr::complete

      library(dplyr)
      library(tidyr)
      set.seed(2021)
      
      age <- floor(runif(35, min = 20, max = 25))
      
      dat <- data.frame(age)
      
      incomplete_data <- dat %>%
        mutate(education = sample(c("Low", "Mid-level", "High"), 
          size = nrow(dat), prob = c(0.55, 0.2, 0.25), replace = TRUE)) %>%
        group_by(age, education) %>%
        summarise(n = n(), .groups = "drop")
      

      收入数据

      # A tibble: 11 x 3
           age education     n
       * <dbl> <chr>     <int>
       1    20 High          1
       2    20 Low           2
       3    21 Low           3
       4    21 Mid-level     2
       5    22 High          2
       6    22 Low           4
       7    23 Low           4
       8    23 Mid-level     2
       9    24 High          1
      10    24 Low          10
      11    24 Mid-level     4
      

      使用complete函数

      complete_data <- incomplete_data %>% 
        complete(age, education, fill = list(n = 0))
      

      输出

      # A tibble: 15 x 3
           age education     n
         <dbl> <chr>     <dbl>
       1    20 High          1
       2    20 Low           2
       3    20 Mid-level     0
       4    21 High          0
       5    21 Low           3
       6    21 Mid-level     2
       7    22 High          2
       8    22 Low           4
       9    22 Mid-level     0
      10    23 High          0
      11    23 Low           4
      12    23 Mid-level     2
      13    24 High          1
      14    24 Low          10
      15    24 Mid-level     4
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-10-09
        • 1970-01-01
        • 2020-02-17
        • 2018-10-22
        • 2018-09-24
        • 2021-07-21
        • 2017-07-06
        • 1970-01-01
        相关资源
        最近更新 更多