【问题标题】:Is there a way to group data twice (i.e. by month and then by season) in R?有没有办法在 R 中对数据进行两次分组(即按月,然后按季节)?
【发布时间】:2020-02-29 15:20:52
【问题描述】:

我正在尝试回答这个问题:

使用 nycflights13 包和航班数据框回答以下问题:取消航班比例最高的月份是哪一个月?哪个月最低?解释任何季节性模式。

我已经从技术上回答了这个问题,但我正在尝试制作一个比现在更简洁的标题。

这是我目前所拥有的:

#Load packages
library(nycflights13)
library(tidyverse)

#Data frame "cancprop" with three new variables ("canc" = flights that were canceled, "notc" = flights that were not canceled, and "canp" = proportion of all flights that were canceled)
cancprop <- flights %>%
  mutate(
    canc = is.na(dep_time),
    notc = !is.na(dep_time),
    canp = canc / (canc + notc)
  )

#A tibble showing the average proportion of all flights that were canceled by month sorted by descending average proportion.
cancprop %>%
  group_by(month) %>% 
  summarize(mcanp = mean(canp)) %>% 
  arrange(desc(mcanp))
# A tibble: 12 x 2
   month   mcanp
   <int>   <dbl>
 1     2 0.0505 
 2    12 0.0364 
 3     6 0.0357 
 4     7 0.0319 
 5     3 0.0299 
 6     4 0.0236 
 7     5 0.0196 
 8     1 0.0193 
 9     8 0.0166 
10     9 0.0164 
11    11 0.00854
12    10 0.00817

#Data frame "seas" with a new variable ("season" = the season corresponding with the month)
seas <- cancprop %>% 
  group_by(month) %>% 
  summarize(mcanp = mean(canp)) %>% 
  mutate(
    season = case_when(
      month %in% 3:5 ~ "Spring",
      month %in% 6:8 ~ "Summer",
      month %in% 9:11 ~ "Fall",
      TRUE ~ "Winter"
    ))
seas
# A tibble: 12 x 3
   month   mcanp season
   <int>   <dbl> <chr> 
 1     1 0.0193  Winter
 2     2 0.0505  Winter
 3     3 0.0299  Spring
 4     4 0.0236  Spring
 5     5 0.0196  Spring
 6     6 0.0357  Summer
 7     7 0.0319  Summer
 8     8 0.0166  Summer
 9     9 0.0164  Fall  
10    10 0.00817 Fall  
11    11 0.00854 Fall  
12    12 0.0364  Winter

#A plot showing the proportion of flights canceled
ggplot(seas, aes(x = factor(month), y = mcanp, fill = season)) +
  geom_bar(stat = "identity") +
  labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season")

我要创建的是一个显示每个季节取消航班的平均比例的小标题,例如这个(随机的、非计算的比例,因为我不确定如何实际获得结果):

# A tibble: 4 x 2
       season   mcanp
        <chr>   <dbl> 
 1     Winter  0.0433
 2     Spring  0.0235
 3     Summer  0.0109
 4     Fall    0.0246

感谢您的帮助,谢谢!

【问题讨论】:

  • 我觉得你需要seas %&gt;% group_by(season) %&gt;% summarise(mcanp = mean(mcanp))
  • 在这种情况下效果很好,但并不是我想要的,因为它采用了每个季节的月度平均值,而不是季节的平均值。跨度>
  • Using seas %&gt;% group_by(season) %&gt;% summarise(mcanp = mean(mcanp)) 得到 1 Winter 0.0354, 2 Summer 0.0281, 3 Spring 0.0243, 4 Fall 0.0110 而我正在寻找的答案因为是 1 冬季 0.0350,2 夏季 0.0280,3 春季 0.0243,4 秋季 0.0110

标签: r dplyr


【解决方案1】:

如果我没听错的话,你需要按季节的取消比例。如果是这种情况,您自己完成了大部分工作。不要按顺序group_by 月份和季节,因为您的评论正确地表明这会计算每个季节内的每月取消比例的平均值。相反,创建季节变量并将其附加到 mutate 内的未分组数据框中。

cancprop <- flights %>%
 mutate(
  canc = is.na(dep_time),
  notc = !is.na(dep_time),
  canp = canc / (canc + notc),
  season = case_when(
     month %in% 3:5  ~ "Spring",
     month %in% 6:8  ~ "Summer",
     month %in% 9:11 ~ "Fall",
     TRUE            ~ "Winter"))

cancprop %>%
 group_by(season) %>% 
 summarize(mcanp = mean(canp)) %>% 
 arrange(desc(mcanp))

# A tibble: 4 x 2
season  mcanp
<chr>   <dbl>
1 Winter 0.0350
2 Summer 0.0280
3 Spring 0.0243
4 Fall   0.0110

这是按季节降序排列的取消比例。

【讨论】:

    【解决方案2】:

    我想通了 - 我需要从整个数据框开始,而不是按月分组。

    library(nycflights13)
    library(tidyverse)
    
    cancprop <- flights %>%
      mutate(
        canc = is.na(dep_time),
        notc = !is.na(dep_time),
        canp = canc / (canc + notc),
        season = case_when(
          month %in% 3:5 ~ "Spring",
          month %in% 6:8 ~ "Summer",
          month %in% 9:11 ~ "Fall",
          TRUE ~ "Winter"
        )
      )
    cancprop
    # A tibble: 336,776 x 23
        year month   day dep_time sched_dep_time
       <int> <int> <int>    <int>          <int>
     1  2013     1     1      517            515
     2  2013     1     1      533            529
     3  2013     1     1      542            540
     4  2013     1     1      544            545
     5  2013     1     1      554            600
     6  2013     1     1      554            558
     7  2013     1     1      555            600
     8  2013     1     1      557            600
     9  2013     1     1      557            600
    10  2013     1     1      558            600
    # ... with 336,766 more rows, and 18 more
    #   variables: dep_delay <dbl>, arr_time <int>,
    #   sched_arr_time <int>, arr_delay <dbl>,
    #   carrier <chr>, flight <int>, tailnum <chr>,
    #   origin <chr>, dest <chr>, air_time <dbl>,
    #   distance <dbl>, hour <dbl>, minute <dbl>,
    #   time_hour <dttm>, canc <lgl>, notc <lgl>,
    #   canp <dbl>, season <chr>
    
    
    mcp <- cancprop %>%
      group_by(month, season) %>% 
      summarize(mcanp = mean(canp)) %>% 
      arrange(desc(mcanp))
    mcp
    # A tibble: 12 x 3
    # Groups:   month [12]
       month season   mcanp
       <int> <chr>    <dbl>
     1     2 Winter 0.0505 
     2    12 Winter 0.0364 
     3     6 Summer 0.0357 
     4     7 Summer 0.0319 
     5     3 Spring 0.0299 
     6     4 Spring 0.0236 
     7     5 Spring 0.0196 
     8     1 Winter 0.0193 
     9     8 Summer 0.0166 
    10     9 Fall   0.0164 
    11    11 Fall   0.00854
    12    10 Fall   0.00817
    
    ggplot(mcp, aes(x = factor(month), y = mcanp, fill = season)) +
      geom_bar(stat = "identity") +
      labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season")
    # February had the highest proportion of canceled flights and October had the lowest.
    
    
    scp <- cancprop %>% 
      group_by(season) %>% 
      summarize(mcanp = mean(canp)) %>% 
      arrange(desc(mcanp))
    scp
    # A tibble: 4 x 2
      season  mcanp
      <chr>   <dbl>
    1 Winter 0.0350
    2 Summer 0.0280
    3 Spring 0.0243
    4 Fall   0.0110
    
    ggplot(scp, aes(x = factor(season), y = mcanp, fill = season)) +
      geom_bar(stat = "identity") +
      labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season")
    # Winter had the highest proportion of canceled flights and Fall had the lowest.
    

    【讨论】:

    • 看起来好像您在我发布回复时自己想通了。双重分组对于回答您的问题是不必要的。自己搞清楚就好了!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-09-09
    • 2013-09-11
    • 1970-01-01
    • 2018-04-15
    • 1970-01-01
    • 2018-05-24
    相关资源
    最近更新 更多