【问题标题】:Why does a mutate following a group_by(year, month) seem to miss a row?为什么 group_by(year,month) 之后的变异似乎错过了一行?
【发布时间】:2021-04-08 03:35:24
【问题描述】:

我有一个日周期数据框,我将其转换为月周期,包括基于汇总值的简单转换:

tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year = year(date),
  month = month(date)
) %>% group_by(year, month) %>% summarise(
  date = last(date),
  month.close = last(index),
) %>% mutate(
  month.change = log(month.close / lag(month.close))
)

代码看起来很简单,但是当我运行它时,我得到了一些奇怪的东西:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 4 x 5
# Groups:   year [2]
   year month date       month.close month.change
  <dbl> <dbl> <date>           <dbl>        <dbl>
1  2002    12 2002-12-31        403.     NA      
2  2003     1 2003-01-31        419.     NA      
3  2003     2 2003-02-28        422.      0.00572
4  2003     3 2003-03-01        417.     -0.0121 

尽管第 1 行和第 2 行具有有效的 month.close 值,为什么第 2 行没有 month.change 值? summarise() 操作是否分别在两个给定维度上起作用?

我真的需要了解为什么会发生这种行为,所以请不要只是告诉我使用不同的函数来折叠周期性,我真的很想知道我实现的哪一部分我理解不正确,所以我以后不会在其他地方插入类似的错误。我知道这与按 2 个变量分组有关,因为当我将两列简化为一列时,我得到了预期的行为。

这段代码:

library(zoo)
tibble(
  date = ymd("2002-12-31") + c(0:60),
  index = 406 * exp(cumsum(rnorm(61,0,0.01)))
) %>% mutate(
  year.month = as.yearmon(date)
) %>% group_by(year.month) %>% summarise(
  date = last(date),
  month.close = last(index),
) %>% mutate(
  month.change = log(month.close / lag(month.close))
)

返回预期结果

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 4 x 4
  year.month date       month.close month.change
  <yearmon>  <date>           <dbl>        <dbl>
1 Dec 2002   2002-12-31        405.     NA      
2 Jan 2003   2003-01-31        428.      0.0560 
3 Feb 2003   2003-02-28        421.     -0.0173 
4 Mar 2003   2003-03-01        423.      0.00513

我错过了什么?

【问题讨论】:

    标签: r tidyr dplyr summarize


    【解决方案1】:

    默认情况下,当您将group_bysummarise 结合使用时,只会删除最后一级的分组。

    所以在这个阶段你的数据仍然按year分组。

    tibble(
      date = ymd("2002-12-31") + c(0:60),
      index = 406 * exp(cumsum(rnorm(61,0,0.01)))
    ) %>% mutate(
      year = year(date),
      month = month(date)
    ) %>% group_by(year, month) %>% summarise(
      date = last(date),
      month.close = last(index))
    
    # A tibble: 4 x 4
    # Groups:   year [2] # <- Notice this
    #   year month date       month.close
    #  <int> <int> <date>           <dbl>
    #1  2002    12 2002-12-31        411.
    #2  2003     1 2003-01-31        393.
    #3  2003     2 2003-02-28        406.
    #4  2003     3 2003-03-01        398.
    

    要克服这种行为,您可以在上述步骤之后指定.groups = 'drop' 或使用ungroup()

    tibble(
      date = ymd("2002-12-31") + c(0:60),
      index = 406 * exp(cumsum(rnorm(61,0,0.01)))
    ) %>% mutate(
      year = year(date),
      month = month(date)
    ) %>% group_by(year, month) %>% summarise(
      date = last(date),
      month.close = last(index), .groups = 'drop',
    ) %>% mutate(
      month.change = log(month.close / lag(month.close))
    )
    
    #   year month date       month.close month.change
    #  <int> <int> <date>           <dbl>        <dbl>
    #1  2002    12 2002-12-31        399.    NA       
    #2  2003     1 2003-01-31        380.    -0.0510  
    #3  2003     2 2003-02-28        381.     0.00257 
    #4  2003     3 2003-03-01        381.     0.000673
    

    对于第二步,由于您的数据仅按一个键分组,因此在 summarise 之后删除它,您将获得预期的输出。

    【讨论】:

      猜你喜欢
      • 2014-11-02
      • 2014-03-24
      • 1970-01-01
      • 2012-07-31
      • 1970-01-01
      • 2012-03-17
      • 1970-01-01
      • 1970-01-01
      • 2017-04-08
      相关资源
      最近更新 更多