按日期分组数据框：解决缺少时间段的错误答案

【问题标题】：Grouping a data frame by dates: resolve missing time periods' bug按日期分组数据框：解决缺少时间段的错误
【发布时间】：2017-08-31 14:21:46
【问题描述】：

我在几周前从 StackOverflow 上的一位慷慨回复者那里收到的一些不错的代码中发现了一个难以解决的错误（如果不是我自己创建的），我今天可以使用一些新的帮助。

样本数据（下面称为对象eh）：

    ID        2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
    5167f          0          0          0          0          0          0          0          0          0
    1214m          0          0          0          0          0          0          0          0          0
    1844f          0          0          0          0          0          0          0          0          0
    2113m          0          0          0          0          0          0          0          0          0
    2254m          0          0          0          0          0          0          0          0          0
    2721f          0          0          0          0          0          0          0          0          0
    3121f          0          0          0          0          0          0          0          0          0
    3486f          0          0          0          0          0          0          0          0          0
    3540f          0          0          0          0          0          0          0          0          0
    4175m          0          0          0          0          0          0          0          0          0

我需要能够按 0s 和 1s 各自列日期所在的时间段（例如，每 1、2、3 或 4 周）对它们进行分组。每当1 至少一次落入特定日期范围 (Period) 时，就会在该 ID 中为该 ID 汇总一个 1（Period，否则为0。

我以 1 周总结例程为例。我的主要问题是，在时间序列"2013-03-20" 到"2015-12-31" 期间，生成的最终输出缺少一些可能的 1 周 Periods。

请注意，此示例输出中的行是唯一的 IDs，列是唯一的 Periods，如何缺少 Periods 2、5、7 和 9：

    1   3   4   6   8   10  11  12  13  14
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0

这是对原始数据框进行分组的完整例程（参见上面共享的示例数据）：

    #Convert to data table from original data frame, eh
    dt <- as.data.table(eh)

    #One week summarized encounter histories
    dt_merge <- data_frame(
      # Create a column showing the beginning date
      Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
      # Create  a column showing the end date of each period
      mutate(Date2 = lead(Date1)) %>%
      # Adjust Date1
      mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
      # Remove the last row
      drop_na(Date2) %>%
      # Create date list
      mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
      unnest() %>%
      # Create Group ID
      mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
      # Create Period ID
      mutate(Period = paste0(RunID)) %>%
      # Add a column showing Month
      mutate(Month = month(Dates)) %>%
      # Add a column showing Year
      mutate(Year = year(Dates)) %>%
      # Add a column showing season
      mutate(Season = case_when(
        Month %in% 3:5            ~ "Spring",
        Month %in% 6:8            ~ "Summer",
        Month %in% 9:11           ~ "Fall",
        Month %in% c(12, 1, 2)    ~ "Winter",
        TRUE                      ~ NA_character_
      )) %>%
      # Combine Season and Year
      mutate(SeasonYear = paste0(Season, Year)) %>%
      select(-Date1, -Date2, -RunID)
    dt2 <- dt %>%
      # Reshape the data frame
      gather(Date, Value, -ID) %>%
      # Convert Date to date class
      mutate(Date = ymd(Date)) %>%
      # Join dt_merge
      left_join(dt_merge, by = c("Date" = "Dates")) 
    one.week <- dt2 %>%
      group_by(ID, Period) %>%
      summarise(Value = max(Value)) %>%
      spread(Period, Value)

    #Finished product
    one.week <- as.data.frame(one.week)

    #Missing weeks 2, 5, 7, and 9...
    one.week

有人可以帮助我了解我哪里出错了吗？提前致谢！

-AD

【问题讨论】：

标签： r dataframe dplyr lubridate tidyverse

【解决方案1】：

发生这种情况是因为 eh 数据中缺少这些周。例如，如果您查看构成第 2 周的日期：

dt_merge %>%
  filter(Period == 2)
#> # A tibble: 7 x 6
#>        Dates Period Month  Year Season SeasonYear
#>       <date>  <chr> <dbl> <dbl>  <chr>      <chr>
#> 1 2013-03-28      2     3  2013 Spring Spring2013
#> 2 2013-03-29      2     3  2013 Spring Spring2013
#> 3 2013-03-30      2     3  2013 Spring Spring2013
#> 4 2013-03-31      2     3  2013 Spring Spring2013
#> 5 2013-04-01      2     4  2013 Spring Spring2013
#> 6 2013-04-02      2     4  2013 Spring Spring2013
#> 7 2013-04-03      2     4  2013 Spring Spring2013

您可以看到eh 的列中没有这些日期，它们从 2013-03-20 跳到 2013-04-09。因为您在创建 dt2 时使用了 left_join，所以只保留了 eh 中的日期（因此也是星期）。

这可以通过使用 tidyr 包中的complete() 创建缺少的 ID 和日期组合来纠正。

dt2 <- dt %>%
  # Reshape the data frame
  gather(Date, Value, -ID) %>%
  # Convert Date to date class
  mutate(Date = ymd(Date)) %>%
  # Create missing ID/Date combinations
  complete(ID, Date = dt_merge$Dates) %>%
  # Join dt_merge
  left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
  mutate(Period = as.numeric(Period)) %>%
  group_by(ID, Period) %>%
  summarise(Value = max(Value, na.rm = TRUE)) %>%
  spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups:   ID [10]
#>       ID   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`
#>  * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 1214m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  2 1844f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  3 2113m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  4 2254m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  5 2721f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  6 3121f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  7 3486f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  8 3540f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  9 4175m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> 10 5167f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> #   `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> #   `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> #   `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> #   `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> #   `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> #   `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> #   `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> #   `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> #   `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> #   `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> #   `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> #   `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> #   `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> #   `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> #   `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> #   `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> #   `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> #   `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> #   `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> #   `110` <dbl>, `111` <dbl>, ...

如果在给定的一周内没有该 ID 的值，则返回 -Inf。或者，代替使用NA 填充缺失值，可以使用complete(ID, Date = dt_merge$Dates, fill = list(Value = 0)) 填充它们，例如0。这将使任何未观察到的 ID 和日期组合的值变量为 0。

【讨论】：

就是这样。非常感谢！