使用开始和结束日期按日期范围扩展行答案

【问题标题】：Expand rows by date range using start and end date使用开始和结束日期按日期范围扩展行
【发布时间】：2014-07-17 12:18:07
【问题描述】：

考虑表单的数据框

       idnum      start        end
1993.1    17 1993-01-01 1993-12-31
1993.2    17 1993-01-01 1993-12-31
1993.3    17 1993-01-01 1993-12-31

start 和 end 的类型为 Date

 $ idnum : int  17 17 17 17 27 27
 $ start : Date, format: "1993-01-01" "1993-01-01" "1993-01-01" "1993-01-01" ...
 $ end   : Date, format: "1993-12-31" "1993-12-31" "1993-12-31" "1993-12-31" ...

我想创建一个新数据框，改为在 start 和 end（包括边界）之间的每个月对每一行进行每月观察：

期望的输出

idnum       month
   17  1993-01-01
   17  1993-02-01
   17  1993-03-01
...
   17  1993-11-01
   17  1993-12-01

我不确定month 应该有什么格式，我会在某个时候想要按idnum、month 分组以对其余数据集进行回归。

到目前为止，对于每一行，seq(from=test[1,'start'], to=test[1, 'end'], by='1 month') 都为我提供了正确的顺序 - 但一旦我尝试将其应用于整个数据框，它将无法正常工作：

> foo <- apply(test, 1, function(x) seq(x['start'], to=x['end'], by='1 month'))
Error in to - from : non-numeric argument to binary operator

【问题讨论】：

作为R的初学者，我应该如何判断答案？有没有办法检查它们的效率，如 Python 中的%timeit？

标签： r

【解决方案1】：

使用data.table：

require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

setDT 将df 转换为data.table。然后对于每一行by = 1:nrow(df)，我们根据需要创建idnum 和month。

【讨论】：

据我所知，这是最有效的答案。简短的跟进：假设我实际上在新数据框中有一个很长的列列表，而不仅仅是idnum。有没有一种优雅的方式来提供这些？用colnames(df) 替换idnum=idnum 肯定行不通。
在大约 40k 记录的小型数据集上，这比 dplyr::rowwise() 选项快 25 倍。
如何使用多列代替 idnum ？
@jeganathanvelu 最好作为一个单独的问题提出。

【解决方案2】：

使用dplyr：

test %>%
    group_by(idnum) %>%
    summarize(start=min(start),end=max(end)) %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))

请注意，这里我不会为每一行生成start 和end 之间的序列，而是为每个idnum 生成min(start) 和max(end) 之间的序列。如果你想要前者：

test %>%
    rowwise() %>%
    do(data.frame(idnum=.$idnum, month=seq(.$start,.$end,by="1 month")))

【讨论】：

【解决方案3】：

更新2

对于purrr (0.3.0) 和dplyr (0.8.0) 的新版本，这可以通过map2 完成

library(dplyr)
library(purrr)
 test %>%
     # sequence of monthly dates for each corresponding start, end elements
     transmute(idnum, month = map2(start, end, seq, by = "1 month")) %>%
     # unnest the list column
     unnest %>% 
     # remove any duplicate rows
     distinct

更新

基于@Ananda Mahto 的 cmets

 res1 <- melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"],
 test[x, "end"], by = "1 month")), test$idnum))

还有，

  res2 <- setNames(do.call(`rbind`,
          with(test, 
          Map(`expand.grid`,idnum,
          Map(`seq`, start, end, by='1 month')))), c("idnum", "month"))


  head(res1)
 #  idnum      month
 #1    17 1993-01-01
 #2    17 1993-02-01
 #3    17 1993-03-01
 #4    17 1993-04-01
 #5    17 1993-05-01
 #6    17 1993-06-01

【讨论】：

+1。我已经完成了melt(setNames(lapply(1:nrow(test), function(x) seq(test[x, "start"], test[x, "end"], by = "1 month")), test$idnum)) 以避免不必要地调用data.frame。
如果所有这些方法都适用于我的 R 版本，我该如何选择？我在这里是一个完整的初学者......这些方法中的一些是否可以更好地推广到类似的解决方案，或者更新且不太可能被弃用？有没有我可以用来检查它们的性能例程？
@Ananda Mahto。谢谢我用你的代码替换了我的代码。
@FooBar，一部分是个人喜好，一部分是“6 个月后我能理解什么代码？”，一部分是“我的数据有多大？”选择一种方法而不是另一种方法有很多不同的原因。 “microbenchmark”包可帮助您确定哪些方法在计算时间方面最有效。
@FooBar，对我来说，如果数据集相当大，一般来说，基于dplyr 或data.table 的解决方案会更快。很难预测要弃用哪一个。

【解决方案4】：

tidyverse回答

数据

df <- structure(list(idnum = c(17L, 17L, 17L), start = structure(c(8401, 
8401, 8401), class = "Date"), end = structure(c(8765, 8765, 8765
), class = "Date")), class = "data.frame", .Names = c("idnum", 
"start", "end"), row.names = c(NA, -3L))

回答并输出

library(tidyverse)
df %>%
  nest(start, end) %>%
  mutate(data = map(data, ~seq(unique(.x$start), unique(.x$end), 1))) %>%
  unnest(data)

# # A tibble: 365 x 2
   # idnum       data
   # <int>     <date>
 # 1    17 1993-01-01
 # 2    17 1993-01-02
 # 3    17 1993-01-03
 # 4    17 1993-01-04
 # 5    17 1993-01-05
 # 6    17 1993-01-06
 # 7    17 1993-01-07
 # 8    17 1993-01-08
 # 9    17 1993-01-09
# 10    17 1993-01-10
# # ... with 355 more rows

【讨论】：

dplyr 版本0.7.4 提供Error: Each column must either be a list of vectors or a list of data frames [data]

【解决方案5】：

使用dplyr 和tidyr 为每行创建一个序列的一个选项可能是：

df %>%
 rowwise() %>%
 transmute(idnum,
           date = list(seq(start, end, by = "month"))) %>%
 unnest(date)

  idnum date      
   <int> <date>    
 1    17 1993-01-01
 2    17 1993-02-01
 3    17 1993-03-01
 4    17 1993-04-01
 5    17 1993-05-01
 6    17 1993-06-01
 7    17 1993-07-01
 8    17 1993-08-01
 9    17 1993-09-01
10    17 1993-10-01
# … with 26 more rows

或者使用分组 ID 创建序列：

df %>%
 group_by(idnum) %>%
 transmute(date = list(seq(min(start), max(end), by = "month"))) %>%
 unnest(date)

或者当目标是为每个 ID 创建一个唯一序列时：

df %>%
 group_by(idnum) %>%
 summarise(start = min(start),
           end = max(end)) %>%
 transmute(date = list(seq(min(start), max(end), by = "month"))) %>%
 unnest(date)

   date      
   <date>    
 1 1993-01-01
 2 1993-02-01
 3 1993-03-01
 4 1993-04-01
 5 1993-05-01
 6 1993-06-01
 7 1993-07-01
 8 1993-08-01
 9 1993-09-01
10 1993-10-01
11 1993-11-01
12 1993-12-01

【讨论】：