【发布时间】:2023-03-05 01:47:01
【问题描述】:
我有一个data.frame 的群组和日期。如何在每个组的最小-最大日期范围内填写所有缺失的日期?
理想情况下,我会在dplyr 中执行此操作。但最终,我只想用尽可能少的(可读)代码行有效地做到这一点。下面是一个最小的例子。我实际上有很多约会和团体。我的两种方法看起来都很难看。一定有更好的方法吧?
#### setup ####
library(sqldf)
library(dplyr)
df <- data.frame(the_group = rep(LETTERS[1:2], each=3), date = Sys.Date() + c(0:2, 1:3), stringsAsFactors = F) %>%
tbl_df() %>%
slice(-2) # represents that I may be missing data in a range!
#### dplyr approach with cross join dummy ####
full_seq <- data.frame(cross_join_dummy = 1, date = seq.Date(from=min(df$date), to=max(df$date), by = "day"))
range_by_group <- df %>%
group_by(the_group) %>%
summarise(min_date = min(date), max_date = max(date)) %>%
ungroup() %>%
mutate(cross_join_dummy = 1)
desired <- range_by_group %>%
inner_join(full_seq, by="cross_join_dummy") %>%
filter(date >= min_date, date <= max_date) %>%
select(the_group, date)
#### sqldf approach ####
full_seq <- data.frame(date = as.character(seq.Date(from=min(df$date), to=max(df$date), by="day")))
df <- df %>%
mutate(date = as.character(date))
range_by_group <- sqldf("
SELECT the_group, MIN(date) AS min_date, MAX(date) AS max_date
FROM df
GROUP BY the_group
")
desired <- sqldf("
SELECT rbg.the_group, fs.date
FROM range_by_group rbg
JOIN full_seq fs
ON fs.date BETWEEN rbg.min_date AND rbg.max_date
")
【问题讨论】: