【发布时间】:2020-02-09 20:29:25
【问题描述】:
我有一个 dataframe 和一堆 start 和 end 日期,我正在循环一个日期列表,并查看我的数据框中有多少行在该日期列表中“打开”(即开始日期已经发生,但结束日期尚未发生)。
我目前正在使用 lapply 执行此操作,但我想知道是否可以在 dplyr 中执行此操作,以及在内存和速度方面是否有任何好处(实际数据帧为 150 万行)。
RollingDateRange <- seq(Sys.Date()-15, Sys.Date(), by="days")
temp <- data.frame(RollingDateRange)
dat <- data.frame(
Order = c(1,1,1,2,2,2,3,3,3),
Code = c("Green","Yellow","Blue","Yellow","Yellow","Red","Purple","Green","Blue"),
Start.Date = as.Date(c("2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03","2020-02-01","2020-02-02","2020-02-03")),
End.Date = as.Date(c("2020-02-02","2020-02-08",NA,"2020-02-07","2020-02-06",NA,"2020-02-03","2020-02-08","2020-02-06")),
Count = c(1,1,1,1,1,1,1,1,1),
stringsAsFactors = FALSE)
temp$Count <- lapply(temp$RollingDateRange, function(d){
b <- dat[((dat$Start.Date <= d) & (dat$End.Date >= d)) | ((dat$Start.Date <= d) & (is.na(dat$End.Date))),]
total <- sum(b$Count, na.rm = TRUE)
})
输出:
> temp
RollingDateRange Count
1 2020-01-25 0
2 2020-01-26 0
3 2020-01-27 0
4 2020-01-28 0
5 2020-01-29 0
6 2020-01-30 0
7 2020-01-31 0
8 2020-02-01 3
9 2020-02-02 6
10 2020-02-03 8
11 2020-02-04 7
12 2020-02-05 7
13 2020-02-06 7
14 2020-02-07 5
15 2020-02-08 4
16 2020-02-09 2
【问题讨论】:
-
好奇的@Kevin,tidyverse 解决方案是否解决了您的性能需求?