【问题标题】:How to count records with start date end date interval in R?如何计算R中开始日期结束日期间隔的记录?
【发布时间】:2019-04-30 02:57:55
【问题描述】:

我有一个包含项目的数据框,每个项目都有一个开始日期和结束日期。我想知道某个时间段内每天有多少项处于活动状态。

示例数据集:

ItemId <- c(1,2,3)
StartDate <- c(ymd("2014-01-01"),ymd("2014-02-01"),ymd("2014-03-01"))
EndDate <- c(ymd("2014-02-15"),ymd("2014-02-07"),ymd("2014-03-03"))
data.frame(ItemId,StartDate,EndDate)
  ItemId           StartDate             EndDate
1      1 2014-01-01 01:00:00 2014-02-15 01:00:00
2      2 2014-02-01 01:00:00 2014-02-07 01:00:00
3      3 2014-03-01 01:00:00 2014-03-03 01:00:00

结果应该是这样的(每天一个条目):

Date        ActiveCount
2014-01-01  1
2014-01-02  1
...
2014-02-01  2
...

我有一个使用 sqldf 的解决方案,但不确定如何在 R 中执行此操作。

select d.date
,      ( select count(ItemID)
         from   items
         where  startdate <= d.date
         and    enddate >= d.date
       ) activecount
from   (select distinct startdate from items
        union
        select distinct enddate from items
       ) d
order by 1

(我每天都包含多个条目,因此对于 R 中的 sqlite,这是可行的。在 postgresql 上,我可以生成一系列更好的日期。)

谢谢。

【问题讨论】:

  • 看看foverlaps() 来自data.table 包。或在 SO 上搜索。
  • 注意sql语句可以使用where d.date is between startdate and enddate

标签: r


【解决方案1】:

调用你的数据df:

dates = seq(min(df$StartDate), max(df$EndDate), by = "day")

counts = data.frame(date = dates,
                    count = sapply(dates, function(x) sum(x <= df$EndDate & x >= df$StartDate)))

【讨论】:

    【解决方案2】:

    当 R 任务类似于 SQL 任务时,可能是时候将dplyr 带出柜子了:

    library(dplyr) 
    ItemId <- c(1,2,3)
    StartDate <- c(ymd("2014-01-01"),ymd("2014-02-01"),ymd("2014-03-01"))
    EndDate <- c(ymd("2014-02-15"),ymd("2014-02-07"),ymd("2014-03-03"))
    
    jim <- data.frame(ItemId,StartDate,EndDate)
    
    # One technique that's often useful especially in R, is to take your 
    # iterator, and define it as a variable.  You can use that to implement
    # a vectorised version of whatever you were thinking of doing.*/
    
    ed <- data.frame(rng = seq(min(jim$StartDate), max(jim$EndDate), by = 'day'))
    merge(jim, ed, all=TRUE) %>% 
         filter(rng >= StartDate, rng <= EndDate) %>%
         group_by(rng) %>% 
         summarise(n())
    

    这给了你:

        rng         n()
    1   2014-01-01  1 
    2   2014-01-02  1
    3   2014-01-03  1
    ...
    

    【讨论】:

      【解决方案3】:

      我已经多次回到这个问题,并且一直在寻找最有效的方法。

      我以前使用过 map-reduce 方法,但注意到它不能很好地扩展到具有宽日期间隔的大型数据帧。我刚刚尝试使用 lubridate 包中的 interval 类,发现它是迄今为止最快的实现。

      这是最终代码:

      library(tidyverse)
      library(lubridate)
      
      # Initialize a dataframe with start and end "active" dates per object
      N = 1000
      id_dates = tibble(id = 1 : N) %>%
        mutate(
          start = sample(seq(as.Date('2018-1-1'), as.Date('2019-1-1'), by = "day"), size = N, replace = TRUE),
          end   = start + sample(7 : 100, size = N, replace = TRUE),
          interval = interval(start, end))
      
      # Use the %within% command to calculate the number of active items per date
      queue_history = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day")) %>% 
        rowwise() %>% 
        mutate(numInWIP = sum(Date %within% id_dates$interval)) %>%
        ungroup()
      

      这里有一些基准表明 lubridate 解决方案比当前答案和 map-reduce 方法快得多

      library(tidyverse)
      library(lubridate)
      
      # Initialize a dataframe with start and end "active" dates per object
      N = 1000
      id_dates = tibble(id = 1 : N) %>%
        mutate(
          start = sample(seq(as.Date('2018-1-1'), as.Date('2019-1-1'), by = "day"), size = N, replace = TRUE),
          end   = start + sample(7 : 100, size = N, replace = TRUE),
          interval = interval(start, end))
      
      # a map-reduce solution
      method_mapreduce = function() {
        queue_history = as.tibble(table(reduce(map2(id_dates$start, id_dates$end, seq, by = 1), c)))
        queue_history = queue_history %>%
          rename(Date = Var1, numInWIP = Freq) %>%
          mutate(Date = as_date(Date))
      
        return (queue_history)
      }
      
      # a lubridate interval solution
      method_intervals = function() {
        date_df = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day"))
        queue_history = date_df %>% 
          rowwise() %>% 
          mutate(numInWIP = sum(Date %within% id_dates$interval))
      
        return (queue_history)
      }
      
      # current best answer
      method_currentsolution = function() {
        date_df = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day"))
        queue_history = merge(id_dates, date_df, all=TRUE) %>% 
          filter(Date >= start, Date <= end) %>%
          group_by(Date) %>% 
          summarise(n())
      
      }
      
      # Compare with benchmarks
      tst = microbenchmark::microbenchmark(
        method_mapreduce(),
        method_intervals(),
        method_currentsolution(),
        times = 5)
      
      microbenchmark::autoplot.microbenchmark(tst) +
        scale_y_log10(
          name   = sprintf("Time [%s]", attr(summary(tst), "unit")),
          breaks = scales::trans_breaks("log10", function(x) round(10^x)))
      

      【讨论】:

        【解决方案4】:

        您首先要获取至少包含一个活动项目的所有日期,然后您要计算每天的活动项目数。如果我们将您的数据存储在 itemDates 中,那么这应该会处理好它:

        dates <- min(itemDates$StartDate) + days(0:as.numeric(max(itemDates$EndDate) - min(itemDates$StartDate)))
        dateCounts <- data.frame(
            row.names=dates,
            counts=sapply(dates, function(date)
                sum(date >= itemDates$StartDate & date <= itemDates$EndDate)))
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2020-01-05
          • 1970-01-01
          • 1970-01-01
          • 2022-01-22
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多