【问题标题】:Extract overlapping and non-overlapping time periods using R (data.table)使用 R (data.table) 提取重叠和非重叠时间段
【发布时间】:2022-01-17 17:33:14
【问题描述】:

我有一个数据集,其中包含发生干预的时间段。我们有两种干预措施。我有每次干预的开始和结束日期。我现在想提取两种类型之间没有重叠的时间(以天为单位)以及重叠的程度。

这是一个示例数据集:

data <- data.table( id = seq(1,21),
                    type = as.character(c(1,2,2,2,2,2,2,2,1,1,1,1,1,2,1,2,1,1,1,1,1)),
                    start_dt = as.Date(c("2015-01-09", "2015-04-14", "2015-06-19", "2015-10-30", "2016-03-01", "2016-05-24", 
                                         "2016-08-03", "2017-08-18", "2017-08-18", "2018-02-01", "2018-05-07", "2018-08-09", 
                                         "2019-01-31", "2019-03-22", "2019-05-16", "2019-11-04", "2019-11-04", "2020-02-06",
                                         "2020-05-28", "2020-08-25", "2020-12-14")),
                    end_dt   = as.Date(c("2017-07-24", "2015-05-04", "2015-08-27", "2015-11-19", "2016-03-21", "2016-06-09", 
                                         "2017-07-18", "2019-02-21", "2018-01-23", "2018-04-25", "2018-07-29", "2019-01-15", 
                                         "2019-04-24", "2019-09-13", "2019-10-13", "2020-12-23", "2020-01-26", "2020-04-29", 
                                         "2020-08-19", "2020-11-16", "2021-03-07")))

> data
    id type   start_dt     end_dt
 1:  1    1 2015-01-09 2017-07-24
 2:  2    2 2015-04-14 2015-05-04
 3:  3    2 2015-06-19 2015-08-27
 4:  4    2 2015-10-30 2015-11-19
 5:  5    2 2016-03-01 2016-03-21
 6:  6    2 2016-05-24 2016-06-09
 7:  7    2 2016-08-03 2017-07-18
 8:  8    2 2017-08-18 2019-02-21
 9:  9    1 2017-08-18 2018-01-23
10: 10    1 2018-02-01 2018-04-25
11: 11    1 2018-05-07 2018-07-29
12: 12    1 2018-08-09 2019-01-15
13: 13    1 2019-01-31 2019-04-24
14: 14    2 2019-03-22 2019-09-13
15: 15    1 2019-05-16 2019-10-13
16: 16    2 2019-11-04 2020-12-23
17: 17    1 2019-11-04 2020-01-26
18: 18    1 2020-02-06 2020-04-29
19: 19    1 2020-05-28 2020-08-19
20: 20    1 2020-08-25 2020-11-16
21: 21    1 2020-12-14 2021-03-07

这是数据图,以便更好地了解我想知道的内容:

library(ggplot2)
ggplot(data = data,
       aes(x = start_dt, xend = end_dt, y = id, yend = id, color = type)) +  
  geom_segment(size = 2) +
  xlab("") + 
  ylab("") + 
  theme_bw()

我将描述示例的第一部分:从2015-01-092017-07-24,我们有一个类型 1 的干预。然而,从2015-04-14 开始,干预类型 2 也在发生。这意味着我们只有从2015-01-092015-04-13 的“纯”类型1,即95 天。 然后我们有一个从2015-04-142015-05-04 的重叠期,也就是 21 天。然后我们再次有一个只有类型 1 从2015-05-052015-06-18 的周期,即 45 天。总的来说,我们现在有 (95 + 45 =) 140 天的“纯”类型 1 和 21 天的重叠。然后我们在整个时间段内继续这样。

我想知道“纯”类型 1、“纯”类型 2 和重叠的总时间(以天为单位)。

或者,如果可能的话,我想组织数据,以便提取所有单独的时间段,这意味着数据看起来像这样(类型 3 = 重叠):

> data_adjusted
    id type   start_dt     end_dt
 1:  1    1 2015-01-09 2015-04-14
 2:  2    3 2015-04-15 2015-05-04
 3:  3    1 2015-05-05 2015-06-18
 4:  4    3 2015-06-19 2015-08-27
 ........

然后可以从data_adjuted 轻松计算每种干预类型花费的时间(以天为单位)。

我使用dplyr 或只是标记重叠的时间段有类似的答案,但我还没有找到针对我的具体案例的答案。 有没有一种使用data.table 计算的有效方法?

【问题讨论】:

    标签: r date time data.table overlap


    【解决方案1】:

    此方法会查看范围内的所有日期,因此如果您的数据变大,它可能无法很好地扩展。

    library(data.table)
      alldates <- data.table(date = seq(min(data$start_dt), max(data$end_dt), by = "day"))
      data[alldates, on = .(start_dt <= date, end_dt >= date)] %>%
        .[, .N, by = .(start_dt, type) ] %>%
        .[ !is.na(type), ] %>%
        dcast(start_dt ~ type, value.var = "N") %>%
        .[, r := do.call(rleid, .SD), .SDcols = setdiff(colnames(.), "start_dt") ] %>%
        .[, .(type = fcase(is.na(`1`[1]), "2", is.na(`2`[1]), "1", TRUE, "3"),
              start_dt = min(start_dt), end_dt = max(start_dt)), by = r ]
    #         r   type   start_dt     end_dt
    #     <int> <char>     <Date>     <Date>
    #  1:     1      1 2015-01-09 2015-04-13
    #  2:     2      3 2015-04-14 2015-05-04
    #  3:     3      1 2015-05-05 2015-06-18
    #  4:     4      3 2015-06-19 2015-08-27
    #  5:     5      1 2015-08-28 2015-10-29
    #  6:     6      3 2015-10-30 2015-11-19
    #  7:     7      1 2015-11-20 2016-02-29
    #  8:     8      3 2016-03-01 2016-03-21
    #  9:     9      1 2016-03-22 2016-05-23
    # 10:    10      3 2016-05-24 2016-06-09
    # 11:    11      1 2016-06-10 2016-08-02
    # 12:    12      3 2016-08-03 2017-07-18
    # 13:    13      1 2017-07-19 2017-07-24
    # 14:    14      3 2017-08-18 2018-01-23
    # 15:    15      2 2018-01-24 2018-01-31
    # 16:    16      3 2018-02-01 2018-04-25
    # 17:    17      2 2018-04-26 2018-05-06
    # 18:    18      3 2018-05-07 2018-07-29
    # 19:    19      2 2018-07-30 2018-08-08
    # 20:    20      3 2018-08-09 2019-01-15
    # 21:    21      2 2019-01-16 2019-01-30
    # 22:    22      3 2019-01-31 2019-02-21
    # 23:    23      1 2019-02-22 2019-03-21
    # 24:    24      3 2019-03-22 2019-04-24
    # 25:    25      2 2019-04-25 2019-05-15
    # 26:    26      3 2019-05-16 2019-09-13
    # 27:    27      1 2019-09-14 2019-10-13
    # 28:    28      3 2019-11-04 2020-01-26
    # 29:    29      2 2020-01-27 2020-02-05
    # 30:    30      3 2020-02-06 2020-04-29
    # 31:    31      2 2020-04-30 2020-05-27
    # 32:    32      3 2020-05-28 2020-08-19
    # 33:    33      2 2020-08-20 2020-08-24
    # 34:    34      3 2020-08-25 2020-11-16
    # 35:    35      2 2020-11-17 2020-12-13
    # 36:    36      3 2020-12-14 2020-12-23
    # 37:    37      1 2020-12-24 2021-03-07
    #         r   type   start_dt     end_dt
    

    它删除了id 字段,我不知道如何将它很好地映射回您的原始数据。

    【讨论】:

      【解决方案2】:

      @r2evans 的解决方案比较完整,但是如果你想探索foverlaps 的使用,你可以这样开始:

      #split into two frames
      data = split(data,by="type")
      
      # key the second frame
      setkey(data[[2]], start_dt, end_dt)
      
      # create the rows that have overlaps
      overlap = foverlaps(data[[1]],data[[2]], type="any", nomatch=0)
      
      # get the overlapping time periods
      overlap[, .(start_dt = max(start_dt,i.start_dt), end_dt=min(end_dt,i.end_dt)), by=1:nrow(overlap)][,type:=3]
      

      输出:

         nrow   start_dt     end_dt type
       1:    1 2015-04-14 2015-05-04    3
       2:    2 2015-06-19 2015-08-27    3
       3:    3 2015-10-30 2015-11-19    3
       4:    4 2016-03-01 2016-03-21    3
       5:    5 2016-05-24 2016-06-09    3
       6:    6 2016-08-03 2017-07-18    3
       7:    7 2017-08-18 2018-01-23    3
       8:    8 2018-02-01 2018-04-25    3
       9:    9 2018-05-07 2018-07-29    3
      10:   10 2018-08-09 2019-01-15    3
      11:   11 2019-01-31 2019-02-21    3
      12:   12 2019-03-22 2019-04-24    3
      13:   13 2019-05-16 2019-09-13    3
      14:   14 2019-11-04 2020-01-26    3
      15:   15 2020-02-06 2020-04-29    3
      16:   16 2020-05-28 2020-08-19    3
      17:   17 2020-08-25 2020-11-16    3
      18:   18 2020-12-14 2020-12-23    3
      

      这些重叠天数的总和是 1492。

      【讨论】:

        猜你喜欢
        • 2021-04-13
        • 2017-06-09
        • 1970-01-01
        • 2015-03-31
        • 2016-04-29
        • 2020-01-28
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多