【问题标题】:Fill in missing date interval by group r按组 r 填写缺失的日期间隔
【发布时间】:2020-05-07 11:25:08
【问题描述】:

我有一个大型数据集,其中包括每个 ID 和参考日期具有不同疾病状态的日期期间。我想为每个 ID 的参考日期起 +/- 5 年内的所有缺失日期期间添加一个“健康”状态。

我曾尝试在此处修改解决方案:Fill in missing date ranges 但失败了。最好,我想保留 data.table 框架。非常感谢任何建议!

样本数据:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2012-02-01    2012-03-01   2
2   2014-06-11      2012-02-01    2012-03-01   2
3   2011-08-14      NA            NA           NA 
")

期望的输出:

DT <- fread("
id  reference_date  period_start  period_end   Status
1   2010-01-10      2004-06-22    2005-03-15   1
1   2010-01-10      2005-03-16    2008-10-10   0   
1   2010-01-10      2008-10-11    2008-10-12   1
1   2010-01-10      2008-10-13    2014-11-04   0
1   2010-01-10      2014-11-05    2016-01-03   2
2   2013-05-10      2008-05-10    2012-01-31   0
2   2013-05-10      2012-02-01    2012-03-01   2
2   2013-05-10      2012-03-02    2018-05-10   0
2   2014-06-11      2009-06-11    2012-01-31   0
2   2014-06-11      2012-02-01    2012-03-01   2
2   2014-06-11      2012-03-02    2019-06-11   0
3   2011-08-14      2006-08-14    2016-08-14   0 
")

评论: 对于第一行,+/-5 年的日期间隔是从 2005-01-10 到 2015-01-10。然而,由于 2005 年 3 月 15 日结束的持续疾病状态,“健康”期从 2005 年 3 月 16 日开始。因为每个 id 可以有多个参考日期,所以重复的日期期间(如 id 2 所观察到的:2012-02-01-2012-03-01)将存在并且可以。最后,没有疾病状态的 id 用 NA 表示(如 id 3)。

编辑:我对真实数据有一​​些问题,所以我稍微调整了解决方案;还添加了状态,以便按日期间隔折叠状态:

 DT2 <- DT[,{

        # +/-5 years from t0
        sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
        edt <- seq(reference_date, by="5 years", length.out=2L)[2L]

        if(is.na(start[1L])) {
          # replace NA with full time interval for 'healthy'
          .(period_start=sdt, period_end=edt, status='notsick')
        } else{
          # Add date for -5 years if it is the minimum, otherwise use existing minimum
          if (sdt < period_start[1L]) {
            period_start <- c(sdt, period_start)
          }
          # Add date for +5 years if it is the maximum, otherwise use existing maximum
          if (edt > period_end[.N]) {
            period_end <- c(period_end,edt)
          }
          dates=unique(sort(c(period_start, period_end+1L)))
          .(start=dates[-length(dates)],end=dates[-1L]-1,status='')
        }
      },
      .(id,reference_date)]

      ## (c). Collapse status for overlapping periods
      DT <- DT[DT2, on = .(id,reference_date, period_start <= period_start, period_end >= period_end), {
        status <- paste(status, collapse = ";")
        .(status=status)},
        by = .EACHI, allow.cartesian = TRUE]

【问题讨论】:

    标签: r date data.table


    【解决方案1】:

    这里有一个选项:

    interweave <- function(x, y) c(rbind(x, y)) #see ref
    ans <- DT[, {
            sdt <- seq(reference_date, by="-5 years", length.out=2L)[2L]
            edt <- seq(reference_date, by="5 years", length.out=2L)[2L]
    
            if(is.na(period_start[1L])) {
                .(period_start=sdt, period_end=edt, Status=0L)
            } else {    
                if (sdt < period_start[1L]) {
                    period_start <- c(sdt, period_start)
                } 
                ps <- as.IDate(sort(interweave(period_start, period_end+1L)))
    
                if (period_end[.N] > edt) {
                    ps <- ps[-length(ps)]
                    pe <- period_end[.N]
                } else {
                    pe <- edt
                }
                .(period_start=ps, period_end=c(ps[-1L] - 1, pe), Status=0L)
            }
        },
        .(id, reference_date)]
    ans[DT, on=setdiff(names(DT), "Status"), Status := i.Status]
    ans
    

    数据:

    library(data.table)
    DT <- fread("
    id  reference_date  period_start  period_end   Status
    1   2010-01-10      2004-06-22    2005-03-15   1
    1   2010-01-10      2008-10-11    2008-10-12   1
    1   2010-01-10      2014-11-05    2016-01-03   2
    2   2013-05-10      2012-02-01    2012-03-01   2
    2   2014-06-11      2012-02-01    2012-03-01   2
    3   2011-08-14      NA            NA           NA 
    ")
    cols <- c("reference_date","period_start","period_end")
    DT[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
    

    参考: Alternate, interweave or interlace two vectors

    【讨论】:

    • 太棒了!代码清晰,速度快。我几乎把头发扯下来试图解决它。谢谢
    猜你喜欢
    • 2018-07-15
    • 1970-01-01
    • 1970-01-01
    • 2018-06-24
    • 1970-01-01
    • 2020-11-10
    • 2023-03-05
    相关资源
    最近更新 更多