【问题标题】:How to generate a unique ID for each group based on relative date interval in R using dplyr?如何使用 dplyr 根据 R 中的相对日期间隔为每个组生成唯一 ID?
【发布时间】:2019-06-21 22:01:55
【问题描述】:

我有一组包含多人访问的数据,并希望根据人员 # 和访问时间将访问分组为具有共同 ID。条件是如果开始是在前一次退出的 24 小时内,那么我希望那些具有相同的 ID。

数据样例:

dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00", 
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017-    01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01 
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time  = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22 
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))

这是我试图开始的:

dat1 <- 
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time, 
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run = 
cumsum(start))

dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))

这几乎是对的,只是它没有给出访问 7(人 #3)的正确 ID。由于有三次访问,而第二次访问完全在第一次内,而第三次在第一次的 24 小时内开始,而不是第二次。

【问题讨论】:

    标签: r group-by dplyr datediff cumsum


    【解决方案1】:

    可能有一种方法可以缩短它,但这里有一种使用tidyr::gatherspread 的方法。通过收集成长格式,我们可以跟踪每次访问中的累积录取。只要有新的Person_IDPerson_ID 至少在 24 小时前完成了一次访问(累计访问量为零),就会记录一次新的访问。

    library(tidyr)
    dat1 <- dat %>%
      # Gather into long format with event type in one column, timestamp in another
      gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
    
      # I want discharges to have an effect up to 24 hours later. Sort using that.
      mutate(time_adj = if_else(event == "Discharge_Date_Time", 
                                time + ddays(1), 
                                time)) %>%
      arrange(Person_ID, time_adj) %>%
    
      # For each Person_ID, track cumulative admissions. 0 means a visit has completed. 
      #   (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
      group_by(Person_ID) %>%
      mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
      mutate(admissions_count = cumsum(admissions)) %>%
      ungroup() %>%
    
      # Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a 
      #   completed visit (ie admissions_count was zero).
      mutate(Hosp_ID_chg = 1 * 
               (Person_ID != lag(Person_ID, default = 1) |   # (a)
                lag(admissions_count, default = 1) == 0),    # (b)
             Hosp_ID = cumsum(Hosp_ID_chg)) %>%
    
      # Spread back into original format
      select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
      spread(event, time)
    

    结果

    > dat1
    # A tibble: 9 x 5
      Person_ID Visit_ID Hosp_ID Admit_Date_Time     Discharge_Date_Time
          <dbl>    <int>   <dbl> <dttm>              <dttm>             
    1         1        1       1 2017-02-07 15:26:00 2017-03-01 11:42:00
    2         1        2       2 2017-04-21 10:20:00 2017-04-22 05:56:00
    3         1        3       2 2017-04-22 12:12:00 2017-04-26 21:01:00
    4         2        4       3 2017-10-16 01:31:00 2017-10-18 20:11:00
    5         3        5       4 2017-01-24 02:41:00 2017-01-27 22:15:00
    6         3        6       4 2017-01-24 05:31:00 2017-01-26 15:35:00
    7         3        7       4 2017-01-28 04:26:00 2017-01-28 09:25:00
    8         4        8       5 2017-12-01 01:31:00 2017-12-05 18:33:00
    9         4        9       5 2017-12-01 01:31:00 2017-12-04 16:41:00
    

    【讨论】:

    • 这是完美的。非常感谢您的帮助,非常感谢!
    【解决方案2】:

    这是一种使用重叠连接的 data.table 方法

    library( data.table )
    library( lubridate )
    setDT( dat )
    setorder( dat, Person_ID, Admit_Date_Time )
    #create a 1-day extension after each discharge
    dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
    #now create id
    setkey( dat, Admit_Date_Time, discharge_24h )
    #create data-table with overlap-join, create groups based on overlapping ranges
    dt2 <- setorder( 
      foverlaps( dat, 
                 dat, 
                 mult = "first", 
                 type = "any", 
                 nomatch = 0L 
                 ), 
      Visit_ID )[, list( Visit_ID = i.Visit_ID, 
                         Hosp_ID = .GRP ), 
                 by = .( Visit_ID )][, Visit_ID := NULL]
    
    #reorder the result
    setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
    
    #    Visit_ID Hosp_ID Person_ID     Admit_Date_Time Discharge_Date_Time
    # 1:        1       1         1 2017-02-07 15:26:00 2017-03-01 11:42:00
    # 2:        2       2         1 2017-04-21 10:20:00 2017-04-22 05:56:00
    # 3:        3       2         1 2017-04-22 12:12:00 2017-04-26 21:01:00
    # 4:        4       3         2 2017-10-16 01:31:00 2017-10-18 20:11:00
    # 5:        5       4         3 2017-01-24 02:41:00 2017-01-27 22:15:00
    # 6:        6       4         3 2017-01-24 05:31:00 2017-01-26 15:35:00
    # 7:        7       4         3 2017-01-28 04:26:00 2017-01-28 09:25:00
    # 8:        8       5         4 2017-12-01 01:31:00 2017-12-05 18:33:00
    # 9:        9       5         4 2017-12-01 01:31:00 2017-12-04 16:41:00
    

    【讨论】:

    • 感谢您的帖子,不幸的是,数据表的使用率远高于我,我无法让它在我的实际代码上工作(见下面的错误)。 dplyr 帖子是我需要的。不管怎么说,多谢拉! foverlaps 中的错误(dat,dat,mult = “first”,type = “any”,nomatch = 0L):Admit_date_time 列中的所有条目都应
    • @r_newb 该错误意味着您的某些出院日期时间在您的承认日期时间之前......您可能想检查一下!更正它们,或过滤掉它们。
    猜你喜欢
    • 2021-01-01
    • 1970-01-01
    • 2019-08-14
    • 2021-08-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-13
    相关资源
    最近更新 更多