【问题标题】:merge data tables by time intervals overlap按时间间隔重叠合并数据表
【发布时间】:2016-01-24 07:53:30
【问题描述】:

假设我有两张桌子。一个是约会,第二个是招待会。每张表都有孩子ID、医生ID、开始和结束时间(约会计划和接待事实)和一些其他数据。我想计算在预约期间的时间间隔内有多少预约有接待。接待事实可以在预约开始时间之前开始,之后,它可以在应用程序内部。间隔等

下面我做了两张桌子。一种用于约会,一种用于接待。我写了嵌套循环,但它的工作速度很慢。我的表每个包含大约 50 行。我需要快速解决这个问题。我怎么能在没有循环的情况下做到这一点?提前致谢!

library(data.table)

date <- as.POSIXct('2015-01-01 14:30:00')

# appointments data table
app <- data.table(med.id = 1:10,
                  filial.id = rep(c(100,200), each = 5),
                  start.time = rep(seq(date, length.out = 5, by = "hours"),2),
                  end.time = rep(seq(date+3599, length.out = 5, by = "hours"),2),
                  A = rnorm(10))


# receptions data table
re <- data.table(med.id = c(1,11,3,4,15,6,7),
                 filial.id = c(rep(100, 5), 200,200),
                 start.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('14:25:00', '14:25:00','16:32:00', '17:25:00', '16:10:00', '15:35:00','15:50:00'))),
                 end.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('15:25:00', '15:20:00','17:36:00', '18:40:00', '16:10:00', '15:49:00','16:12:00'))),
                 B = rnorm(7))



app$count <- 0

for (i in 1:dim(app)[1]){
  for (j in 1:dim(re)[1]){
    if ((app$med.id[i] == re$med.id[j]) & # med.id is equal and
        app$filial.id[i] == re$filial.id[j]) { # filial.id is equal
      if ((re$start.time[j] < app$start.time[i]) & (re$end.time[j] > app$start.time[i])) { # reception starts before appointment start time and ends after appointment start time OR 
        app$count[i] <- app$count[i] + 1
      } else if ((re$start.time[j] < app$end.time[i]) & (re$start.time[j] > app$start.time[i])) { # reception starts before appointment end time and after app. start time
        app$count[i] <- app$count[i] + 1
      }
    }
  }
}

【问题讨论】:

  • 试试?foverlaps。检查here

标签: r merge group-by data.table dplyr


【解决方案1】:

使用foverlaps()

setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
#     med.id filial.id          start.time            end.time           A count
#  1:      1       100 2015-01-01 14:30:00 2015-01-01 15:29:59  0.60878560     1
#  2:      2       100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284     0
#  3:      3       100 2015-01-01 16:30:00 2015-01-01 17:29:59  0.68992084     1
#  4:      4       100 2015-01-01 17:30:00 2015-01-01 18:29:59  0.04703938     1
#  5:      5       100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419     0
#  6:      6       200 2015-01-01 14:30:00 2015-01-01 15:29:59  0.26193554     0
#  7:      7       200 2015-01-01 15:30:00 2015-01-01 16:29:59  1.55206077     1
#  8:      8       200 2015-01-01 16:30:00 2015-01-01 17:29:59  0.44517362     0
#  9:      9       200 2015-01-01 17:30:00 2015-01-01 18:29:59  0.11475881     0
# 10:     10       200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828     0

PS:请通过vignettes学习有效使用数据表。

【讨论】:

    【解决方案2】:

    我实际上认为您根本不需要按时间重叠进行合并:您的代码实际上是通过med.idfilial.id 合并然后执行简单的比较。

    首先,为了清楚起见,让我们重命名 start.timeend.time 字段:

    setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
    setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))
    

    然后您应该合并两个 data.tables 键上的 med.idfilial.id,如下所示:

    app_re <- re[app, on=c("med.id", "filial.id")]
    #    med.id filial.id       re.start.time         re.end.time          B
    # 1:      1       100 2015-01-01 14:25:00 2015-01-01 15:25:00  0.4307760
    # 2:      2       100                <NA>                <NA>         NA
    # 3:      3       100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
    # 4:      4       100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
    # 5:      5       100                <NA>                <NA>         NA
    # 6:      6       200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
    # 7:      7       200 2015-01-01 15:50:00 2015-01-01 16:12:00  2.5742241
    # 8:      8       200                <NA>                <NA>         NA
    # 9:      9       200                <NA>                <NA>         NA
    # 10:    10       200                <NA>                <NA>         NA
    #          app.start.time        app.end.time           A
    # 1:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
    # 2:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.24246341
    # 3:  2015-01-01 16:30:00 2015-01-01 17:29:59  1.55824948
    # 4:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.25829302
    # 5:  2015-01-01 18:30:00 2015-01-01 19:29:59  1.14244558
    # 6:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
    # 7:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.07710022
    # 8:  2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
    # 9:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.21682394
    # 10: 2015-01-01 18:30:00 2015-01-01 19:29:59  1.11197318
    

    然后您可以使用与以前相同的条件创建计数变量:

    app_re[, count := 
      as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) | 
        (re.start.time < app.end.time & re.start.time > app.start.time)]
    # Convert the NAs to 0
    app_re[, count := ifelse(is.na(count), 0, count)]
    

    这应该比for 循环快得多。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-04-12
      • 2014-02-23
      • 2013-10-16
      • 1970-01-01
      • 1970-01-01
      • 2019-03-06
      • 1970-01-01
      相关资源
      最近更新 更多