按时间间隔重叠合并数据表答案

【问题标题】：merge data tables by time intervals overlap按时间间隔重叠合并数据表
【发布时间】：2016-01-24 07:53:30
【问题描述】：

假设我有两张桌子。一个是约会，第二个是招待会。每张表都有孩子ID、医生ID、开始和结束时间（约会计划和接待事实）和一些其他数据。我想计算在预约期间的时间间隔内有多少预约有接待。接待事实可以在预约开始时间之前开始，之后，它可以在应用程序内部。间隔等

下面我做了两张桌子。一种用于约会，一种用于接待。我写了嵌套循环，但它的工作速度很慢。我的表每个包含大约 50 行。我需要快速解决这个问题。我怎么能在没有循环的情况下做到这一点？提前致谢！

library(data.table)

date <- as.POSIXct('2015-01-01 14:30:00')

# appointments data table
app <- data.table(med.id = 1:10,
                  filial.id = rep(c(100,200), each = 5),
                  start.time = rep(seq(date, length.out = 5, by = "hours"),2),
                  end.time = rep(seq(date+3599, length.out = 5, by = "hours"),2),
                  A = rnorm(10))


# receptions data table
re <- data.table(med.id = c(1,11,3,4,15,6,7),
                 filial.id = c(rep(100, 5), 200,200),
                 start.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('14:25:00', '14:25:00','16:32:00', '17:25:00', '16:10:00', '15:35:00','15:50:00'))),
                 end.time = as.POSIXct(paste(rep('2015-01-01 ',7), c('15:25:00', '15:20:00','17:36:00', '18:40:00', '16:10:00', '15:49:00','16:12:00'))),
                 B = rnorm(7))



app$count <- 0

for (i in 1:dim(app)[1]){
  for (j in 1:dim(re)[1]){
    if ((app$med.id[i] == re$med.id[j]) & # med.id is equal and
        app$filial.id[i] == re$filial.id[j]) { # filial.id is equal
      if ((re$start.time[j] < app$start.time[i]) & (re$end.time[j] > app$start.time[i])) { # reception starts before appointment start time and ends after appointment start time OR 
        app$count[i] <- app$count[i] + 1
      } else if ((re$start.time[j] < app$end.time[i]) & (re$start.time[j] > app$start.time[i])) { # reception starts before appointment end time and after app. start time
        app$count[i] <- app$count[i] + 1
      }
    }
  }
}

【问题讨论】：

试试?foverlaps。检查here

标签： r merge group-by data.table dplyr

【解决方案1】：

使用foverlaps()：

setkey(re, med.id, filial.id, start.time, end.time)
olaps = foverlaps(app, re, which=TRUE, nomatch=0L)[, .N, by=xid]
app[, count := 0L][olaps$xid, count := olaps$N]
app
#     med.id filial.id          start.time            end.time           A count
#  1:      1       100 2015-01-01 14:30:00 2015-01-01 15:29:59  0.60878560     1
#  2:      2       100 2015-01-01 15:30:00 2015-01-01 16:29:59 -0.11545284     0
#  3:      3       100 2015-01-01 16:30:00 2015-01-01 17:29:59  0.68992084     1
#  4:      4       100 2015-01-01 17:30:00 2015-01-01 18:29:59  0.04703938     1
#  5:      5       100 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.95315419     0
#  6:      6       200 2015-01-01 14:30:00 2015-01-01 15:29:59  0.26193554     0
#  7:      7       200 2015-01-01 15:30:00 2015-01-01 16:29:59  1.55206077     1
#  8:      8       200 2015-01-01 16:30:00 2015-01-01 17:29:59  0.44517362     0
#  9:      9       200 2015-01-01 17:30:00 2015-01-01 18:29:59  0.11475881     0
# 10:     10       200 2015-01-01 18:30:00 2015-01-01 19:29:59 -0.66139828     0

PS：请通过vignettes学习有效使用数据表。

【讨论】：

【解决方案2】：

我实际上认为您根本不需要按时间重叠进行合并：您的代码实际上是通过med.id 和filial.id 合并然后执行简单的比较。

首先，为了清楚起见，让我们重命名 start.time 和 end.time 字段：

setnames(app, c("start.time", "end.time"), c("app.start.time", "app.end.time"))
setnames(re, c("start.time", "end.time"), c("re.start.time", "re.end.time"))

然后您应该合并两个 data.tables 键上的 med.id 和 filial.id，如下所示：

app_re <- re[app, on=c("med.id", "filial.id")]
#    med.id filial.id       re.start.time         re.end.time          B
# 1:      1       100 2015-01-01 14:25:00 2015-01-01 15:25:00  0.4307760
# 2:      2       100                <NA>                <NA>         NA
# 3:      3       100 2015-01-01 16:32:00 2015-01-01 17:36:00 -1.2933755
# 4:      4       100 2015-01-01 17:25:00 2015-01-01 18:40:00 -1.2374469
# 5:      5       100                <NA>                <NA>         NA
# 6:      6       200 2015-01-01 15:35:00 2015-01-01 15:49:00 -0.8054822
# 7:      7       200 2015-01-01 15:50:00 2015-01-01 16:12:00  2.5742241
# 8:      8       200                <NA>                <NA>         NA
# 9:      9       200                <NA>                <NA>         NA
# 10:    10       200                <NA>                <NA>         NA
#          app.start.time        app.end.time           A
# 1:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.26828337
# 2:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.24246341
# 3:  2015-01-01 16:30:00 2015-01-01 17:29:59  1.55824948
# 4:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.25829302
# 5:  2015-01-01 18:30:00 2015-01-01 19:29:59  1.14244558
# 6:  2015-01-01 14:30:00 2015-01-01 15:29:59 -0.41234563
# 7:  2015-01-01 15:30:00 2015-01-01 16:29:59  0.07710022
# 8:  2015-01-01 16:30:00 2015-01-01 17:29:59 -1.46421985
# 9:  2015-01-01 17:30:00 2015-01-01 18:29:59  1.21682394
# 10: 2015-01-01 18:30:00 2015-01-01 19:29:59  1.11197318

然后您可以使用与以前相同的条件创建计数变量：

app_re[, count := 
  as.numeric(re.start.time < app.start.time & re.end.time > app.start.time) | 
    (re.start.time < app.end.time & re.start.time > app.start.time)]
# Convert the NAs to 0
app_re[, count := ifelse(is.na(count), 0, count)]

这应该比for 循环快得多。

【讨论】：