如何按时间间隔连接 2 个数据表并按因子变量汇总重叠和不重叠的时间段答案

【问题标题】：How to join 2 data tables by time interval and summarize overlapping and non-overlapping time periods by factor variable如何按时间间隔连接 2 个数据表并按因子变量汇总重叠和不重叠的时间段
【发布时间】：2019-03-06 23:09:50
【问题描述】：

我有 2 个数据表，每个表都列出了观察努力的时期和努力的类型（A、B、C）。我想知道重叠和非重叠工作的持续时间。

我已尝试使用 data.table 和 foverlaps 来执行此操作，但不知道如何包含所有非重叠时段。

这是我的示例数据。我首先创建了 2 个包含工作时间的数据表。我的数据集将包括单个观察者正在努力的时间段。

library(data.table)
library(lubridate)

# times have been edited so not fixed to minute intervals - to make more realistic
set.seed(13)
EffortType = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 01:00:00'), by = "1 sec"), 100, replace=F)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort1 = data.table(EffortType, On, Off)

EffortType2 = sample(c("A","B","C"), 100, replace = TRUE)
On2 = sample(seq(as.POSIXct('2016/01/01 12:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 sec"), 100, replace=F)
Off2 = On2 + minutes(sample(1:60, 100, replace=T))
Effort2 = data.table(EffortType2, On2, Off2)

#prep for using foverlaps
setkey(Effort1, On, Off)
setkey(Effort2, On2, Off2)

然后我使用 foverlaps 来查找工作重叠的位置。我设置了 nomatch=NA，但这只是给了我正确的外部连接。我想要完整的外部连接。所以我想知道更合适的功能是什么。

matches = foverlaps(Effort1,Effort2,type="any",nomatch=NA)

我继续在这里展示我是如何尝试确定所有重叠和非重叠轮班时间的持续时间的。但我认为这部分我也不对。

# find start and end of intersection of all shifts
matches$start = pmax(matches$On, matches$On2, na.rm=T)
matches$end = pmin(matches$Off, matches$Off2, na.rm=T)

# create intervals and find durations
matches$int = interval(matches$start, matches$end)
matches$dur = as.duration(matches$int)

然后我想总结每个“EffortType”分组的观察努力时间

最终得到这样的结果（数字只是示例，因为我还没有设法弄清楚如何正确计算，即使在 excel 中也是如此）

EffortType  Duration(in minutes)
A           10
B           20
C           12
AA          8
BB          6
CC          1
AC          160
AB          200
BC          150

【问题讨论】：

您应该添加一些 library(...) 调用。 minutes 不在基础 R 中
Effort1 和 Effort2 内有重叠。这些应该如何处理？这些应该被EffortType折叠吗？
我已经编辑了示例中时间的创建方式，使其更加真实。 Effort1 和 Effort2 之间可能存在重叠。这应该包含在持续时间摘要中，例如。 AA、BB 或 CC。

标签： r data.table overlap

【解决方案1】：

不是完整的答案（见最后一段）。但我认为这会让你得到你想要的。

library( data.table )
library( lubridate )

set.seed(13)
EffortType = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 01:00:00'), by = "15 mins"), 100, replace=T)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort1 = data.table(EffortType, On, Off)

EffortType2 = sample(c("A","B","C"), 100, replace = TRUE)
On = sample(seq(as.POSIXct('2016/01/01 12:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "15 mins"), 100, replace=T)
Off = On + minutes(sample(1:60, 100, replace=T))
Effort2 = data.table(EffortType2, On, Off)

#create DT of minutes, spanning your entire period.
dt.minutes <- data.table( On = seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 mins"), 
                          Off = seq(as.POSIXct('2016/01/01 01:00:00'), as.POSIXct('2016/01/03 12:00:00'), by = "1 mins") + 60 )

#prep for using foverlaps
setkey(Effort1, On, Off)
setkey(Effort2, On, Off)

#overlap join both efforts on the dt.minutes. note the use of "within" an "nomatch" to throw away minutes without events.

m1 <- foverlaps(dt.minutes, Effort1 ,type="within",nomatch=0L)
m2 <- foverlaps(dt.minutes, Effort2 ,type="within",nomatch=0L)

#bind together
result <- rbindlist(list(m1,m2))[, `:=`(On=i.On, Off = i.Off)][, `:=`(i.On = NULL, i.Off = NULL)]

#cast the result
result.cast <- dcast( result, On + Off ~ EffortType, value.var = "EffortType")

结果

head( result.cast, 10)

#                      On                 Off A B C
#  1: 2016-01-01 01:00:00 2016-01-01 01:01:00 1 0 1
#  2: 2016-01-01 01:01:00 2016-01-01 01:02:00 1 0 1
#  3: 2016-01-01 01:02:00 2016-01-01 01:03:00 1 0 1
#  4: 2016-01-01 01:03:00 2016-01-01 01:04:00 1 0 1
#  5: 2016-01-01 01:04:00 2016-01-01 01:05:00 1 0 1
#  6: 2016-01-01 01:05:00 2016-01-01 01:06:00 1 0 1
#  7: 2016-01-01 01:06:00 2016-01-01 01:07:00 1 0 1
#  8: 2016-01-01 01:07:00 2016-01-01 01:08:00 1 0 1
#  9: 2016-01-01 01:08:00 2016-01-01 01:09:00 1 0 1
# 10: 2016-01-01 01:09:00 2016-01-01 01:10:00 1 0 1

有时一个事件在同一分钟内发生 2-3 次，例如

#                     On                 Off A B C
#53: 2016-01-02 14:36:00 2016-01-02 14:37:00 2 2 3

不确定你想如何总结...

如果您可以将它们视为一分钟，那么：

> sum( result.cast[A>0 & B==0, C==0, ] )
[1] 476
> sum( result.cast[A==0 & B>0, C==0, ] )
[1] 386
> sum( result.cast[A==0 & B==0, C>0, ] )
[1] 504
> sum( result.cast[A>0 & B>0, C==0, ] )
[1] 371
> sum( result.cast[A==0 & B>0, C>0, ] )
[1] 341
> sum( result.cast[A>0 & B==0, C>0, ] )
[1] 472
> sum( result.cast[A>0 & B>0, C>0, ] )
[1] 265

我认为会在几分钟内获得持续时间（尽管这可能会以更智能的方式完成）

【讨论】：

您的解决方案揭示了我的示例数据集的一个弱点 - 我不应该在您的结果指出的同一分钟内发生 2-3 次事件。已更改示例数据以使其更加真实并避免这种情况。
@heatherr 如果这是您问题的答案，请接受答案让我知道。如果没有，请指出还存在哪些问题。
请您解释一下为什么绑定在一起时将i.On和i.Off分配给On和Off？我不确定我们为什么不直接删除 On 和 Off 列。是为了保持名字整洁吗？”
@heatherr，是的.. 只是为了保持名称整洁。