data.table 包提供了两种方法:foverlaps() 函数和非等连接。两种方法都需要向数据添加帮助列
创建数据
arrayA <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:00",
"2000-12-31 12:00:05", "2000-12-31 12:00:10"), tz = "UTC")
arrayB <- anytime::utctime(c("2000-12-31 10:00:00", "2000-12-31 12:00:01",
"2000-12-31 12:00:02", "2000-12-31 11:00:00"), tz = "UTC")
请注意,这两个向量都属于 POSIXct 类,它比由 strptime() 函数创建的 POSIXlt 类更合适。此外,还添加了更多时间戳来测试不匹配。
准备数据
这两种方法的数据准备是相同的:
# make data.tables
library(data.table) # version 1.10.4 used here
A <- data.table(arrayA)
B <- data.table(arrayB)
# define tolerance = 2 * tol_half
tol_half <- 1L # seconds
# add helper columns
A[, "copyA" := arrayA]
A
# arrayA copyA
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:00
#3: 2000-12-31 12:00:05 2000-12-31 12:00:05
#4: 2000-12-31 12:00:10 2000-12-31 12:00:10
B[, `:=`(start = arrayB - tol_half, end = arrayB + tol_half)]
B
# arrayB start end
#1: 2000-12-31 10:00:00 2000-12-31 09:59:59 2000-12-31 10:00:01
#2: 2000-12-31 12:00:01 2000-12-31 12:00:00 2000-12-31 12:00:02
#3: 2000-12-31 12:00:02 2000-12-31 12:00:01 2000-12-31 12:00:03
#4: 2000-12-31 11:00:00 2000-12-31 10:59:59 2000-12-31 11:00:01
B 中的start 和end 表示arrayA 必须适合才能被视为匹配的可容忍时间范围。这类似于match_fun 函数在fuzzyjoin solution 中即时执行的操作。
foverlaps()
使用foverlaps() 搜索A 和B 中的重叠时间范围:
# setting keys is required by foverlap()
setkey(A, arrayA, copyA)
setkey(B, start, end)
# find overlaps
result <- foverlaps(B, A, nomatch = 0)[, c("copyA", "start", "end") := NULL][]
result
# arrayA arrayB
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:01
请注意,[, c("copyA", "start", "end") := NULL][] immediatley 会从 foverlaps() 的输出中删除辅助列。
非等值连接
使用最新版本的data.table,非等值连接是可能的:
result <- A[B, .(arrayA, arrayB), on = c("copyA>=start", "copyA<=end"), nomatch = 0L]
result
# arrayA arrayB
#1: 2000-12-31 10:00:00 2000-12-31 10:00:00
#2: 2000-12-31 12:00:00 2000-12-31 12:00:01
请注意,由于自动索引,非 equi 连接不需要预先设置键。
基准测试
TO DO:在大型用例上比较 fuzzyjoin、foverlaps() 和 非等值连接会很有趣。