我测试了我在问题中列出的五个函数(如@r2evans 建议的那样)。我使用了五个不同的数据集,因为我认为性能可能会有所不同,具体取决于向量对是大部分不相交还是大部分不相交。 (事实证明,从 EIC.1 到 EIC.4 没有太大区别;至于 EIC.5,如果大多数对不相交,则运行速度较慢。)
我是这样生成数据集的:
n=1400L
a1 <- replicate(n, sample(5000000L, 500L, replace = TRUE), simplify = FALSE)
b1 <- replicate(n, sample(5000000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 22% of the pairs have elements in common
a2 <- replicate(n, sample(800000L, 500L, replace = TRUE), simplify = FALSE)
b2 <- replicate(n, sample(800000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 79% of the pairs have elements in common
a3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
b3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, equal in length, to be compared pairwise, where about 50% of the pairs have elements in common
这是我的结果:
library(microbenchmark)
LL <- c(expression(sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]]))) )
v1 <- a1
v2 <- b1
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 110.59374 110.98621 113.5366 112.52576 114.4162 130.0801 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 97.18203 97.64194 101.4938 99.20129 101.6032 158.8913 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 96.98262 98.73502 100.5121 99.06029 100.6465 136.2520 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.85385 256.67103 262.0515 258.23332 265.1787 291.9498 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.49910 231.25642 236.2385 233.05208 237.7731 280.7453 100
v1 <- a2
v2 <- b2
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 112.40455 112.78578 114.8205 114.4925 114.9898 126.2302 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 98.45717 98.87847 101.7272 100.5070 101.0258 134.8737 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 98.15024 98.59084 101.1340 100.2553 101.2907 131.4896 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 258.48673 259.18759 264.2449 260.1710 265.2686 307.0624 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 200.79988 201.52592 205.8434 203.3817 207.2203 244.2715 100
v1 <- a3
v2 <- b3
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 134.0820 134.5529 135.4400 134.6922 135.6203 142.1575 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 119.7959 120.1119 122.3887 120.2729 122.2338 158.0306 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 119.7705 120.2145 122.3458 121.9361 122.4224 150.4227 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.0928 259.0730 263.2403 259.6671 263.7227 318.9604 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 226.4821 227.0798 230.2878 228.4882 231.3292 258.4599 100
v1 <- b1 # the longer vector is now vec1
v2 <- a1
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 199.2799 201.3817 202.5054 201.6378 202.7534 214.8660 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 187.5226 187.9299 188.9177 188.1184 189.8541 196.1020 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 187.8891 188.3417 190.5641 190.1809 190.8307 219.4735 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.1007 255.8905 260.1282 256.8316 262.1560 288.4900 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 237.7409 238.4515 241.5251 239.9415 243.5631 266.5916 100
v1 <- b2
v2 <- a2
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 198.8747 201.2476 202.1573 201.5215 202.3886 207.7772 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 185.5260 185.7983 187.8099 185.9842 188.3947 225.7553 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 185.8022 186.1824 188.8937 187.9226 188.6763 221.2442 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.6607 258.5063 262.3677 259.6778 264.6313 304.4813 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.5553 231.3261 233.9914 232.9138 235.0349 260.4950 100
在所有情况下,EIC.2 和 EIC.3 最快(并且彼此非常接近),EIC.1 紧随其后。但是请注意,如果较短的向量在前,它们都会更有效。例如,vec1 是 a1(长度 500),vec2 是 b1(长度 2500),EIC.2 的中位数为 99 毫秒。但是当我将它们切换为vec1 是b1 并且vec2 是a1 时,EIC.2 会减慢到 188 毫秒。因此,为了提高效率,值得在调用 EIC.2 之前检查哪个向量更长。 (或者重新编写 EIC.2 以便它始终在测试 [shorter vector] %in% [longer vector]。)