【发布时间】:2018-04-01 15:12:31
【问题描述】:
我正在尝试生成几个汇总统计信息,其中一些需要在每个组的子集上生成。 data.table 非常大,有 1000 万行,但使用 by 没有列子集的速度非常快(不到一秒)。仅添加一个需要对每个组的子集进行计算的额外列会使运行时间增加 12 倍。
是更快的方法吗?以下是我的完整代码。
library(data.table)
library(microbenchmark)
N = 10^7
DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),
id2 = sample(1:100, size = N, replace = TRUE),
id3 = sample(1:50, size = N, replace = TRUE),
filter_var = sample(1:10, size = N, replace = TRUE),
x1 = sample(1:1000, size = N, replace = TRUE),
x2 = sample(1:1000, size = N, replace = TRUE),
x3 = sample(1:1000, size = N, replace = TRUE),
x4 = sample(1:1000, size = N, replace = TRUE),
x5 = sample(1:1000, size = N, replace = TRUE) )
setkey(DT, id1,id2,id3)
microbenchmark(
DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5)
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
0.942013 0.9566891 1.004134 0.9884895 1.031334 1.165144 10
microbenchmark( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
12.24046 12.4123 12.83447 12.72026 13.49059 13.61248 10
【问题讨论】:
-
尝试添加
verbose=TRUE并读取?GForce..如果您必须进行此计算,您可以先创建v := x1*(filter_var < 5)然后取其平均值 -
@Frank 很好的建议,你应该回答 - 我的代码在一秒钟内而不是 12 秒内运行。我没有意识到
gforce在子集时会被关闭。总是这样吗?自 2016 年以来,我很少使用data.table,我似乎记得在这种情况下它的工作速度与预期一样快,但我可能错了。 -
好了,完成。是的,我认为 GForce 从未涵盖过这种用法。顺便说一句,他们正在研究一个基准小插曲。如果你有兴趣 github.com/Rdatatable/data.table/blob/master/vignettes/… 我猜它会在下一个 CRAN 版本中。
标签: r data.table