在大型数据集上按组拆分和扩展网格答案

【问题标题】：split and expand.grid by group on large data set在大型数据集上按组拆分和扩展网格
【发布时间】：2017-10-26 15:31:03
【问题描述】：

我有一个以下格式的df，并尝试获取一个包含每组所有成对组合的数据框

df<-structure(list(id = c(209044052, 209044061, 209044061, 209044061,209044062, 209044062, 209044062, 209044182, 209044183, 209044295), group = c(2365686, 387969, 388978, 2365686, 387969, 388978, 2365686, 2278460, 2278460, 654238)), .Names = c("id", "group"), row.names = c(NA, -10L), class = "data.frame")

虽然 do.call(rbind, lapply(split(df, df$group), function(i) expand.grid(i$id, i$id))) 适用于小型数据框，但我在大型数据（约 1200 万个观测值和约 150 万个组）上遇到了时间问题。

经过一些测试，我发现 split 命令似乎是瓶颈，而 expand.grid 也可能不是最快的解决方案。

发现 expand.grid Use outer instead of expand.grid 的一些改进和一些更快的拆分替代品Improving performance of split() function in R?，但很难通过分组将它们放在一起。

输出应该类似于

  Var1      Var2
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044061 209044061
209044062 209044061
209044061 209044062
209044062 209044062
209044295 209044295
209044182 209044182
209044183 209044182
....

作为额外的，我想排除同一对的重复，自我引用（例如以上209044061 209044061）并且只保留一个组合，如果它们以不同的顺序（例如以上209044061 209044062和209044062 209044061）（没有重复的组合）。用 'combinations()` 尝试了 library(gtools)，但无法确定这是否会进一步减慢计算速度。

【问题讨论】：

可能data.table? library(data.table); setDT(df)[, expand.grid(id, id), by = group]

标签： r split expand

【解决方案1】：

避免重复同一对以及不同顺序的一种可能解决方案是使用data.table 和combinat 包：

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

这里使用order(id)只是为了方便更好地检查结果，但在生产代码中可以跳过。

将`combn2()` 替换为非等值连接

还有另一种方法是调用combn2() 替换为非等连接：

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
    allow.cartesian = TRUE]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

请注意，非等连接需要对数据进行排序。

基准测试

第二种方法似乎要快得多

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
  id = sample.int(rg, nr, TRUE),
  group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
  combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
  nej = {
    mdf <- df2[order(group, id), unique(id), by = group]
    mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
        allow.cartesian = TRUE]},
  times = 1L)

对于 120000 行和 14994 组，时间为：

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115     1
    nej   137.3228   137.3228   137.3228   137.3228   137.3228   137.3228     1

警告

正如by the OP 指出的那样，每个group 的id 数量在内存消耗和速度方面至关重要。组合数为O(n²)，正好n * (n-1) / 2 或 choose(n, 2L) 如果 n 是 id 的数量。

最大组的大小可以通过

df2[, uniqueN(id), by = group][, max(V1)]

最终结果中的总行数可以通过

提前计算出来

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]

【讨论】：

遇到negative length vectors are not allowed 这是因为内存问题？ @Uwe
你试过非equi连接版本吗？使用具有 12 M 行、1,5 M 组的人工基准数据，在没有内存问题的情况下花费了 14 秒。每个group 的id 的最大数量是多少，例如df2[, uniqueN(id), by = group][, max(V1)]？
哇现在成功了。检查每个组的 id 的提示有所帮助。最大的组（每个 grp 200000 个 ids）太多了，我假设一旦我把它拿出来你的代码运行通过。也许这可能是一个补充，因为未来的 OP 可能会遇到这个大群体问题
我添加了一个警告来检查最大组的大小并提前计算最终结果中的行数。感谢您的建议。

将combn2() 替换为非等值连接

基准测试

警告

将`combn2()` 替换为非等值连接