用于处理对齐数据帧组的高效 tidy R 技术答案

【问题标题】：efficient tidy R technique for processing aligned data frame groups用于处理对齐数据帧组的高效 tidy R 技术
【发布时间】：2017-09-29 14:52:10
【问题描述】：

我正在尝试找到一种有效（并且理想情况下是整洁）的方式来处理一对分组的 data_frames。设置看起来或多或少是这样的：

A = crossing(idx=1:1e5, asdf=seq(1:rpois(1,50))
B = tbl(idx=sample(1:1e5, replace=TRUE), yet_more_stuff='whatever')
proc_one_group <- function(one_A, one_b) { ... }
# example:
proc_one_group(filter(A, idx==50), filter(B, idx==50))

因此，我的处理操作相当复杂，一次处理一个 idx，来自两个单独的数据帧，其中一个数据帧每个 idx 有一行或多行（通常是几十行），而other 每个idx 可以有零、一或多行。

我知道我可以这样做，但它非常慢，因为对每个值的filter 操作需要全表扫描和子集。

map_df(unique(A$idx), ~ proc_one_group(filter(A, idx==.), filter(B, idx==.)))

我也知道我可以使用split 相对有效地创建data_frames 的子帧列表，但我不知道通过两个data_frames 的索引进行O(1) 查找的好方法。

我想要的是left_join 的第一步，它计算出每个组的索引子组，而不是实际创建每个组的笛卡尔组合的单个data_frame，它只是给了我可以根据需要处理的一对子组。（完整的left_join 在这里对我没有帮助。）

有什么想法吗？

【问题讨论】：

标签： r dataframe tidyverse

【解决方案1】：

一种可能性是在加入之前先嵌套两个数据框：

library(tidyverse)

set.seed(1234)

A = crossing(idx = 1:1e5, asdf = seq(1:rpois(1, 50)))
B = data_frame(idx = sample(1:1e5, replace = TRUE), yet_more_stuff = "whatever")

proc_one_group <- function(one_A, one_B) { ... }

nest_A <- A %>%
  group_by(idx) %>%
  nest(.key = "data_a")
nest_B <- B %>%
  group_by(idx) %>%
  nest(.key = "data_b")

all_data <- full_join(nest_A, nest_B, by = "idx")
all_data
#> # A tibble: 100,000 x 3
#>      idx            data_a           data_b
#>    <int>            <list>           <list>
#>  1     1 <tibble [41 x 1]>           <NULL>
#>  2     2 <tibble [41 x 1]> <tibble [2 x 1]>
#>  3     3 <tibble [41 x 1]> <tibble [2 x 1]>
#>  4     4 <tibble [41 x 1]> <tibble [1 x 1]>
#>  5     5 <tibble [41 x 1]>           <NULL>
#>  6     6 <tibble [41 x 1]>           <NULL>
#>  7     7 <tibble [41 x 1]> <tibble [2 x 1]>
#>  8     8 <tibble [41 x 1]>           <NULL>
#>  9     9 <tibble [41 x 1]> <tibble [1 x 1]>
#> 10    10 <tibble [41 x 1]> <tibble [1 x 1]>
#> # ... with 99,990 more rows

这会产生一个数据框，其中每个 idx 的数据来自数据框 A 中的 data_a，数据框 B 中的数据 data_b。完成此操作后，不必为 map_df 调用中的每个案例过滤大型数据框。

all_data %>%
  map2_df(data_a, data_b, proc_one_group)

【讨论】：

哦，非常聪明。是的，这会执行连接的匹配部分，而无需实际执行连接。

【解决方案2】：

以下是一些基准测试结果：

A = crossing(idx=1:1e3, asdf=seq(1:rpois(1,50)))
B = tibble(idx=sample(1:1e3, replace=TRUE), yet_more_stuff='whatever')

第一个想法是按照您的建议使用split，保持split.A 和split.B 的顺序相同。您可以使用map2 来遍历匹配的列表：

myfun <- function(A,B) {
    split.A <- split(A, A$idx)
    splitsort.A <- split.A[order(names(split.A))]
    splitsort.B <- map(names(splitsort.A), ~B[as.character(B$idx) == .x,])
    ans <- map2(splitsort.A, splitsort.B, ~unique(.x$idx) == unique(.y$idx))
    return(ans)
}

这是您当前使用的方法，使用dplyr::filter

OP <- function(A,B) {
    ans <- map(unique(A$idx), ~unique(filter(A, idx==.x)$idx) == unique(filter(B, idx==.x)$idx))
    return(ans)
}

这是相同的逻辑，但避免了dplyr::filter，与基本 R 子集相比更慢

OP2 <- function(A,B) {
    ans <- map(unique(A$idx), ~unique(A[A$idx==.x,]$idx) == unique(B[B$idx==.x,]$idx))
    return(ans)
}

这使用了@JakeThompson 的方法（它似乎是当前方法中的赢家）

JT <- function(A,B) {
    nest.A <- A %>% group_by(idx) %>% nest()
    nest.B <- B %>% group_by(idx) %>% nest()
    ans <- full_join(nest.A, nest.B, by="idx")
}

一些验证以确保某些函数的结果有意义

identical(OP(A,B), OP2(A,B))
# TRUE

E <- myfun(A,B)
any(E==FALSE)
# NA

F <- myfun(A,B)
any(F==FALSE)
# NA

identical(sum(E==TRUE, na.rm=TRUE), sum(F==TRUE, na.rm=TRUE))
# TRUE

基准测试结果

library(microbenchmark)
microbenchmark(myfun(A,B), OP(A,B), OP2(A,B), JT(A,B), times=2L)
# Unit: seconds
        # expr       min        lq      mean    median        uq       max neval
 # myfun(A, B)  3.164046  3.164046  3.254588  3.254588  3.345129  3.345129     2
    # OP(A, B) 14.926431 14.926431 15.053662 15.053662 15.180893 15.180893     2
   # OP2(A, B)  3.202414  3.202414  3.728423  3.728423  4.254432  4.254432     2
    # JT(A, B)  1.330278  1.330278  1.378241  1.378241  1.426203  1.426203     2

【讨论】：