有效地找到独特的子集组（例如独特的购物篮）答案

【问题标题】：Efficiently find unique groups of subsets (e.g. unique shopping baskets)有效地找到独特的子集组（例如独特的购物篮）
【发布时间】：2019-08-23 00:44:42
【问题描述】：

我有一个数据框，其中一列代表购物篮的索引。对于每个篮子，我都有另一列标识该篮子中的项目。在数据集中查找唯一篮子的最有效方法是什么？

这里是一个使用dplyr的例子：

outer_num <- 10000
tmp_df <-
    data.frame(basket_index = rep(1:(8*outer_num), each = 2),
               items_purchased = rep(rep(c(1, 1, 2, 2, 1, 1, 3, 3), 2), outer_num))

items_purchased_df <-
    data.frame(items_purchased = 1:3, 
               item_name = c("shampoo", "soap", "conditioner"))

tmp_df_2 <-
    tmp_df %>%
    inner_join(items_purchased_df) %>%
    select(basket_index, items_purchased = item_name) 

head(tmp_df_2, 16)
#    basket_index items_purchased
# 1             1         shampoo
# 2             1         shampoo
# 3             2            soap
# 4             2            soap
# 5             3         shampoo
# 6             3         shampoo
# 7             4     conditioner
# 8             4     conditioner
# 9             5         shampoo
# 10            5         shampoo
# 11            6            soap
# 12            6            soap
# 13            7         shampoo
# 14            7         shampoo
# 15            8     conditioner
# 16            8     conditioner

在这个例子中，我们看到只有三个独特的购物篮，每个购物篮有两个物品。一般来说，篮子里的物品数量可能不同，可能有也可能没有重复的物品，在某些情况下，篮子中物品的出现顺序很重要。

以下函数产生可接受的输出：

tmp_fn <- function(tmp_df) {
    tmp_df %>%
        group_by(basket_index) %>%
        mutate(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
        group_by(collapsed_purchases) %>%
        filter(basket_index == min(basket_index)) %>%
        ungroup
}

这样

tmp_fn(tmp_df_2)
#   basket_index items_purchased collapsed_purchases    
#           <int> <fct>           <chr>                  
# 1            1 shampoo         shampoo,shampoo        
# 2            1 shampoo         shampoo,shampoo        
# 3            2 soap            soap,soap              
# 4            2 soap            soap,soap              
# 5            4 conditioner     conditioner,conditioner
# 6            4 conditioner     conditioner,conditioner

这不是非常节省时间。将项目因子转换为整数（并假设这是一个瞬时过程！）将其速度提高了近两个数量级，但即使在这个小数据集上仍然需要半秒：

tmp_df_3 <-
    tmp_df_2 %>%
    mutate(items_purchased_old = items_purchased,
           items_purchased = as.integer(factor(items_purchased)))

microbenchmark::microbenchmark(tmp_fn(tmp_df_2), times = 10)
# Unit: seconds
#            expr     min       lq     mean   median       uq      max neval
# tmp_fn(tmp_df_2) 20.6301 20.93541 21.98261 22.24193 22.43473 23.77921    10

microbenchmark::microbenchmark(tmp_fn(tmp_df_3), times = 10)
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
# tmp_fn(tmp_df_3) 348.3901 358.0814 507.7983 363.7639 387.2384 1566.903    10

【问题讨论】：

标签： r dplyr

【解决方案1】：

更新：我的结果是 stringsAsFactors = F。没有它，与 OP 的 tmp_fn() 函数相比，性能没有显着提升。

据我所知，group_by + mutate 和 group_by + filter 很慢。这是一种避免这种情况的方法-

# for outer_num <- 10000
system.time(
  res <- tmp_df_2 %>%
    group_by(basket_index) %>%
    summarize(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
    filter(!duplicated(collapsed_purchases)) 
    # summarize drops one (in this case, the only) grouping level
    # so filter is on ungrouped data which is good; also duplicated() is fast enough
)

# user  system elapsed 
# 4.35    0.00    4.41 

res
# A tibble: 3 x 2
#   basket_index collapsed_purchases    
#          <int> <chr>                  
# 1            1 shampoo,shampoo        
# 2            2 soap,soap              
# 3            4 conditioner,conditioner

# get desired result
tmp_df_2 %>% 
  inner_join(res, by = "basket_index")

#   basket_index items_purchased     collapsed_purchases
# 1            1         shampoo         shampoo,shampoo
# 2            1         shampoo         shampoo,shampoo
# 3            2            soap               soap,soap
# 4            2            soap               soap,soap
# 5            4     conditioner conditioner,conditioner
# 6            4     conditioner conditioner,conditioner

注意：使用data.table 可能会提供更快的速度。

【讨论】：

我发现当应用于tmp_df3（即项目被重新编码为整数的数据框）时，它的工作速度是原来的两倍。
@Alex 很高兴知道。明天我会再次尝试进行基准测试。

【解决方案2】：

如果您对 items_purchased 的独特组合感到满意，unique(list_data) 太快了。

tmp_df_2 %>%
  with(split(x = items_purchased, f = basket_index)) %>% 
  unique()

## output
# [[1]]
# [1] shampoo shampoo
# Levels: conditioner shampoo soap
#
# [[2]]
# [1] soap soap
# Levels: conditioner shampoo soap
#
# [[3]]
# [1] conditioner conditioner
# Levels: conditioner shampoo soap



f <- function() tmp_df_2 %>%
  with(split(x = items_purchased, f = basket_index)) %>% 
  unique()

microbenchmark::microbenchmark(tmp_fn(tmp_df_2), f(), times = 5)

# Unit: milliseconds  ## ! f() took 1 second or less !
# expr                    min         lq       mean     median         uq        max neval cld
# tmp_fn(tmp_df_2) 22902.3614 24637.1447 24657.7256 24928.6063 25280.1145 25540.4009     5   b
# f()                657.4491   672.0378   674.6513   673.4228   676.9276   693.4191     5  a

[已编辑]
处理真实数据，需要对unique()之前的数据进行排序。

test_d <- data.frame(basket_index = c(rep(1, 2), rep(2, 2), rep(3, 3), rep(4, 3), rep(5, 3), rep(6, 2)),
                     items_purchased = letters[c(1, 2, 2, 1, 1, 2, 3, 1, 2, 3,  2, 3, 1, 3, 4)])

tmp_fn(test_d) %>% distinct(collapsed_purchases)
#  collapsed_purchases  # Oops!
# 1 a,b                
# 2 b,a                
# 3 a,b,c              
# 4 b,c,a              
# 5 c,d    

test_d %>% 
  arrange(items_purchased) %>% 
  with(split(x = items_purchased, f = basket_index)) %>% 
  unique()

# [[1]]
# [1] a b
# Levels: a b c d
# 
# [[2]]
# [1] a b c
# Levels: a b c d
# 
# [[3]]
# [1] c d
# Levels: a b c d

【讨论】：

很好，这相当于在映射到因子的项目上调用我的原始函数。
并将其应用到 tmp_df3 会带来另一个巨大的改进。
这个方法处理物品的顺序吗？
应该对吧？因为split 产生向量列表？我假设向量是有序的？
@Alex 我同意 Shree 的观点。在实际情况下，您需要在拆分之前进行排序（使用group_by %>% mutate(paste)方法，这是相同的）。我将编辑我的答案以在实际场景中使用。

【解决方案3】：

您可以在aggregate() 中使用paste() 来尝试使用base R，然后过滤掉duplicated。在aggregate 中，我更喜欢'data.frame' 而不是'formula' 方法，以便立即获得"collapsed_purchases" 列名（请参阅?aggregate）。

FUN <- function(dat) {
  res <- with(dat, aggregate(list(collapsed_purchases=items_purchased), 
                             by=list(basket_index=basket_index), paste, collapse=","))
  res <- res[!duplicated(res[2]), ]
  return(merge(tmp_df_2, res, all.y=T))
}

结果

> system.time(res2 <- FUN(tmp_df_2))
   user  system elapsed 
   1.73    0.01    1.76 
> res2
  basket_index items_purchased     collapsed_purchases
1            1         shampoo         shampoo,shampoo
2            1         shampoo         shampoo,shampoo
3            2            soap               soap,soap
4            2            soap               soap,soap
5            4     conditioner conditioner,conditioner
6            4     conditioner conditioner,conditioner
>
> system.time(res3 <- FUN(tmp_df_3))  # numerized version
   user  system elapsed 
   0.77    0.02    0.78 
> res3
  basket_index items_purchased collapsed_purchases
1            1         shampoo                 2,2
2            1         shampoo                 2,2
3            2            soap                 3,3
4            2            soap                 3,3
5            4     conditioner                 1,1
6            4     conditioner                 1,1

【讨论】：