【问题标题】:Efficiently find unique groups of subsets (e.g. unique shopping baskets)有效地找到独特的子集组(例如独特的购物篮)
【发布时间】:2019-08-23 00:44:42
【问题描述】:

我有一个数据框,其中一列代表购物篮的索引。对于每个篮子,我都有另一列标识该篮子中的项目。在数据集中查找唯一篮子的最有效方法是什么?

这里是一个使用dplyr的例子:

outer_num <- 10000
tmp_df <-
    data.frame(basket_index = rep(1:(8*outer_num), each = 2),
               items_purchased = rep(rep(c(1, 1, 2, 2, 1, 1, 3, 3), 2), outer_num))

items_purchased_df <-
    data.frame(items_purchased = 1:3, 
               item_name = c("shampoo", "soap", "conditioner"))

tmp_df_2 <-
    tmp_df %>%
    inner_join(items_purchased_df) %>%
    select(basket_index, items_purchased = item_name) 

head(tmp_df_2, 16)
#    basket_index items_purchased
# 1             1         shampoo
# 2             1         shampoo
# 3             2            soap
# 4             2            soap
# 5             3         shampoo
# 6             3         shampoo
# 7             4     conditioner
# 8             4     conditioner
# 9             5         shampoo
# 10            5         shampoo
# 11            6            soap
# 12            6            soap
# 13            7         shampoo
# 14            7         shampoo
# 15            8     conditioner
# 16            8     conditioner

在这个例子中,我们看到只有三个独特的购物篮,每个购物篮有两个物品。一般来说,篮子里的物品数量可能不同,可能有也可能没有重复的物品,在某些情况下,篮子中物品的出现顺序很重要。

以下函数产生可接受的输出:

tmp_fn <- function(tmp_df) {
    tmp_df %>%
        group_by(basket_index) %>%
        mutate(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
        group_by(collapsed_purchases) %>%
        filter(basket_index == min(basket_index)) %>%
        ungroup
}

这样

tmp_fn(tmp_df_2)
#   basket_index items_purchased collapsed_purchases    
#           <int> <fct>           <chr>                  
# 1            1 shampoo         shampoo,shampoo        
# 2            1 shampoo         shampoo,shampoo        
# 3            2 soap            soap,soap              
# 4            2 soap            soap,soap              
# 5            4 conditioner     conditioner,conditioner
# 6            4 conditioner     conditioner,conditioner

这不是非常节省时间。将项目因子转换为整数(并假设这是一个瞬时过程!)将其速度提高了近两个数量级,但即使在这个小数据集上仍然需要半秒:

tmp_df_3 <-
    tmp_df_2 %>%
    mutate(items_purchased_old = items_purchased,
           items_purchased = as.integer(factor(items_purchased)))

microbenchmark::microbenchmark(tmp_fn(tmp_df_2), times = 10)
# Unit: seconds
#            expr     min       lq     mean   median       uq      max neval
# tmp_fn(tmp_df_2) 20.6301 20.93541 21.98261 22.24193 22.43473 23.77921    10

microbenchmark::microbenchmark(tmp_fn(tmp_df_3), times = 10)
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval
# tmp_fn(tmp_df_3) 348.3901 358.0814 507.7983 363.7639 387.2384 1566.903    10

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    更新:我的结果是 stringsAsFactors = F。没有它,与 OP 的 tmp_fn() 函数相比,性能没有显着提升。


    据我所知,group_by + mutategroup_by + filter 很慢。这是一种避免这种情况的方法-

    # for outer_num <- 10000
    system.time(
      res <- tmp_df_2 %>%
        group_by(basket_index) %>%
        summarize(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
        filter(!duplicated(collapsed_purchases)) 
        # summarize drops one (in this case, the only) grouping level
        # so filter is on ungrouped data which is good; also duplicated() is fast enough
    )
    
    # user  system elapsed 
    # 4.35    0.00    4.41 
    
    res
    # A tibble: 3 x 2
    #   basket_index collapsed_purchases    
    #          <int> <chr>                  
    # 1            1 shampoo,shampoo        
    # 2            2 soap,soap              
    # 3            4 conditioner,conditioner
    
    # get desired result
    tmp_df_2 %>% 
      inner_join(res, by = "basket_index")
    
    #   basket_index items_purchased     collapsed_purchases
    # 1            1         shampoo         shampoo,shampoo
    # 2            1         shampoo         shampoo,shampoo
    # 3            2            soap               soap,soap
    # 4            2            soap               soap,soap
    # 5            4     conditioner conditioner,conditioner
    # 6            4     conditioner conditioner,conditioner
    

    注意:使用data.table 可能会提供更快的速度。

    【讨论】:

    • 我发现当应用于tmp_df3(即项目被重新编码为整数的数据框)时,它的工作速度是原来的两倍。
    • @Alex 很高兴知道。明天我会再次尝试进行基准测试。
    【解决方案2】:

    如果您对 items_purchased 的独特组合感到满意,unique(list_data) 太快了。

    tmp_df_2 %>%
      with(split(x = items_purchased, f = basket_index)) %>% 
      unique()
    
    ## output
    # [[1]]
    # [1] shampoo shampoo
    # Levels: conditioner shampoo soap
    #
    # [[2]]
    # [1] soap soap
    # Levels: conditioner shampoo soap
    #
    # [[3]]
    # [1] conditioner conditioner
    # Levels: conditioner shampoo soap
    
    
    
    f <- function() tmp_df_2 %>%
      with(split(x = items_purchased, f = basket_index)) %>% 
      unique()
    
    microbenchmark::microbenchmark(tmp_fn(tmp_df_2), f(), times = 5)
    
    # Unit: milliseconds  ## ! f() took 1 second or less !
    # expr                    min         lq       mean     median         uq        max neval cld
    # tmp_fn(tmp_df_2) 22902.3614 24637.1447 24657.7256 24928.6063 25280.1145 25540.4009     5   b
    # f()                657.4491   672.0378   674.6513   673.4228   676.9276   693.4191     5  a 
    

    [已编辑]
    处理真实数据,需要对unique()之前的数据进行排序。

    test_d <- data.frame(basket_index = c(rep(1, 2), rep(2, 2), rep(3, 3), rep(4, 3), rep(5, 3), rep(6, 2)),
                         items_purchased = letters[c(1, 2, 2, 1, 1, 2, 3, 1, 2, 3,  2, 3, 1, 3, 4)])
    
    tmp_fn(test_d) %>% distinct(collapsed_purchases)
    #  collapsed_purchases  # Oops!
    # 1 a,b                
    # 2 b,a                
    # 3 a,b,c              
    # 4 b,c,a              
    # 5 c,d    
    
    test_d %>% 
      arrange(items_purchased) %>% 
      with(split(x = items_purchased, f = basket_index)) %>% 
      unique()
    
    # [[1]]
    # [1] a b
    # Levels: a b c d
    # 
    # [[2]]
    # [1] a b c
    # Levels: a b c d
    # 
    # [[3]]
    # [1] c d
    # Levels: a b c d
    

    【讨论】:

    • 很好,这相当于在映射到因子的项目上调用我的原始函数。
    • 并将其应用到 tmp_df3 会带来另一个巨大的改进。
    • 这个方法处理物品的顺序吗?
    • 应该对吧?因为split 产生向量列表?我假设向量是有序的?
    • @Alex 我同意 Shree 的观点。在实际情况下,您需要在拆分之前进行排序(使用group_by %&gt;% mutate(paste)方法,这是相同的)。我将编辑我的答案以在实际场景中使用。
    【解决方案3】:

    您可以在aggregate() 中使用paste() 来尝试使用base R,然后过滤掉duplicated。在aggregate 中,我更喜欢'data.frame' 而不是'formula' 方法,以便立即获得"collapsed_purchases" 列名(请参阅?aggregate)。

    FUN <- function(dat) {
      res <- with(dat, aggregate(list(collapsed_purchases=items_purchased), 
                                 by=list(basket_index=basket_index), paste, collapse=","))
      res <- res[!duplicated(res[2]), ]
      return(merge(tmp_df_2, res, all.y=T))
    }
    

    结果

    > system.time(res2 <- FUN(tmp_df_2))
       user  system elapsed 
       1.73    0.01    1.76 
    > res2
      basket_index items_purchased     collapsed_purchases
    1            1         shampoo         shampoo,shampoo
    2            1         shampoo         shampoo,shampoo
    3            2            soap               soap,soap
    4            2            soap               soap,soap
    5            4     conditioner conditioner,conditioner
    6            4     conditioner conditioner,conditioner
    >
    > system.time(res3 <- FUN(tmp_df_3))  # numerized version
       user  system elapsed 
       0.77    0.02    0.78 
    > res3
      basket_index items_purchased collapsed_purchases
    1            1         shampoo                 2,2
    2            1         shampoo                 2,2
    3            2            soap                 3,3
    4            2            soap                 3,3
    5            4     conditioner                 1,1
    6            4     conditioner                 1,1
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-07-07
      • 2012-11-14
      • 2019-04-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-05-31
      • 1970-01-01
      相关资源
      最近更新 更多