【问题标题】:Different aggregation rules with data.table in rr中具有data.table的不同聚合规则
【发布时间】:2020-03-29 19:39:54
【问题描述】:

我有一个大数据框,我想根据两个不同的 id 聚合它。不同的列有不同的聚合规则,我想写一个紧凑的代码来做聚合(最终数据集中还有很多我不需要的无用变量)。我做了一个玩具示例,用 dplyr::group_by:

聚合我的数据
n=10
df <- data.frame(id1 = sample(c("a","b"),n,T),id2 = sample(c("c","d"),n,T), # variables with IDs
                 var_sum1 = rnorm(n,0,1),var_sum2 = rnorm(n,5,1),           # variables to sum
                 var_mean1 = rnorm(n,10,1), var_mean2 = rnorm(n,15,1),      # variables to average
                 var_weighted_mean = rnorm(n,0,1),                          # vars to weight average
                 weight = sample(c(1,2),n,T),                               # weight
                 var_useless_1 = 1,var_useless_n = 1)                       # useless variables to throw away


final_dplyr <- df %>%
  group_by(id1, id2) %>%
  summarise(var_sum1 = sum(var_sum1),
            var_sum2 = sum(var_sum2),
            var_mean1 = mean(var_mean1),
            var_mean2 = mean(var_mean2),
            var_weighted_mean = weighted.mean(var_weighted_mean,weight))

现在,我想在向量中定义将遵循每个规则的变量:

ids <- c("id1","id2")
summing = c("var_sum1","var_sum2")
averaging = c("var_mean1","var_mean2")
wght_avergage = c("var_weighted_mean")

每个向量都将包含或多或少 20 个变量的名称,因此像我对 dplyr 玩具示例所做的那样“手动”聚合它会有点麻烦。

我可以使用 data.table 包来实现它吗?也欢迎其他解决方案,但是当我现在正在学习这个包时,我真的很感激 data.table 的解决方案。

我想过这样的事情(但由于我是 data.table 的新手,这可能是完全错误的):

dt <- as.data.table(df)

# line not working
dt[ , .(summing, averaging, wght_average) := list(lapply(.SD[,.(summing)],sum),
                                               lapply(.SD[,.(averaging)],mean),
                                               lapply(.SD[,.(wght_average)],function(x)weighted.mean(x,weight))), 
    by = .(ids), 
    .SDcols = .(summing, averaging, wght_average)]

感谢您的帮助!

【问题讨论】:

    标签: r data.table aggregate


    【解决方案1】:

    您可以使用该通用语法,只需进行一些更改 (1) 您正在创建一个新数据框(列的长度不等于 nrow(df)),因此您不需要 := 和它之前的部分 (2) 您可以使用 mget 从字符向量中获取到 lapply 的列列表 (3) 使用 c 将列表连接在一起,而不是使用 list 创建子列表。

    ids <- c("id1","id2")
    summing = c("var_sum1","var_sum2")
    averaging = c("var_mean1","var_mean2")
    wght_average = c("var_weighted_mean")
    
    
    df[ ,  c(lapply(mget(summing), sum), 
             lapply(mget(averaging), mean), 
             lapply(mget(wght_average), weighted.mean, weight)), 
        by = c(ids)]
    
    #    id1 id2   var_sum1  var_sum2 var_mean1 var_mean2 var_weighted_mean
    # 1:   a   c -0.4091754 19.469144 10.181026  15.29206        0.06766247
    # 2:   a   d -0.9797636  4.884255  8.856079  15.36002        1.43762082
    # 3:   b   c -3.0569705 15.284160 10.021045  14.94577       -0.72186913
    # 4:   b   d -0.4616429 10.076022  8.442672  15.09100        0.13813689
    

    一个可能的 tidyverse 解决方案是将规则存储在 tibble 中

    library(tidyverse)
    
    ids = c("id1","id2")
    do_over <- 
      list(
        summing = c("var_sum1","var_sum2"),
        averaging = c("var_mean1","var_mean2"),
        wght_average = c("var_weighted_mean"))
    do_what <- 
      list(
        summing = sum,
        averaging = mean,
        wght_average = ~weighted.mean(., weight))
    
    todo <- tibble(do_over, do_what)
    
    todo
    # # A tibble: 3 x 2
    #   do_over      do_what     
    #   <named list> <named list>
    # 1 <chr [2]>    <fn>        
    # 2 <chr [2]>    <fn>        
    # 3 <chr [1]>    <formula>   
    

    然后在 tibble 上 pmap 以获取您的输出

    pmap_dfc(todo, ~
               df %>% 
                group_by_at(ids) %>% 
                summarise_at(.x, .y))
    
    # # A tibble: 3 x 11
    # # Groups:   id1 [2]
    #   id1   id2   var_sum1 var_sum2 id11  id21  var_mean1 var_mean2 id12  id22  var_weighted_mean
    #   <fct> <fct>    <dbl>    <dbl> <fct> <fct>     <dbl>     <dbl> <fct> <fct>             <dbl>
    # 1 a     c        0.152     4.90 a     c          9.04      15.1 a     c                 0.294
    # 2 a     d        2.74     16.0  a     d         10.0       14.8 a     d                -0.486
    # 3 b     c       -0.112    23.6  b     c         10.2       14.5 b     c                 0.421
    

    【讨论】:

      【解决方案2】:

      dplyr 中,您可以使用_at 变体,它可以接受列名作为字符串,这样您就不必重复这些函数

      library(dplyr)
      
      df %>%
        group_by_at(ids) %>%
        mutate_at(summing, sum) %>%
        mutate_at(averaging, mean) %>%
        mutate_at(wght_avergage, ~weighted.mean(., weight)) %>%
        slice(1L) %>%
        select(summing, averaging, wght_avergage)
      
      #  id1   id2   var_sum1 var_sum2 var_mean1 var_mean2 var_weighted_mean
      #  <fct> <fct>    <dbl>    <dbl>     <dbl>     <dbl>             <dbl>
      #1 a     c       -0.840     9.87      9.76      13.9            0.308 
      #2 a     d        3.27     14.4       9.66      15.8            0.275 
      #3 b     c       -0.408    18.5       8.82      14.8            0.0450
      #4 b     d        1.29      4.85     10.3       15.4           -0.521 
      

      这给出了与final_dplyr 相同的输出。

      final_dplyr
      
      #  id1   id2   var_sum1 var_sum2 var_mean1 var_mean2 var_weighted_mean
      #  <fct> <fct>    <dbl>    <dbl>     <dbl>     <dbl>             <dbl>
      #1 a     c       -0.840     9.87      9.76      13.9            0.308 
      #2 a     d        3.27     14.4       9.66      15.8            0.275 
      #3 b     c       -0.408    18.5       8.82      14.8            0.0450
      #4 b     d        1.29      4.85     10.3       15.4           -0.521 
      

      【讨论】:

        【解决方案3】:

        我们也可以利用purrr中的map2来做到这一点

        library(dplyr)
        library(purrr)
        fns <- list(sum, mean, partial(weighted.mean, weight = weight))
        map2(list(df[3:4], df[5:6], df[7:8]), fns,
           ~  bind_cols(.x, df %>% 
                  select(id1, id2))  %>% 
                 group_by(id1, id2) %>%
                 summarise_at(vars(-group_cols()), .y)) %>% 
          reduce(inner_join, by = c('id1', 'id2')) %>%
          select(-weight)
        # A tibble: 4 x 7
        # Groups:   id1 [2]
        #  id1   id2   var_sum1 var_sum2 var_mean1 var_mean2 var_weighted_mean
        #  <fct> <fct>    <dbl>    <dbl>     <dbl>     <dbl>             <dbl>
        #1 a     c       -0.840     9.87      9.76      13.9             0.308
        #2 a     d        3.27     14.4       9.66      15.8             0.511
        #3 b     c       -0.408    18.5       8.82      14.8             0.390
        #4 b     d        1.29      4.85     10.3       15.4            -0.521
        

        或者使用来自base RMap

        Reduce(function(...) merge(..., by = c('id1', 'id2')), 
             Map(function(fn, dat)  aggregate(.~ id1 + id2, 
                cbind(dat, df[c('id1', 'id2')]), fn), 
              list(sum, mean, weighted.mean), list(df[3:4], df[5:6], df[7:8])))[-8]
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-10-15
          • 1970-01-01
          • 2015-04-29
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-05-05
          • 1970-01-01
          相关资源
          最近更新 更多