【问题标题】:Rolling cumsum in data.table在 data.table 中滚动 cumsum
【发布时间】:2021-01-15 04:23:20
【问题描述】:

尝试在 data.table 中按组获取(反向)移动窗口中的累积和。例如,从以下数据中,我想在“roll_cumsum”列中获取这些值:

dt = data.table()
dt[, a := seq(1, 10, 1)]
dt[, group := rep(1:2, each = 5)]
dt[, roll_cumsum := c(15, 14, 12, 9, 5, 40, 34, 27, 19, 10)]

我用下面的代码得到了我想要的结果,但是对于大型数据集来说它很慢:

partial_sum = function(x) { n <- seq_along(x); cumsum(x)[length(x)] - cumsum(x)[n] + x[n] }
dt[, partial_sum(a), by = group]

有什么建议可以加快计算速度吗?非常感谢!

【问题讨论】:

    标签: r data.table cumsum


    【解决方案1】:

    有一个revcumsum函数

    library(spatstat.utils)
    dt[, roll_cumsum2 := revcumsum(a), group]
    

    -输出

    dt
    #     a group roll_cumsum roll_cumsum2
    # 1:  1     1          15           15
    # 2:  2     1          14           14
    # 3:  3     1          12           12
    # 4:  4     1           9            9
    # 5:  5     1           5            5
    # 6:  6     2          40           40
    # 7:  7     2          34           34
    # 8:  8     2          27           27
    # 9:  9     2          19           19
    #10: 10     2          10           10
    

    或者只是做reverse

    dt[, roll_cumsum2 := rev(cumsum(rev(a))), group]
    

    -输出

    dt
    #     a group roll_cumsum roll_cumsum2
    # 1:  1     1          15           15
    # 2:  2     1          14           14
    # 3:  3     1          12           12
    # 4:  4     1           9            9
    # 5:  5     1           5            5
    # 6:  6     2          40           40
    # 7:  7     2          34           34
    # 8:  8     2          27           27
    # 9:  9     2          19           19
    #10: 10     2          10           10
    

    或者另一种方式是

    dt[, roll_cumsum2 := cumsum(a[.N:1])[.N:1], group]
    

    注意:两者都是精简版

    基准测试

    dt1 <- data.table(a = 1:1e7, group = rep(1:1e6, length.out = 1e7,  10))
    system.time(dt1[, roll_cumsum := partial_sum(a), by = group])
    #user  system elapsed 
    # 2.073   0.037   2.094 
    system.time(dt1[, roll_cumsum2 := revcumsum(a), group])
    #user  system elapsed 
    #  2.623   0.029   2.637 
    
    system.time(dt1[, roll_cumsum3 := rev(cumsum(rev(a))), group])
    #user  system elapsed 
    #  4.275   0.051   4.276 
    
    system.time(dt1[, roll_cumsum4 := cumsum(a[.N:1])[.N:1], group])
    #user  system elapsed 
    # 1.703   0.028   1.722 
    
    system.time(dt1[, roll_cumsum5 := sum(a) - cumsum(shift(a, fill = 0)), group])
    # user  system elapsed 
    # 10.095   0.041  10.129 
    

    【讨论】:

    • frollapply 时间呢?我怀疑它是否会具有竞争力,但它是 UDF 滚动函数的“首选”方法。
    • @jangorecki 不知道。最好也检查一下。
    【解决方案2】:

    您可以从每个组中的sum(a) 中减去a 的累积总和。

    library(data.table)
    dt[, roll_cumsum1 :=  sum(a) - cumsum(shift(a, fill = 0)), group]
    dt
    
    #     a group roll_cumsum roll_cumsum1
    # 1:  1     1          15           15
    # 2:  2     1          14           14
    # 3:  3     1          12           12
    # 4:  4     1           9            9
    # 5:  5     1           5            5
    # 6:  6     2          40           40
    # 7:  7     2          34           34
    # 8:  8     2          27           27
    # 9:  9     2          19           19
    #10: 10     2          10           10
    

    【讨论】:

      猜你喜欢
      • 2023-03-11
      • 1970-01-01
      • 2017-10-28
      • 1970-01-01
      • 2012-08-15
      • 2021-04-21
      • 1970-01-01
      • 1970-01-01
      • 2016-04-18
      相关资源
      最近更新 更多