Rcpp中的滚动求和函数答案

【问题标题】：Rolling sum function in RcppRcpp中的滚动求和函数
【发布时间】：2018-12-03 00:47:40
【问题描述】：

我目前正在处理一个大型数据框，并且必须为多个变量创建多个长度的滚动总和。我有一个通过data.table 的工作方法，但是运行一个变量需要相当长的时间（每个变量大约需要 50 分钟）。

我花了一些时间改进脚本以加快它的速度，但已经没有想法了。我没有 C++ 经验，但认为Rcpp 包可能是一种选择。我自己研究过它，但还没有想出任何可用的东西。

这是我的 data.table 一个变量的脚本

df_td <- setDT(df_1, key=c("Match","Name"))[,by=.(Match, Name), paste0("Period_", 1:10) 
                                        := mclapply((1:10)*600, function(x) rollsumr(Dist, x, fill = NA))][]

我使用了parallel::mclapply，这很有帮助，但仍然需要很长时间才能工作。

> dput(head(df_1, 20))
structure(list(Match = c("Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", "Bath_A", 
"Bath_A", "Bath_A"), Name = c("Jono Lance", "Jono Lance", "Jono     Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", "Jono Lance", 
"Jono Lance", "Jono Lance"), Dist = c(0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Dist_HS = c(0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Dist_SD = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names =    c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

> str(df_1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   26533771 obs. of  5     variables:
$ Match  : chr  "Bath_A" "Bath_A" "Bath_A" "Bath_A" ...
$ Name   : chr  "Jono Lance" "Jono Lance" "Jono Lance" "Jono Lance"   ...
$ Dist   : num  0 0 0 0 0 0 0 0 0 0 ...
$ Dist_HS: num  0 0 0 0 0 0 0 0 0 0 ...
$ Dist_SD: num  0 0 0 0 0 0 0 0 0 0 ...

任何如何加快速度的建议将不胜感激

【问题讨论】：

如果数据集可以缩小，您可能想后退一两步（在您创建这个庞大的数据集之前）？也许你的数据集中有很多零
数据集是由近 400 个单独的文件创建的，我曾考虑将它们保持为列表格式，但对我来说不是更快吗？就零而言，数据基于 GPS 距离，每 0.1 秒数据一次，因此每个文件的开头都有一些零，但每个文件的数量不同

标签： r data.table rcpp zoo

【解决方案1】：

由于存在重叠的总和，您可以重复使用之前迭代的总和。这是使用shift的可能方法

library(RcppRoll)
DT[, Period_1 := roll_sumr(Dist, 600L, fill=NA), by=.(ID)]
for (n in 2L:10L) {
    DT[, paste0("Period_", n) := {
            x <- get(paste0("Period_", n-1L))
            shift(x, 600L) + Period_1
        },
        by=.(ID)]
}

使用Reduce替换循环：

library(RcppRoll)
DT[, Period_1 := roll_sumr(Dist, 600L, fill=NA), by=.(ID)]
DT[, paste0("Period_", 1L:10L) :=
    Reduce(function(x, y) x + y, shift(Period_1, (1L:9L)*600L), Period_1, accum=TRUE),
    by=.(ID)]

数据：

library(data.table)
set.seed(0L)
nsampl <- 6003
nIDs <- 1
DT <- data.table(ID=rep(1:nIDs, each=nsampl), 
    Dist=rnorm(nIDs*nsampl, 1000, 100))

【讨论】：

谢谢，但是我上面写的通过Rcpp 概述构建和合并函数的方法似乎对我很有效。很想知道从 C++ 的角度来看它是否是一个好的解决方案，尽管我没有这方面的经验。
你可能想计时，看看它是否符合你的需要
对我来说效果很好，到目前为止我运行的任何分析都不到一分钟。大进步！

【解决方案2】：

我可能已经找到解决问题的方法here。通过从Rcpp添加以下函数

cppFunction('
NumericVector run_sum_v2(NumericVector x, int n) {

        int sz = x.size();

        NumericVector res(sz);

        // sum the values from the beginning of the vector to n 
        res[n-1] = std::accumulate(x.begin(), x.end()-sz+n, 0.0);

        // loop through the rest of the vector
        for(int i = n; i < sz; i++) {
        res[i] = res[i-1] + x[i] - x[i-n];
        }

        // pad the first n-1 elements with NA
        std::fill(res.begin(), res.end()-sz+n-1, NA_REAL);

        return res;
        }
        ')

run_sum_v2 适合我的data.table 行代替zoo:rollsumr，并且似乎要快得多（

已将 2 多小时的脚本缩短到

【讨论】：