在 R 中不使用 for 的先前值的循环计算答案

【问题标题】：Loop calculation with previous value not using for in R在 R 中不使用 for 的先前值的循环计算
【发布时间】：2020-05-02 09:23:39
【问题描述】：

我是一名初级 R 程序员。我在使用以前的值（如递归）进行循环计算时遇到问题。我的数据示例：

 dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))

计算值'c'为y[n] = (y[n-1] + b[n])*a[n]。 c的初始值为0。(c[1] = 0)

我使用了for循环，代码和结果如下。

dt$y <- 0
for (i in 2:nrow(dt)) {
  dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}

   a b  y
1: 0 0  0
2: 1 1  1
3: 2 2  6
4: 3 1 21
5: 4 3 96

这个结果就是我想要的。但是，我的数据有超过 1,000,000 行和几列，因此我试图在不使用 for 循环的情况下找到其他方法。我尝试使用“Reduce()”，但它仅适用于单个向量（例如 y[n] = y_[n-1]+b[n]）。如上图，我的函数使用了两个向量，a和b，所以找不到解。

有没有更有效的方法可以在不使用 for 循环的情况下提高速度，例如使用递归函数或任何好的包函数？

【问题讨论】：

标签： r loops recursion reduce

【解决方案1】：

我认为这不会更快，但这是一种无需显式循环的方法

dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
                             , .init = 0)[-1]]

dt      
#    a b  y
# 1: 0 0  0
# 2: 1 1  1
# 3: 2 2  6
# 4: 3 1 21
# 5: 4 3 96

【讨论】：

【解决方案2】：

这是一个基本的 R 解决方案。

根据来自@ThetaFC 的信息，加速的指示是使用矩阵或向量（而不是data.frame 用于data.table）。因此，在计算df$y之前最好进行如下预处理，即，

a <- as.numeric(df$a)
b <- as.numeric(df$b)

那么，您有两种方法可以获取df$y：
- 编写您的自定义递归函数

f <- function(k) {
  if (k == 1) return(0)
  c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}

df$y <- f(nrow(df))

或非递归函数（我想这会比递归方法快得多）

g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))

df$y <- g(seq(nrow(df)))

这样

【讨论】：

在只有 20,000 行的 df 上，您的第二个解决方案在我的机器上花费了 27 秒，使用我在下面提供的答案中的玩具示例。通过矩阵或单个数字向量（0.02 秒）访问向量要快得多。似乎索引 data.frame 或 data.table 是速度瓶颈——而不是 for 循环
@ThetaFC 感谢您的信息，现在我的答案已更新

【解决方案3】：

由于迭代依赖，这种计算无法利用 R 的向量化优势。但减速似乎真的来自data.frame 或data.table 上的索引性能。

有趣的是，我可以通过直接访问 a、b 和 y 作为数字 vectors（2*10^5 行的 1000+ 倍优势）或 @ 987654328@“columns”（2*10^5 行的 100+ 倍优势）与 data.table 或 data.frame 中的列相比。

这个古老的讨论可能仍然对这个相当令人惊讶的结果有所启发：https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html

请注意，我还制作了一个不同的玩具 data.frame，因此我可以测试一个更大的示例而不返回 Inf，因为 y 与 i 一起增长：

选项data.frame（根据您的示例嵌入在data.frame 或data.table 中的数字向量）：

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
  dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user  system elapsed 
#79.39  146.30  225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.  
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674

选项vector（循环前提取的numeric向量）：

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
  y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user  system elapsed 
#0.03    0.00    0.03 
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674

选项matrix（循环前data.frame转换为matrix）：

vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
  dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user  system elapsed 
#0.67    0.01    0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns

带有矩阵样式索引的选项data.frame：

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
    dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user  system elapsed 
#110.69    0.03  112.01 
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674

【讨论】：

【解决方案4】：

一个选项是使用Rcpp，因为这个递归方程很容易用 C++ 编写代码：

library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
    int len = b.size();
    NumericVector y(len);

    for (int i = 1; i < len; i++) {
        y[i] = (y[i-1] + b[i]) * a[i];
    }

    return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1]  0  1  6 21 96

计时码：

vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)

system.time(for (i in 2:length(y)) {
    y[i] <- (y[i - 1] + b[i]) * a[i]
})
#   user  system elapsed 
#  19.22    0.06   19.44 

system.time(func(b, a))
#   user  system elapsed 
#   0.09    0.02    0.09

【讨论】：