如何加快我的数据循环计算 R？答案

【问题标题】：How do I speed up my data loop calculations R?如何加快我的数据循环计算 R？
【发布时间】：2021-05-30 07:51:04
【问题描述】：

我在下面创建了一个数据循环，它为我提供了我需要的结果。但是，处理时间很长。我需要分析大量数据（400,000 多个对象，最好是 25,000,000 多个），因此我很感兴趣是否有任何方法可以加快以下计算（数据片段）：

我的数据框被称为：crsp.comp3

Permno Observation  C.xsgaq  C.xsgaq.depr
10026      1        45.145    44.393     
10026      2        45.145    43.653     
10026      3        45.145    42.925     
10026      4        96.730    92.935     
10026      5        96.730    91.386     
10026      6        96.730    89.863     
10026      7        145.511   136.333     
10026      8        145.511   134.061     
10026      9        145.511   131.827     
10026     10        190.986   174.347

目前，我将“C.xsgaq.depr”列中的数字计算为：

for (i in 1:nrow(crsp.comp3)) {
  if (crsp.comp3[i, 2] == 1) {
    crsp.comp3[i, 4] <- crsp.comp3[i, 3]*(1 - (0.2/12))
  } else {
    crsp.comp3[i, 4] <- (crsp.comp3[i - 1, 4] + 
                           (crsp.comp3[i, 3] - crsp.comp3[i - 1, 3]))*(1 - (0.2/12))
  }
}

分配为“1”的观测值需要按上述计算，并且所有观测值 =/ 1 都需要按上述循环中的说明计算。我的目标是优化代码，以便更快地处理它。我听说过一些关于矢量化数据框的事情？

谢谢

【问题讨论】：

您之前曾发布过类似的数据，但尚未对某些查询作出回应。请检查
为什么不在循环中使用ifelse 函数而不是if ... else 构造？
何约翰，感谢您的回复！我对 R 很陌生，因此不熟悉这样的设置。如果使用“ifelse”函数，循环会如何？
只需将if ... else ... 替换为crsp.comp3[i, 4] <- ifelse (crsp.comp3[i, 2] == 1, crsp.comp3[i, 3]*(1 - (0.2/12)), (crsp.comp3[i - 1, 4] + (crsp.comp3[i, 3] - crsp.comp3[i - 1, 3]))*(1 - (0.2/12)))
您的相关留言：stackoverflow.com/questions/66399219/…

标签： r performance dataframe loops vectorization

【解决方案1】：

性能比较：对于 10k 次观察，此方法大约快 150 倍（8.1 秒 -> 0.05 秒），对于 100k 次观察，速度快 1,000 倍（166 秒 -> 0.15 秒）。我预计随着您使用更大的数据，性能差距会变得更大。用于测试的假数据见底部。

这可以在大约 5 秒内处理 2500 万行假数据。如果您需要更快，我建议使用 data.table。

这是使用dplyr 的替代方法（不是因为它在这里提供了比base R 更特别的优势，而是b/c 对我来说更容易），它依赖于一些代数操作。（欢迎提出进一步简化的建议！）

使 R 变得更快的关键是构建问题，以便您可以在数据的所有元素上使用相同的计算一次。这就是矢量化。

library(dplyr)


crsp.comp3 %>%
  # The grouping here will make it so that the first Obs
  #   in a new group won't "see" the last Obs of the prior group. 
  #   We could just as easily group by Permno...
  mutate(group = cumsum(Observation == 1)) %>%
  group_by(group) %>%
         
  mutate(deprec = (11.8/12) ^ Observation,
         C_change =  (C.xsgaq - lag(C.xsgaq, default = 0)) /
           lag(deprec, default = 1),  # Edit: a little faster than 
                                      # (11.8/12)^(Observation-1),
         cuml = cumsum(C_change),
         output = cuml * deprec) %>%
  ungroup()

结果

# A tibble: 10 x 8
   Permno Observation C.xsgaq group deprec C_change  cuml output
    <int>       <int>   <dbl> <int>  <dbl>    <dbl> <dbl>  <dbl>
 1  10026           1    45.1     1  0.983     45.1  45.1   44.4
 2  10026           2    45.1     1  0.967      0    45.1   43.7
 3  10026           3    45.1     1  0.951      0    45.1   42.9
 4  10026           4    96.7     1  0.935     54.3  99.4   92.9
 5  10026           5    96.7     1  0.919      0    99.4   91.4
 6  10026           6    96.7     1  0.904      0    99.4   89.9
 7  10026           7   146.      1  0.889     54.0 153.   136. 
 8  10026           8   146.      1  0.874      0   153.   134. 
 9  10026           9   146.      1  0.860      0   153.   132. 
10  10026          10   191.      1  0.845     52.9 206.   174.

测试假数据：

n = 100000
Permno = 1000
Obs = floor(n / Permno)

crsp.comp3 <- tibble(Permno = rep(1:Permno, each = Obs),
                    Observation = rep(1:Obs, length.out = n),
                    Chg = sample(c(rep(0, 10), runif(5, 1, 100)), n, replace = TRUE)) %>%
  group_by(Permno) %>%
  mutate(C.xsgaq = cumsum(Chg))  %>%
  ungroup() %>%
  select(Permno, Observation, C.xsgaq)

【讨论】：

非常感谢，乔恩！您刚刚保存了我们的硕士论文进度:)

【解决方案2】：

通过purrr::accumulate() 进行迭代的另一种方法

#add column for asset addition mid-way during the year
crsp.comp3$assetadd <- c(0, diff(crsp.comp3$C.xsgaq))

#create your new desired column iteratively
accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]

 [1]  44.39258  43.65271  42.92516  92.93499  91.38608  89.86297 136.33324 134.06102 131.82667 174.34664

#Or store it in new variable directly
crsp.comp3$desired_val <- accumulate(crsp.comp3$assetadd, ~ (.x + .y)*(11.8/12), .init = crsp.comp3$C.xsgaq[1])[-1]

#check it
> crsp.comp3
   Permno Observation C.xsgaq C.xsgaq.depr assetadd desired_val
1   10026           1  45.145       44.393    0.000    44.39258
2   10026           2  45.145       43.653    0.000    43.65271
3   10026           3  45.145       42.925    0.000    42.92516
4   10026           4  96.730       92.935   51.585    92.93499
5   10026           5  96.730       91.386    0.000    91.38608
6   10026           6  96.730       89.863    0.000    89.86297
7   10026           7 145.511      136.333   48.781   136.33324
8   10026           8 145.511      134.061    0.000   134.06102
9   10026           9 145.511      131.827    0.000   131.82667
10  10026          10 190.986      174.347   45.475   174.34664

使用的数据

crsp.comp3 <- structure(list(Permno = c(10026L, 10026L, 10026L, 10026L, 10026L, 
10026L, 10026L, 10026L, 10026L, 10026L), Observation = 1:10, 
    C.xsgaq = c(45.145, 45.145, 45.145, 96.73, 96.73, 96.73, 
    145.511, 145.511, 145.511, 190.986), C.xsgaq.depr = c(44.393, 
    43.653, 42.925, 92.935, 91.386, 89.863, 136.333, 134.061, 
    131.827, 174.347)), class = "data.frame", row.names = c(NA, 
-10L))

> crsp.comp3
   Permno Observation C.xsgaq C.xsgaq.depr
1   10026           1  45.145       44.393
2   10026           2  45.145       43.653
3   10026           3  45.145       42.925
4   10026           4  96.730       92.935
5   10026           5  96.730       91.386
6   10026           6  96.730       89.863
7   10026           7 145.511      136.333
8   10026           8 145.511      134.061
9   10026           9 145.511      131.827
10  10026          10 190.986      174.347

【讨论】：

非常感谢您的意见，阿尼尔！随着我们需要处理的数据量，我们需要避免过多的迭代过程，因为运行时间太长。非常感谢您的帮助
如果对你有帮助可以upvote回答。您可以为每个问题投票，只要您认为是正确的。