使嵌套的 foreach 循环在 R 中更有效？答案

【问题标题】：Making nested foreach loops more efficient in R?使嵌套的 foreach 循环在 R 中更有效？
【发布时间】：2015-12-14 17:31:21
【问题描述】：

我编写了一个带有 3 个嵌套 foreach 循环的函数，并行运行。该函数的目标是将 30 个 [10,5] 矩阵（即[[30]][10,5]）的列表拆分为 5 个 [10,30] 矩阵（即[[5]][10,30]）的列表。

但是，我尝试使用 1,000,000 个路径（即foreach (m = 1:1000000)）运行此函数，显然，性能很糟糕。

如果可能，我想避免应用函数，因为我发现它们在与并行 foreach 循环结合使用时效果不佳：

library(foreach)
library(doParallel)

# input matr: a list of 30 [10,5] matrices
matrix_splitter <- function(matr) {
  time_horizon <- 30
  paths <- 10
  asset <- 5

  security_paths <- foreach(i = 1:asset, .combine = rbind, .packages = "doParallel", .export = "daily") %dopar% {
    foreach(m = 1:paths, .combine = rbind, .packages = "doParallel", .export = "daily") %dopar% {
      foreach(p = daily, .combine = c) %dopar% {
        p[m,i]  
      }
    }
  }
  df_securities <- as.data.frame(security_paths)
  split(df_securities, sample(rep(1:paths), asset))
}

总的来说，我正在尝试转换这种数据格式：

[[30]]
            [,1]        [,2]       [,3]       [,4]        [,5]
 [1,]  0.2800977  2.06715521  0.9196326  0.3560659  1.36126507
 [2,] -0.5119867  0.24329025  0.1513218 -1.2528092 -0.04795098
 [3,] -2.0293933 -1.17989270  0.3053376 -0.9528611  0.86758140
 [4,] -0.6419024 -0.24846720 -0.6640066 -1.7104961 -0.32759406
 [5,] -0.4340359 -0.44034013  3.3440507  0.7380613  2.01237069
 [6,] -0.6679914 -0.01332117  1.9286056 -0.7194116  0.15549978
 [7,]  0.5919820  0.11616685 -0.8424634 -0.7652715  1.34176688
 [8,]  0.8079152  0.40592119 -0.4291811  0.9358829 -0.97479314
 [9,] -0.0265207 -0.03598320  1.1287344  0.4732984  1.37792596
[10,]  1.0553966  0.65776721 -1.2833613 -0.2414846  0.81528686

到这种格式（显然到 V30）：

$`5`
V1         V2          V3         V4         V5         V6         V7
result.2   -0.11822260  1.7712833  1.97737285 -1.6643193  0.4788075  1.2394064  1.4800787
result.7   -1.23251178  0.4267885 -0.07728632  0.3463092  0.8766395  0.6324840  0.5946710
result.2.1 -1.27309457 -0.3128173 -0.79561297 -0.4713307 -0.4344864  0.4688124 -0.5646857
result.7.1  0.51702719 -1.6242650 -2.37976199 -0.1088408  0.4846507 -0.7594376  0.9326529
result.2.2  1.77550390  0.9279155  0.26168402  0.4893835  1.4131326  0.5989508 -0.3434010
result.7.2 -0.01590682 -0.5568578  1.35789122 -0.1385092 -0.4501515 -0.2581724  0.5451699
result.2.3  0.30400225 -1.0245640 -0.05285694 -0.1354228  0.3070331 -0.7618850  1.0330961
result.7.3 -0.08139912  0.4106541  1.40418839  0.2471505  1.2106539  1.3844721  0.4006751
result.2.4  0.94977544 -0.8045054  1.48791211  1.4361686 -0.3789274 -1.9570125 -1.6576634
result.7.4  0.70449194  1.6887800  0.56447340  0.6465640  2.6865388 -0.7367524  0.6242624
                     V8         V9         V10         V11        V12         V13
result.2   -0.432404728 -1.6225350  0.09855465  0.17371907  0.3081843  0.15148452
result.7   -0.597420706  0.6173004  0.07518596  2.01741406  0.1767152 -0.39219471
result.2.1  0.918408322 -1.6896424 -0.13409626  0.38674224  0.3491750 -1.61083286
result.7.1  2.564057340 -0.7696399  1.06103614  1.38528367  1.1684045 -0.08467871
result.2.2  0.951995816  0.1910284  1.79943500  2.13909498  0.2847664  0.31094568
result.7.2 -0.479349220 -0.2368760  0.04298525 -0.40385960  0.3986555 -1.93499213
result.2.3 -1.382370069  1.0459845 -0.33106323 -0.43362925  0.7045572 -0.30211601
result.7.3 -1.457106442  0.1487447 -2.52392942 -0.02399523 -1.0349746  0.87666365
result.2.4 -0.848879365  0.7521024  0.16790915  0.47112444  0.8886361 -0.12733039
result.7.4 -0.003350467  0.4021858 -1.80031445 -1.42399232  1.0507765 -0.36193846

【问题讨论】：

你想如何重新排列？在您的示例中，输出中没有输出图。
真的只是从[[30]][10,5]到[[5]][10,30]
我根本没有找到任何非常清楚的解释，但我怀疑您可能会发现包（和函数）abind 很有帮助，然后是函数 @987654329 @.
性能是否受到并行开销的影响？你基本上什么都不做，并行调用它。除此之外，据我所知，您是否要将[[30]][1000000,5] 更改为[[5]][1000000,30]？

标签： r matrix foreach parallel-processing dataframe

【解决方案1】：

感谢alply，包plyr 是针对这个问题设计的。这个想法是：取消列出您的列表，以适当的方式将其放入数组中，然后使用 alply 将此数组转换为矩阵列表。

将2 矩阵3x5 列表转换为5 矩阵2x3 列表的示例：

library(plyr)

lst = list(matrix(1:15, ncol=5), matrix(10:24, ncol=5))

alply(array(unlist(lst), c(2,3,5)),3)

#$`1`
#     [,1] [,2] [,3]
#[1,]    1    3    5
#[2,]    2    4    6

#$`2`
#     [,1] [,2] [,3]
#[1,]    7    9   11
#[2,]    8   10   12

#$`3`
#     [,1] [,2] [,3]
#[1,]   13   15   11
#[2,]   14   10   12

#$`4`
#     [,1] [,2] [,3]
#[1,]   13   15   17
#[2,]   14   16   18

#$`5`
#     [,1] [,2] [,3]
#[1,]   19   21   23
#[2,]   20   22   24

【讨论】：

谢谢。这对于小规模的事情非常有用。但是，当我达到 5,000,000 条路径时，unlist 创建的向量太大（5.6 GB）。没有unlist()，有没有办法做到这一点？

【解决方案2】：

我会将你所有的列表转换成一个很大的向量，然后重新调整它的尺寸。

我的解决方案是：

[[28]]
        [,1] [,2] [,3] [,4] [,5]
  [1,]    1   11   21   31   41
  [2,]    2   12   22   32   42
  [3,]    3   13   23   33   43
  [4,]    4   14   24   34   44
  [5,]    5   15   25   35   45
  [6,]    6   16   26   36   46
  [7,]    7   17   27   37   47
  [8,]    8   18   28   38   48
  [9,]    9   19   29   39   49
 [10,]   10   20   30   40   50

重复了三十次。这是变量orig。我的代码：

flattened.vec <- unlist(orig)  #flatten the list of matrices into one big vector
dim(flattened.vec) <-c(10,150) #need to rearrange the vector so the re-shape comes out right
transposed.matrix <- t(flattened.vec) #transposing to make sure right elements go to the right place
new.matrix.list <- split(transposed.matrix,cut(seq_along(transposed.matrix)%%5, 10, labels = FALSE))  #split the big, transposed matrix into 5 10x30 matrices

此代码为您提供了 5 个向量，您需要 dim(10,30) 然后在 foreach 中对它们使用 t() 以获得 5 个 30X10 向量（我通常会使用 apply 函数，并且不熟悉foreach 图书馆）。

这样做后 5 个矩阵之一的最终结果：

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
 [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1
 [2,]    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2
 [3,]    3    3    3    3    3    3    3    3    3     3     3     3     3     3     3     3     3
 [4,]    4    4    4    4    4    4    4    4    4     4     4     4     4     4     4     4     4
 [5,]    5    5    5    5    5    5    5    5    5     5     5     5     5     5     5     5     5
 [6,]    6    6    6    6    6    6    6    6    6     6     6     6     6     6     6     6     6
 [7,]    7    7    7    7    7    7    7    7    7     7     7     7     7     7     7     7     7
 [8,]    8    8    8    8    8    8    8    8    8     8     8     8     8     8     8     8     8
 [9,]    9    9    9    9    9    9    9    9    9     9     9     9     9     9     9     9     9
[10,]   10   10   10   10   10   10   10   10   10    10    10    10    10    10    10    10    10

        [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
 [1,]     1     1     1     1     1     1     1     1     1     1     1     1     1
 [2,]     2     2     2     2     2     2     2     2     2     2     2     2     2
 [3,]     3     3     3     3     3     3     3     3     3     3     3     3     3
 [4,]     4     4     4     4     4     4     4     4     4     4     4     4     4
 [5,]     5     5     5     5     5     5     5     5     5     5     5     5     5
 [6,]     6     6     6     6     6     6     6     6     6     6     6     6     6
 [7,]     7     7     7     7     7     7     7     7     7     7     7     7     7
 [8,]     8     8     8     8     8     8     8     8     8     8     8     8     8
 [9,]     9     9     9     9     9     9     9     9     9     9     9     9     9
[10,]    10    10    10    10    10    10    10    10    10    10    10    10    10

顺便说一句，这可能是 plyr 包本身已经完成的工作（由 Beauvel 上校发布），只是手动而不是使用外部库

【讨论】：

谢谢。这对于小规模的事情非常有用。但是，当我达到 5,000,000 条路径时，unlist 创建的向量太大（5.6 GB）。没有unlist()，有什么办法可以做到这一点？
数据是否必须以列表形式出现？从我的测试来看，unlist() 似乎确实很慢（设置use.names=FALSE 有一点帮助，但作用不大）。但是，如果您可以改为从 3 维向量开始，那么这种低效率就会消失（并且您的数据集所需的存储空间将减少约 10%）

【解决方案3】：

我相信您正在寻找这个问题的答案： Function to split a matrix into sub-matrices in R

您只需使用 do.call(rbind, matlist) 作为这些函数的输入。

【讨论】：