【问题标题】：Efficiency of matrix rowSums() vs. colSums() in R vs Rcpp vs ArmadilloR 与 Rcpp 与犰狳中矩阵 rowSums() 与 colSums() 的效率
【发布时间】：2019-01-17 08:51:42
【问题描述】：

背景

来自 R 编程，我正在使用 Rcpp 扩展为 C/C++ 形式的编译代码。作为循环交换效果的实践练习（通常只是 C/C++），我为具有 Rcpp 的矩阵实现了 R 的 rowSums() 和 colSums() 函数的等价物（我知道这些存在作为 Rcpp 糖和犰狳 - 这只是一个练习）。

问题

我在this matsums.cpp file 中有rowSums() 和colSums() 的C++ 实现以及Rcpp sugar 和arma::sum() 版本。我的只是这样的简单循环：

NumericVector Cpp_colSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nc);
  for (int j = 0; j < nc; j++) {
    double sum = 0.0;
    for (int i = 0; i < nr; i++) {
      sum += x(i, j);
    }
    ans[j] = sum;
  }
  return ans;
}

NumericVector Cpp_rowSums(const NumericMatrix& x) {
  int nr = x.nrow(), nc = x.ncol();
  NumericVector ans(nr);
  for (int j = 0; j < nc; j++) {
    for (int i = 0; i < nr; i++) {
      ans[i] += x(i, j);
    }
  }
  return ans;
}

（R 矩阵以列为主，因此外循环中的列应该是更有效的方法。这就是我最初测试的。）

在对这些进行基准测试时，我遇到了一些我没有预料到的事情：行和和列和之间存在明显的性能差异（请参阅下面的基准）：

使用内置 R 函数，colSums() 的速度大约是 rowSums() 的两倍。
使用我自己的 Rcpp 和糖版本，情况正好相反：rowSums() 的速度大约是 colSums() 的两倍。
最后，添加 Armadillo 实现后，操作大致相等（col sum 可能会快一点，因为我预计它们也会出现在 R 中）。

所以我的主要问题是：为什么 Cpp_rowSums() 比 Cpp_colSums() 快得多？

作为次要兴趣，我也很好奇为什么在 R 实现中会颠倒相同的差异。（我浏览了the C source，但无法真正找出显着差异。）（第三，犰狳为什么能获得相同的性能？）

基准

我在10,000 x 10,000 对称矩阵上测试了这两个函数的所有 4 种实现：

Rcpp::sourceCpp("matsums.cpp")

set.seed(92136)
n <- 1e4 # build n x n test matrix
x <- matrix(rnorm(n), 1, n)
x <- crossprod(x, x) # symmetric

bench::mark(
  rowSums(x),
  colSums(x),
  Cpp_rowSums(x),
  Cpp_colSums(x),
  Sugar_rowSums(x),
  Sugar_colSums(x),
  Arma_rowSums(x),
  Arma_colSums(x)
)[, 1:7]

#> # A tibble: 8 x 7
#>   expression            min     mean   median      max `itr/sec` mem_alloc
#>   <chr>            <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 rowSums(x)        192.2ms  207.9ms  194.6ms  236.9ms      4.81    78.2KB
#> 2 colSums(x)         93.4ms   97.2ms   96.5ms  101.3ms     10.3     78.2KB
#> 3 Cpp_rowSums(x)     73.5ms   76.3ms     76ms   80.4ms     13.1     80.7KB
#> 4 Cpp_colSums(x)    126.5ms  127.6ms  126.8ms  130.3ms      7.84    80.7KB
#> 5 Sugar_rowSums(x)   73.9ms   75.6ms   74.3ms   79.4ms     13.2     80.7KB
#> 6 Sugar_colSums(x)  124.2ms  125.8ms  125.6ms  127.9ms      7.95    80.7KB
#> 7 Arma_rowSums(x)    73.2ms   74.7ms   73.9ms   79.3ms     13.4     80.7KB
#> 8 Arma_colSums(x)    62.8ms   64.4ms   63.7ms   69.6ms     15.5     80.7KB

（同样，你可以找到C++源文件matsums.cpphere。）

平台：

> sessioninfo::platform_info()
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Helsinki             
 date     2018-08-09

更新

进一步调查，我还使用 R 的传统 C 接口编写了相同的函数：源代码是 here。我compiled the functions 和R CMD SHLIB，并再次测试：行总和比列总和快的相同现象持续存在（benchmarks）。然后我还查看了the disassembly with objdump，但在我看来（我对 asm 的理解非常有限）编译器并没有真正优化主循环体（rows，cols）从 C 代码开始，或者？

【问题讨论】：

标签： c++ r performance matrix rcpp

【解决方案1】：

“为什么 Cpp_rowSums() 比 Cpp_colSums() 快得多？” - 在获取“主要行”时，CPU 预取器可以预测您在做什么，并在您需要之前将您需要的下一组数据从主内存获取到 CPU 缓存。这可以加快您访问数据的速度。

当您访问“列专业”时，预取器会更难预测您接下来需要什么，因此它不会像以前那样有效地将东西填充到缓存内存中（如果有的话） - 这会减慢你倒下了。

CPU 喜欢线性访问数据。如果你不做他们喜欢做的事，你就要付出缓存未命中和主内存访问延迟的代价。

【讨论】：

和/或您想要的下一个项目已经被缓存，因为它与最后一个项目在同一缓存行中
矩阵按列存储；获取主要行时，您将什么称为对数据的线性访问？
谢谢！正如其他人所提到的，R 矩阵（和犰狳也是）以列为主存储，这就是为什么我在外循环中遍历列和行总和的原因。然后应该始终尽可能线性地访问数据，是吗？

【解决方案2】：

首先，让我在我的笔记本电脑上显示计时统计信息。我使用了一个 5000 x 5000 矩阵，足以进行基准测试，microbenchmark 包用于 100 次评估。

Unit: milliseconds
             expr       min        lq      mean    median        uq       max
       colSums(x)  71.40671  71.64510  71.80394  71.72543  71.80773  75.07696
   Cpp_colSums(x)  71.29413  71.42409  71.65525  71.48933  71.56241  77.53056
 Sugar_colSums(x)  73.05281  73.19658  73.38979  73.25619  73.31406  76.93369
  Arma_colSums(x)  39.08791  39.34789  39.57979  39.43080  39.60657  41.70158
       rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
   Cpp_rowSums(x)  54.00498  54.37984  54.70358  54.49165  54.73224  64.16104
 Sugar_rowSums(x)  54.17001  54.38420  54.73654  54.56275  54.75695  61.80466
  Arma_rowSums(x)  49.54407  49.77677  50.13739  49.90375  50.06791  58.29755

R 核心中的 C 代码并不总是比我们自己编写的更好。 Cpp_rowSums 比 rowSums 快表明了这一点。我觉得自己没有能力解释为什么 R 的版本比它应该的要慢。我将重点关注：我们如何进一步优化我们自己的 colSums 和 rowSums 以击败犰狳。请注意，我编写 C，使用 R 的旧 C 接口并使用 R CMD SHLIB 进行编译。

`colSums` 和 `rowSums` 之间有什么实质性区别吗？

如果我们有一个比 CPU 缓存容量大得多的n x n 矩阵，colSums 将n x n 数据从 RAM 加载到缓存，但rowSums 加载两倍，即@987654336 @。

想想保存总和的结果向量：这个长度-n 向量从 RAM 加载到缓存中多少次？对于colSums，它只加载一次，但对于rowSums，它加载n 次。每次向其中添加矩阵列时，它都会被加载到缓存中，但由于它太大而被驱逐。

对于大号n：

colSums 导致 n x n + n 数据从 RAM 加载到缓存；
rowSums 导致 n x n + n x n 数据从 RAM 加载到缓存。

换句话说，rowSums 在理论上内存效率较低，并且可能更慢。

如何提高`colSums`的性能？

由于 RAM 和缓存之间的数据流很容易达到最佳状态，因此唯一的改进是循环展开。将内部循环（求和循环）展开 2 的深度就足够了，我们将看到 2 倍的提升。

循环展开是因为它启用了 CPU 的指令流水线。如果我们每次迭代只做一个加法，那么流水线是不可能的；通过两个添加，这种指令级并行性开始起作用。我们也可以将循环展开深度为 4，但我的经验是深度为 2 的展开足以从循环展开中获得大部分好处。

如何提高`rowSums`的性能？

优化数据流是第一步。我们需要先做缓存阻塞，以减少从2 x n x n到n x n的数据传输。

将此n x n 矩阵切成多个行块：每个块为2040 x n（最后一个块可能更小），然后逐块应用普通的rowSums 块。对于每个块，累加器向量的长度为 2040，大约是 32KB CPU 缓存可以容纳的长度的一半。对于添加到此累加器向量的矩阵列，另一半则相反。这样，累加器向量可以保存在缓存中，直到处理完该块中的所有矩阵列。结果，累加器向量只加载到缓存中一次，因此整体内存性能与colSums一样好。

现在我们可以进一步对每个块中的rowSums 应用循环展开。将外循环和内循环都展开 2 的深度，我们将看到提升。展开外循环后，块大小应减少到 1360，因为现在我们需要缓存中的空间来保存每个外循环迭代的三个长度为 1360 的向量。

C 代码：让我们打败犰狳

使用循环展开编写代码可能是一项令人讨厌的工作，因为我们现在需要为一个函数编写几个不同的版本。

对于colSums，我们需要两个版本：

colSums_1x1：内外循环都以深度 1 展开，即这是一个没有循环展开的版本；
colSums_2x1: 没有外循环展开，而内循环展开深度为 2。

对于rowSums，我们最多可以有四个版本，rowSums_sxt，其中s = 1 or 2 是内循环的展开深度，t = 1 or 2 是外循环的展开深度。

如果我们一个一个地编写每个版本，代码编写可能会非常乏味。经过多年或对此感到沮丧后，我使用内联模板函数和宏开发了“自动代码/版本生成”技巧。

#include <stdlib.h>
#include <Rinternals.h>

static inline void colSums_template_sx1 (size_t s,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t nrc = nr % s, i;
  double *A_end = A + LDA * nc, a0, a1;

  for (; A < A_end; A += LDA) {
    a0 = 0.0; a1 = 0.0;  // accumulator register variables
    if (nrc > 0) a0 = A[0];  // is there a "fractional loop"?
    for (i = nrc; i < nr; i += s) {  // main loop of depth-s
      a0 += A[i];  // 1st iteration
      if (s > 1) a1 += A[i + 1];  // 2nd iteration
      }
    if (s > 1) a0 += a1;  // combine two accumulators
    *sum++ = a0;  // write-back
    }

  }

#define macro_define_colSums(s, colSums_sx1) \
SEXP colSums_sx1 (SEXP matA) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); \
  double *sum = REAL(result); \
  colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); \
  UNPROTECT(1); \
  return result; \
  }

macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)

模板函数计算（在 R 语法中）sum <- colSums(A[1:nr, 1:nc]) 的矩阵 A 和 LDA（A 的前导维度）行。参数s 是内循环展开的版本控制。模板函数乍一看很糟糕，因为它包含许多if。但是，它被声明为static inline。如果通过将已知常量 1 或 2 传递给 s 来调用它，则优化编译器能够在编译时评估那些 if，消除无法访问的代码并删除“设置但未使用”的变量（寄存器已初始化、修改但未写回 RAM 的变量）。

宏用于函数声明。接受常量s 和函数名称，它会生成一个具有所需循环展开版本的函数。

以下是rowSums。

static inline void rowSums_template_sxt (size_t s, size_t t,
                                         double *A, size_t LDA,
                                         size_t nr, size_t nc,
                                         double *sum) {

  size_t ncr = nc % t, nrr = nr % s, i;
  double *A_end = A + LDA * nc, *B;
  double a0, a1;

  for (i = 0; i < nr; i++) sum[i] = 0.0;  // necessary initialization

  if (ncr > 0) {  // is there a "fractional loop" for the outer loop?
    if (nrr > 0) sum[0] += A[0];  // is there a "fractional loop" for the inner loop?
    for (i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      sum[i] += A[i];
      if (s > 1) sum[i + 1] += A[i + 1];
      }
    A += LDA;
    }

  for (; A < A_end; A += t * LDA) {  // main outer loop with depth-t
    if (t > 1) B = A + LDA;
    if (nrr > 0) {  // is there a "fractional loop" for the inner loop?
      a0 = A[0]; if (t > 1) a0 += A[LDA];
      sum[0] += a0;
      }
    for(i = nrr; i < nr; i += s) {  // main inner loop with depth-s
      a0 = A[i]; if (t > 1) a0 += B[i];
      sum[i] += a0;
      if (s > 1) {
        a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
        sum[i + 1] += a1;
        }
      }
    }

  }

#define macro_define_rowSums(s, t, rowSums_sxt) \
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { \
  double *A = REAL(matA); \
  size_t nrow_A = (size_t)nrows(matA); \
  size_t ncol_A = (size_t)ncols(matA); \
  SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); \
  double *sum = REAL(result); \
  size_t block_size = (size_t)asInteger(chunk_size); \
  size_t i, block_size_i; \
  if (block_size > nrow_A) block_size = nrow_A; \
  for (i = 0; i < nrow_A; i += block_size_i) { \
    block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; \
    rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); \
    A += block_size_i; sum += block_size_i; \
    } \
  UNPROTECT(1); \
  return result; \
  }

macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)

请注意，模板函数现在接受s 和t，并且要由宏定义的函数已应用行分块。

尽管我在代码中留下了一些 cmets，但代码可能仍然不容易理解，但我不能花更多时间更详细地解释。

要使用它们，请将它们复制并粘贴到名为“matSums.c”的 C 文件中，然后使用 R CMD SHLIB -c matSums.c 编译它。

对于R端，在“matSums.R”中定义如下函数。

colSums_zheyuan <- function (A, s) {
  dyn.load("matSums.so")
  if (s == 1) result <- .Call("colSums_1x1", A)
  if (s == 2) result <- .Call("colSums_2x1", A)
  dyn.unload("matSums.so")
  result
  }

rowSums_zheyuan <- function (A, chunk.size, s, t) {
  dyn.load("matSums.so")
  if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
  if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
  if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
  if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
  dyn.unload("matSums.so")
  result
  }

现在让我们做一个基准测试，再次使用 5000 x 5000 矩阵。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")

microbenchmark("col0" = colSums(A),
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = rowSums(A),
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

在我的笔记本电脑上，我得到：

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  65.33908  71.67229  71.87273  71.80829  71.89444 111.84177   100
 col1  67.16655  71.84840  72.01871  71.94065  72.05975  77.84291   100
 col2  35.05374  38.98260  39.33618  39.09121  39.17615  53.52847   100
 row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827   100
 row1  49.65853  54.78769  54.78313  54.92278  55.08600  60.27789   100
 row2  49.42403  54.56469  55.00518  54.74746  55.06866  60.31065   100
 row3  37.43314  41.57365  41.58784  41.68814  41.81774  47.12690   100
 row4  34.73295  37.20092  38.51019  37.30809  37.44097  99.28327   100

注意循环展开如何加快colSums 和rowSums。并且通过全面优化（“col2”和“row4”），我们击败了犰狳（参见本答案开头的时序表）。

在这种情况下，行分块策略显然不会带来好处。让我们尝试一个包含数百万行的矩阵。

A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

我明白了

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790   100
 row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051   100
 row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852   100
 row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614   100

在这种情况下，我们观察到缓存阻塞带来的好处。

最后的想法

基本上这个答案已经解决了所有问题，除了以下问题：

为什么 R 的 rowSums 效率低于应有水平。
为什么不进行任何优化，rowSums ("row1") 比 colSums ("col1") 快。

再次，我无法解释第一个，实际上我不在乎，因为我们可以轻松编写比 R 的内置版本更快的版本。

第二个绝对值得追求。我复制了我们讨论室中的 cmets 以供记录。

这个问题归结为：“为什么将单个向量相加比按元素相加两个向量要慢？”

我不时看到类似的现象。我第一次遇到这种奇怪的行为是在几年前，当我编写自己的矩阵-矩阵乘法时。我发现 DAXPY 比 DDOT 快。

DAXPY 这样做：y += a * x，其中x 和y 是向量，a 是标量； DDOT 这样做：a += x * y。

鉴于 DDOT 是一种归约操作，我希望它比 DAXPY 快。但是不，DAXPY 更快。

但是，只要我展开矩阵乘法的三重循环嵌套中的循环，DDOT 就比 DAXPY 快得多。

您的问题也发生了非常相似的事情。减少操作：a = x[1] + x[2] + ... + x[n] 比元素方式添加：y[i] += x[i] 慢。但是一旦循环展开，就失去了后者的优势。

由于没有证据，我不确定以下解释是否属实。

归约操作有一个依赖链，因此计算是严格串行的；另一方面，逐元素操作没有依赖链，因此 CPU 可能会做得更好。

一旦我们展开循环，每次迭代都有更多的算术要做，CPU 可以在这两种情况下进行流水线操作。然后可以观察到归约操作的真正优势。

回复Jaap关于使用`rowSums2`和`colSums2`来自`matrixStats`

仍然使用上面的 5000 x 5000 示例。

A <- matrix(0, 5000, 5000)

library(microbenchmark)
source("matSums.R")
library(matrixStats)  ## NEW

microbenchmark("col0" = base::colSums(A),
               "col*" = matrixStats::colSums2(A),  ## NEW
               "col1" = colSums_zheyuan(A, 1),
               "col2" = colSums_zheyuan(A, 2),
               "row0" = base::rowSums(A),
               "row*" = matrixStats::rowSums2(A),  ## NEW
               "row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
               "row2" = rowSums_zheyuan(A, 2040, 1, 1),
               "row3" = rowSums_zheyuan(A, 1360, 1, 2),
               "row4" = rowSums_zheyuan(A, 1360, 2, 2))

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
 col0  71.53841  71.72628  72.13527  71.81793  71.90575  78.39645   100
 col*  75.60527  75.87255  76.30752  75.98990  76.18090  87.07599   100
 col1  71.67098  71.86180  72.06846  71.93872  72.03739  77.87816   100
 col2  38.88565  39.03980  39.57232  39.08045  39.16790  51.39561   100
 row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662   100
 row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457   100
 row1  54.62389  54.81724  54.97211  54.92394  55.04690  56.33462   100
 row2  54.15409  54.44208  54.78769  54.59162  54.76073  60.92176   100
 row3  41.43393  41.63886  42.57511  41.73538  41.81844 111.94846   100
 row4  37.07175  37.25258  37.45033  37.34456  37.47387  43.14157   100

我没有看到 rowSums2 和 colSums2 的性能优势。

【讨论】：

我想为在 Windows 10 上使用 R 4.xx 的用户（如我）留下评论。为了使上述代码可运行，我们必须设置环境变量（参见 support.shotgunsoftware.com/hc/en-us/articles/…） .在系统变量路径中添加“C:\rtools40\usr\bin; C:\rtools40\mingw64\x86_64-w64-mingw32; C:\Program Files\R\R-4.0.2\bin\x64”。然后，打开命令窗口并转到您的文件夹。输入“R CMD SHLIB -c matSums.c”并输入。我们将在文件夹中获取“matSums.dll”文件。
接下来，在 "matSums.R" 中，将所有 "dyn.load("matSums.so")" 更改为 "dyn.load("matSums.dll")"。最后，“source("matSums.r")”。完成！
写得很好。需要注意的是，在支持它的系统中（例如具有 x87 FPU 的系统），R 将使用长双精度累加器和相应的 x87 指令。所以这不是苹果对苹果。 IIUC 您的“row1”实际上没有展开，并且已经进行了大部分改进。我已经单独确认切换到双累加器会得到类似的结果。大概 CPU 已经通过 OOO 在内部展开。我发现展开 4 列是 rowSums 的长双累加器的最佳选择。

【解决方案3】：

有一个非常棒的 R 包，名为 Rfast (here)，它提供行/列总和和均值的 C++ 实现等等。刚试了一下，比base包中各自的默认函数快很多。

【讨论】：

背景

问题

基准

更新

colSums 和 rowSums 之间有什么实质性区别吗？

如何提高colSums的性能？

如何提高rowSums的性能？

C 代码：让我们打败犰狳

最后的想法

回复Jaap关于使用rowSums2和colSums2来自matrixStats

`colSums` 和 `rowSums` 之间有什么实质性区别吗？

如何提高`colSums`的性能？

如何提高`rowSums`的性能？

回复Jaap关于使用`rowSums2`和`colSums2`来自`matrixStats`