基于列索引对矩阵的行求和答案

【问题标题】：Summing rows of a matrix based on column index基于列索引对矩阵的行求和
【发布时间】：2018-03-28 10:21:06
【问题描述】：

我正在尝试从具有“属于一起”的列的矩阵转到已形成相关子矩阵的行和的矩阵。 IE。来自

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,]    1    5    9   13   17   21   25   29   33    37    41    45    49    53    57    61
[2,]    2    6   10   14   18   22   26   30   34    38    42    46    50    54    58    62
[3,]    3    7   11   15   19   23   27   31   35    39    43    47    51    55    59    63
[4,]    4    8   12   16   20   24   28   32   36    40    44    48    52    56    60    64

到

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232

我认为必须有一些更优雅和更快的方法来做到这一点，而不是像下面那样循环索引（特别是，我的真实矩阵更像是 4000 乘以数千）。

example <- matrix(1:64, nrow=4) myindex <- c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,5,5) summed <- matrix( rep(unique(myindex), each=dim(example)[1]), nrow=dim(example)[1]) for (i in 1:length(unique(myindex))){ summed[,i] <- apply(X=example[,(myindex==i)], MARGIN=1, FUN=sum) }

这可能是我在 apply 和 tapply 方面缺乏经验，这让我无法弄清楚这一点。当然也欢迎快速的 dplyr 方法。

【问题讨论】：

标签： r matrix apply tapply rowsum

【解决方案1】：

我们可以使用sapply 的单行：

sapply(unique(myindex), function(x) rowSums(example[, which(myindex == x), drop = FALSE]))

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232

我们让sapply 循环遍历myindex 的所有唯一值，并使用which 定义应包含在rowSums 中的列。

编辑：包含drop = FALSE 以防止单个索引简化为向量。感谢@mt1022 指出错误！

【讨论】：

感谢您对我的 4000 by 3020 示例的快速回复之一，它似乎始终是 3 个建议答案中最快的，因此我将其标记为已接受的答案。

【解决方案2】：

我们也可以通过splitting 做到这一点

sapply(split.default(as.data.frame(example), myindex), rowSums)
#     1  2  3   4   5
#[1,] 15 30 46 185 220
#[2,] 18 32 48 190 224
#[3,] 21 34 50 195 228
#[4,] 24 36 52 200 232

【讨论】：

【解决方案3】：

另一种方法...

example <- matrix(1:64, nrow=4)
myindex <- c(1,1,1,2,2,3,3,4,4,4,4,4,5,5,5,5)

summed <- t(apply(example,1,cumsum))
summed <- summed[,cumsum(rle(myindex)$lengths)]
summed[,-1] <- t(apply(summed,1,diff))
summed

     [,1] [,2] [,3] [,4] [,5]
[1,]   15   30   46  185  220
[2,]   18   32   48  190  224
[3,]   21   34   50  195  228
[4,]   24   36   52  200  232

【讨论】：

【解决方案4】：

矩阵乘法的替代方法（对于大型数据集效率较低）：

x <- matrix(0, nrow = ncol(example), ncol = max(myindex))
x[cbind(1:ncol(example), myindex)] <- 1
example %*% x

#      [,1] [,2] [,3] [,4] [,5]
# [1,]   15   30   46  185  220
# [2,]   18   32   48  190  224
# [3,]   21   34   50  195  228
# [4,]   24   36   52  200  232

这里是一个与实际数据大小匹配的示例数据的基准：

library(microbenchmark)

n_row <- 4000
n_col <- 3020
example <- matrix(rnorm(n_row * n_col), nrow = n_row)
myindex <- ceiling((1:n_col)/5)

microbenchmark(
    matrix = {
        x <- matrix(0, nrow = ncol(example), ncol = max(myindex))
        x[cbind(1:ncol(example), myindex)] <- 1
        example %*% x
    },
    split = {  # by akrun
        sapply(split.default(as.data.frame(example), myindex), rowSums)
    },
    which = {  # by LAP
        sapply(unique(myindex), function(x) rowSums(example[, which(myindex == x)]))
    },
    times = 10
)

# Unit: milliseconds
#    expr       min        lq     mean    median       uq      max neval
#  matrix 982.55727 989.65177 992.7295 992.91230 997.3704 999.0066    10
#   split 162.13377 162.57711 194.5668 167.92963 182.5335 403.8740    10
#   which  90.28227  94.82681 119.3977  96.03701 103.1125 316.9170    10

【讨论】：

谢谢。 microbenchpackage 真的很有趣。有趣的是与我的真实例子（4000到3020），我最终用单位：毫秒EXPR LQ均值UQ矩阵13040.566213535.0503 13535.0503 13535.0503 13535.0503 13535.0503 13535.05/503 13535.0503 13535.05/50313535.0503 1387251 Split 2379.6066 3173.5876 2387290213523533.5876 231.131371.2384，其中204.2357 31731 3371.2384其中204.2357 322.9715 254.2363 383.8717 span>
@Björn，在真实数据集上，矩阵方法似乎太慢了。我认为应该更快。