【问题标题】：r - Binding sparse matrices of different sizes on rowsr - 在行上绑定不同大小的稀疏矩阵
【发布时间】：2017-03-30 12:15:05
【问题描述】：

我正在尝试使用 Matrix 包将两个不同大小的稀疏矩阵绑定在一起。绑定在行上，使用列名进行匹配。

表 A：

ID     | AAAA   | BBBB   |
------ | ------ | ------ |
XXXX   | 1      | 2      |

表 B：

ID     | BBBB   | CCCC   |
------ | ------ | ------ |
YYYY   | 3      | 4      |

绑定表A和B：

ID     | AAAA   | BBBB   | CCCC   |
------ | ------ | ------ | ------ |
XXXX   | 1      | 2      |        |
YYYY   |        | 3      | 4      |

目的是将大量小矩阵插入到单个大矩阵中，以实现连续查询和更新/插入。

我发现 Matrix 或 slam 包都没有处理这个问题的功能。

过去曾提出过类似的问题，但似乎没有找到解决方案：

帖子1：in-r-when-using-named-rows-can-a-sparse-matrix-column-be-added-concatenated

帖子2：bind-together-sparse-model-matrices-by-row-names

我们将不胜感激有关如何解决它的想法。

最好的问候，

弗雷德里克

【问题讨论】：

标签： r sparse-matrix

【解决方案1】：

出于我的目的（具有数百万行和数万列的非常稀疏的矩阵，超过 99.9% 的值是空的），这仍然太慢了。有效的是下面的代码 - 可能对其他人也有帮助：

merge.sparse = function(listMatrixes) {
  # takes a list of sparse matrixes with different columns and adds them row wise

  allColnames <- sort(unique(unlist(lapply(listMatrixes,colnames))))
  for (currentMatrix in listMatrixes) {
    newColLocations <- match(colnames(currentMatrix),allColnames)
    indexes <- which(currentMatrix>0, arr.ind = T)
    newColumns <- newColLocations[indexes[,2]]
    rows <- indexes[,1]
    newMatrix <- sparseMatrix(i=rows,j=newColumns, x=currentMatrix@x,
                              dims=c(max(rows),length(allColnames)))
    if (!exists("matrixToReturn")) {
      matrixToReturn <- newMatrix
    }
    else {
      matrixToReturn <- rbind2(matrixToReturn,newMatrix)
    }
  }
  colnames(matrixToReturn) <- allColnames
  matrixToReturn  
}

【讨论】：

【解决方案2】：

看起来有必要将空列（带有 0 的列）添加到矩阵中，以使它们与 rbind 兼容（具有相同列名且顺序相同的矩阵）。下面的代码做到了：

# dummy data
set.seed(3344)
A = Matrix(matrix(rbinom(16, 2, 0.2), 4))
colnames(A)=letters[1:4]
B = Matrix(matrix(rbinom(9, 2, 0.2), 3))
colnames(B) = letters[3:5]

# finding what's missing
misA = colnames(B)[!colnames(B) %in% colnames(A)]
misB = colnames(A)[!colnames(A) %in% colnames(B)]

misAl = as.vector(numeric(length(misA)), "list")
names(misAl) = misA
misBl = as.vector(numeric(length(misB)), "list")
names(misBl) = misB

## adding missing columns to initial matrices
An = do.call(cbind, c(A, misAl))
Bn = do.call(cbind, c(B, misBl))[,colnames(An)]

# final bind
rbind(An, Bn)

【讨论】：

谢谢，非常快速的解决方案。合并两个尺寸为 100.000x5 和 10x5 的稀疏矩阵需要 8 毫秒。
colnames(B)[!colnames(B) %in% colnames(A)] (etc) 可读性不是很好（也不是很快），我建议用setdiff(rownames(B), rownames(A)) 等替换它。
@plijnzaad 有一个很好的观点，还有一个很好的选择。

【解决方案3】：

从上面Valentin的回答开始，我做了自己的merge.sparse函数，实现如下：

保留列和行的名称（合并时当然要考虑它们）
保持行列名的原顺序，只合并常用的

下面的代码似乎可以做到这一点：

if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)

merge.sparse <- function(...) {
  
  cnnew <- character()
  rnnew <- character()
  x <- vector()
  i <- numeric()
  j <- numeric()
  
  for (M in list(...)) {
  
  cnold <- colnames(M)
  rnold <- rownames(M)
  
  cnnew <- union(cnnew,cnold)
  rnnew <- union(rnnew,rnold)
  
  cindnew <- match(cnold,cnnew)
  rindnew <- match(rnold,rnnew)
  ind <- unname(which(M != 0,arr.ind=T))
  i <- c(i,rindnew[ind[,1]])
  j <- c(j,cindnew[ind[,2]])
  x <- c(x,M@x)
  }
  
  sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}

我用以下数据对其进行了测试：

df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)

这给了：

Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
#   N R S T U V W X A B C K1 K2 K3 K4
#M  0 1 1 1 0 1 0 0 0 0 1  0  0  0  0
#N  1 1 1 0 0 1 1 0 1 0 0  0  0  0  0
#X  0 0 1 1 0 0 0 0 0 0 0  0  0  0  0
#Z  0 0 0 1 2 0 0 1 0 0 1  0  0  0  0
#K  0 0 0 1 0 0 1 0 0 1 0  0  0  0  0
#F  1 1 0 0 0 0 0 0 0 0 0  0  0  0  0
#G  0 1 1 0 0 0 0 0 0 0 0  0  0  0  0
#H  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#I  0 0 0 1 0 0 0 0 0 0 0  0  0  0  0
#L  0 0 0 0 1 0 0 0 0 0 0  0  0  0  0
#J1 0 0 0 0 0 0 0 0 0 0 0  1  0  0  0
#J2 0 0 0 0 0 0 0 0 0 0 0  0  1  0  0
#J3 0 0 0 0 0 0 0 0 0 0 0  0  0  1  0
#J4 0 0 0 0 0 0 0 0 0 0 0  0  0  0  1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
#   [[ suppressing 15 column names ‘N’, ‘R’, ‘S’ ... ]]
#                                
#M  . 1 1 1 . 1 . . . . 1 . . . .
#N  1 1 1 . . 1 1 . 1 . . . . . .
#X  . . 1 1 . . . . . . . . . . .
#Z  . . . 1 2 . . 1 . . 1 . . . .
#K  . . . 1 . . 1 . . 1 . . . . .
#F  1 1 . . . . . . . . . . . . .
#G  . 1 1 . . . . . . . . . . . .
#H  . . . 1 . . . . . . . . . . .
#I  . . . 1 . . . . . . . . . . .
#L  . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1

我不知道为什么在尝试显示合并的稀疏矩阵Ms 时列名被“抑制”；转换为非稀疏矩阵确实会将它们带回来，所以...

另外，我注意到当多次包含相同的“坐标”时，稀疏矩阵包含x 中相应值的总和（参见“Z”行，“U”列"，在 M1 和 M2 中都是 1)。也许有办法改变它，但对于我的应用程序来说这很好。

我会分享这段代码，以防其他人需要以这种方式合并稀疏矩阵，以防有人可以在大型矩阵上对其进行测试并提出性能改进建议。

编辑

在检查this post 之后，我发现summary 可以更轻松地提取有关稀疏矩阵（非零）元素的信息，而无需使用which。

所以我上面的这部分代码：

ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)

可以替换为：

ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])

现在我不知道其中哪一个在计算上更有效，或者有一种更简单的方法可以通过更改矩阵的维度然后将它们相加来做到这一点，但这似乎对我有用，所以。 ..

【讨论】：

我在这里测试了所有的技巧，发现你的代码是最快的。

【解决方案4】：

我们可以创建一个包含所有行和列的空稀疏矩阵，然后使用子集赋值将值插入其中：

my.bind = function(A, B){
  C = Matrix(0, nrow = NROW(A) + NROW(B), ncol = length(union(colnames(A), colnames(B))), 
             dimnames = list(c(rownames(A), rownames(B)), union(colnames(A), colnames(B))))
  C[rownames(A), colnames(A)] = A
  C[rownames(B), colnames(B)] = B
  return(C)
}

my.bind(A,B)
# 2 x 3 sparse Matrix of class "dgCMatrix"
#      AAAA BBBB CCCC
# XXXX    1    2    .
# YYYY    .    3    4

请注意，以上假设 A 和 B 不共享行名。如果有共享的行名，那么您应该使用行号而不是名称来分配。

数据：

library(Matrix)
A = Matrix(c(1,2), 1, dimnames = list('XXXX', c('AAAA','BBBB')))
B = Matrix(c(3,4), 1, dimnames = list('YYYY', c('BBBB','CCCC')))

【讨论】：

谢谢。优雅的解决方案，但在较大的矩阵上有点慢。我尝试合并两个稀疏矩阵的尺寸：100.000x5 和 10x5。需要 4.3 秒。

【解决方案5】：

如果需要将许多小型稀疏矩阵组合/连接成一个大型稀疏矩阵，则使用全局和局部行和列索引的映射来构造大型稀疏矩阵会更好、更有效。例如，

globalInds <- matrix(NA, nrow=dim(localPairRowColInds)[1], 2)

# extract the corresponding global row indices for the local row indices
globalInds[ , 1] <- globalRowInds[ localPairRowColInds[,1] ] 
globalInds[ , 2] <- globalColInds[ localPairRowColInds[,2] ]

write.table(cbind(globalInds, localPairVals), file=dataFname, append = T, sep = " ", row.names = F, col.names = F)

【讨论】：