从上面Valentin的回答开始,我做了自己的merge.sparse函数,实现如下:
- 保留列和行的名称(合并时当然要考虑它们)
- 保持行列名的原顺序,只合并常用的
下面的代码似乎可以做到这一点:
if (length(find.package(package="Matrix",quiet=TRUE))==0) install.packages("Matrix")
require(Matrix)
merge.sparse <- function(...) {
cnnew <- character()
rnnew <- character()
x <- vector()
i <- numeric()
j <- numeric()
for (M in list(...)) {
cnold <- colnames(M)
rnold <- rownames(M)
cnnew <- union(cnnew,cnold)
rnnew <- union(rnnew,rnold)
cindnew <- match(cnold,cnnew)
rindnew <- match(rnold,rnnew)
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
}
sparseMatrix(i=i,j=j,x=x,dims=c(length(rnnew),length(cnnew)),dimnames=list(rnnew,cnnew))
}
我用以下数据对其进行了测试:
df1 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("N","N","M","X","X","Z","Z"))
M1 <- xtabs(~y+x,df1,sparse=T)
df2 <- data.frame(x=c("S","S","T","T","U","V","V","W","W","X"),y=c("N","M","M","K","Z","M","N","N","K","Z"))
M2 <- xtabs(~y+x,df2,sparse=T)
df3 <- data.frame(x=c("A","C","C","B"),y=c("N","M","Z","K"))
M3 <- xtabs(~y+x,df3,sparse=T)
df4 <- data.frame(x=c("N","R","R","S","T","T","U"),y=c("F","F","G","G","H","I","L"))
M4 <- xtabs(~y+x,df4,sparse=T)
df5 <- data.frame(x=c("K1","K2","K3","K4"),y=c("J1","J2","J3","J4"))
M5 <- xtabs(~y+x,df5,sparse=T)
这给了:
Ms <- merge.sparse(M1,M2,M3,M4,M5)
as.matrix(Ms)
# N R S T U V W X A B C K1 K2 K3 K4
#M 0 1 1 1 0 1 0 0 0 0 1 0 0 0 0
#N 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0
#X 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
#Z 0 0 0 1 2 0 0 1 0 0 1 0 0 0 0
#K 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0
#F 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#G 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#H 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
#L 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#J1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
#J2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#J3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#J4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Ms
#14 x 15 sparse Matrix of class "dgCMatrix"
# [[ suppressing 15 column names ‘N’, ‘R’, ‘S’ ... ]]
#
#M . 1 1 1 . 1 . . . . 1 . . . .
#N 1 1 1 . . 1 1 . 1 . . . . . .
#X . . 1 1 . . . . . . . . . . .
#Z . . . 1 2 . . 1 . . 1 . . . .
#K . . . 1 . . 1 . . 1 . . . . .
#F 1 1 . . . . . . . . . . . . .
#G . 1 1 . . . . . . . . . . . .
#H . . . 1 . . . . . . . . . . .
#I . . . 1 . . . . . . . . . . .
#L . . . . 1 . . . . . . . . . .
#J1 . . . . . . . . . . . 1 . . .
#J2 . . . . . . . . . . . . 1 . .
#J3 . . . . . . . . . . . . . 1 .
#J4 . . . . . . . . . . . . . . 1
我不知道为什么在尝试显示合并的稀疏矩阵Ms 时列名被“抑制”;转换为非稀疏矩阵确实会将它们带回来,所以...
另外,我注意到当多次包含相同的“坐标”时,稀疏矩阵包含x 中相应值的总和(参见“Z”行,“U”列",在 M1 和 M2 中都是 1)。也许有办法改变它,但对于我的应用程序来说这很好。
我会分享这段代码,以防其他人需要以这种方式合并稀疏矩阵,以防有人可以在大型矩阵上对其进行测试并提出性能改进建议。
编辑
在检查this post 之后,我发现summary 可以更轻松地提取有关稀疏矩阵(非零)元素的信息,而无需使用which。
所以我上面的这部分代码:
ind <- unname(which(M != 0,arr.ind=T))
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,M@x)
可以替换为:
ind <- summary(M)
i <- c(i,rindnew[ind[,1]])
j <- c(j,cindnew[ind[,2]])
x <- c(x,ind[,3])
现在我不知道其中哪一个在计算上更有效,或者有一种更简单的方法可以通过更改矩阵的维度然后将它们相加来做到这一点,但这似乎对我有用,所以。 ..