试试这个:
CommonWordsMatrixOld <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
xSplit <- strsplit(x, " ")[[1]]
do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
}))
}
myText <- list(text1, text2, text3)
我们有这样的称呼:
CommonWordsMatrixOld(myText)
[,1] [,2] [,3]
[1,] 4 1 0
[2,] 1 6 1
[3,] 0 1 8
对于 OP 请求的数据大小,它的速度相当快。数据获得here:
testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)
set.seed(1111)
myTestText <- lapply(1:100, function(x) {
paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
})
myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
reform credit publish judge Senate publish fresh restaurant publish
version Senate critical release recall relation version"
system.time(test1 <- CommonWordsMatrixOld(myTestText))
user system elapsed
0.625 0.009 0.646
这是输出:
test1[1:10,1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 9 3 5 1 3 4 4 2 2 1
[2,] 3 5 3 1 1 3 3 0 0 1
[3,] 5 3 12 0 3 8 4 3 2 1
[4,] 1 1 0 1 0 0 1 0 0 0
[5,] 3 1 3 0 4 2 1 1 1 0
[6,] 4 3 8 0 2 13 7 4 1 1
[7,] 4 3 4 1 1 7 10 4 1 1
[8,] 2 0 3 0 1 4 4 7 3 0
[9,] 2 0 2 0 1 1 1 3 4 0
[10,] 1 1 1 0 0 1 1 0 0 2
更新
这是一个更快的算法,它可以减少许多不必要的操作,并利用lower.tri,同时保持非常通用。
CommonWordsMatrixNew <- function(vList) {
v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
s <- length(v)
m <- do.call(rbind, lapply(1L:s, function(x) {
c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
}))
m[lower.tri(m)] <- t(m)[lower.tri(m)]
m
}
为了让您了解性能提升,这里有一些基准测试。(需要注意的是,OP 的解决方案不是拆分向量,因此不是真正的比较)。新算法的速度几乎是 OP 解决方案的两倍。
microbenchmark(New=CommonWordsMatrixNew(myTestText),
Old=CommonWordsMatrixOld(myTestText),
Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10
identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE
新算法将对strsplit 的调用次数减少n^2 - n 次(例如,在上面的示例中,strplit 在原始算法中被调用10000 次,而在算法中只有100 次更新后的版本)。此外,由于生成的矩阵是对称的,因此无需多次计算每个句子之间的交互,因此 lapply 函数中的 x = 1:s 和 y = x:s。这些循环的计算次数从n^2 减少到nth triangle number = (n*(n+1)/2)(例如,在我们上面的示例中,从10000 减少到5050)。之后,我们依靠R 中的索引功能,这通常比手动制造要快得多。