比较 R 中文档术语矩阵中的文档术语答案

【问题标题】：Compare Terms of Documents in Document Term Matrix in R比较 R 中文档术语矩阵中的文档术语
【发布时间】：2013-01-14 16:30:43
【问题描述】：

我需要通过比较文档的术语来构建相似度矩阵。因此，例如，如果 Document1 和 Document2 有 2 个相同的术语，我需要在我的相似度矩阵中写一个 2 在 m[1, 2]。我的相似度矩阵现在看起来像这样：

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]    0   NA   NA   NA   NA   NA   NA   NA   NA
[2,]    0    0   NA   NA   NA   NA   NA   NA   NA
[3,]    0    0    0   NA   NA   NA   NA   NA   NA
[4,]    0    0    0    0   NA   NA   NA   NA   NA
[5,]    0    0    0    0    0   NA   NA   NA   NA
[6,]    0    0    0    0    0    0   NA   NA   NA
[7,]    0    0    0    0    0    0    0   NA   NA
[8,]    0    0    0    0    0    0    0    0   NA

文档和术语位于文档术语矩阵中。现在我必须通过比较所有文档及其在相似度矩阵中表示为 NA 的术语来填充相似度矩阵。对于文档对中匹配的每个 Term，我必须计数 +1 并将最终值注入矩阵中的正确位置。

我的问题是，我似乎无法访问文档术语矩阵中的单个文档及其术语。还有其他方法可以执行此操作还是我遗漏了什么？

代码如下：

install.packages("tm")
install.packages("openNLP")
install.packages("openNLPmodels.en")

Sys.setenv(NOAWT=TRUE)

library(tm)
library(openNLP)
library(openNLPmodels.en)

sample = c(
  "count eagle alien", 
  "dis bound eagle",   
  "bound count eagle dis",
  "count eagle dis alien",
  "bound eagle",
  "count dis alien",
  "bound count alien",
  "bound count",
  "count eagle dis"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)

dtm <- DocumentTermMatrix(corpus)
inspect(dtm)

# need to create similarity matrix here
#dist(dtm, method = "manhattan", diag = FALSE, upper = TRUE)

rowCount <- nrow(dtm)
similMatrix = matrix(nrow = rowCount - 1, ncol = rowCount)
show(similMatrix)
similMatrix[ row(similMatrix) >= col(similMatrix) ] <- 0

for(i in 1:(rowCount - 1)){  # rows
  for (j in i+1:rowCount){      # cols
  
      # need to compare document i and j here and write
      # the value into similarity matrix 
  }
}
show(similMatrix)

【问题讨论】：

那是很多包。你需要安装所有这些来重现这个吗？
例如-如果你刚刚安装了任何定义DocumentTermMatrix的包，那么dput在结果上创建了一个表示，这足以重现吗？
我认为 tm openNLP 和 openNLPmodels.en 包应该可以完成这项工作，但不能 100% 确定这一点。我的教授推荐了所有这些软件包来执行这项任务。
我不是在问你是否需要所有的包来完成任务，我是在问是否有人想要帮助需要安装所有这些包（许多用户可能想回答但不想为此费心安装 8 个软件包）。这通常是一个很好的问题，但您可以尝试让示例更简单吗？
是的，你是对的，除了上面提到的 3 个之外，都删除了

标签： r matrix document similarity

【解决方案1】：

我认为您在相似度矩阵中又少了一行。因为你没有得到你的最后一份文件。我的看起来像这样。

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]   NA   NA   NA   NA   NA   NA   NA   NA   NA
 [2,]    1   NA   NA   NA   NA   NA   NA   NA   NA
 [3,]    2    3   NA   NA   NA   NA   NA   NA   NA
 [4,]    3    2    3   NA   NA   NA   NA   NA   NA
 [5,]    1    2    2    1   NA   NA   NA   NA   NA
 [6,]    2    1    2    3    0   NA   NA   NA   NA
 [7,]    2    1    2    2    1    2   NA   NA   NA
 [8,]    1    1    2    1    1    1    2   NA   NA
 [9,]    2    2    3    3    1    2    1    1   NA

为了得到这个结果，我做了以下步骤。

mat=as.data.frame(as.matrix(dtm)) # you get the dataframe from DocumentTerm Matrix 
rowCount <- nrow(dtm)
colCount <- ncol(dtm)
similMatrix = matrix(nrow = rowCount, ncol = rowCount)
similMatrix[ row(similMatrix) >= col(similMatrix) ] <- 0
for(i in 1:(rowCount)){ #set all columns NA you can change to zeros if you need later
    similMatrix[i,i]=NA
} # then we will do the actual job
for(i in 1:rowCount ){  # rows
  for (j in 1:rowCount ){      # cols
      if(is.na(similMatrix[i,j])==F){
        a=mat[i,]
        b=mat[j,]
        for(k in 1:colCount){   #n number of Cols in Document term matrix

          if(a[k]==1 && a[k]==b[k]){
              similMatrix[i,j]=similMatrix[i,j]+1
          }
        }
      }
  }
}

【讨论】：

这太棒了！！！看起来你倒置了矩阵，是否可以改变矩阵的左下角和右上角三角形？我必须在代码中更改什么？
你可以尝试转置矩阵similMatrix=t(similMatrix)。