查找文档的余弦相似度及其从 R 数据框中的删除答案

【问题标题】：Finding cosine similarity of documents and their removal from R dataframe查找文档的余弦相似度及其从 R 数据框中的删除
【发布时间】：2018-11-29 14:04:14
【问题描述】：

我正在处理仅包含每行文档编号和文本数据的数据框。此数据是从 xml 文件中导出的。数据是变量text_df中的数据框形式：

行/文本

 1 when uploading objective file bugzilla se
 2 spelling mistake docs section searching fo…
 3 editparams cgi won save updates iis instal…
 4 editparams cgi won save updates            
 5 rfe unsubscribe from bug you reported      
 6 unsubscribe from bug you reported

我正在使用以下代码来识别和删除重复项。

doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)

# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
 v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 
 0.1, term_count_min = 5)
 vectorizer = vocab_vectorizer(v)
 dtm1 = create_dtm(it1, vectorizer)
 dtm2 = create_dtm(it2, vectorizer)
 d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
  mat<-(d1_d2_cos_sim)
  mat[lower.tri(mat,diag=TRUE)] <- 0
  ## for converting a sparse matrix into dataframe
  mdf<- as.data.frame(as.matrix(mat))
  datalist = list()
  for (i in 1:nrow(mat)) {
   t<-which(mat[i,]>0.8)
   if(length(t)>1){
   datalist[[i]] <- t # add it to your list
      }
    }

  #Number of Duplicates Found
  length(unique(unlist(datalist)))

   tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))

  # Removing the similar documents
  text_df<-text_df[names(tmdf),]
  nrow(text_df)

此代码需要大量时间来解决，欢迎提出任何改进建议。

【问题讨论】：

注意stackoverflow.com/questions/5963269/… 会大大帮助他人帮助您。

标签： r xml nlp cosine-similarity

【解决方案1】：

库quanteda 在这种情况下工作得很好。下面我举个例子：

library(tibble)
library(quanteda)
df<- data_frame(text = c("when uploading objective file bugzilla se",
       "spelling mistake docs section searching fo",
       "editparams cgi won save updates iis instal",
       "editparams cgi won save updates",
       "rfe unsubscribe from bug you reported",
       "unsubscribe from bug you reported"))
DocTerm <- quanteda::dfm(df$text)
textstat_simil(DocTerm, margin="documents", method = "cosine")
          text1     text2     text3     text4     text5
text2 0.0000000                                        
text3 0.0000000 0.0000000                              
text4 0.0000000 0.0000000 0.8451543                    
text5 0.0000000 0.0000000 0.0000000 0.0000000          
text6 0.0000000 0.0000000 0.0000000 0.0000000 0.9128709
    textstat_simil(DocTerm, margin="documents", method = "cosine")

如果想按特定数量对其进行子集化并查看哪些与特定数字相似（在此 0.9 中），可以执行以下操作：

mycosinesim<-textstat_simil(DocTerm, margin="documents", method = "cosine")
myMatcosine<-as.data.frame(as.matrix(mycosinesim))
higherthan90<-as.data.frame(which(myMatcosine>0.9,arr.ind = T, useNames = T))
higherthan90[which(higherthan90$row !=higherthan90$col),]

row col
text6     6   5
text5.1   5   6

现在您可以决定是否删除 5 或 6 作为文本，因为它们非常相似

【讨论】：

感谢@Carles 的回复，但我也想删除那些有相似之处，例如从数据帧中超过 0.9。请也提出建议。
我希望编辑能让您了解如何提取它。干杯！ :)
我很欣赏这个答案，但第二部分将再次为我目前正在工作的 90,000 个文档进行密集计算。任何其他可以在这里工作的替代方案。
我很抱歉@osmjit，我不确定如何做到这一点非常高效。然而，这将是另一个更有趣的问题，即：如何有效地从大 data.frames() 中提取索引。请关闭问题，因为问题已得到解答。干杯！
我发现这可以帮助您更快地完成它:)。 stackoverflow.com/questions/28233561/…;希望对你有帮助！