【发布时间】:2018-11-29 14:04:14
【问题描述】:
我正在处理仅包含每行文档编号和文本数据的数据框。此数据是从 xml 文件中导出的。数据是变量text_df中的数据框形式:
行/文本
1 when uploading objective file bugzilla se
2 spelling mistake docs section searching fo…
3 editparams cgi won save updates iis instal…
4 editparams cgi won save updates
5 rfe unsubscribe from bug you reported
6 unsubscribe from bug you reported
我正在使用以下代码来识别和删除重复项。
doc_set_1 = text_df
it1 = itoken(doc_set_1$text, progressbar = FALSE)
# specially take different number of docs in second set
doc_set_2 = text_df
it2 = itoken(doc_set_2$text, progressbar = FALSE)
it = itoken(text_df$text, progressbar = FALSE)
v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max =
0.1, term_count_min = 5)
vectorizer = vocab_vectorizer(v)
dtm1 = create_dtm(it1, vectorizer)
dtm2 = create_dtm(it2, vectorizer)
d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")
mat<-(d1_d2_cos_sim)
mat[lower.tri(mat,diag=TRUE)] <- 0
## for converting a sparse matrix into dataframe
mdf<- as.data.frame(as.matrix(mat))
datalist = list()
for (i in 1:nrow(mat)) {
t<-which(mat[i,]>0.8)
if(length(t)>1){
datalist[[i]] <- t # add it to your list
}
}
#Number of Duplicates Found
length(unique(unlist(datalist)))
tmdf<- subset(mdf,select=-c(unique(unlist(datalist))))
# Removing the similar documents
text_df<-text_df[names(tmdf),]
nrow(text_df)
此代码需要大量时间来解决,欢迎提出任何改进建议。
【问题讨论】:
-
注意stackoverflow.com/questions/5963269/… 会大大帮助他人帮助您。
标签: r xml nlp cosine-similarity