【发布时间】:2015-05-22 05:34:26
【问题描述】:
我正在尝试编写代码来构建一个表格,该表格显示语料库中所有单词之间的所有相关性。
我知道我可以使用 tm 包中的 findAssocs 来查找单个单词的所有单词相关性,即findAssocs(dtm, "quick", 0.5) - 会给我所有与单词“quick”相关的单词,高于 0.5 ,但我不想为文本中的每个单词手动执行此操作。
#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)
从这里我可以找到单个单词的单词相关性:
findAssocs(dtm, "quick", 0.4)
但我想找到所有这样的相关性:
quick easy the and
quick 1.00 0.54 0.72 0.92
easy 0.54 1.00 0.98 0.54
the 0.72 0.98 1.00 0.05
and 0.92 0.54 0.05 1.00
有什么建议吗?
“TESTER.csv”数据文件示例(从单元格 A1 开始)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
【问题讨论】:
-
请提供一个可重现的例子。
-
实际数据有大约1000个这样的cmets
标签: r text correlation tm