【问题标题】:using findAssocs to build a correlation matrix of all word combinations in R使用 findAssocs 构建 R 中所有单词组合的相关矩阵
【发布时间】:2015-05-22 05:34:26
【问题描述】:

我正在尝试编写代码来构建一个表格,该表格显示语料库中所有单词之间的所有相关性。

我知道我可以使用 tm 包中的 findAssocs 来查找单个单词的所有单词相关性,即findAssocs(dtm, "quick", 0.5) - 会给我所有与单词“quick”相关的单词,高于 0.5 ,但我不想为文本中的每个单词手动执行此操作。

#Loading a .csv file into R
file_loc <- "C:/temp/TESTER.csv"
x <- read.csv(file_loc, header=FALSE)
require (tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)

#Clean up the text
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

从这里我可以找到单个单词的单词相关性:

findAssocs(dtm, "quick", 0.4)

但我想找到所有这样的相关性:

       quick  easy   the   and 
quick   1.00  0.54  0.72  0.92     
 easy   0.54  1.00  0.98  0.54   
  the   0.72  0.98  1.00  0.05  
  and   0.92  0.54  0.05  1.00

有什么建议吗?

“TESTER.csv”数据文件示例(从单元格 A1 开始)

[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly

【问题讨论】:

  • 请提供一个可重现的例子。
  • 实际数据有大约1000个这样的cmets

标签: r text correlation tm


【解决方案1】:

您可能可以使用as.matrixcorfindAssocs 下限为 0:

(cor_1 <- findAssocs(dtm, colnames(dtm)[1:2], 0))
#               all along
#  there       1.00  1.00
#  information 0.65  0.65
#  needed      0.65  0.65
#  the         0.47  0.47
#  was         0.47  0.47

cor 为您提供所有 pearson 相关性,值得:

cor_2 <- cor(as.matrix(dtm))
cor_2[c("there", "information", "needed", "the", "was"), c("all", "along")]
#                   all     along
# there       1.0000000 1.0000000
# information 0.6454972 0.6454972
# needed      0.6454972 0.6454972
# the         0.4714045 0.4714045
# was         0.4714045 0.4714045

前面的代码:

x <- readLines(n = 7)
[1] I got my question answered very quickly
[2] It was quick and easy to find the information I needed
[3] My question was answered quickly by the people at stack overflow
[4] Because they're good at what they do
[5] They got it dealt with quickly and didn't mess around
[6] The information I needed was there all along
[7] They resolved it quite quickly
library(tm)
corp <- Corpus(VectorSource(x))
dtm <- DocumentTermMatrix(corp)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(stripWhitespace))
dtm <- DocumentTermMatrix(corp)

【讨论】:

  • 嗨,卢克,感谢您提供的示例。我可以使用 pearson 相关性来获得第二个,但不是您的第一个示例
  • 我添加了前面的代码(和一个圆括号,我忘记了)
  • 抱歉,Luke,但是当我运行该代码时,我不断收到“numeric(0)”
  • 我无法使用 tm_0.6 和 R 版本 3.1.3 重现它。
  • 我正在使用 tm_0.6 和 R 版本 2.15.3
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2015-02-10
  • 1970-01-01
  • 1970-01-01
  • 2019-12-07
  • 1970-01-01
  • 2017-02-01
  • 2023-04-01
相关资源
最近更新 更多