出现在来自数据框列的文本中的单词列，其频率在 R 中答案

【问题标题】：word columns appearing in text froma data frame column with their freuency in R出现在来自数据框列的文本中的单词列，其频率在 R 中
【发布时间】：2020-03-04 12:24:20
【问题描述】：

我有一个关于这个旧帖子的问题：R Text mining - how to change texts in R data frame column into several columns with word frequencies?

我正在尝试使用 R 来模仿与上面链接中发布的内容完全相同的内容，但是使用包含数字字符的字符串。

假设 res 是我定义的数据框：

library(qdap)
x1 <- as.factor(c( "7317 test1 fool 4258 6287" , "thi1s is 6287 test funny text1 test1", "this is test1 6287 text1 funny fool"))
y1 <- as.factor(c("test2 6287", "this is test text2", "test2 6287"))
z1 <- as.factor(c( "test2 6287" , "this is test 4258 text2 fool", "test2 6287"))
res <- data.frame(x1, y1, z1)

当我计算使用这些命令定义的词的频率时，

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE))
abcd <- data.frame(res, freqs, check.names = FALSE)

abcd 忽略 7317、4258、6287 甚至 test1 中的数字 1 并计算频率。

在 x1 列的第一行中，从 test1 中删除 1 并计为一个单词。类似地，is 从 thi1s 中被剥离并计为一个词。但是，我想要的是test1。类似地，以字符串形式存储的字符串 7317、4258 等必须计为单词，并以其频率出现在数据表中。代码中必须额外容纳什么？

【问题讨论】：

标签： r text count word mining

【解决方案1】：

您需要在 freqs 语句中添加以下内容：removeNumbers = FALSE。 wfm 函数调用其他几个函数，其中之一是 tm::TermDocumentMatrix。在这里，wfm 提供给这个函数的默认值是removeNumbers = TRUE。所以这需要设置为FALSE。

代码：

freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE, removeNumbers = FALSE))
abcd <- data.frame(res, freqs, check.names = FALSE)

【讨论】：