R：按文档比较单词直方图答案

【问题标题】：R: compare words histogram by documentR：按文档比较单词直方图
【发布时间】：2018-07-01 07:47:04
【问题描述】：

我正在寻找一种方法来按文档比较单词直方图，该文档属于具有多个文档网络的文件夹语料库。我确实尝试过：

freq <- sort(colSums(as.matrix(dtm), group=Docs), decreasing=TRUE)

也确实尝试过 ggplot 选项：

p <- p + geom_bar(stat="identity") +   facet_wrap(~ Docs)

但可悲的是我得到了错误。

下面是我的代码的修改示例，但可悲的是，我的 3 个文档的情节像一个，也没有被 Docs 分段：

c= c("hola como  hola como  hola como", "hola me fui hola me fui hola me fui hola me fui", "hola como estas hola como estas hola como estas" )
corpus= VCorpus(VectorSource(c))

dtm <- DocumentTermMatrix(corpus)

m <- as.matrix(dtm)   
m 
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)  
wf <- data.frame(word=names(freq), freq=freq)   

p <- ggplot(subset(wf, freq>1), aes(word, freq))    
p <- p + geom_bar(stat="identity") 
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) 
p

【问题讨论】：

标签： r text text-mining corpus

【解决方案1】：

您创建 wf 的方式意味着您正在丢失您的文档，并且它们不适用于 ggplot2。我将 wf 创建为文档名称和 dtm 的组合。（如果你有一个大的语料库，请注意这里。）然后我将 wf 转换为长格式，因此 ggplot 是 ggplot2 非常好的格式。然后，只需在创建绘图时以您需要的任何方式进行文档化。在下面的示例中，我在文档之间拆分图表。

library(tm)

c= c("hola como  hola como  hola como", "hola me fui hola me fui hola me fui hola me fui", "hola como estas hola como estas hola como estas" )
corpus= VCorpus(VectorSource(c))

dtm <- DocumentTermMatrix(corpus)

wf <- data.frame(docs=Docs(dtm), as.matrix(dtm)) 

library(tidyr)
wf <- wf %>% gather(key = "terms", value = "freq", -docs)

library(ggplot2)
ggplot(wf, aes(terms, freq)) + 
  geom_bar(stat="identity") +
  facet_wrap(~ docs) + 
  theme(axis.text.x=element_text(angle=45, hjust=1))

【讨论】：