如何在 r 中读写 TermDocumentMatrix？答案

【问题标题】：how to read and write TermDocumentMatrix in r?如何在 r 中读写 TermDocumentMatrix？
【发布时间】：2017-02-08 02:34:29
【问题描述】：

我在 R 中使用 csv 文件制作了 wordcloud。我在 tm 包中使用了 TermDocumentMatrix 方法。这是我的代码：

csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE)

Encoding(csvData$content) <- "UTF-8"
# useSejongDic() - KoNLP package
nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)
#create Corpus
myCorpus <- Corpus(VectorSource(nouns))

myCorpus <- tm_map(myCorpus, removePunctuation)
# remove numbers
myCorpus <- tm_map(myCorpus, removeNumbers)
#remove StopWord 
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#create Matrix
TDM <- TermDocumentMatrix(myCorpus, control = list(wordLengths=c(2,5)))

m <- as.matrix(TDM)

这个过程似乎花费了太多时间。我认为extractNoun 是花费太多时间的原因。为了使代码更省时，我想将生成的 TDM 保存为文件。当我阅读这个保存的文件时，我可以完全使用m <- as.matrix(saved TDM file)吗？或者，有没有更好的选择？

【问题讨论】：

标签： r nlp term-document-matrix

【解决方案1】：

我不是专家，但我有时会使用 NLP。

我确实使用来自parallel 包的parSapply。这是文档http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf

parallel 带有 R 基础，这是一个愚蠢的使用示例：

library(parallel)
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
clusterExport(cl, "base")

base <- 2
parSapply(cl, as.character(2:4), 
          function(exponent){
            x <- as.numeric(exponent)
            c(base = base^x, self = x^x)
          })

所以，并行化nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F)，它会更快:)

【讨论】：

你能再回答一个问题吗？如果你不介意。正如你所说，我使用parallel。但这似乎带来了巨大的内存泄漏。所以我使用gc() 临时措施。你有过这样的经历吗？如果有，有办法吗？
我没看到。这取决于系统。如果你使用 Windows 可能是 cl<-makeCluster(no_cores, type="FORK") 但我从 XP 开始就没有使用过 Windows
我尝试cl<-makeCluster(no_cores, type="FORK") 来解决我的问题。但它不起作用。所以，我想我需要找到一种方法来管理内存...谢谢您的回复！
嗨。也许最好继续使用我昨天发布的有效方法。否则，也许可以编译 R 以使用 100% 的架构。
我觉得这样会更好。我将继续使用您发布的方法。谢谢！

【解决方案2】：

我注意到您调用了几个 library(tm) 命令，这些命令也可以很容易地并行化。对于 library tm，此功能已于 2017 年 3 月更新，即您提出问题的一个月后。

在 library tm 版本 0.7 (2017-03-02) 的发行说明的新功能部分中指出：

tm_parLapply() 现在在内部用于并行化转换、过滤器和术语文档矩阵构造。可以通过 tm_parLapply_engine() 注册首选并行化引擎。默认是不使用并行化（而不是以前版本中的 mclapply (package parallel)）。

要为 tm 命令设置并行化，以下对我有用：

library(parallel)
cores <- detectCores()
cl <- makeCluster(cores)   # use cores-1 if you want to do anything else on the PC.
tm_parLapply_engine(cl)
## insert your commands for create corpus, 
## tm_map and TermDocumentMatrix commands here
tm_parLapply_engine(NULL)
stopCluster(cl)

如果您有通过 tm_map 内容转换器应用的函数，则需要在 tm_map(MyCorpus, content_transformer(clean)) 命令之前使用 clusterExport 将该函数传递给并行化环境。例如。将我的 clean 函数传递给环境。

clusterExport(cl, "clean")

最后一条评论，请留意您的内存使用情况。如果您的计算机开始将内存分页到磁盘，则 CPU 不再是关键路径，所有并行化都不会产生影响。

【讨论】：