将 R 与 tm 包一起使用时如何准确删除标点符号答案

【问题标题】：How to exactly remove the punctuation when using R with tm package将 R 与 tm 包一起使用时如何准确删除标点符号
【发布时间】：2015-08-20 08:47:45
【问题描述】：

更新：

我想我可能有一个解决方法来解决这个问题，只需添加一个代码：dtms = removeSparseTerms(dtm,0.1) 它会删除语料库中的稀疏字符。但我认为这只是一种解决方法，仍然等待专家的回答！

最近我正在使用 tm 包在 R 中学习文本挖掘。我有一个想法，以最大频率绘制关于我的 ABAP 程序中的单词的词云。所以我写了一个 R 程序来实现这一点。

library(tm)
library(SnowballC)
library(wordcloud)

# set path
path = system.file("texts","abapcode",package = "tm")

# make corpus
code = Corpus(DirSource(path),readerControl = list(language = "en"))

# cleanse text
code = tm_map(code,stripWhitespace)
code = tm_map(code,removeWords,stopwords("en"))
code = tm_map(code,removePunctuation)
code = tm_map(code,removeNumbers)

# make DocumentTermMatrix
dtm = DocumentTermMatrix(code)

#freqency 
freq = sort(colSums(as.matrix(dtm)),decreasing = T)

#wordcloud(code,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)
wordcloud(names(freq),freq,scale = c(5,1),max.words = 50,random.order = F,colors = brewer.pal(8, "Dark2"),rot.per = 0.35,use.r.layout = F)

但是在我的 ABAP 代码中，一些变体在变体名称中包含“_”和“-”，所以如果我执行了这个：

code = tm_map(code,removePunctuation)

语料库内容不太正确，因此词云是这样的：

如果去掉“_”或“-”，有些词会很奇怪。

然后我评论那段代码，词云是这样的：

这次的话是正确的，但是弹出了一些意想不到的字符，例如我的ABAP代码commet...

那么我们是否有一些方法可以准确地删除我们不想要的标点符号并保留我们想要的标点符号？

【问题讨论】：

近乎重复：tm custom removePunctuation except hashtag

标签： r customization text-mining tm punctuation

【解决方案1】：

作为代码格式的答案发布，但它是从 getTransformtions 中找到的 content_transformer 文档的改编版，在 tm_map 文档中找到：

主要是在content_transformer 中使用gsub 与removePunctuation 减去_ 和-（[:punct:] 正则表达式类）相同。 removePunctuation 可以选择保留破折号 - 但不保留下划线 _。

f <- content_transformer(function(x, pattern) gsub(pattern, "", x))
code <- tm_map(code, f, "[!\"#$%&'*+,./)(:;<=>?@\][\\^`{|}~]")

在字符类中，您必须转义\、" 和右括号]。

【讨论】：

【解决方案2】：

好的...所以以下工作... 将语料库转换为数据框，删除不需要的字符，然后重新转换为语料库...

dataframe<-data.frame(text=unlist(sapply(code,[, "content")), stringsAsFactors=F) dataframe$text <- gsub("[][!#$%()*,.:;<=>@^_|~.{}]", "", dataframe$text)

code <- corpus(Vectorsource(dataframe$text))

【讨论】：