寻找 twit 和短信风格的停用词答案

【问题标题】：looking for twit and text message style stopwords寻找 twit 和短信风格的停用词
【发布时间】：2012-11-13 13:36:33
【问题描述】：

我使用 R 来挖掘推文，我得到了推文中使用频率最高的词。然而，最常见的词是这样的：

 [1] "cant"     "dont"     "girl"     "gonna"    "lol"      "love"    
 [7] "que"      "thats"    "watching" "wish"     "youre"

我正在寻找文本中的趋势、名称和事件。我想知道是否有办法从语料库中删除这种短信风格的词（例如，想要，想要，...）？他们有停用词吗？任何帮助将不胜感激。

【问题讨论】：

你可能想看看ark.cs.cmu.edu/TweetNLP

标签： r nlp text-mining stop-words

【解决方案1】：

文本挖掘包维护自己的停用词列表，并提供有用的工具来管理和总结此类文本。

假设您的推文存储在向量中。

library(tm)
words <- vector_of_strings
corpus <- Corpus(VectorSource(words))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) tolower(x))
corpus <- tm_map(corpus, function(x) removeWords(x, 
                stopwords()))

您可以将最后一行与您自己的停用词列表一起使用（）：

stoppers <- c(stopwords(), "gonna", "wanna", "lol", ... )

不幸的是，您必须生成自己的“短信”或“互联网短信”停用词列表。

但是，你可以通过向 NetLingo (http://vps.netlingo.com/acronyms.php) 借钱来作弊

library(XML)
theurl <- "http://vps.netlingo.com/acronyms.php"
h <- htmlParse(theurl)
h <- getNodeSet(h,"//ul/li/span//a")
stoppers <- sapply(h,xmlValue)

【讨论】：