【问题标题】：R : Text Analysis - tm Package - stemComplete errorR：文本分析 - tm 包 - stemComplete 错误
【发布时间】：2015-02-20 01:11:34
【问题描述】：

机器：Windows 7 - 64 位 R版：R版3.1.2（2014-10-31）——“南瓜头盔”

我正在为我正在做的分析准备一些文本，我可以一直做所有事情，直到'stemComplete' 有关更多上下文，请参见下文；

包：

TM
雪球C
rJava
RWeka
Rwekajars
自然语言处理

单词列表示例

test <- as.vector(c('win', 'winner', 'wins', 'wins', 'winning'))

转换为语料库

Test_Corpus <- Corpus(VectorSource(test))

文本操作`

Test_Corpus <- tm_map(Survey_Corpus, content_transformer(tolower))
Test_Corpus <- tm_map(Survey_Corpus, removePunctuation)
Test_Corpus <- tm_map(Survey_Corpus, removeNumbers)

使用tm包下的tm_map进行词干

>Test_stem <- tm_map(Test_Corpus, stemDocument, language = 'english' )

以下是上述词干的结果，到目前为止都是正确的：

赢
获胜者
赢
赢
赢

现在问题来了！当我尝试使用 test_corpus 作为字典来使用以下代码将单词转换回适当的格式时；

>Test_complete <- tm_map(Test_stem, stemCompletion, Test_Corpus)

以下是我收到的错误消息：

警告信息：

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be  used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
4: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
5: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used

我已经尝试了以前帖子中列出的几件事，并且看到其他有同样问题的人尝试过但没有运气。以下是这些内容的列表：

更新 Java
使用 content_transformation
使用 PlainTextDocument

【问题讨论】：

我不确定您的格式是否符合您的想法。缩进代码块（包括 cmets）并尽量避免过度使用标题。

标签： regex r text tm stemming

【解决方案1】：

我认为您需要在词干提取过程之前将您的 test_corpus 保存为字典。您可以尝试类似Test_Corpus <- corpus 的方法，然后您可以稍后在Test_complete <- tm_map(corpus, stemCompletion) 中开始提取和使用语料库。

【讨论】：

通过在词干时更改语料库的名称，它会做同样的事情吗？