如何在 TermDocumentMatrix 中使用正则表达式进行文本挖掘？答案

【问题标题】：How to use a regular expression inside TermDocumentMatrix for text mining?如何在 TermDocumentMatrix 中使用正则表达式进行文本挖掘？
【发布时间】：2013-08-22 14:18:21
【问题描述】：

我知道我可以使用 tm 包通过 Dictionary 函数来计算语料库中特定单词的出现次数：

require(tm)
data(crude)

dic <- Dictionary("crude")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE))
inspect(tdm)

我想知道是否有一种工具可以向 Dictionary 提供正则表达式而不是固定单词？

有时词干可能不是我想要的（例如，我可能想找出拼写错误），所以我想做类似的事情：

dic <- Dictionary(c("crude", 
                    "\\bcrud[[:alnum:]]+"),
                    "\\bcrud[de]")

然后继续使用 tm 包的功能？

【问题讨论】：

标签： regex r text-mining tm

【解决方案1】：

我不确定您是否可以将正则表达式放入字典函数中，因为它只接受字符向量或术语文档矩阵。我建议的解决方法是使用正则表达式对术语文档矩阵中的术语进行子集化，然后进行字数统计：

# What I would do instead
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE))
# subset the tdm according to the criteria
# this is where you can use regex
crit <- grep("cru", tdm$dimnames$Terms)
# have a look to see what you got
inspect(tdm[crit])
        A term-document matrix (2 terms, 20 documents)

    Non-/sparse entries: 10/30
    Sparsity           : 75%
    Maximal term length: 7 
    Weighting          : term frequency (tf)

             Docs
    Terms     127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543
      crucial   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0
      crude     2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0   0   2
             Docs
    Terms     704 708
      crucial   0   0
      crude     0   1
# and count the number of times that criteria is met in each doc
colSums(as.matrix(tdm[crit]))
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
  2   0   2   3   0   2   2   0   0   0   5   2   0   2   0   0   0   2   0   1 
# count the total number of times in all docs
sum(colSums(as.matrix(tdm[crit])))
[1] 23

如果这不是您想要的，请继续编辑您的问题，以包含一些正确代表您的实际用例的示例数据，以及您想要的输出示例。

【讨论】：

【解决方案2】：

如果您指定valuetype = "regex"，文本分析包quanteda 允许使用正则表达式选择特征。

require(tm)
require(quanteda)
data(crude)

dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE)
# Document-feature matrix of: 20 documents, 2 features.
# 20 x 2 sparse Matrix of class "dfmSparse"
#      features
# docs  crude crucial
#   127     2       0
#   144     0       0
#   191     2       0
#   194     3       0
#   211     0       0
#   236     2       0
#   237     0       2
#   242     0       0
#   246     0       0
#   248     0       0
#   273     5       0
#   349     2       0
#   352     0       0
#   353     2       0
#   368     0       0
#   489     0       0
#   502     0       0
#   543     2       0
#   704     0       0
#   708     1       0

另见?selectFeatures。

【讨论】：