R 我如何使用 TermDocumentMatrix() 保持标点符号答案

【问题标题】：R How do i keep punctuation with TermDocumentMatrix()R 我如何使用 TermDocumentMatrix() 保持标点符号
【发布时间】：2015-11-27 10:01:53
【问题描述】：

我有一个大型数据框，我在其中识别字符串中的模式，然后提取它们。我提供了一个小子集来说明我的任务。我通过创建一个包含多个单词的 TermDocumentMatrix 来生成我的模式。我将这些模式与 stringi 和 stringr 包中的 stri_extract 和 str_replace 一起使用，以在“punct_prob”数据帧中进行搜索。

我的问题是我需要在“punct_prob$description”中保持标点符号的完整，以保持每个字符串中的字面含义。例如，我不能让 2.35 毫米变成 235 毫米。然而，我正在使用的 TermDocumentMatrix 过程正在删除标点符号（或至少是句点），因此我的模式搜索函数无法匹配它们。

简而言之...生成 TDM 时如何保留标点符号？我尝试在 TermDocumentMatrix 控制参数中包含 removePunctuation=FALSE 但没有成功。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                    "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                    "TITANIUM LINE POWER P. B F.O. TRIP SPR",
                                    "MEDESY SPECIAL ITEM")))

punct_prob$description = as.character(punct_prob$description)

# a control for the number of words in phrases
max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

#set up ngrams and tdm
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = max_ngram, max = max_ngram))}
punct_prob_corpus = Corpus(VectorSource(punct_prob$description))
punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = BigramTokenizer, removePunctuation=FALSE))
inspect(punct_prob_tdm)

检查结果 - 没有标点符号....

                                   Docs
Terms                              1 2 3 4
  angle head 2 1 for 2 35mm bur    1 0 0 0
  contra angle head 2 1 for 2 35mm 1 0 0 0
  line mini p b f o trip spray     0 1 0 0
  line power p b f o trip spr      0 0 1 0
  titanium line mini p b f o trip  0 1 0 0
  titanium line power p b f o trip 0 0 1 0

提前感谢您的帮助:)

【问题讨论】：

标签： r tm punctuation term-document-matrix

【解决方案1】：

问题不在于 termdocumentmatrix，而在于基于 RWEKA 的 ngram tokenizer。 Rweka 在进行标记时会删除标点符号。

如果您使用 nlp 分词器，它会保留标点符号。请参阅下面的代码。

附：我在您的第三个文本行中删除了一个空格，所以 P.B. 是 P.B.就像它在第 2 行一样。

library(tm)
punct_prob = data.frame(description = tolower(c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
                                                "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
                                                "TITANIUM LINE POWER P.B F.O. TRIP SPR",
                                                "MEDESY SPECIAL ITEM")))
punct_prob$description = as.character(punct_prob$description)

max_ngram = max(sapply(strsplit(punct_prob$description, " "), length))

punct_prob_corpus = Corpus(VectorSource(punct_prob$description))




NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), max_ngram), paste, collapse = " "), use.names = FALSE)
}


punct_prob_tdm <- TermDocumentMatrix(punct_prob_corpus, control = list(tokenize = NLPBigramTokenizer))
inspect(punct_prob_tdm)

<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 3/9
Sparsity           : 75%
Maximal term length: 38
Weighting          : term frequency (tf)

                                        Docs
Terms                                    1 2 3 4
  contra angle head 2:1 for 2.35mm bur   1 0 0 0
  titanium line mini p.b f.o. trip spray 0 1 0 0
  titanium line power p.b f.o. trip spr  0 0 1 0

【讨论】：

【解决方案2】：

quanteda 包足够聪明，可以在不将词内标点符号视为“标点符号”的情况下进行标记化。这使得构建矩阵变得非常容易：

txt <- c("CONTRA ANGLE HEAD 2:1 FOR 2.35mm BUR",
         "TITANIUM LINE MINI P.B F.O. TRIP SPRAY",
         "TITANIUM LINE POWER P.B F.O. TRIP SPR",
         "MEDESY SPECIAL ITEM")

require(quanteda)
myDfm <- dfm(txt, ngrams = 6:8, concatenator = " ")
t(myDfm)
#                                        docs
# features                                text1 text2 text3 text4
#   contra angle head for 2.35mm bur          1     0     0     0
#   titanium line mini p.b f.o trip           0     1     0     0
#   line mini p.b f.o trip spray              0     1     0     0
#   titanium line mini p.b f.o trip spray     0     1     0     0
#   titanium line power p.b f.o trip          0     0     1     0
#   line power p.b f.o trip spr               0     0     1     0
#   titanium line power p.b f.o trip spr      0     0     1     0

如果你想保留“标点符号”，它会在结束一个术语时被标记为一个单独的标记：

myDfm2 <- dfm(txt, ngrams = 8, concatenator = " ", removePunct = FALSE)
t(myDfm2)
#                                          docs
# features                                  text1 text2 text3 text4
#   titanium line mini p.b f.o . trip spray     0     1     0     0
#   titanium line power p.b f.o . trip spr      0     0     1     0

请注意，ngrams 参数是完全灵活的，可以采用 ngram 大小的向量，如第一个示例中ngrams = 6:8 表示它应该形成 6-、7- 和 8-gram。

【讨论】：

谢谢@Ken。我很快就会玩这个。我喜欢可变 ngram 长度的想法，这是我首先转向 RWeka 标记器的原因之一。