R：考虑标点符号做分词答案

【问题标题】：R: consider punctuation to do word segmentationR：考虑标点符号做分词
【发布时间】：2018-03-02 07:53:37
【问题描述】：

我使用 NGramTokenizer() 进行 1~3 克的分割，但似乎没有考虑标点符号，并且去掉了标点符号。

所以分词对我来说并不理想。

（如结果：氧化剂氨基，氧化剂氨基酸，颗粒氧化剂等。）

有没有什么分词方式可以保留标点符号（我想我可以在分词工作后使用词性标注过滤掉包含标点符号的字符串。）

或者有其他方法可以考虑标点符号进行分词吗？会更非常适合我。

text <-  "the slurry includes: attrition pellet, oxidant, amino acid and water."

corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <-  DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)

 [1] "acid"                      "acid and"                  "acid and water"           
 [4] "amino"                     "amino acid"                "amino acid and"           
 [7] "and"                       "and water"                 "attrition"                
[10] "attrition pellet"          "attrition pellet oxidant"  "includes"                 
[13] "includes attrition"        "includes attrition pellet" "oxidant"                  
[16] "oxidant amino"             "oxidant amino acid"        "pellet"                   
[19] "pellet oxidant"            "pellet oxidant amino"      "slurry"                   
[22] "slurry includes"           "slurry includes attrition" "the"                      
[25] "the slurry"                "the slurry includes"       "water"

【问题讨论】：

如果只需要标点符号，也许可以基于标点符号（基于正则表达式）进行标记。这行得通吗？

标签： r tm text-segmentation

【解决方案1】：

你可以使用quanteda包的tokenize函数如下：

library(quanteda)
text <- "some text, with commas, and semicolons; and even fullstop. to be toekinzed"
tokens(text, what = "word", remove_punct = FALSE, ngrams = 1:3)

输出：

tokens from 1 document.
text1 :
 [1] "some"              "text"              ","                 "with"             
 [5] "commas"            ","                 "and"               "semicolons"       
 [9] ";"                 "and"               "even"              "fullstop"         
[13] "."                 "to"                "be"                "toekinzed"        
[17] "some text"         "text ,"            ", with"            "with commas"      
[21] "commas ,"          ", and"             "and semicolons"    "semicolons ;"     
[25] "; and"             "and even"          "even fullstop"     "fullstop ."       
[29] ". to"              "to be"             "be toekinzed"      "some text ,"      
[33] "text , with"       ", with commas"     "with commas ,"     "commas , and"     
[37] ", and semicolons"  "and semicolons ;"  "semicolons ; and"  "; and even"       
[41] "and even fullstop" "even fullstop ."   "fullstop . to"     ". to be"          
[45] "to be tokeinzed"

有关函数中每个参数的详细信息，请参阅documentation

更新： 有关文档术语频率，请查看 Constructing a document-frequency matrix

作为示例尝试以下操作：

对于二元组（注意您不需要标记化）：

dfm(text, remove_punct = FALSE, ngrams = 2, concatenator = " ")

【讨论】：

这似乎是实现我想要的分割的好方法。但是因为我需要在分词后将这些字符串转换为dtm。不使用语料库可以转换dtm吗？
@Eva 我已经更新了答案以解决文档词频的需要，希望对您有所帮助

【解决方案2】：

您可能可以在 DTM 之前通过 tm_map 传递语料库，例如，

text <-  "the slurry includes: attrition pellet, oxidant, amino acid and water."

corpus_text <- VCorpus(VectorSource(text))
content(corpus_text[[1]])


clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, removePunctuation) #other common punctuation
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "and")) #ignoring "and"
  return(corpus)
}

corpus_text <- clean_corpus(corpus_text)
content(clean_corpus(corpus_text)[[1]])
#" slurry includes attrition pellet oxidant amino acid water"

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
dtm <-  DocumentTermMatrix(corpus_text, control = list(tokenize = BigramTokenizer))
mat <- as.matrix(dtm)
colnames(mat)

【讨论】：