您可以使用更智能的标记器,例如 quanteda 包中的标记器,其中removePunct = TRUE 将自动删除括号。
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "Other" "segment" "comprised" "of" "our" ## "active" "pharmaceutical"
## [8] "ingredient" "API" "business" "which"
添加:
如果您想首先标记文本,那么您需要 lapply 和 gsub,直到我们在 quanteda 中将正则表达式 valuetype 添加到 removeFeatures.tokenizedTexts()。但这会起作用:
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other" "segment" "comprised" "of" "our" "active"
## [7] "pharmaceutical" "ingredient" "business,which..."
如果您只是想删除问题中的括号表达式,那么您不需要 tm 或 quanteda:
# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."
# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API). New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient. New sentence..."
较长的正则表达式还可以捕获括号表达式结束句子或后跟附加标点符号(例如逗号)的情况。