删除字符串中包含标点符号 (R) 的所有单词答案

【问题标题】：Remove all words in string containing punctuation (R)删除字符串中包含标点符号 (R) 的所有单词
【发布时间】：2019-06-06 05:14:50
【问题描述】：

我如何（在 R 中）删除包含标点符号的字符串中的任何单词，而保留不包含标点符号的单词？

  test.string <- "I am:% a test+ to& see if-* your# fun/ction works o\r not"

  desired <- "I a see works not"

【问题讨论】：

标签： r regex string gsub

【解决方案1】：

这是一种使用sub 的方法，似乎可行：

test.string <- "I am:% a test$ to& see if* your# fun/ction works o\r not"
gsub("[A-Za-z]*[^A-Za-z ]\\S*\\s*", "", test.string)

[1] "I a see works not"

这种方法是使用下面的正则表达式模式：

[A-Za-z]*     match a leading letter zero or more times
[^A-Za-z ]    then match a symbol once (not a space character or a letter)
\\S*          followed by any other non whitespace character
\\s*          followed by any amount of whitespace

然后，我们只需用空字符串替换，以删除其中包含一个或多个符号的单词。

【讨论】：

【解决方案2】：

你可以使用这个正则表达式

(?<=\\s|^)[a-z0-9]+(?=\\s|$)

(?<=\\s|^) - 正向向后看，匹配应以空格或字符串开头。
[a-z0-9]+ - 匹配字母和数字一次或多次，
(?=\\s|$) - 匹配必须后跟空格或字符串结尾

Demo

蒂姆的编辑：

此答案使用白名单方法，即识别 OP 确实想要在其输出中保留的所有单词。我们可以尝试使用上面给出的正则表达式模式进行匹配，然后使用paste 连接匹配向量：

test.string <- "I am:% a test$ to& see if* your# fun/ction works o\\r not"
result <- regmatches(test.string,gregexpr("(?<=\\s|^)[A-Za-z0-9]+(?=\\s|$)",test.string, perl=TRUE))[[1]]
paste(result, collapse=" ")

[1] "I a see works not"

【讨论】：

你需要转义\\s
@TimBiegeleisen 解释将帮助我学习新事物，如果你不介意，你能解释一下你的第一条评论吗？
@CodeManiac 我编辑了您的答案以使其在 R 中工作。不确定我们的哪个答案会表现更好。
@TimBiegeleisen 哦，我明白你的意思是专门针对R，无论如何感谢编辑，我对R 不了解，因为我对行业很陌生，但我知道该正则表达式有点跨平台，大多数语言都支持但有一些限制，所以我添加了这个答案
好的，但是为了将来参考，如果问题被标记为特定的编程语言，那么 OP 以及将来可能正在阅读它的任何人都可能期待使用该实际语言的工作解决方案.但是，您的答案是有效的，如you can see in this demo。

【解决方案3】：

这里有更多的方法

第一种方法：

str_split(test.string, " ", n=Inf) %>%  # spliting the line into words
unlist %>% 
.[!str_detect(., "\\W|\r")] %>%         # detect words without punctuation or \r
paste(.,collapse=" ")                   # collapse the words to get the line

第二种方法：

str_extract_all(test.string, "^\\w+|\\s\\w+\\s|\\w+$") %>% 
unlist %>% 
trimws() %>% 
paste(., collapse=" ")

^\\w+ - 单词只有 [a-zA-Z0-9_] 并且也是字符串的开头
\\s\\w+\\s - 带有 [a-zA-Z0-9_] 且前后有空格的单词
\\w+$ - 单词有 [a-zA-Z0-9_] 并且也是字符串的结尾

【讨论】：