如何删除R中以冒号结尾的文本模式？答案

【问题标题】：How to remove a pattern of text ending with a colon in R?如何删除R中以冒号结尾的文本模式？
【发布时间】：2019-07-12 13:34:44
【问题描述】：

我有下面这句话

review <- C("1a. How long did it take for you to receive a personalized response to an internet or email inquiry made to THIS dealership?: Approx. It was very prompt however. 2f. Consideration of your time and responsiveness to your requests.: Were a little bit pushy but excellent otherwise 2g. Your satisfaction with the process of coming to an agreement on pricing.: Were willing to try to bring the price to a level that was acceptable to me. Please provide any additional comments regarding your recent sales experience.: Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! ")

我想删除之前的所有内容：

我试过下面的代码，

gsub("^[^:]+:","",review)

但是，它只删除了以冒号结尾的第一句

预期结果：

Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me. Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)!

任何帮助或建议将不胜感激。谢谢。

【问题讨论】：

你的问题不清楚。之前的一切都可以包括所有的字符。是一个句子吗？
所以你只想删除1a.、2f.、2g.、:？每行的这些字符是否相同？
对不起，我的意思是我想摆脱句子中的所有问题，只保留回复。就我而言，问题以冒号结尾，这就是为什么我在冒号之前提到了所有内容
试试gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
如果您能解释一下正则表达式，那就太好了。

标签： r regex gsub

【解决方案1】：

如果句子不复杂且没有缩写，您可以使用

gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)

请参阅regex demo。

请注意，您可以通过将 \\d+[a-zA-Z] 更改为 [0-9a-zA-Z]+ / [[:alnum:]]+ 以匹配 1+ 位数字或字母来进一步概括它。

详情

(?:\d+[a-zA-Z]\.)? - 可选序列
- \d+ - 1 位以上
- [a-zA-Z] - 一个 ASCII 字母
- \. - 一个点
[^.?!:]* - 除.、?、!、: 之外的 0 个或多个字符
[?!.] - ?、! 或 .
: - 冒号
\s* - 0+ 个空格

R 测试：

> gsub("(?:\\d+[a-zA-Z]\\.)?[^.?!:]*[?!.]:\\s*", "", review)
[1] "Approx. It was very prompt however. Were a little bit pushy but excellent otherwise Were willing to try to bring the price to a level that was acceptable to me.Abel is awesome! Took care of everything from welcoming me into the dealership to making sure I got the car I wanted (even the color)! "

扩展以处理缩写

如果您添加交替，您可以枚举异常：

gsub("(?:\\d+[a-zA-Z]\\.)?(?:i\\.?e\\.|[^.?!:])*[?!.]:\\s*", "", review)     
                          ^^^^^^^^^^^^^^^^^^^^^^

在这里，(?:i\.?e\.|[^.?!:])* 匹配 0 个或多个 ie. 或 i.e. 子字符串或除 .、?、! 或 : 之外的任何字符。

见this demo。

【讨论】：

对于诸如“4c。请在还给您时评价您的车辆状况（即清洁度、未损坏）。：非常感谢您的清洗！”这样的句子，正则表达式不会返回预期结果。我该怎么办？
@gamyanaidu 我在一开始就添加了：如果没有缩写。如果有，你可以手动添加，比如(?:\d+[a-zA-Z]\.)?(?:i\.?e\.|[^.?!:])*[?!.]:\s*，见this demo。
完美答案。非常感谢。