简单的正则表达式令人困惑答案

【问题标题】：easy regex is confusing简单的正则表达式令人困惑
【发布时间】：2012-12-24 04:58:14
【问题描述】：

我似乎无法从以下短语中获取电子邮件地址：

“mailto:fwwrp-3492801490@yahoo.com?”

到目前为止我已经尝试过

regexpr(":([^\\?*]?)", phrase)

代码逻辑如下：

以分号开头：
获取每个不是问号的字符
返回括号内的那些字符。

我不确定我的正则表达式哪里出错了。

【问题讨论】：

标签： regex r

【解决方案1】：

让我们看看你的正则表达式，我们会看看你哪里出错了。我们将把它拆开以便于讨论：

:            Just a literal colon, no worries here.
(            Open a capture group.
    [        Open a character class, this will match one character.
        ^    The leading ^ means "negate this class"
        \\   This ends up as a single \ when the regex engine sees it and that will
             escape the next character.
        ?    This has no special meaning inside a character class, sometimes a
             question mark is just a question mark and this is one of those
             times. Escaping a simple character doesn't do anything interesting.
        *    Again, we're in a character class so * has no special meaning.
    ]        Close the character class.
    ?        Zero or one of the preceding pattern.
)            Close the capture group.

去除噪音给我们:([^?*]?)。

所以你的正则表达式实际上匹配：

冒号后跟零个或一个不是问号或星号的字符以及非问号或非星号的字符将位于第一个捕获组中。

这与您尝试做的完全不同。一些调整应该可以解决您的问题：

:([^?]*)

匹配：

冒号后跟任意数量的非问号，非问号将位于第一个捕获组中。

字符类外的*是特殊的，字符类外表示“零或多个”，字符类内只是*。

我会把它留给其他人来帮助你处理 R 方面的事情，我只是想让你了解正则表达式发生了什么。

【讨论】：

感谢您打破常规。我意识到我的错误在哪里以及我实际上应该做什么。这真的很有帮助。
@user1103294：谢谢，我喜欢将自己视为fishing instructor :)

【解决方案2】：

这是gsub 的一种非常简单的方法：

gsub("([a-z]+:)(.*)([?]$)", "\\2", "mailto:fwwrp-3492801490@yahoo.com?")
## Or, if you expect things other than characters before the colon
gsub("(.*:)(.*)([?]$)", "\\2", "mailto:fwwrp-3492801490@yahoo.com?")
## Or, discarding the first and third groups since they aren't very useful
gsub(".*:(.*)[?]$", "\\1", "mailto:fwwrp-3492801490@yahoo.com?")

在@TylerRinker 开始的位置的基础上，您还可以按如下方式使用strsplit（以避免将gsub 排除在问号之外）：

strsplit("mailto:fwwrp-3492801490@yahoo.com?", ":|\\?", fixed=FALSE)[[1]][2]

如果你有一个这样的字符串列表呢？

phrase <- c("mailto:fwwrp-3492801490@yahoo.com?", 
            "mailto:somefunk.y-address@Sqmpalm.net?")
phrase
# [1] "mailto:fwwrp-3492801490@yahoo.com?"  
# [2] "mailto:somefunk.y-address@Sqmpalm.net?"

## Using gsub
gsub("(.*:)(.*)([?]$)", "\\2", phrase)
# [1] "fwwrp-3492801490@yahoo.com"     "somefunk.y-address@Sqmpalm.net"

## Using strsplit
sapply(phrase, 
       function(x) strsplit(x, ":|\\?", fixed=FALSE)[[1]][2], 
       USE.NAMES=FALSE)
# [1] "fwwrp-3492801490@yahoo.com"     "somefunk.y-address@Sqmpalm.net"

我更喜欢 gsub 方法的简洁性。

【讨论】：

谢谢阿纳多，我也会复习你的答案。我喜欢你在分号前添加胡言乱语的实现方式。