R 相当于 Python 的 re.findall答案

【问题标题】：R's equivalent of Python's re.findallR 相当于 Python 的 re.findall
【发布时间】：2017-09-10 03:01:11
【问题描述】：

我正在尝试从字符串中获取 RegExp 的所有匹配项，但显然它在 R 中并不那么容易，或者我忽略了一些东西。说实话，这真的很混乱，我发现自己迷失在所有选项中：str_extract、str_match、str_match_all、regexec、grep、gregexpr，谁知道还有多少其他选项。

实际上，我想要完成的只是（在 Python 中）：

>>> import re
>>> re.findall(r'([\w\']+|[.,;:?!])', 'This is starting to get really, really annoying!!')
['This', 'is', 'starting', 'to', 'get', 'really', ',', 'really', 'annoying', '!', '!']

上面提到的函数的问题是要么返回一个匹配，要么根本不返回匹配。

【问题讨论】：

等等，是 Python 3 还是 Python 2？ \w 有不同的默认行为。
它是 Python 3，但我试图找到问题的通用解决方案，不一定针对这个特定的正则表达式。
好的，我会在答案中添加更多内容。

标签： python r regex

【解决方案1】：

一般来说，没有与 Python re.findall 完全等价的 R 函数返回匹配值列表或包含捕获组子匹配的元组（列表）。最接近的是 stringr 包中的 str_match_all，但它也非常接近 Python re.finditer（因为它返回第一项中的匹配值，然后返回所有子匹配项（捕获组内容）在随后的项目中（仍然不完全等同于re.finditer，因为只返回文本，不匹配数据对象）。因此，如果没有使用str_match_all 返回整个匹配值，它将完全等同于 Python re.findall。

您使用re.findall 仅返回匹配项，而不是捕获，您的模式中的捕获组是多余的，您可以将其删除。因此，您可以安全地将regmatches 与gregexpr 和PCRE 风格一起使用（因为[\\w'] 不适用于TRE 正则表达式）：

s <- "This is starting to get really, really annoying!!"
res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE))
## => [[1]]
[1] "This"     "is"      "starting" "to"       "get"      "really"  
[7] ","        "really"   "annoying" "!"        "!"

见R demo

或者，要使\w 支持Unicode，使其像在Python 3 中一样工作，请添加(*UCP) PCRE 动词：

res <- regmatches(s, gregexpr("(*UCP)[\\w']+|[.,;:?!]", s, perl=TRUE))

见another R demo

如果你想使用 stringr 包（在幕后使用 ICU 正则表达式库），你需要str_extract_all:

res <- str_extract_all(s, "[\\w']+|[.,;:?!]")

【讨论】：

该死的，str_extract_all，唯一一个我还没有签出...感谢\w 的澄清。