使用 Python 的正则表达式：在边界内查找所有内容答案

【问题标题】：RegEx with Python: findall inside a boundry使用 Python 的正则表达式：在边界内查找所有内容
【发布时间】：2017-09-30 22:54:58
【问题描述】：

我有一个字符串，可以通过以下方式说明（额外的空格）：

"words that don't matter   START    some words one       some words two     some words three   END    words that don't matter"

为了获取 START 和 END ['some words one', some words two', 'some words three'] 之间的每个子字符串，我编写了以下代码：

result = re.search(r'(?<=START).*?(?=END)', string, flags=re.S).group()
result = re.findall(r'(\(?\w+(?:\s\w+)*\)?)', result)

是否有可能用一个正则表达式来实现这一点？

【问题讨论】：

标签： python regex findall

【解决方案1】：

理论上，您可以将第二个正则表达式包装在 ()* 中并将其放入您的第一个。这将捕获边界内所有出现的内部表达式。不幸的是，Python 实现只保留了多次匹配的组的最后一个匹配项。我知道保留组的所有匹配项的唯一实现是 .NET 实现。所以不幸的是不是你的解决方案。

另一方面，您为什么不能简单地保留现有的两步方法？

编辑：您可以使用在线正则表达式工具比较我描述的行为。

模式：(\w+\s*)* 输入：aaa bbb ccc

以https://pythex.org/ 和http://regexstorm.net/tester 为例。您将看到 Python 返回一个匹配/组，即 ccc，而 .NET 返回 $1 作为三个捕获 aaa, bbb, ccc。

Edit2：正如@Jan 所说，还有更新的regex 模块支持多捕获。我完全忘记了这一点。

【讨论】：

你为什么不能简单地保持你所拥有的两步方法？我会的，但这让我想知道一个可能的单一正则表达式模式可以实现它，因为我正在尽我所能去学习它。有趣：我刚刚意识到实际上是用之前的一段代码做到了这一点：actors = re.findall(r'Actors[\n\r\t]([\w\s\-\'\,]*)[\n\r\t]Stage', crew) 这个工作，但源材料略有不同，我找不到一种方法使它与原始示例一起工作。

【解决方案2】：

使用较新的regex 模块，您可以一步完成：

(?:\G(?!\A)|START)\s*\K
(?!\bEND\b)
\w+\s+\w+\s+\w+

这看起来很复杂，但分解后，它说：

(?:\G(?!\A)|START)  # look for START or the end of the last match
\s*\K               # whitespaces, \K "forgets" all characters to the left
(?!\bEND\b)         # neg. lookahead, do not overrun END
\w+\s+\w+\s+\w+     # your original expression

在Python 中，这看起来像：

import regex as re

rx = re.compile(r'''
        (?:\G(?!\A)|START)\s*\K
        (?!\bEND\b)
        \w+\s+\w+\s+\w+''', re.VERBOSE)

string = "words that don't matter   START    some words one       some words two     some words three   END    words that don't matter"

print(rx.findall(string))
# ['some words one', 'some words two', 'some words three']

此外，请参阅a demo on regex101.com。

【讨论】：

这就是我一直在寻找的：一个单一的正则表达式解决方案。这是相当新的模块，对吧？我不知道。我还需要了解 IF x THEN |正则表达式中的其他可能性。
@LeandroRibeiro：确实如此。看看regexone.com 和rexegg.com（相当先进，但很棒）。
我更改了您的正则表达式 a bit。这样，无论单词数量如何，它都会抓取所有子字符串。我的示例有三个单词子字符串，但我需要它来匹配每个字符串中未知数量的单词： (?:\G(?!\A)|START)\s*\K (?!\bEND\b) \w+ (?:\s\w+)*
注意：在 Regex101，如果我更改为 Python，此代码将不起作用：(regex101.com/r/oLFVRk/3)
@LeandroRibeiro：确实如此。毕竟，regex101.com 只是一个模拟器，它在内部使用re 模块。您需要先通过pip install regex 安装regex 模块才能使用它。

【解决方案3】：

这是一个理想的情况，我们可以使用re.split，正如@PeterE 提到的那样来规避只能访问最后捕获的组的问题。

import re
s=r'"words that don\'t matter   START    some words one       some words two     some words three   END    words that don\'t matter" START abc  a bc c   END'
print('\n'.join(re.split(r'^.*?START\s+|\s+END.*?START\s+|\s+END.*?$|\s{2,}',s)[1:-1]))

启用re.MULTILINE/re.M 标志，因为我们正在使用^ 和$。

输出

some words one
some words two
some words three
abc
a bc c

【讨论】：

这很优雅。谢谢。