python中的多行模式匹配答案

【问题标题】：multi-line pattern matching in pythonpython中的多行模式匹配
【发布时间】：2011-02-12 18:48:40
【问题描述】：

定期计算机生成的消息（简化）：

Hello user123,

- (604)7080900
- 152
- minutes

Regards

使用python，我如何在两个空行之间提取“（604）7080900”、“152”、“分钟”（即任何以"- "模式开头的文本）（空行是\n\n之后“Hello user123”和“问候”之前的\n\n）。如果结果字符串列表存储在数组中，那就更好了。谢谢！

编辑：两个空行之间的行数不固定。

第二次编辑：

例如

hello

- x1
- x2
- x3

- x4

- x6
morning
- x7

world

x1 x2 x3 很好，因为所有行都被 2 个空行包围，x4 也很好，原因相同。 x6 不好，因为它后面没有空行，x7 不好，因为它前面没有空白。 x2 好（不像 x6、x7），因为前面的线是好线，后面的线也很好。

当我发布问题时，这个条件可能不清楚：

a continuous of good lines between 2 empty lines

good line must have leading "- "
good line must follow an empty line or follow another good line
good line must be followed by an empty line or followed by another good line

谢谢

【问题讨论】：

标签： python regex multiline

【解决方案1】：

>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>

【讨论】：

@S.Mark，对不起，我没有把问题说清楚，请参阅关于两个空行之间未定义行数的编辑。
@Horace，添加了 \n+ 以匹配超过 2 个空行
@S.Mark，是否有可能从 re 带走（分钟）？因为“分钟”不需要显示在最后一行
@Horace，是的，如果您不想在结果中出现，可以将 (minutes) 更改为 .*。
抛出一个像re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x) 这样复杂的单行并且在你的答案中不提供任何解释是一种非常糟糕的做法。

【解决方案2】：

最简单的方法是遍历这些行（假设您有一个行列表或一个文件，或者将字符串拆分为一个行列表），直到您看到只有 '\n' 的行，然后检查每个行行以'- ' 开头（使用startswith 字符串方法）并将其切掉，存储结果，直到找到另一个空行。例如：

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

已编辑：由于您详细说明了要执行的操作，因此这里是循环的更新版本。它不再循环两次，而是收集数据直到遇到“坏”行，并在遇到块分隔符时保存或丢弃收集的行。它不需要显式迭代器，因为它不会重新开始迭代，所以你可以只传递一个行列表（或任何可迭代的）行：

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

它正在发挥作用：

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

【讨论】：

@Thomas Wouters, "for line" 不可靠（否则我不会用多行标记这个问题 ;-) 我只能开始匹配 after "\n \n- "（两个换行符，然后是一个前导减号和空格）
您的问题不是（现在仍然不是），但基本方法保持不变。你仍然可以在行上使用迭代，但是如果你想让我写一个例子，你必须澄清你实际拥有和真正想要的东西。如果行之间不是以“-”开头的行怎么办？如果有多个这样的块怎么办？如果这些行不是空的，只是有一些空格怎么办？
我仍然不确定我当前的答案对您是否不起作用。（我没有看到第二次编辑？）
已更新，“2 个空行之间的连续好行”部分，谢谢

【解决方案3】：

>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']

【讨论】：

您应该在代码中至少添加一行说明 re.findall(r'^- (.*)', s, re.M) 的工作原理

【解决方案4】：

l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

这样做：

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

还有这个：

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]

【讨论】：