使用正则表达式匹配标题下方的段落，Python答案

【问题标题】：Matching paragraphs below title using regex , Python使用正则表达式匹配标题下方的段落，Python
【发布时间】：2014-12-04 18:14:34
【问题描述】：

我有一个部分需要匹配。我的条件是：匹配包括标题在内的所有内容。我已经匹配了标题的模式，我需要匹配以 "fig" 开头的段落。我也已经这样做了，但我注意到一旦遇到不匹配，它就会停止进一步匹配。
另一个条件是，如果一个段落少于 3 个单词，不要匹配它。

这是示例文本：

List of tables and figure captions:

Figure 1 shows study area and locations of borewell and surface water sampling  points. Low lying area on the western side is clearly visible.


Figure 2 displays nothing much.
no match
here


Fig.y yhth hyt htyh hyt htyh th thyt htyht thh

Table xvnm,mcxnv  bvv nd vdm v

段落之间可以有任意数量的行。这里发生的情况是，在以 Figure 2 开头的段落中的行结尾之后，这些单词不匹配，因为它们不是以“Fig”开头，而是它们之后的句子以“Fig”开头。我怎么可能将这条线与Fig.y 匹配？？

这是我的正则表达式：

'((?:^(?:Supp[elmntary]*\s|list\sof\s)?[^\n]*Fig[ures]*[^\n]*(?:Captions?|Legends?|Lists?)[^\n])(?:(?!^)[^\n]+|(?!\n\w+\s*\w+\s*:?\s*$)\n|Fig)*)'

使用的标志：re.I、re.M、re.S (DOTALL)

我尝试提前添加：

(?:.*^Fig[^\n]*$){0,}

但这不起作用，因为我找不到跳过包含 "no match" 和 "here" 的行的方法。

帮助表示赞赏。我将使用re.findall。

【问题讨论】：

标签： python regex python-2.7

【解决方案1】：

新答案可能我还没有完全理解你的要求，但我会再试一次。我假设可以从您的原始正则表达式中插入正确的正则表达式来捕获标题。

# Python 2.7
# Typos may exist, didn't test yet
import re

def emitRecord(matches):
  if len(matches) > 0:
    print "----- Start record -----"
    print "\n".join(matches)
    print "----- End record -----"

matches = []
seenTitle = False
titleRegex = re.compile(r'expression to capture titles here')
figureRegex = re.compile(r'^(?:fig|figure)[^a-z]', re.I)
with open('text.txt', 'r') as text:
  for line in text:
    if not line.strip(): continue
    if titleRegex.search(line):
      seenTitle = True
      emitRecord(matches)
      matches = [line.strip()]
    elif seenTitle:
      if len(line.split()) < 3: continue
      if figureRegex.search(line): matches.append(line.strip())
emitRecord(matches)

【讨论】：

抱歉，这对我不起作用。我的文本以 xml 展开。而且你没有考虑到标题的变化。以上可以简单地使用正则表达式完成。对标题进行前瞻是主要挑战（也是目前还没有其他答案的原因之一）
我又试了一次。抱歉，如果我再次错过了 - 我可能还没有完全理解您的用例。