使用 REGEX 查找段落并在该段落内查找字符串答案

【问题标题】：Find a paragraph and find a string inside this paragraph with REGEX使用 REGEX 查找段落并在该段落内查找字符串
【发布时间】：2014-10-22 15:19:09
【问题描述】：

我在一个 HTML 页面中有一些这样的行：

<div>
    <p class="match"> this sentence should match </p> 
    some text
    <a class="a"> some text </a>  
</div>
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text
    <a class ="b"> some text </a> 
</div>

我想提取<p class="match"> 内的行，但前提是div 内包含<a class="a">。

到目前为止我所做的如下（我首先找到带有<a class="a"> 的段落，然后迭代结果以在<p class="match"> 中找到句子）：

import re
file_to_r = open("a")

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)

regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
    print(regex_match.findall(m))

但我想知道是否有其他（仍然有效的）方法可以一次完成？

【问题讨论】：

尝试漂亮的汤4解析html文件..
stackoverflow.com/a/1732454

标签： python html regex html-parsing

【解决方案1】：

使用 HTML 解析器，例如 BeautifulSoup。

找到带有a 类的a 标记，然后找到带有match 类的find previous sibling - p 标记：

from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

打印：

this sentence should match

还可以在此处了解为什么应避免使用正则表达式来解析 HTML：

RegEx match open tags except XHTML self-contained tags

【讨论】：

@user3683807 请仔细阅读链接线程 - html 解析器是专门用于解析 HTML 的 - 特定任务的特定工具。我建议在这里使用BeautifulSoup - 它使 HTML 解析变得简单可靠。

【解决方案2】：

您应该使用 html 解析器，但如果您仍然使用正则表达式，则可以使用以下内容：

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

Working demo

【讨论】：

@Jerry 正如我在回答中所建议的那样，我不会使用正则表达式来解析 html。但我发布了答案作为使用正则表达式回答问题的选项。

【解决方案3】：

 <div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

你可以用这个。

查看演示。

http://regex101.com/r/lK9iD2/7

【讨论】：