找出 RegEx 搜索词答案

【问题标题】：Figuring out RegEx search term找出 RegEx 搜索词
【发布时间】：2020-09-15 23:59:09
【问题描述】：

我对这整件事很陌生。我正在使用正则表达式从包含以下内容的 HTML 中提取数据：

<p class="bold"> Last Statement:</p>
<p>Yes sir. I  would like to thank God, my dad, my Lord Jesus savior for saving me and changing  my life. I want to apologize to my in-laws for causing all this emotional pain.  I love y&rsquo;all and consider y&rsquo;all my sisters I never had. I want to thank you for  forgiving me. Thank you warden. </p>

我正在尝试使用提取文本

word = re.findall('Last Statement:</p>.*<p>(.+)</p>', x)

但它给了我一个空列表。我该如何调试呢？

【问题讨论】：

尝试使用regex101.com 之类的工具来测试您的正则表达式。您可能还想使用 HTML 解析器：docs.python.org/3/library/html.parser.html
默认情况下，正则表达式只会在一行中查找模式匹配，但您的模式跨越多行。将附加参数 re.DOTALL 传递给 findall() 函数以启用多行匹配。

标签： python html regex

【解决方案1】：

你快到了。用 \s* 替换 .* 应该可以正常工作。

word = re.findall('Last Statement:</p>\s*<p>(.+)</p>', x)

例如

import re

if __name__ == "__main__":
    s = """
<p class="bold"> Last Statement:</p>
<p>Yes sir. I  would like to thank God, my dad, my Lord Jesus savior for saving me and changing  my life. I want to apologize to my in-laws for causing all this emotional pain.  I love y&rsquo;all and consider y&rsquo;all my sisters I never had. I want to thank you for  forgiving me. Thank you warden. </p>
        """
    word = re.findall('Last Statement:</p>\s*<p>(.+)</p>', s)
    print(word)

由于您正在处理 html，因此使用 xml 解析器 + xpath 来查找您感兴趣的文本可能会更好......

【讨论】：