如何使用 re.sub 专门删除带有类的标签答案

【问题标题】：How can I specifically remove a tag with a class using re.sub如何使用 re.sub 专门删除带有类的标签
【发布时间】：2021-12-11 06:46:46
【问题描述】：

我想从下面的 HTML 中过滤掉带有类的 p 标签，而不影响它后面的任何其他 p 标签。

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

我正在使用什么：

def myFunction(result):
    result = re.sub(r'<article data-article-content><div class="abc"><p class="ghf">.*?</p><\/article>',' ',result)
    return result

我将调用这个函数并打印出来应该省略“一些文本”。我是正则表达式的初学者。请帮忙提出建议

预期输出：

其他一些文字不同的文字

【问题讨论】：

欢迎来到 Stack Overflow！不要使用正则表达式来解析 HTML。这是bad idea。但是why not? 这里是some examples 您可能会遇到的问题。请改用HTML parser。
你想用给定的 HTML 实现什么。也许您可以多解释一下用例或上下文，这样我们就可以找到比正则表达式更好的解决方案。因为正则表达式可能会极大地限制您的解决方案。
没有</p><\/article> 所以当然正则表达式不匹配。替换到</article> 的末尾显然会替换所有<p> 节点，而不仅仅是第一个节点。能否请edit 澄清预期结果应该是什么？
感谢您的建议。我正在从网站上获取主体。我想避免从中刮掉不需要的文本。我现在正在尝试 BeautifulSoup。我会用更多信息更新问题。

标签： python html

【解决方案1】：

使用BeautifulSoup。这是一个很棒的 HTML 解析器，它有一个非常直观的 API。我已经在大大小小的项目中使用了数百次。

html = '''
<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

ps = soup.find_all('p')

p_with_class = [p for p in ps if p.get('class') is not None][0]

print(p_with_class)
# <p class="ghf">Some Text</p>

# Remove it.
p_with_class.decompose()

print(soup.prettify())

输出：

<html>
 <body>
  <article data-article-content="">
   <div class="abc">
    <p>
     Some other Text
    </p>
    <p>
     A different Text
    </p>
   </div>
  </article>
 </body>
</html>

更多here.

【讨论】：

感谢您的建议。是的，我已经开始使用 BeautifulSoup。一旦我达到某个点，我会回来的。

【解决方案2】：

使用 BeautifulSoup 您可以转换给定的 HTML，以便

任何带有ghf 类的<p> 标记都将被删除

输入

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

预期输出

<article data-article-content="">
<div class="abc">

<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>

用 BeautifulSoup

这里使用BeautifulSoup，版本4，也称为首字母缩略词bs4。

使用pip安装：

pip install beautifulsoup4

然后解析、查找、修改和打印：

from bs4 import BeautifulSoup

html = '''
<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

soup = BeautifulSoup(html, features='html.parser') # parses HTML using python's internal HTML-parser

found_paragraphs = soup.find("p", {"class": "ghf"}) # find your element
found_paragraphs.extract() # removes and leaves an empty line

print(soup) # unfortunately indentation is lost

您可以在soup 上使用prettify() 来恢复一些缩进。

另见

使用的功能在相关问答中有详细说明：

【讨论】：