python中的简单.html过滤器-仅修改文本元素答案

【问题标题】：Simple .html filter in python - modify text elements onlypython中的简单.html过滤器-仅修改文本元素
【发布时间】：2019-09-25 04:23:25
【问题描述】：

我需要过滤一组相当长（但非常常规）的 .html 文件来修改一些结构只有如果它们出现在文本元素中。

一个很好的例子是将<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> 更改为<p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>。

我可以使用html.parser 轻松解析我的文件，但不清楚如何生成结果文件，该文件应尽可能与输入相似（无需重新格式化）。

我看过 beautiful-soup，但对于这个（应该是？）简单的任务来说，它似乎太大了。

注意：我确实不需要/想要将 .html 文件提供给任何类型的浏览器；我只需要用（稍微）改变的内容更新它们（可能就地）。

更新：

按照@soundstripe 的建议，我编写了以下代码：

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

不幸的是，BeautifulSoup 试图从它（和我）自己的利益出发变得过于聪明：

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

即：它将普通的&amp; 转换为&amp; 从而破坏&ldquo; 实体（注意我使用的是字节数组，而不是字符串。它相关吗？）。

我该如何解决这个问题？

【问题讨论】：

你可以使用 selenium webdriver
@Code_Ninja：乍一看，它看起来比漂亮的汤更有用。我错过了什么吗？
哈哈，不要害怕 API，selenium webdriver 为您提供了比 beautiful-soup 更多的功能，因为它的主要创建目的是在前端级别跟踪和自动化网站上的更改。跨度>

标签： python html filter

【解决方案1】：

我不知道你为什么不使用 BeautifulSoup。这是一个示例，可以按照您的要求替换引号。

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)

【讨论】：

请查看更新后的问题；我们快到了……但还不完全。
通常你需要打开另一个帖子（并先做一些研究）来问一个不同的问题，但我今天感觉很好：P
我会指出这个 re.sub() 模式只适用于单个 HTML 字符串中匹配的引号对。您可能想要更类似于 Word 对智能引号所做的事情——如果引号后面跟着一个字母，它应该是左引号。如果后跟空格或标点符号，则应该是正确的引号。
治病不如治病。使用bs4.BeautifulSoup(replace_quotes(e)) 将字符串包装在<html><body><p>...</p></body></html> 中；外部<html><body>...</body></html> 被replace_with 删除，但<p>...</p> 仍然存在并破坏了格式化。我会接受您的回答，因为我只是将&ldquo; 更改为“，因为我并不真正关心最终产品中的html 实体。谢谢。
我知道re.sub(...) 的局限性，但我仍然认为这是最好的选择；我的一个实际的 sn-ps 读到：<br/>« <span class="speech">Ho visto io quello che è successo. La ruota è passata su un pietrone che la ha spostata di un palmo. Un palmo più in là c’era il vuoto. Non ho fatto a tempo nemmeno a dirti "attento!"</span> »<br/>« <span class="speech">Andiamo a vedere?</span> »<br/> 它不是很明显如何处理“结束”引号，因为它被标点符号包围（这只是第一个命中）。欢迎任何见解。