使用 lxml 查找/替换文本，同时保持 HTML 结构答案

【问题标题】：Using lxml to find/replace text while maintaining HTML structure使用 lxml 查找/替换文本，同时保持 HTML 结构
【发布时间】：2016-02-26 12:24:10
【问题描述】：

我正在尝试构建一个简单的脚本，以根据字典将超链接插入 HTML。

它可以在一定程度上使用简单的str.replace()，但是当它遇到现有的超链接或其他标记时会中断，然后这些标记会被破坏。

我已经看到了广泛的answers on StackOverflow 问题，建议使用 lxml 和 BeatifulSoup，但我遇到了具体问题，希望有人可以推动我朝着正确的方向前进。

这是一个示例：

test_string = """<p>
    COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
    For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
    <strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
    <strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
    <strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""

import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")

for p in paragraphs:
    if p.tag == "p":
       DO SOMETHING

逻辑是只处理文章中的段落标签（并避免标题标签和表格）。

但是，问题是我无法找到一种方法来处理标签中的文本，从而保留标签的位置，例如

如何使用 lxml 将段落分成块：

然后我可以在其上运行查找/替换的文本
必须保留的标签

这样我就可以保留整体结构？

更新：例如，我可能想用<a href="met_office.html">Met Office</a> 替换Met Office，同时保留其他文本，因此：

<p>
        COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
    </p>

【问题讨论】：

您可以使用 XPath 直接选择 <p> 元素：root.xpath("//p")
给出的例子确实令人困惑："..我可能想用 Met Office" 替换 Met Office
替换应该是：Met Office（无标签）到 Met Office（带有超链接）。由于某种原因，它没有出现在文本中（这就是我添加代码示例的原因）。

标签： python replace lxml

【解决方案1】：

例如，要替换段落文本，使用node.set(attribute, value)，然后使用lxml.html.tostring() 保存更改：

p.set('string', '')
result = lxml.html.tostring(root)

【讨论】：

【解决方案2】：

只需使用 BeautifulSoup，它（可选）使用 lxml：

from bs4 import BeautifulSoup
soup = BeautifulSoup(test_string)
pars = soup.findall('p')
for p in pars:
  p.string = 'this text was replaced'
print s.prettify()

结果是：

<html>
 <body>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <h3>
   Weather map
  </h3>
  ...
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
 </body>
</html>

更新：如果您想在段落文本中搜索字符串并将其包装在一个新的<a> 标记中，那么结合new_tag() function in BeautifulSoup 的正则表达式应该可以完成这项工作。

【讨论】：

我添加了一个更新来更好地解释我想要实现的目标。
现在这是一个完全不同的问题...我假设您想将文本包装在另一个标记中？
我接受我的问题没有得到应有的概述。我遇到的问题是我不知道如何在不破坏现有超链接或其他 HTML 标记的情况下将超链接添加到 HTML 中。我没有使用 BeautifulSoup 而是以前依赖于 lxml 并没有帮助。