【发布时间】:2016-02-26 12:24:10
【问题描述】:
我正在尝试构建一个简单的脚本,以根据字典将超链接插入 HTML。
它可以在一定程度上使用简单的str.replace(),但是当它遇到现有的超链接或其他标记时会中断,然后这些标记会被破坏。
我已经看到了广泛的answers on StackOverflow 问题,建议使用 lxml 和 BeatifulSoup,但我遇到了具体问题,希望有人可以推动我朝着正确的方向前进。
这是一个示例:
test_string = """<p>
COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
<strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
<strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
<strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""
import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")
for p in paragraphs:
if p.tag == "p":
DO SOMETHING
逻辑是只处理文章中的段落标签(并避免标题标签和表格)。
但是,问题是我无法找到一种方法来处理标签中的文本,从而保留标签的位置,例如
如何使用 lxml 将段落分成块:
- 然后我可以在其上运行查找/替换的文本
- 必须保留的标签
这样我就可以保留整体结构?
更新:
例如,我可能想用<a href="met_office.html">Met Office</a> 替换Met Office,同时保留其他文本,因此:
<p>
COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
【问题讨论】:
-
您可以使用 XPath 直接选择
<p>元素:root.xpath("//p") -
给出的例子确实令人困惑:"..我可能想用 Met Office" 替换 Met Office
-
替换应该是:Met Office(无标签)到 Met Office(带有超链接)。由于某种原因,它没有出现在文本中(这就是我添加代码示例的原因)。