【问题标题】:Unable to parse links from xml content无法解析来自 xml 内容的链接
【发布时间】:2018-01-15 05:29:44
【问题描述】:

我已经在 python 中结合 xpath 编写了一个脚本,以从具有 xml 内容的站点中抓取链接。由于我从未使用过 xml,所以我无法弄清楚我在哪里犯了错误。提前感谢为我提供解决方法。这是我正在尝试的:

import requests
from lxml import html

response = requests.get("https://drinkup.london/sitemap.xml").text
tree = html.fromstring(response)
for item in tree.xpath('//div[@class="expanded"]//span[@class="text"]'):
    print(item)

链接所在的xml内容:

<div xmlns="http://www.w3.org/1999/xhtml" class="collapsible" id="collapsible4"><div class="expanded"><div class="line"><span class="button collapse-button"></span><span class="html-tag">&lt;url&gt;</span></div><div class="collapsible-content"><div class="line"><span class="html-tag">&lt;loc&gt;</span><span class="text">https://drinkup.london/</span><span class="html-tag">&lt;/loc&gt;</span></div></div><div class="line"><span class="html-tag">&lt;/url&gt;</span></div></div><div class="collapsed hidden"><div class="line"><span class="button expand-button"></span><span class="html-tag">&lt;url&gt;</span><span class="text">...</span><span class="html-tag">&lt;/url&gt;</span></div></div></div>

执行时抛出的错误如下:

    value = etree.fromstring(html, parser, **kw)
  File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
  File "src\lxml\parser.pxi", line 1843, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119053)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

【问题讨论】:

  • 您将 response 变量分配给来自 requests.gettext 属性,这将是一个 unicode 字符串,因此会出现错误。使用content 属性而不是text

标签: python xml python-3.x xpath web-scraping


【解决方案1】:

切换到.content which returns bytes instead of .text which returns unicode

import requests
from lxml import html


response = requests.get("https://drinkup.london/sitemap.xml").content
tree = html.fromstring(response)
for item in tree.xpath('//url/loc/text()'):
    print(item)

还要注意固定的 XPath 表达式。

【讨论】:

  • 你真是太棒了,alecxe 先生。每当我有困难时,你都在。它像魔术一样工作。非常感谢。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-12-25
  • 2012-08-02
  • 1970-01-01
  • 2013-08-30
相关资源
最近更新 更多