lxml解析原子 - 空结果？答案

【问题标题】：lxml parsing atom - empty results?lxml解析原子 - 空结果？
【发布时间】：2017-10-30 14:42:30
【问题描述】：

我正在尝试从 atom_sample.xml 中获取标题和链接，并附上与其他 rss 提要相同的代码。

from lxml import etree
tree = etree.parse('atom_sample.xml')
root = tree.getroot()

titles = root.xpath('//entry/title/text()')
links = root.xpath('//entry/link/@href')
print(titles)
print(links)

结果： [] []

使用来自Issues with python 3.x multiline regex? 的另一个 rss 文件，它可以完美运行。

【问题讨论】：

标签： python python-3.x lxml atom-feed

【解决方案1】：

我认为您的问题是 lxml.etree 使用 xml 命名空间 {http://www.w3.org/2005/Atom} 解析您的 xml 文件：

In [1]: from lxml import etree
...: tree = etree.parse('atom_sample.xml')
...: root = tree.getroot()


In [2]: root
Out[2]: <Element {http://www.w3.org/2005/Atom}feed at 0x7f198e8da808>

我不确定如何轻松摆脱此命名空间，但您可以尝试this 问题的答案之一。

无论如何，作为工作，我将<namespace>:<tag> 添加到xpath 的每个部分，并使用xpath 方法和namespaces 字典作为参数。例如：

In [4]: namespaces = {'atom':'http://www.w3.org/2005/Atom'}

In [5]: root.xpath('//atom:entry/atom:title/text()', namespaces=namespaces)
Out[5]: 
['sample.00',
 'sample.01',
 'sample.02',
 'sample.03',
 'sample.04',
 'sample.05',
 'sample.06',
 'sample.07',
 'sample.08',
 'sample.09',
 'sample.10']

 In [6]: root.xpath('//atom:entry/atom:link/@href', namespaces=namespaces)
 Out[6]: 
 ['https://myfeedurl.com/feed/00',
  'https://myfeedurl.com/feed/01',
  'https://myfeedurl.com/feed/02',
  'https://myfeedurl.com/feed/03',
  'https://myfeedurl.com/feed/04',
  'https://myfeedurl.com/feed/05',
  'https://myfeedurl.com/feed/06',
  'https://myfeedurl.com/feed/07',
  'https://myfeedurl.com/feed/08',
  'https://myfeedurl.com/feed/09',
  'https://myfeedurl.com/feed/10']

【讨论】：