Python的lxml在HTML上的等效解析方法的差异：cssselect vs xpath答案

【问题标题】：Discrepency in Python's lxml's equivalent parsing methods on HTML: cssselect vs xpathPython的lxml在HTML上的等效解析方法的差异：cssselect vs xpath
【发布时间】：2011-12-24 07:55:10
【问题描述】：

我试图用 xpath 和 cssselect 解析 example.com's home page，但似乎我不知道 xpath 是如何工作的，或者 lxml 的 xpath 已损坏，因为它缺少匹配项。

这是快速而肮脏的代码。

from lxml.html import *
mySearchTree = parse('http://www.example.com').getroot()
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

print '-'*8 +'Now for Xpath' + 8*'-'
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

结果：

found "About" link to href "/about/"
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Domains" link to href "/domains/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Protocols" link to href "/protocols/"
found "Number Resources" link to href "/numbers/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
--------Now for Xpath--------
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"

基本上 xpath 找到了它应该找到的每个链接，除了那些被 Example.com 加粗的链接。但是，星号通配符不应该在 xpath 匹配 './/tr/*/a' 中允许这样做吗？

【问题讨论】：

标签： python xpath css-selectors lxml

【解决方案1】：

可能发生了其他事情（我没有仔细检查示例文档），但是您的 CSS 选择器和 XPath 不等效。

CSS tr a 在 XPath 中是 //tr//a。 .//tr/*/a 表示（概念上，不准确）：

.：当前节点
//: 当前节点的所有后代
tr: 当前节点所有后代中的所有 tr 元素
/: 找到 tr 元素的所有子元素
*: 找到的 tr 元素的子元素中的任何元素
/: 找到 tr 元素的任何子元素的所有子元素
a: 所有 a 元素，它们是 tr 元素的子元素的子元素

换句话说，给定以下 HTML：

<ul>
    <li><a href="link1"></a><li>
    <li><b><a href="link2"></a></b><li>
</ul>

//ul/*/a 只会匹配 link1。

XPath 入门

实际上，“XPath”是一系列由斜线分隔的定位步骤。位置步骤包括：

一个轴（例如 child::)
节点测试（节点名称或特殊节点类型之一，例如node()、text()）
可选谓词（由[] 包围。仅当所有谓词都为真时才匹配节点。）

如果我们将.//tr/*/a 分解成它的位置步骤，它看起来像这样：

.
（“//”中斜线之间的“空格”）
tr
*
a

我在说什么可能不太明显。这是因为 XPath 具有缩写语法。这是扩展了缩写的表达式（轴和节点测试由::分隔，由/分隔）：

self::node()/descendent-or-self::node()/child::tr/child::*/child::a

（注意self::node() 是多余的。）

从概念上讲，步骤中发生的事情是：

给定一组上下文节点（默认为当前节点或根节点为“/”）
对于每个上下文节点，创建一组满足定位步骤的节点
将所有每个上下文节点集合并为一个节点集
将该集合作为其给定的上下文节点传递给下一个位置步骤。
重复直到步数结束。最后一步之后剩下的集合就是整个路径的集合。

请注意，这仍然是一种简化。如果需要，请阅读 XPath Standard 了解血腥细节。

【讨论】：

感谢您的回答，并在您的描述中回答了我还没有回答的其他 3 个问题！
Francis，您对 XPath 运算符含义的解释与它们的实际含义大相径庭。根据您的解释，.// 或 .//tr/ 等表达式应该是有效的，而实际上，这些在语法上是非法的。请删除或更正解释。
通过更正解释（即引入位置步骤的概念），OP 可能会变得不太清楚。我将添加第二个更准确的解释。

【解决方案2】：

'tr a' -> '//tr//a'

【讨论】：