XPath 找到的结果数不正确答案

【问题标题】：Incorrect number of results found by XPathXPath 找到的结果数不正确
【发布时间】：2017-02-13 16:50:47
【问题描述】：

其实情况要复杂一点。

我正在尝试从此示例 html 中获取数据：

<li itemprop="itemListElement">
    <h4>
        <a href="/one" title="page one">one</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/two" title="page two">two</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/three" title="page three">three</a>
    </h4>
</li>

<li itemprop="itemListElement">
    <h4>
        <a href="/four" title="page four">four</a>
    </h4>
</li>

目前，我正在使用带有urllib 和lxml 的Python 3。由于某种原因，以下代码无法按预期工作（请阅读 cmets）

scan = []

example_url = "path/to/html"
page = html.fromstring(urllib.request.urlopen(example_url).read())

# Extracting the li elements from the html
for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

# At this point, the list 'scan' length is 4 (Nothing wrong)

for list_item in scan:
    # This is supposed to print '1' since there's only one match
    # Yet, this actually prints '4' (This is wrong)
    print(len(list_item.xpath("//h4/a")))

如您所见，第一步是提取 4 个li 元素并将它们附加到一个列表中，然后扫描每个li 元素以查找a 元素，但问题是每个li scan 中的元素其实就是这四个元素。

...或者我是这么认为的。

通过快速调试，我发现scan 列表正确包含四个li 元素，因此我得出了一个可能的结论：上面提到的for 循环有问题。

for list_item in scan:
    # This is supposed to print '1' since there's only one match
    # Yet, this actually prints '4' (This is wrong)
    print(len(list_item.xpath("//h4/a")))

    # Something is wrong here...

唯一真正的问题是我无法确定错误。是什么原因造成的？

PS：我知道，有一种更简单的方法可以从列表中获取 a 元素，但这只是一个示例 html，真正的包含更多...的东西。

【问题讨论】：

标签： python-3.x loops xpath lxml urllib

【解决方案1】：

在您的示例中，当 XPath 以 // 开头时，它将从文档的根目录开始搜索（这就是它匹配所有四个锚元素的原因）。如果您想相对于 li 元素进行搜索，则可以省略前导斜杠：

for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

for list_item in scan:
    print(len(list_item.xpath("h4/a")))

当然你也可以用.//替换//，这样搜索也是相对的：

for item in page.xpath("//li[@itemprop='itemListElement']"):
    scan.append(item)

for list_item in scan:
    print(len(list_item.xpath(".//h4/a")))

这是从规范中摘录的相关引述：

2.5 Abbreviated Syntax

// 是 /descendant-or-self::node()/ 的缩写。例如，//para 是 /descendant-or-self::node()/child::para 的缩写，因此将选择文档中的任何 para 元素（即使是作为文档元素的 para 元素也会被 //para 选择，因为文档元素节点是根节点的子节点）； div//para 是 div/descendant-or-self::node()/child::para 的缩写，因此将选择所有 para div 子代的后代。

【讨论】：

.// 解决了问题，谢谢您的回答。但为什么会这样呢？首先，我们加载一个页面并获取其 html，然后提取li 标签并将每个放入一个列表中。为什么使用// 会有什么不同？由于在第二个for 循环中，我们遍历了每个li 标签，因此应该只有一个h4，因此应该有a 标签。编辑：难道即使在提取li 标签之后，我们仍然拥有整个 html 吗？这可能是真正的罪魁祸首。
@Eekan - 正确，即使在提取了li 标记之后，XPath 查询仍然可以访问整个 HTML。在您的示例中，list_item 是对 li 元素的引用。我相信这样做的原因是因为 XPath 允许您遍历树并选择父元素。这意味着 li 必须是一个引用，以便树上的其他元素仍然可用于更复杂的查询。
谢谢，伙计。我想我已经更好地掌握了 XPath。

【解决方案2】：

print(len(list_item.xpath(".//h4/a")))

// 表示/descendant-or-self::node() 它以/开头，所以它会从文档的根节点开始搜索。

使用. 指向当前上下文节点是list_item，而不是整个文档

【讨论】：