在 Python 中使用 XPath 计算节点的最有效方法答案

【问题标题】：Most efficient way to count nodes using XPath in Python在 Python 中使用 XPath 计算节点的最有效方法
【发布时间】：2014-11-13 19:38:55
【问题描述】：

在 Python 中，如何使用 XPath 计算节点数？例如，使用this webpage 和这段代码：

from lxml import html, etree
import requests
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
r = requests.get(url)
tree = html.fromstring(r.content)
count = tree.xpath('count(//*[@id="body"])')
print count

它打印 1。但它有 5 个div 节点。请向我解释一下，我该如何正确地做到这一点？

【问题讨论】：

标签： python xpath lxml python-requests scrape

【解决方案1】：

它会打印 1（或 1.0），因为在您获取的 HTML 文件中只有一个带有 id="body" 的元素。

我下载了文件并确认是这种情况。例如：

$ curl -O http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals

抓取文件587-islam-is-dominated-by-radicals

$ grep --count 'id="body"' 587-islam-is-dominated-by-radicals

答案 1. 为了更加确定，我也使用 vi 手动搜索了文件。只有一个！

也许您正在寻找另一个div 节点？一个有不同的id？

更新：顺便说一句，使用 XPath 和其他 HTML/XML 解析非常具有挑战性。大量不良数据和大量复杂标记是检索、解析和遍历过程的复杂度的倍数。您可能会多次运行测试和试验。如果您不为每个人“上网”，那将会快很多。缓存实时结果。原始代码如下所示：

from lxml import html, etree
import requests

filepath = "587-islam-is-dominated-by-radicals"
try:
    contents = open(filepath).read()
    print "(reading cached copy)"
except IOError:
    url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
    print "(getting file from the net; please stand by)"
    r = requests.get(url)
    contents = r.content
tree = html.fromstring(contents)
count = tree.xpath('count(//*[@id="body"])')
print count

但是您可以通过使用requests 的通用缓存前端来简化很多事情，例如requests-cache。解析愉快！

【讨论】：

因为 XPath 1.0 就是这样做的。 XPath 2.0 将返回更符合预期的整数结果。 See this question for a deeper explanation