从 HTML 文档中获取 XPath答案

【问题标题】：Getting the XPath from an HTML document从 HTML 文档中获取 XPath
【发布时间】：2018-02-19 22:47:08
【问题描述】：

https://next.newsimpact.com/NewsWidget/Live

我正在尝试编写一个 python 脚本，该脚本将从上面链接中的 HTML 表中获取一个值。上面的链接是我试图从中获取的站点，这是我编写的代码。我认为我的 XPath 可能不正确，因为它在其他元素上运行良好，但我使用的路径没有返回/打印任何内容。

from lxml import html
import requests
page = requests.get('https://next.newsimpact.com/NewsWidget/Live')
tree = html.fromstring(page.content)

#This will create a list of buyers:
value = tree.xpath('//*[@id="table9521"]/tr[1]/td[4]/text()')

print('Value: ', value)

奇怪的是，当我打开查看源代码页面时，我找不到要从中提取的表。感谢您的帮助！

【问题讨论】：

预期输出是什么？
2,632 @GillesQuenot
把你的“导入”行也放进去
糟糕，我现在将它们包含在代码中。 @GillesQuenot

标签： python html xpath python-requests

【解决方案1】：

初始页面源中缺少必需数据 - 它来自 XHR。您可以通过以下方式获取：

import requests

response = requests.get('https://next.newsimpact.com/NewsWidget/GetNextEvents?offset=-120').json()

first_previous = response['Items'][0]['Previous']  # Current output - "2.632"
second_previous = response['Items'][1]['Previous']  # Currently - "0.2"
first_forecast = response['Items'][0]['Forecast']  # ""
second_forecast = response['Items'][1]['Forecast']  # "0.3"

您可以将response 解析为简单的Python dict 并获取所有需要的数据

【讨论】：

你不知道你给了我多少帮助......大声笑，一直坚持这一点。我不知道您可以为此使用 JSON，展示了我对这些东西的了解。非常感谢您的帮助！

【解决方案2】：

你的问题很简单，request 根本不处理javascript。这些值是 JS 生成的！

如果你真的需要运行这个xpath，你需要使用一个能够理解JS的模块，比如spynner。

您可以先使用curl 或在浏览器中禁用 JS 来测试何时需要 JS。使用firefox：导航栏about:config，然后搜索javascript.enabled，然后双击切换真假

在chrome，打开chrome dev tools，某处有选项。

查看https://github.com/makinacorpus/spynner

另一个（可能的）问题，使用tree = html.fromstring(page.text) 而不是tree = html.fromstring(page.content)

【讨论】：

哦，我明白了...对此感到抱歉。您对如何获取数据有什么建议吗？你是说使用 XPath 不是一个好方法？