使用python和lxml从网站获取html源代码答案

【问题标题】：Using python and lxml to get html sources code from web site使用python和lxml从网站获取html源代码
【发布时间】：2015-12-22 05:24:08
【问题描述】：

我是 python 的初学者，并尝试使用 Python 2.7 创建一个程序，该程序从以下网站检索投注赔率。

英文版网址： http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1&lang=en

中文版网址： bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1

我要检索的数据标记在以下图像文件中 https://na.cx/i/Bz873x.jpg

该程序在其他网站上运行良好（例如 reddit 或 lxml.de/parsing.html）。但我不知道为什么该过程检索到的 html 代码与我使用 Chrome 检索到的不同。

from urllib2 import urlopen
from lxml import etree

# print out the sources code of the web site
# work properly on other web sites (e.g. reddit.com or lxml.de/parsing.html)
# but having problem on the betting web site
url = 'http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=24-09-2015&venue=hv&raceno=1'
tree = etree.HTML(urlopen(url).read())
print(etree.tostring(tree, pretty_print=True))

# printing the first horse name in chinese version web site (Doesn't work)
horse_name = tree.xpath('//*[@id="detailWPTable"]/table/tbody/tr[2]/td[3]/a/span/text()')
print horse

运行上述程序后，我发现Python检索到的html代码与我使用Chrome Function - [查看源代码]或[打开开发者工具]检索到的html代码不同。

我的问题是

如何使用 python 获取正确的 html 代码（与 Chrome 相同的代码 - 查看源代码）？

谢谢:)

【问题讨论】：

标签： python xml-parsing html-parsing lxml

【解决方案1】：

这可能是因为您的用户代理设置不同，并且页面上的某些脚本没有执行。您可以在 HTTP 请求标头中设置第一个元素，但最重要的是您需要使用 headless browser 呈现网页。

在 Python 中工作的这种框架的一个很好的例子是 Selenium。

【讨论】：