【发布时间】:2011-12-24 07:55:10
【问题描述】:
我试图用 xpath 和 cssselect 解析 example.com's home page,但似乎我不知道 xpath 是如何工作的,或者 lxml 的 xpath 已损坏,因为它缺少匹配项。
这是快速而肮脏的代码。
from lxml.html import *
mySearchTree = parse('http://www.example.com').getroot()
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
print '-'*8 +'Now for Xpath' + 8*'-'
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
结果:
found "About" link to href "/about/"
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Domains" link to href "/domains/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Protocols" link to href "/protocols/"
found "Number Resources" link to href "/numbers/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
--------Now for Xpath--------
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
基本上 xpath 找到了它应该找到的每个链接,除了那些被 Example.com 加粗的链接。但是,星号通配符不应该在 xpath 匹配 './/tr/*/a' 中允许这样做吗?
【问题讨论】:
标签: python xpath css-selectors lxml