【发布时间】:2016-08-20 08:10:00
【问题描述】:
在 Python 中,我有一个包含 html 表元素的变量,如下所示:
page = requests.get('http://www.myPage.com')
tree = html.fromstring(page.content)
table = tree.xpath('//table[@class="list"]')
table 变量有这样的内容:
<table class="list">
<tr>
<th>Date(s)</th>
<th>Sport</th>
<th>Event</th>
<th>Location</th>
</tr>
<tr>
<td>Jan 18-31</td>
<td>Tennis</td>
<td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td>
<td>Melbourne, Australia</td>
</tr>
</table>
我正在尝试像这样提取标题:
rows = iter(table)
headers = [col.text for col in next(rows)]
print "headers are: ", headers
但是,当我打印 headers 变量时,我得到了这个:
headers are: ['\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n
', '\n ', '\n ']
如何正确提取标题?
【问题讨论】:
-
无法使用this code 重现问题。您能否发布简化但完整的代码来重现该问题?
标签: python html xpath web-scraping