无法获取表头元素答案

【问题标题】：Can not get table header elements无法获取表头元素
【发布时间】：2016-08-20 08:10:00
【问题描述】：

在 Python 中，我有一个包含 html 表元素的变量，如下所示：

page = requests.get('http://www.myPage.com')
tree = html.fromstring(page.content)
table = tree.xpath('//table[@class="list"]')

table 变量有这样的内容：

<table class="list">
      <tr>
        <th>Date(s)</th>
        <th>Sport</th>
        <th>Event</th>
        <th>Location</th>
      </tr>
      <tr>
        <td>Jan 18-31</td>
        <td>Tennis</td>
        <td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td>
        <td>Melbourne, Australia</td>
      </tr>
</table>

我正在尝试像这样提取标题：

rows = iter(table)
headers = [col.text for col in next(rows)]
print "headers are: ", headers

但是，当我打印 headers 变量时，我得到了这个：

headers are:  ['\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n
      ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n
', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n        ', '\n
        ', '\n        ', '\n        ']

如何正确提取标题？

【问题讨论】：

无法使用this code 重现问题。您能否发布简化但完整的代码来重现该问题？

标签： python html xpath web-scraping

【解决方案1】：

使用表格并假设只有一个：

table[0].xpath("//th/text()")

或者，如果您只想要表格中的标题并且不打算将其用于您只需要的其他任何东西：

headers = tree.xpath('//table[@class="list"]//th/text()')

两者都会给你：

['Date(s)', 'Sport', 'Event', 'Location']

【讨论】：

【解决方案2】：

试试这个：

from lxml import html

HTML_CODE = """<table class="list">
      <tr>
        <th>Date(s)</th>
        <th>Sport</th>
        <th>Event</th>
        <th>Location</th>
      </tr>
      <tr>
        <td>Jan 18-31</td>
        <td>Tennis</td>
        <td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td>
        <td>Melbourne, Australia</td>
      </tr>
</table>"""

tree = html.fromstring(HTML_CODE)
headers = tree.xpath('//table[@class="list"]/tr/th/text()')
print "Headers are: {}".format(', '.join(headers))

输出：

Headers are: Date(s), Sport, Event, Location

【讨论】：