如何解析带有行中标题的HTML表格答案

【问题标题】：How to parse an HTML table with headers in in rows如何解析带有行中标题的HTML表格
【发布时间】：2020-11-29 09:58:45
【问题描述】：

我有一个类似于下面的 HTML 表格，其中表格标题也在一行内。如何使用第三方 python 包一次性提取它？（应该是列表或字典）

<table>
<tr>
<th>Header 1</th><td>Value 1</td>
</tr>
<tr>
<th>Header 2</th><td>Value 2</td>
</tr>
<tr>
<th>Header 3</th><td>Value 3</td>
</tr>
</table>

【问题讨论】：

pythonprogramming.net/…
预期输出是什么？
应该是列表或字典

标签： web-scraping beautifulsoup scrapy

【解决方案1】：

我假设，您需要一个字典：

from bs4 import BeautifulSoup


txt = '''<table>
<tr>
<th>Header 1</th><td>Value 1</td>
</tr>
<tr>
<th>Header 2</th><td>Value 2</td>
</tr>
<tr>
<th>Header 3</th><td>Value 3</td>
</tr>
</table>
'''

soup = BeautifulSoup(txt, 'html.parser')

out = {}
for tr in soup.select('tr'):
    out[tr.select_one('th').get_text(strip=True)] = [td.get_text(strip=True) for td in tr.select('td')]

print(out)

打印：

{'Header 1': ['Value 1'], 'Header 2': ['Value 2'], 'Header 3': ['Value 3']}

或者：

out = {}
for tr in soup.select('tr'):
    out[tr.select_one('th').get_text(strip=True)] = tr.select_one('td').get_text(strip=True)

print(out)

打印：

{'Header 1': 'Value 1', 'Header 2': 'Value 2', 'Header 3': 'Value 3'}

【讨论】：