BeautifulSoup - 将多个 <tr> 元素合并到一个列表中答案

【问题标题】：BeautifulSoup - Merging multiple <tr> elements into a single listBeautifulSoup - 将多个 <tr> 元素合并到一个列表中
【发布时间】：2019-05-03 03:46:04
【问题描述】：

我正在使用漂亮的汤来解析 Python 对象中的 HTML 文档，但我遇到了一个小问题。

我正在尝试将表格转换为字典列表。我希望字典中的键是列标题，但是该表有多个标题行，其中包含不同数量的 th 元素。为了使字典键有效，我需要以某种方式将两个标题行合并为它们自己的连接版本。

这就是标题行的样子。

这是底层的 HTML

<thead>
   <tr>
      <th></th>
      <th class="metadata platform"></th>
      <th class="wtt time borderleft" colspan="2"><abbr title="Working Timetable">WTT</abbr></th>
      <th class="gbtt time borderleft" colspan="2"><abbr title="Public Timetable (Great Britain Timetable)">GBTT</abbr></th>
      <th class="metadata line path borderleft" colspan="2">Route</th>
      <th class="metadata allowances borderleft" colspan="3">Allowances</th>
   </tr>
   <tr>
      <th>Location</th>
      <th class="metadata platform span2">Pl</th>
      <th class="wtt time span3 borderleft">Arr</th>
      <th class="wtt time span3">Dep</th>
      <th class="gbtt time span3 borderleft">Arr</th>
      <th class="gbtt time span3">Dep</th>
      <th class="metadata line span2 borderleft">Line</th>
      <th class="metadata path span2">Path</th>
      <th class="metadata allowances engineering span2 borderleft"><abbr title="Engineering allowance">Eng</abbr></th>
      <th class="metadata allowances pathing span2"><abbr title="Pathing allowance">Pth</abbr></th>
      <th class="metadata allowances performance span2"><abbr title="Performance allowance">Prf</abbr></th>
   </tr>
</thead>

这是我需要的理想输出，因此我可以做一些字典理解来构建列表。

['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

我认为这样做的唯一方法是遍历每个元素并以这种方式构建标题。所以在这里，我最终会得到一个包含 11 个元素的列表，需要两次“通过”才能构建。

# First pass
['', '', 'WTT', 'WTT', 'GBTT', 
 'GBTT', 'Route', 'Route', 'Allowances ', 
 'Allowances', 'Prf']

# Second pass
['Location', 'Pl', 'WTT Arr', 'WTT Dep', 'GBTT Arr', 
 'GBTT Dep', 'Route Line', 'Route Path', 'Allowances Eng', 
 'Allowances Pth', 'Allowances Prf']

虽然这是一个行之有效的解决方案，但我想有一种更 Pythonic 的方法。

编辑：创建字典键的代码：

from bs4 import BeautifulSoup
import requests

url = 'http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced'

bs = BeautifulSoup(requests.get(url).content, 'lxml')
table = bs.find_all('table', class_='advanced')
headers = table[0].select('thead tr ')

keys = []
for th in headers[0].findChildren('th'):
    keys.append(th.getText())
    try:
        colspan = int(th['colspan'])
        if colspan > 0:
            for i in range(0, colspan-1):
                keys.append(th.getText())
    except KeyError:
        pass

th_elements = list(headers[1].findChildren('th'))
for i in range(0, len(keys)):
    keys[i] = keys[i] + ' ' + th_elements[i].getText()
    keys[i] = keys[i].strip()

print(keys)

【问题讨论】：

您能edit 并将您的代码尝试包含在问题中吗？但即便如此，通过 2 次传球，我也想不出更有效或 Pythonic 的方式来做到这一点。页面数据的结构方式您必须进行某种字符串分析，这需要 1 次通过将数据转换为某种格式，第 2 次通过分析和组织数据到最终列表中。所以不确定可以做出多少真正的改进。
@davedwards，抱歉耽搁了。添加了代码。
谢谢，做得好。这比我的尝试更好更短。如果您也可以包含标题，那么它是minimal reproducible example，也许比我们更熟练的人可以提供更好的解决方案。
完成。添加了 MCVE。

标签： python python-3.x beautifulsoup lxml

【解决方案1】：

作为替代方法，您可以使用 pandas read_html（也使用 BeautifulSoup）。将 html 读入数据框 flatten the column names 并将结果输出到字典列表。

import pandas as pd

df = pd.read_html('http://www.realtimetrains.co.uk/train/P16871/2018/12/10/advanced')[0]
df.columns = [' '.join([c for c in col if 'Unnamed' not in c]) 
              for col in df.columns.values]
df.to_dict(orient='records')

给予：

[
  {
    'Location': 'Swansea [SWA]',
    'Pl': 3.0,
    'WTT Arr': nan,
    'GBTT Dep': 911.0,
    'Route Arr': nan,
    'Allowances Dep': 910.0,
    'Line': nan,
    'Path': nan,
    'Eng': nan,
    'Pth': nan,
    'Prf': nan
  }, 
  ...
]

【讨论】：