【问题标题】:How to convert an HTML table into a Python dictionary如何将 HTML 表格转换为 Python 字典
【发布时间】:2019-01-10 10:11:48
【问题描述】:

我有以下 HTML 摘录,格式为 Python 列表,我想将其转换为字典。这是一周中每天的时间表。

[u'
<table class="hours table">\n
    <tbody>\n
        <tr>\n
            <th scope="row">Mon</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Tue</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Wed</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Thu</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Fri</th>\n
            <td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
                <br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sat</th>\n
            <td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n
        <tr>\n
            <th scope="row">Sun</th>\n
            <td>\n Closed\n </td>\n
            <td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']

如意输出是:

{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'], 
'Sat': '5:00pm - 10:00pm', 
'Sun': 'Closed'
}

您将如何在 Python 3.x 中实现这一点?我不介意“周六”和“周日”键是否有列表格式的值,如果这有帮助的话。提前感谢您的想法。

【问题讨论】:

    标签: python html python-3.x dictionary beautifulsoup


    【解决方案1】:

    这是一个解决方案,它首先读入 Pandas DataFrame,然后转换为您想要的输出中的字典:

    import pandas as pd
    
    dfs = pd.read_html(html_string)
    df = dfs[0]  # pd.read_html reads in all tables and returns a list of DataFrames
    

    给予:

         0                                      1         2
    0  Mon  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
    1  Tue  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
    2  Wed  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm  Open now
    3  Thu  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
    4  Fri  2:00 pm - 3:00 pm  5:00 pm - 10:00 pm       NaN
    5  Sat                     5:00 pm - 10:00 pm       NaN
    6  Sun                                 Closed       NaN
    

    然后使用groupby 和字典理解:

    summary = {k: v.iloc[0, 1].split('  ') for k, v in df.groupby(0)}
    

    给予:

    {'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
     'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
     'Sat': ['5:00 pm - 10:00 pm'],
     'Sun': ['Closed'],
     'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
     'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
     'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}
    

    如果在两个空格上拆分并不总是适用于您的开放时间数据格式,您可能需要稍作修改。

    【讨论】:

      【解决方案2】:
      from bs4 import BeautifulSoup
      from collections import OrderedDict
      from pprint import pprint
      
      soup = BeautifulSoup(data, 'lxml')
      
      d = OrderedDict()
      for th, td in zip(soup.select('th'), soup.select('td')[::2]):
          d[th.text.strip()] = td.text.strip().splitlines()
      
      pprint(d)
      

      打印:

      OrderedDict([('Mon', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
                   ('Tue', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
                   ('Wed', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
                   ('Thu', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
                   ('Fri', ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']),
                   ('Sat', ['5:00 pm - 10:00 pm']),
                   ('Sun', ['Closed'])])
      

      【讨论】:

        【解决方案3】:

        使用库来解析 HTML,如下所示:

        import pandas as panda
        url = r'https://en.wikipedia.org/wiki/List_of_sovereign_states'
        tables = panda.read_html(url)
        sp500_table = tables[0] #Selecting the first table (for example)
        

        【讨论】:

          猜你喜欢
          • 2017-05-30
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-05-10
          • 2019-10-16
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多