【问题标题】:Convert multiple tables from HTML to JSON - Python将多个表格从 HTML 转换为 JSON - Python
【发布时间】:2020-06-08 06:52:14
【问题描述】:

这个问题是this answer 的另一部分。我可以将一个 HTML 表格转换为 JSON,但是当有多个带有 不同 标头的表格时,结果不匹配。

例如,考虑以下 HTML 内容:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name2</th>
                <th>Age2</th>
                <th>License2</th>
                <th>Amount2</th>
                <th>Random</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Lou</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Harry</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Phil</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
                <td>2</td>
            </tr>
        </table>
    </body>
</html>

请注意,除了标题和段落标签之外,还有两个不同的表格,它们具有不同的标题。我想将此表转换为 JSON。但是,使用我下面的代码,

from bs4 import BeautifulSoup
import json

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for table in model.find_all("table"):
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

我得到以下输出:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name": "Rich",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30",
        "Name2": "2"
    },
    {
        "Name": "Lou",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30",
        "Name2": "2"
    },
    {
        "Name": "Harry",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20",
        "Name2": "2"
    },
    {
        "Name": "Phil",
        "Age": "21",
        "License": "N",
        "Amount": "3.80",
        "Name2": "2"
    }
]

输出不正确,因为两个表中的表头列不同,但在 JSON 中的第二个集合中输出的表头与第一个相同。另请注意 JSON 中第二个表中的最后一列是如何完全不正确的。我希望输出是:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name2": "Rich",
        "Age2": "28",
        "License2": "Y",
        "Amount2": "12.30",
        "Random": "2"
    },
    {
        "Name2": "Lou",
        "Age2": "25",
        "License2": "Y",
        "Amount2": "22.30",
        "Random": "2"
    },
    {
        "Name2": "Harry",
        "Age2": "38",
        "License2": "Y",
        "Amount2": "52.20",
        "Random": "2"
    },
    {
        "Name2": "Phil",
        "Age2": "21",
        "License2": "N",
        "Amount2": "3.80",
        "Random": "2"
    }
]

【问题讨论】:

    标签: python html json


    【解决方案1】:

    我必须在每次迭代后清除“th”字段列表:

    from bs4 import BeautifulSoup
    import json
    
    if __name__ == '__main__':
        model = BeautifulSoup(xml_data, features='lxml')
        fields = []
        table_data = []
        for table in model.find_all("table"):
            fields.clear()
            for tr in table.find_all('tr', recursive=False):
                for th in tr.find_all('th', recursive=False):
                    fields.append(th.text)
            for tr in table.find_all('tr', recursive=False):
                datum = {}
                for i, td in enumerate(tr.find_all('td', recursive=False)):
                    datum[fields[i]] = td.text
                if datum:
                    table_data.append(datum)
    
        print(json.dumps(table_data, indent=4))
    

    【讨论】:

      【解决方案2】:

      问题出在线路上

      datum[fields[i]] = td.text

      i 只是枚举器的索引,所以它总是按照在第一个内部循环中第一次遇到它们的顺序将字段添加到 JSON 对象。这意味着它将首先使用第一个表中的标题。您需要为每个表创建一个单独的fields 数组,您只需将fields 的声明移动到外部循环中即可,如下所示

      if __name__ == '__main__':
          model = BeautifulSoup(xml_data, features='lxml')
          table_data = []
          for table in model.find_all("table"):
              fields = []
              for tr in table.find_all('tr', recursive=False):
                  for th in tr.find_all('th', recursive=False):
                      fields.append(th.text)
              for tr in table.find_all('tr', recursive=False):
                  datum = {}
                  for i, td in enumerate(tr.find_all('td', recursive=False)):
                      datum[fields[i]] = td.text
                  if datum:
                      table_data.append(datum)
      
          print(json.dumps(table_data, indent=4))
      

      这应该会产生所需的输出

      【讨论】:

        猜你喜欢
        • 2019-06-13
        • 2015-09-02
        • 2013-07-02
        • 2020-05-09
        • 2021-07-05
        • 2016-01-02
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多