【发布时间】:2020-06-08 06:52:14
【问题描述】:
这个问题是this answer 的另一部分。我可以将一个 HTML 表格转换为 JSON,但是当有多个带有 不同 标头的表格时,结果不匹配。
例如,考虑以下 HTML 内容:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
<table>
<tr>
<th>Name2</th>
<th>Age2</th>
<th>License2</th>
<th>Amount2</th>
<th>Random</th>
</tr>
<tr>
<td>Rich</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
<td>2</td>
</tr>
<tr>
<td>Lou</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
<td>2</td>
</tr>
<tr>
<td>Harry</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
<td>2</td>
</tr>
<tr>
<td>Phil</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
<td>2</td>
</tr>
</table>
</body>
</html>
请注意,除了标题和段落标签之外,还有两个不同的表格,它们具有不同的标题。我想将此表转换为 JSON。但是,使用我下面的代码,
from bs4 import BeautifulSoup
import json
if __name__ == '__main__':
model = BeautifulSoup(xml_data, features='lxml')
fields = []
table_data = []
for table in model.find_all("table"):
for tr in table.find_all('tr', recursive=False):
for th in tr.find_all('th', recursive=False):
fields.append(th.text)
for tr in table.find_all('tr', recursive=False):
datum = {}
for i, td in enumerate(tr.find_all('td', recursive=False)):
datum[fields[i]] = td.text
if datum:
table_data.append(datum)
print(json.dumps(table_data, indent=4))
我得到以下输出:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name": "Rich",
"Age": "28",
"License": "Y",
"Amount": "12.30",
"Name2": "2"
},
{
"Name": "Lou",
"Age": "25",
"License": "Y",
"Amount": "22.30",
"Name2": "2"
},
{
"Name": "Harry",
"Age": "38",
"License": "Y",
"Amount": "52.20",
"Name2": "2"
},
{
"Name": "Phil",
"Age": "21",
"License": "N",
"Amount": "3.80",
"Name2": "2"
}
]
输出不正确,因为两个表中的表头列不同,但在 JSON 中的第二个集合中输出的表头与第一个相同。另请注意 JSON 中第二个表中的最后一列是如何完全不正确的。我希望输出是:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name2": "Rich",
"Age2": "28",
"License2": "Y",
"Amount2": "12.30",
"Random": "2"
},
{
"Name2": "Lou",
"Age2": "25",
"License2": "Y",
"Amount2": "22.30",
"Random": "2"
},
{
"Name2": "Harry",
"Age2": "38",
"License2": "Y",
"Amount2": "52.20",
"Random": "2"
},
{
"Name2": "Phil",
"Age2": "21",
"License2": "N",
"Amount2": "3.80",
"Random": "2"
}
]
【问题讨论】: