额外的 HTML 标记导致 bs4 出现问题答案

【问题标题】：Extra HTML tag causing problems with bs4额外的 HTML 标记导致 bs4 出现问题
【发布时间】：2017-09-27 14:05:58
【问题描述】：

我正在尝试从网站http://www.house.gov/representatives/ 上的表格中获取一些信息具体来说，我想从“按姓氏排列的代表目录”表中获取有关代表的信息。到目前为止，我可以从网站下载 HTML 并将其写入文件，但是当使用 bs4 解析和抓取我想要的特定表格时，它只抓取每个表格的第一行。

这是因为HTML表格的每一行都有一个额外的标签：

<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>

最后一个 /td 标记以某种方式导致 bs4 无法抓取其余行。我确实进行了手动测试并删除了一些额外的标签，然后我取回了所有行，所以我知道额外的标签是问题所在。到目前为止，这是我的 python 代码：

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

我也尝试过使用 prettify() 但这也无济于事。关于如何清理 HTML 以便我可以使用 bs4（或其他东西）来解析和提取我需要的表的任何想法？

谢谢！

【问题讨论】：

标签： python html bs4

【解决方案1】：

您可以在代码中使用lxml 解析器而不是html.parser：

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
print(table[0]) #print first table

输出将显示完整的第一个表，其中“title”=“Representative Directory By Last Name”：

<table class="directory" title="Representative Directory By Last Name">
<colgroup>
<col class="name"></col>
<col class="dist2"></col>
<col class="part"></col>
<col class="room"></col>
<col class="phone2"></col>
<col class="comm2"></col>
</colgroup>
<thead>
<tr>
<th>Name</th>
<th>District</th>
<th>Party</th>
<th>Room</th>
<th>Phone</th>
<th>Committee Assignment</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<br/>Armed Services<br/>Science, Space, and Technology</td>
</tr>
<tr>
<td><a href="http://adams.house.gov">
Adams, Alma </a>
</td>
<td>North Carolina 12th District</td>
<td>D</td>
<td>222 CHOB</td>
<td>202-225-1510</td>
<td>Agriculture<br/>Education and the Workforce<br/>Small Business</td>
</tr>
<tr>
<td><a href="https://aderholt.house.gov/">
Aderholt, Robert </a>
</td>
<td>Alabama 4th District</td>
<td>R</td>
<td>235 CHOB</td>
<td>202-225-4876</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://aguilar.house.gov/">
Aguilar, Pete </a>
</td>
<td>California 31st District</td>
<td>D</td>
<td>1223 LHOB</td>
<td>202-225-3201</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="http://allen.house.gov">
Allen, Rick </a>
</td>
<td>Georgia 12th District</td>
<td>R</td>
<td>426 CHOB</td>
<td>202-225-2823</td>
<td>Agriculture<br/>Education and the Workforce</td>
</tr>
<tr>
<td><a href="https://amash.house.gov/">
Amash, Justin </a>
</td>
<td>Michigan 3rd District</td>
<td>R</td>
<td>114 CHOB</td>
<td>202-225-3831</td>
<td>Oversight and Government</td>
</tr>
<tr>
<td><a href="https://amodei.house.gov">
Amodei, Mark </a>
</td>
<td>Nevada 2nd District</td>
<td>R</td>
<td>332 CHOB</td>
<td>202-225-6155</td>
<td>Appropriations</td>
</tr>
<tr>
<td><a href="https://arrington.house.gov">
Arrington, Jodey  </a>
</td>
<td>Texas 19th District</td>
<td>R</td>
<td>1029 LHOB</td>
<td>202-225-4005</td>
<td>Agriculture<br/>the Budget<br/>Veterans' Affairs</td>
</tr>
</tbody>
</table>

【讨论】：

感谢工作！这些解析器有什么区别？使用 lxml 解析器通常是一个更好的主意吗？
对于区别，也许你可以参考这个答案，它会给你更多的细节stackoverflow.com/questions/25714417/…