【问题标题】:Extra HTML tag causing problems with bs4额外的 HTML 标记导致 bs4 出现问题
【发布时间】:2017-09-27 14:05:58
【问题描述】:

我正在尝试从网站http://www.house.gov/representatives/ 上的表格中获取一些信息 具体来说,我想从“按姓氏排列的代表目录”表中获取有关代表的信息。到目前为止,我可以从网站下载 HTML 并将其写入文件,但是当使用 bs4 解析和抓取我想要的特定表格时,它只抓取每个表格的第一行。

这是因为HTML表格的每一行都有一个额外的标签:

<tr>
<td><a href="https://abraham.house.gov/">
Abraham, Ralph  </a>
</td>
<td>Louisiana 5th District</td>
<td>R</td>
<td>417 CHOB</td>
<td>202-225-8490</td>
<td>Agriculture<BR>Armed Services<BR>Science, Space, and Technology</td>
</td>
</tr>

最后一个 /td 标记以某种方式导致 bs4 无法抓取其余行。我确实进行了手动测试并删除了一些额外的标签,然后我取回了所有行,所以我知道额外的标签是问题所在。到目前为止,这是我的 python 代码:

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

我也尝试过使用 prettify() 但这也无济于事。关于如何清理 HTML 以便我可以使用 bs4(或其他东西)来解析和提取我需要的表的任何想法?

谢谢!

【问题讨论】:

    标签: python html bs4


    【解决方案1】:

    您可以在代码中使用lxml 解析器而不是html.parser

    import bs4, requests
    
    res = requests.get('http://www.house.gov/representatives/')
    res.raise_for_status()
    file = open('HouseReps.html', 'wb')
    for chunk in res.iter_content(100000):
        file.write(chunk)
    file = open('HouseReps.html')
    soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
    table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
    print(table[0]) #print first table
    

    输出将显示完整的第一个表,其中“title”=“Representative Directory By Last Name”:

    <table class="directory" title="Representative Directory By Last Name">
    <colgroup>
    <col class="name"></col>
    <col class="dist2"></col>
    <col class="part"></col>
    <col class="room"></col>
    <col class="phone2"></col>
    <col class="comm2"></col>
    </colgroup>
    <thead>
    <tr>
    <th>Name</th>
    <th>District</th>
    <th>Party</th>
    <th>Room</th>
    <th>Phone</th>
    <th>Committee Assignment</th>
    </tr>
    </thead>
    <tbody>
    <tr>
    <td><a href="https://abraham.house.gov/">
    Abraham, Ralph  </a>
    </td>
    <td>Louisiana 5th District</td>
    <td>R</td>
    <td>417 CHOB</td>
    <td>202-225-8490</td>
    <td>Agriculture<br/>Armed Services<br/>Science, Space, and Technology</td>
    </tr>
    <tr>
    <td><a href="http://adams.house.gov">
    Adams, Alma </a>
    </td>
    <td>North Carolina 12th District</td>
    <td>D</td>
    <td>222 CHOB</td>
    <td>202-225-1510</td>
    <td>Agriculture<br/>Education and the Workforce<br/>Small Business</td>
    </tr>
    <tr>
    <td><a href="https://aderholt.house.gov/">
    Aderholt, Robert </a>
    </td>
    <td>Alabama 4th District</td>
    <td>R</td>
    <td>235 CHOB</td>
    <td>202-225-4876</td>
    <td>Appropriations</td>
    </tr>
    <tr>
    <td><a href="https://aguilar.house.gov/">
    Aguilar, Pete </a>
    </td>
    <td>California 31st District</td>
    <td>D</td>
    <td>1223 LHOB</td>
    <td>202-225-3201</td>
    <td>Appropriations</td>
    </tr>
    <tr>
    <td><a href="http://allen.house.gov">
    Allen, Rick </a>
    </td>
    <td>Georgia 12th District</td>
    <td>R</td>
    <td>426 CHOB</td>
    <td>202-225-2823</td>
    <td>Agriculture<br/>Education and the Workforce</td>
    </tr>
    <tr>
    <td><a href="https://amash.house.gov/">
    Amash, Justin </a>
    </td>
    <td>Michigan 3rd District</td>
    <td>R</td>
    <td>114 CHOB</td>
    <td>202-225-3831</td>
    <td>Oversight and Government</td>
    </tr>
    <tr>
    <td><a href="https://amodei.house.gov">
    Amodei, Mark </a>
    </td>
    <td>Nevada 2nd District</td>
    <td>R</td>
    <td>332 CHOB</td>
    <td>202-225-6155</td>
    <td>Appropriations</td>
    </tr>
    <tr>
    <td><a href="https://arrington.house.gov">
    Arrington, Jodey  </a>
    </td>
    <td>Texas 19th District</td>
    <td>R</td>
    <td>1029 LHOB</td>
    <td>202-225-4005</td>
    <td>Agriculture<br/>the Budget<br/>Veterans' Affairs</td>
    </tr>
    </tbody>
    </table>
    

    【讨论】:

    • 感谢工作!这些解析器有什么区别?使用 lxml 解析器通常是一个更好的主意吗?
    • 对于区别,也许你可以参考这个答案,它会给你更多的细节stackoverflow.com/questions/25714417/…
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-03-18
    • 1970-01-01
    • 2014-04-12
    • 2022-10-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多