【问题标题】:Can't seem to scrape tbody from this website似乎无法从该网站上抓取 tbody
【发布时间】:2023-03-24 08:45:02
【问题描述】:

我正在尝试从该网站抓取数据:https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/


page = requests.get('https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/')
soup = BeautifulSoup(page.text, 'html.parser')
soup.find_all('tbody')

soup.find_all('tbody') 返回 []。我不完全确定为什么。

这是我要删除的 tbody 部分:

<tbody><tr class="page"><td>
                                    7/23/2013


                        </td><td>


                                    Anthony Spencer


                        </td><td>



                                        Cowboys



                        </td><td>


                                    DE


                        </td><td>


                                    Knee


                        </td><td>


                                    Knee


                        </td><td>


                                    Out


                        </td><td>


                                    Is questionable for 9/8 against the NY Giants


                        </td></tr><tr class="page"><td>


                                    7/22/2013


                        </td><td>


                                    Tyrone Crawford


                        </td><td>



                                        Cowboys



                        </td><td>


                                    DE


                        </td><td>


                                    Achilles-tendon


                        </td><td>


                                    Achilles


                        </td><td>


                                    Out


                        </td><td>


                                    Is expected to be placed on injured reserve


                        </td></tr><tr class="page"><td>


                                    7/16/2013


                        </td><td>


                                    Ryan Broyles


                        </td><td>



                                        Lions



                        </td><td>


                                    WR


                        </td><td>


                                    Knee


                        </td><td>


                                    Knee


                        </td><td>


                                    Questionable


                        </td><td>


                                    Is questionable for 9/8 against Minnesota


                        </td></tr><tr class="page"><td>


                                    7/2/2013


                        </td><td>


                                    Jahvid Best


                        </td><td>



                                        Lions



                        </td><td>


                                    RB


                        </td><td>


                                    Concussion


                        </td><td>


                                    Concussion


                        </td><td>


                                    Out


                        </td><td>


                                    Is out indefinitely


                        </td></tr><tr class="page"><td>


                                    7/2/2013


                        </td><td>


                                    Jerel Worthy


                        </td><td>



                                        Packers



                        </td><td>


                                    DE


                        </td><td>


                                    Knee


                        </td><td>


                                    Knee


                        </td><td>


                                    Out


                        </td><td>


                                    Is out indefinitely


                        </td></tr><tr class="page"><td>


                                    7/2/2013


                        </td><td>


                                    JC Tretter


                        </td><td>



                                        Packers



                        </td><td>


                                    TO


                        </td><td>


                                    Ankle


                        </td><td>


                                    Ankle


                        </td><td>


                                    Out


                        </td><td>


                                    Is out indefinitely


                        </td></tr><tr class="page"><td>



                        </td></tr></tbody>

有人可以帮助我,让我知道为什么 tbody 上的 find_all 返回一个空列表吗?即使我尝试使用类页面进行 tr,它也会返回一个空列表。

【问题讨论】:

  • 那是因为 BS 使用了 html4 解析器。

标签: python web-scraping beautifulsoup


【解决方案1】:

似乎是html的问题。切换到使用“lxml”作为解析器而不是“html.parser”。老实说,我也会使用 pandas。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/')
soup = bs(r.content, 'lxml')
print(len(soup.find_all('tbody')))

或者,更简单的表:

import pandas as pd

df = pd.read_html('https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/')[0]
print(df)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-08-27
    • 1970-01-01
    • 2021-02-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多