【发布时间】:2023-03-24 08:45:02
【问题描述】:
我正在尝试从该网站抓取数据:https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/
page = requests.get('https://web.archive.org/web/20130725021041/http://www.usatoday.com/sports/nfl/injuries/')
soup = BeautifulSoup(page.text, 'html.parser')
soup.find_all('tbody')
soup.find_all('tbody') 返回 []。我不完全确定为什么。
这是我要删除的 tbody 部分:
<tbody><tr class="page"><td>
7/23/2013
</td><td>
Anthony Spencer
</td><td>
Cowboys
</td><td>
DE
</td><td>
Knee
</td><td>
Knee
</td><td>
Out
</td><td>
Is questionable for 9/8 against the NY Giants
</td></tr><tr class="page"><td>
7/22/2013
</td><td>
Tyrone Crawford
</td><td>
Cowboys
</td><td>
DE
</td><td>
Achilles-tendon
</td><td>
Achilles
</td><td>
Out
</td><td>
Is expected to be placed on injured reserve
</td></tr><tr class="page"><td>
7/16/2013
</td><td>
Ryan Broyles
</td><td>
Lions
</td><td>
WR
</td><td>
Knee
</td><td>
Knee
</td><td>
Questionable
</td><td>
Is questionable for 9/8 against Minnesota
</td></tr><tr class="page"><td>
7/2/2013
</td><td>
Jahvid Best
</td><td>
Lions
</td><td>
RB
</td><td>
Concussion
</td><td>
Concussion
</td><td>
Out
</td><td>
Is out indefinitely
</td></tr><tr class="page"><td>
7/2/2013
</td><td>
Jerel Worthy
</td><td>
Packers
</td><td>
DE
</td><td>
Knee
</td><td>
Knee
</td><td>
Out
</td><td>
Is out indefinitely
</td></tr><tr class="page"><td>
7/2/2013
</td><td>
JC Tretter
</td><td>
Packers
</td><td>
TO
</td><td>
Ankle
</td><td>
Ankle
</td><td>
Out
</td><td>
Is out indefinitely
</td></tr><tr class="page"><td>
</td></tr></tbody>
有人可以帮助我,让我知道为什么 tbody 上的 find_all 返回一个空列表吗?即使我尝试使用类页面进行 tr,它也会返回一个空列表。
【问题讨论】:
-
那是因为 BS 使用了 html4 解析器。
标签: python web-scraping beautifulsoup