Python - 用 BeautifulSoup 抓取不显示所有行答案

【问题标题】：Python - Scraping with BeautifulSoup not showing all rowsPython - 用 BeautifulSoup 抓取不显示所有行
【发布时间】：2016-12-17 19:07:39
【问题描述】：

我是 BeautifulSoup 的新手。我正在尝试从ESPN Fantasy Basketball Standings 中抓取“Season Stats”表，但并非所有行都返回。经过一番研究，我认为可能是html.parser的问题，所以我使用了lxml。我得到了同样的结果。如果有人能告诉我如何获得所有团队名称，我将不胜感激。

我的代码：

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017"),'html.parser')
tableStats = soup.find("table", {"class" : "tableBody"})
for row in tableStats.findAll('tr')[2:]:
    col = row.findAll('td')

    try:
        name = col[0].a.string.strip()
        print(name)
    except Exception as e:
        print(str(e))

输出（如您所见，只显示了几个团队名称）：

Le Tuc Grizzlies Peyton Ravens Heaven Vultures Versailles Golden Bears Baltimore Corto's La Murette Scavengers XO Gayfishes

【问题讨论】：

您似乎走错了桌子。为什么不参加总排名部分？

标签： python web-scraping beautifulsoup

【解决方案1】：

您似乎完全弄错了table。除了为<table> 标签运行find() 之外，您还可以使用findAll() 并查找具有整个排名的正确表格。我还注意到统计表有一个特殊的表id，称为statsTable。查找此 id 而不是 class 是个好主意，因为它是 HTML 文件所独有的。

查看以下代码中的 cmets 以获得更多指导，

from bs4 import BeautifulSoup
import requests
# Note, I'm using requests here as it's a superior library
text = requests.get("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017").text
soup = BeautifulSoup(text,'html.parser')
# searching by id, always a better option when available
tableStats = soup.find("table", {"id" : "statsTable"})
for row in tableStats.findAll('tr')[3:]:
    col = row.findAll('td')
    try:
        # This fetches all the text in the tag stripped off all the HTML
        name = col[1].get_text()
        print(name)
    except Exception as e:
        print(str(e))

【讨论】：

【解决方案2】：

解析包含所有团队的id="statsTable" 可能更容易，即：

from bs4 import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017"),'html.parser')
tableStats = soup.find('table', id="statsTable")
for row in tableStats.findAll('a', href=True):
    print row.text

【讨论】：