【发布时间】:2017-07-15 23:00:02
【问题描述】:
我正在尝试从这个特定网页上抓取统计数据:https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/
但是,当我查看 HTML 源代码时,“防御性游戏日志”的表格似乎被注释掉了(以 <...> 结尾)
因此,当尝试使用 BeautifulSoup4 时,以下代码仅抓取未注释掉的攻击性数据,而防御性数据被注释掉。
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")
tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print(value)
我很好奇是否有任何解决方案能够将所有防御值添加到列表中,就像在 BeautifulSoup4 内部或外部存储攻击性数据一样。谢谢!
请注意,我添加到下面给出的源自here 的解决方案:
data = []
table = defensive_log
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
【问题讨论】:
-
“注释掉”是什么意思?