为初学者使用 Python 抓取 PFR 足球数据答案

【问题标题】：Scraping PFR Football Data with Python for a Beginner为初学者使用 Python 抓取 PFR 足球数据
【发布时间】：2018-06-21 10:14:10
【问题描述】：

背景：我正试图从这个pro-football-reference page. 中抓取一些表格我是 Python 的完全新手，所以很多技术术语最终都迷失在我身上，但在试图了解如何解决这个问题时，我想不通。

具体问题：因为页面上有多个表格，我不知道如何让 python 定位到我想要的那个。我正在尝试获取 Defense & Fumbles 表。下面的代码是我到目前为止所得到的，它是 from this tutorial 使用来自同一站点的页面 - 但只有一个表。

示例代码：

#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"

#html from the given url
html=urlopen(url)

# make soup object of html
soup = BeautifulSoup(html)

# we see that soup is a beautifulsoup object
type(soup) 

#
column_headers = [th.getText() for th in 
                  soup.findAll('table', {"id": "defense").findAll('th')]

column_headers #our column headers

尝试：我意识到本教程的方法对我不起作用，因此我尝试更改 soup.findAll 部分以针对特定表。但我反复收到错误消息：

AttributeError：ResultSet 对象没有属性“findAll”。您可能将项目列表视为单个项目。当你打算调用 find() 时，你调用了 find_all() 吗？

改成find的时候，报错变成：

AttributeError: 'NoneType' 对象没有属性 'find'

老实说，我不知道自己在做什么，也不知道这些是什么意思。在确定如何定位该数据然后抓取它方面提供任何帮助，我将不胜感激。

谢谢，

【问题讨论】：

这能回答你的问题吗？ Beautiful Soup: 'ResultSet' object has no attribute 'find_all'?

标签： python

【解决方案1】：

首先，你想使用soup.find('table', {"id": "defense"}).findAll('th') - 找到一个表，然后找到所有的'th'标签。

另一个问题是id为“防御”的表在该页面的html中被注释掉了：

<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_defense">
  <table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense &amp; Fumbles Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

等等。我假设 javascript 没有隐藏它。 BeautifulSoup 不解析 cmets 的文本，因此您需要找到页面上所有 cmets 的文本，如 this answer，查找其中包含 id="defense" 的文本，然后输入该评论的文本进入 BeautifulSoup。

像这样：

from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

【讨论】：

您好，感谢您的回复。所以我运行了你所说的，在检查了defenceSoup 中的内容后，它给出了一个非常难以阅读的文本，我假设这是因为所有的 HTML 都变成了文本，对吧？我最初的计划是使用原始教程中概述的说明将其转换为带有 pandas 的数据框，但在这种情况下，它看起来行不通。我尝试在 defenceSoup.find 上运行原始的 column_headers = soup.find 语句，但这给了我这个输出的非类型错误，所以我不确定我从这里的路径是什么。有什么建议吗？
您将不得不做更多工作才能将 html 表格转换为数据框。至少您需要执行类似defenses up.findAll('tr') 的操作来查找所有行，然后为每个tr.findAll(td) 获取单元格。这需要一些弄清楚，但值得学习:)

【解决方案2】：

您在“防御”一词之后的字典中缺少“}”。试试下面，看看它是否有效。

column_headers = [th.getText() for th in soup.findAll('table', {"id": "defense"}).findAll('th')]

【讨论】：

不幸的是，这并没有解决问题，我仍然看到相同的错误响应。
检查 Nathans 的答案是否还有剩余的……他打败了我 ;-)