刮困难表答案

【问题标题】：Scraping difficult table刮困难表
【发布时间】：2019-08-07 10:45:33
【问题描述】：

我一直在尝试从here 刮一张桌子，但没有成功。我试图抓取的表格标题为“每场比赛的球队统计数据”。我有信心，一旦我能够抓取该表的一个元素，我就可以从列表中遍历我想要的列，并最终得到一个 pandas 数据框。

到目前为止，这是我的代码：

from bs4 import BeautifulSoup
import requests

# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)

# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)

# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())

# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)

任何帮助将不胜感激。

【问题讨论】：

该页面作弊，表格的 HTML 源存储在 HTML cmets 中，Javascript then extracts 并返回到 HTML...
这是防止刮擦的预防方法吗？
更可能的原因是阻止表格显示在 Google 结果页面中。

标签： python web-scraping beautifulsoup

【解决方案1】：

该网页使用了一个技巧来阻止搜索引擎和其他自动网络客户端（包括抓取工具）查找表格数据：表格存储在 HTML cmets 中：

<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">

<div class="section_heading">
  <span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2>    <div class="section_heading_text">
      <ul> <li><small>* Playoff teams</small></li>
      </ul>
    </div>      
</div>
<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_team-stats-per_game">
  <table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>

...

</table>

      </div>
   </div>
-->
</div>

我注意到开头的div 有setup_commented 和commented 类。页面中包含的 Javascript 代码随后由您的浏览器执行，然后从这些 cmets 加载文本并将 placeholder div 替换为新的 HTML 内容以供浏览器显示。

您可以在此处提取评论文本：

from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')

然后继续解析表格HTML。

此特定站点已发布terms of use 和a page on data use，如果您打算使用他们的数据，您可能应该阅读。具体来说，他们的条款在第 6 节下规定。网站内容：

未经 SRL 事先书面同意，您不得构图、捕捉、获取或收集网站或内容的任何部分。

抓取数据将属于该标题。

【讨论】：

请问next这个词在这种情况下的作用是什么？
@QHarr：这是一个function，它从迭代器中获取下一项。它传递了一个过滤器Element.next_siblings 迭代器的生成器表达式。我用它来跳过直接跟在占位符 div 之后的文本节点

【解决方案2】：

只是为了完成 Martijn Pieters 的回答（并且没有 lxml）

from bs4 import BeautifulSoup, Comment
import requests

r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
soup = BeautifulSoup(r.content, 'html.parser')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table = BeautifulSoup(comment, 'html.parser')
rows = table.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if cells:
        print([cell.text for cell in cells])

部分输出

[u'New Orleans Pelicans', u'71', u'240.0', u'43.6', u'91.7', u'.476', u'10.1', u'29.4', u'.344', u'33.5', u'62.4', u'.537', u'18.1', u'23.9', u'.760', u'11.0', u'36.0', u'47.0', u'27.0', u'7.5', u'5.5', u'14.5', u'21.4', u'115.5']
[u'Milwaukee Bucks*', u'69', u'241.1', u'43.3', u'90.8', u'.477', u'13.3', u'37.9', u'.351', u'30.0', u'52.9', u'.567', u'17.6', u'22.8', u'.773', u'9.3', u'40.1', u'49.4', u'26.0', u'7.4', u'6.0', u'14.0', u'19.8', u'117.6']
[u'Los Angeles Clippers', u'70', u'241.8', u'41.0', u'87.6', u'.469', u'9.8', u'25.2', u'.387', u'31.3', u'62.3', u'.502', u'22.8', u'28.8', u'.792', u'9.9', u'35.7', u'45.6', u'23.4', u'6.6', u'4.7', u'14.5', u'23.5', u'114.6']

【讨论】：

两者都是令人难以置信的答案。谢谢你们！