BeautifulSoup 返回 None 即使元素存在答案

【问题标题】：BeautifulSoup returns None even though the element existsBeautifulSoup 返回 None 即使元素存在
【发布时间】：2017-03-01 23:07:41
【问题描述】：

我已经完成了针对类似问题的大多数解决方案，但还没有找到一个有效的解决方案，更重要的是，还没有找到解释为什么在被抓取的网站上调用 Javascript 或其他内容时会发生这种情况的原因.

我正在尝试从网站上抓取游戏“Officials”的表格： http://www.pro-football-reference.com/boxscores/201309050den.htm

我的代码是：

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)    
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))

我现在只是打印到控制台，但我得到了一个空列表，使用 findAll 或 None 使用 find。我也用基本的 html.parser 试过这个，但没有运气。

对html有更好理解的人可以告诉我这个网页有什么不同吗？提前致谢！

【问题讨论】：

那个元素不存在。在浏览器中访问 URL，然后选择“查看源代码”或类似内容。搜索“官方”。请注意，唯一看起来像具有该 id 的表格的内容是在评论中（即在  中）。
那么是什么告诉网站显示表官员？当我进入开发工具时，我确实在网站上看到了该元素，那么它是否可以存在于网站以提取表单但不在 BeautifulSoup 看到的网站 html 中？薛定谔的猫？
旁白：请务必阅读他们的terms of use 中有关自动检索数据的第 2 部分。

标签： python web-scraping beautifulsoup

【解决方案1】：

你看不到它，因为它不在那里。尝试关闭turn JS 并使用浏览器打开它，您会看到它不存在 - 该网站进行了一些 JS DOM 操作。

你的选择是：

在你的情况下，你想要的 HTML 就在那里 - 只是在评论中，用 beautifulsoup 从评论中提取它。
使用Selenium 或等效工具来渲染JS（这正是您的浏览器所做的）

【讨论】：

【解决方案2】：

试试这个代码：

from selenium import webdriver
import time
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))


driver.quit()

它将打印：

<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>

【讨论】：

您能解释一下为什么会这样吗？并感谢您的帮助！
之所以有效，是因为他添加了睡眠时间，以便所有元素在您捕获它们后都可以加载。我也被卡住了，完全忘记了 sleep() 方法；-;

【解决方案3】：

它在源代码中，只是被注释掉了，使用 regex 删除 cmets 很简单：

from bs4 import BeautifulSoup
import requests
import re

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = requests.get(url).content
bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
officials = bsObj.find_all("table",{"id":"officials"})

for entry in officials:
    print(entry)

只有一张表，所以你不需要 find_all 并且你的循环有点没有意义，只需使用 find：

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: import re
   ...: url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
   ...: 
   ...: html = requests.get(url).content
   ...: bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
   ...: officials = bsObj.find(id="officials")
   ...: print(officials)
   ...: 

<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="officials"><caption>Officials Table</caption><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</table>

In [2]:

【讨论】：

我正在使用 BS4，但仍然遇到无法找到 cmets 中的脚本 id 的问题。