为什么 BeautifulSoup 找不到 HTML 类？答案

【问题标题】：Why can BeautifulSoup not find the HTML class?为什么 BeautifulSoup 找不到 HTML 类？
【发布时间】：2019-06-08 12:52:57
【问题描述】：

我正在尝试在 python 中使用 requests 和 BeautifulSoup 来抓取 this website：

我想用 class= "ficha-jogo" 获取文章标签内的所有信息。当我运行下面的代码时，x 是一个空列表。

url = "https://globoesporte.globo.com/rs/futebol/brasileirao-serie-a/jogo/25-05-2019/gremio-atletico-mg.ghtml"
r = requests.get(url)     
soup = BeautifulSoup(r.content, "lxml")
x = soup.select(".ficha-jogo")
print(x)

我希望它返回文章标签中包含的所有标签，class= "ficha-jogo"。

【问题讨论】：

标签： python web-scraping beautifulsoup python-requests-html

【解决方案1】：

This网站链接是动态渲染请求article数据。您应该尝试automation selenium 库。它允许您抓取 dynamic rendering request(js or ajax) 页面数据。

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome()
url = "https://globoesporte.globo.com/rs/futebol/brasileirao-serie-a/jogo/25-05-2019/gremio-atletico-mg.ghtml"

browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')

article = soup.find("article",{"class":"ficha-jogo"})
print(article.text)

O/P：

GREPaulo Victor 1GOLLeonardo 6LADPedro Geromel 3ZADRodrigues 38ZAEJuninho Capixaba 29LAEMichel  5VOLMaicon 8VOLJean Pyerre 21MECThaciano 16MECEverton 11ATAAlisson 23ATADiego Tardelli 9ATAAndré 90ATAFelipe Vizeu 10ATACAMVictor 1GOLPatric 2LADLeonardo Silva 3ZADIgor Rabello 16ZAEFábio Santos 6LAEJosé Welison 14VOLNathan 23MECJair 88VOLCazares 10MECGeuvânio 49ATALuan 27MECBruninho 43MECRicardo Oliveira 9ATAChará 8ATARenato GaúchoTécnico4 - 3 - 3Esquema TáticoRodrigo SantanaTécnico4 - 4 - 2Esquema TáticoMostrar ficha completaReservasJúlio César 22GOLLéo Moura 2LADRafael Galhardo 42LADRomulo 13VOLDarlan 37VOLMontoya 20MECVico 15ATAPepê 25ATACleiton 40GOLIago Maidana 19ZADHulk 22LAEAdilson 21VOLVinícius 29MECTerans 20MECAlerrandro 44ATAMaicon 11ATAInformações sobre o jogoArena do GrêmioArena Desportiva

下载 chrome 浏览器的 selenium 网络驱动程序：

http://chromedriver.chromium.org/downloads

为 chrome 浏览器安装网络驱动程序：

https://christopher.su/2015/selenium-chromedriver-ubuntu/

硒教程：

https://selenium-python.readthedocs.io/

【讨论】：

【解决方案2】：

你也可以用 requests_html 来做：

from requests_html import HTMLSession

session = HTMLSession()

url = "https://globoesporte.globo.com/rs/futebol/brasileirao-serie-a/jogo/25-05-2019/gremio-atletico-mg.ghtml"

r = session.get(url)
r.html.render()

article = r.html.find('.ficha-jogo', first=True).text
print(article)

【讨论】：