【问题标题】:Cannot scrape a website with BeautifulSoup4无法使用 BeautifulSoup4 抓取网站
【发布时间】:2018-04-17 17:18:46
【问题描述】:

我要抓取的文本是标题 123rd Meeting 来自

https://www.bcb.gov.br/en/#!/c/copomstatements/1724

为此,我使用此代码

import urllib.request           #get the HTML page from url 
import urllib.error

from bs4 import BeautifulSoup


# set page to read
with urllib.request.urlopen('https://www.bcb.gov.br/en/#!/c/copomstatements/1724') as response:
   page = response.read()

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, "html.parser")
print(soup)

# Inspect: <h3 class="BCTituloPagina ng-binding">123rd Meeting</h3>
title = soup.find("h3", attrs={"class": "BCTituloPagina ng-binding"})
print(title)

但是,命令

print(soup)

既不返回标题:第 123 次会议,也不返回正文:鉴于 .... 目标降低 25 个基点。

【问题讨论】:

    标签: python-3.x beautifulsoup


    【解决方案1】:

    您不能使用 python 中的普通请求库来提取标题,因为您尝试提取的元素是使用 javascript 呈现的。您将需要使用 selenium 来实现您的目标。

    代码:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()
    driver.get('https://www.bcb.gov.br/en/#!/c/copomstatements/1724')
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//h3')))
    title = driver.find_element_by_xpath('//h3').text
    print(title)
    driver.close()
    

    输出:

    123rd Meeting

    【讨论】:

    • 感谢@Ali 的及时回复。由于 driver = webdriver.Chrome() 打开 Google Chrome,并且这个命令必须运行(循环)至少 100 次,我添加了以下行来关闭它 driver.close()
    猜你喜欢
    • 2018-02-06
    • 1970-01-01
    • 2022-01-18
    • 1970-01-01
    • 2020-11-07
    • 1970-01-01
    • 1970-01-01
    • 2016-03-21
    • 1970-01-01
    相关资源
    最近更新 更多