Selenium 很慢，有什么选择吗？答案

【问题标题】：Selenium is very slow , any alternative?Selenium 很慢，有什么选择吗？
【发布时间】：2021-10-31 14:11:46
【问题描述】：

网址 - https://finance.yahoo.com/quote/WRD.PA?p=WRD.PA&.tsrc=fin-srch

使用 selenium 我可以从上述 URL 中提取数据，但过程非常缓慢。有什么方法可以只使用请求库提取数据？

我想提取图像中显示的文本。

我使用 selenium 提取数据的代码 -

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
option = webdriver.ChromeOptions()
option.add_argument('headless')
driver = webdriver.Chrome('chromedriver',options=option)


driver.get('https://finance.yahoo.com/quote/WRD.PA?p=WRD.PA&.tsrc=fin-srch')
time.sleep(5) 
      
html_text2 = driver.page_source
soup2 = BeautifulSoup(html_text2,'lxml')


data1 = soup2.find("span" , "Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)").text.strip()
data2 = soup2.find("span" , "Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($negativeColor)").text.strip()
    
wb = load_workbook('output.xlsx')
ws  = wb.active
fontstyle = Font(size = "16")
ws['B9'].value = f'{data1}  {data2}'
ws.cell(row = 9 , column = 2).font = fontstyle

wb.save("output.xlsx")

【问题讨论】：

到目前为止，您尝试或研究了什么？
@KlausD。 : 他已经表现出他的努力了，看这两行data1 = soup2.find("span" , "Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)").text.strip() data2 = soup2.find("span" , "Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($negativeColor)").text.strip()他们是bs4
首先我尝试仅将 beautifulsoup 与请求库一起使用，但出现错误 (nonetype)。然后我用了硒。但我认为它也可以通过 beautiulsoup 提取，但我不知道如何
@cruisepandey 不，他只展示了他的基于 Selenium 的代码。 “仅使用 beautifulsoup”这一短语具有误导性，因为它无法从服务器获取任何数据。那是 Selenium 的一部分。我猜他想使用请求或类似的。但是，一旦他提供了有关他的尝试的详细信息，我们就会知道。
请用您的尝试和完整的错误消息更新问题！

标签： python selenium beautifulsoup

【解决方案1】：

你可以使用这个 css_selector ：

div>span[data-reactid='31']

在 HTMLDOM 中具有唯一条目。

在 Beautiful soup 中，我们使用 select 表示 css 而不是 find。

driver.get('https://finance.yahoo.com/quote/WRD.PA?p=WRD.PA&.tsrc=fin-srch')
time.sleep(5)

html_text2 = driver.page_source
soup2 = BeautifulSoup(html_text2, 'lxml')

info = [i.text.strip() for i in soup2.select("div>span[data-reactid='31']")]
print(info)

【讨论】：

【解决方案2】：

安装最新的 phantomjs 后，您可以在 webdriver 行中简单地将 chrome() 替换为 PhantomJS()。 PhantomJS 是一种已停产的无头浏览器，用于自动化网页交互。您可以尝试使用 urllib 并直接发布到登录链接。您可以使用 cookiejar 来保存 cookie。你甚至可以简单地保存cookie，毕竟cookie只是http头中的一个字符串。

慢的并不总是硒。有时我们需要查看我们正在使用的代码。

【讨论】：

我使用 request_html 进行数据抓取。

【解决方案3】：

您可以使用beautifulsoup 快速获取该数据。

该数据存在于 <div> 中，属性为 data-reactid = 30。您可以选择 <div> 并从中提取数据。

代码如下：

import requests
from bs4 import BeautifulSoup

headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

url = 'https://finance.yahoo.com/quote/WRD.PA?p=WRD.PA&.tsrc=fin-srch&guccounter=1'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')

d = soup.find('div', {'data-reactid': '30'})
print(list(d.stripped_strings))

['26.70', '-0.00 (-0.01%)', 'At close:  5:35PM CEST']

【讨论】：