如何使用 selenium 从网页中提取数据更加健壮和高效？答案

【问题标题】：How to make the data extraction from webpage with selenium more robust and efficient?如何使用 selenium 从网页中提取数据更加健壮和高效？
【发布时间】：2021-10-23 12:07:21
【问题描述】：

我想从雅虎财经网页中提取所有期权链数据，为简单起见，取看跌期权链数据。首先，加载程序中用到的所有包：

import time 
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

将某公司看跌期权链数据写入目录的函数：

def write_option_chain(code):
    browser = webdriver.Chrome()
    browser.maximize_window()
    url = "https://finance.yahoo.com/quote/{}/options?p={}".format(code,code)
    browser.get(url)
    WebDriverWait(browser,10).until(EC.visibility_of_element_located((By.XPATH, ".//select/option")))
    time.sleep(25)
    date_elem = browser.find_elements_by_xpath(".//select/option")
    time_span = len(date_elem)
    print('{} option chains exists in {}'.format(time_span,code)) 
    df_all = pd.DataFrame()
    for item in range(1,time_span):
        element_date = browser.find_element_by_xpath('.//select/option[{}]'.format(item))
        print("parsing {}'s  put option chain on {} now".format(code,element_date.text))
        element_date.click()
        WebDriverWait(browser,10).until(EC.visibility_of_all_elements_located((By.XPATH, ".//table[@class='puts W(100%) Pos(r) list-options']//td")))
        time.sleep(11)
        put_table = browser.find_element_by_xpath((".//table[@class='puts W(100%) Pos(r) list-options']"))
        put_table_string = put_table.get_attribute('outerHTML')
        df_put = pd.read_html(put_table_string)[0]
        df_all = df_all.append(df_put)
    browser.close()
    browser.quit()
    df_all.to_csv('/tmp/{}.csv'.format(code))
    print('{} otpion chain written into csv file'.format(code))

使用列表测试write_option_chain：

nas_list = ['aapl','adbe','adi','adp','adsk']
for item in nas_list:
    try:
        write_option_chain(code=item)
    except:
        print("check what happens to {} ".format(item))
        continue
    time.sleep(5)

输出信息显示：

#i omitted many lines for simplicity
18 option chains exists in aapl
parsing aapl's  put option chain on August 27, 2021 now
check what happens to aapl 
check what happens to adbe 
12 option chains exists in adi
parsing adi's  put option chain on December 17, 2021 now
adi otpion chain written into csv file
11 option chains exists in adp
parsing adp's  put option chain on August 27, 2021 now
adp otpion chain written into csv file
check what happens to adsk

我们根据以上信息做一个总结：

1.仅将adp和adi的看跌期权链数据写入所需目录。
2.仅获取aapl和adp的部分期权链数据
3.adsk的选项网页打不开。
4.执行大约需要20分钟。

如何使使用 selenium 从网页中提取数据更加健壮和高效？

【问题讨论】：

更健壮是什么意思？
它只能获取部分公司的期权数据，不能全部获取。请复制并在您的计算机上试用，并比较您得到的和我的。
@showkey 是selenium 必须的，我们不能使用其他库来加快速度吗？
@showkey 你还没有回复？在您的最后一次编辑中，这是指其他人，但您做错了。您不应该发送所有nas_list，您必须像x=write_option_chain("aapl") 或可能像在您的代码中一样，一一发送，使用for loop :)
你什么都没有，因为函数write_option_chain() 没有返回任何东西。相反，它将 CSV 文件保存在 /tmp/{code}.csv。

标签： python selenium web-crawler

【解决方案1】：

我不确定我是否可以使用 requests 和 BeautifulSoup 在你明确说之后

如何让Selenium的网页数据提取更加健壮和高效？

但这里的 requests 和 BeautifulSoup 代码非常适合我。

import requests # pip install requests
from bs4 import BeautifulSoup # pip install beautifulsoup4
import pandas as pd

headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0"}

def scrape(c):
    page=requests.get(f"https://finance.yahoo.com/quote/{c}/options?p={c}",headers=headers)
    soup=BeautifulSoup(page.content,"lxml")

    timestamp=list(map(lambda x: x["value"],soup.find("select").find_all("option")))
    # Extracting timestamp from <select>'s <option>

    df=pd.DataFrame()

    for t in timestamp: # Looping through the list of timestamp
        page2=requests.get(f"https://finance.yahoo.com/quote/{c}/options?date={t}&p={c}",headers=headers)
        soup2=BeautifulSoup(page2.content,"lxml")

        table=soup2.find("table",class_="puts W(100%) Pos(r) list-options")
        try:
            tabledf=pd.read_html(str(table))[0]
            df=df.append(tabledf)
        except ValueError:
            pass

    df.to_csv(f"/temp/{c}.csv",index=False)

nas_list = ['aapl','adbe','adi','adp','adsk']
for nas in nas_list:
    scrape(nas)

BeautifulSoup 将比 Selenium 快得多，不支持 JavaScript 的网站。所以，我在这里使用BeautifulSoup。是的，你可以使用Selenium 和BeautifulSoup，也可以使用browser.page_source，但在这里我认为不需要使用Selenium。

访问此处了解更多详情Selenium versus BeautifulSoup for web scraping

【讨论】：

【解决方案2】：

如果可以使用 selenium 以外的其他东西，则可以通过使用 asyncio 和来自 PyPi 存储库的 ahiohttp 包来实现最佳吞吐量，因为需要的并发 URL 获取请求的数量制作（因此是比多线程更好的选择）。为了获得更高的性能（此处未完成），可以将代码分离为获取 URL（纯 I/O）和数据帧处理（CPU 密集型），并为后者使用多处理池。

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import pandas as pd
import time

async def process_code(session, code):
    async with session.get(f'https://finance.yahoo.com/quote/{code}/options?p={code}') as resp:
        status = resp.status
        if status != 200:
            raise Exception('status returned =', status)
        code_page = await resp.text()
    soup = BeautifulSoup(code_page, 'lxml')
    dates = [elem['value'] for elem in soup.find('select').find_all('option')]
    df_all = pd.DataFrame()
    df_tables = await asyncio.gather(*(process_date(session, code, date) for date in dates))
    for df_table in df_tables:
        if df_table is not None:
            df_all = df_all.append(df_table)
    df_all.to_csv('/tmp/{}.csv'.format(code))

async def process_date(session, code, date):
    async with session.get(f'https://finance.yahoo.com/quote/{code}/options?date={date}&p={code}') as resp:
        status = resp.status
        if status != 200:
            raise Exception('status returned =', status)
        code_page = await resp.text()
    soup = BeautifulSoup(code_page, 'lxml')
    table = soup.find('table', class_='puts W(100%) Pos(r) list-options')
    try:
        return pd.read_html(str(table))[0]
    except ValueError:
        return None

async def main():
    nas_list = ['aapl','adbe','adi','adp','adsk']
    # Connection: keep-alive required to prevent ClientPayloadError on some websites:
    t = time.time()
    async with aiohttp.ClientSession(headers = {'Connection': 'keep-alive', 'user-agent': 'my-application'}) as session:
        await asyncio.gather(*(process_code(session, code) for code in nas_list))
    print('Elapsed time:', time.time() - t)

# Test if we are running under iPython or Jupyter Notebook:
try:
    __IPYTHON__
except NameError:
    asyncio.get_event_loop().run_until_complete(main())
else:
    asyncio.get_running_loop().create_task(main())

这里是多线程版本

from multiprocessing.pool import ThreadPool
from functools import partial
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def process_code(session, pool, code):
    code_page = session.get(f'https://finance.yahoo.com/quote/{code}/options?p={code}')
    soup = BeautifulSoup(code_page.content, 'lxml')
    dates = [elem['value'] for elem in soup.find('select').find_all('option')]
    df_all = pd.DataFrame()
    for df_table in pool.imap(partial(process_date, session, code), dates):
        if df_table is not None:
            df_all = df_all.append(df_table)
    df_all.to_csv('/tmp/{}.csv'.format(code))

def process_date(session, code, date):
    code_page = session.get(f'https://finance.yahoo.com/quote/{code}/options?date={date}&p={code}')
    soup = BeautifulSoup(code_page.content, 'lxml')
    table = soup.find('table', class_='puts W(100%) Pos(r) list-options')
    try:
        return pd.read_html(str(table))[0]
    except ValueError:
        return None

t = time.time()
nas_list = ['aapl','adbe','adi','adp','adsk']
with requests.Session() as session:
    headers = {'User-Agent': 'my-application'}
    session.headers = headers
    pool = ThreadPool(100)
    pool.map(partial(process_code, session, pool), nas_list)
print('Elapsed time:', time.time() - t)

【讨论】：

我已对asyncio 代码进行了更正（它仅在 Jupyter Notebook 下正常运行——对此感到抱歉）。在我的桌面上它运行大约 7 秒。多线程版本大约需要 11 秒。

【解决方案3】：

在这个用例中，使用 selenium 是可以的。您只需要一些优化，以下是我找到的一些示例：

使用headless 模式：selenium 测试可能需要一段时间才能完成，因为页面上的元素需要浏览器加载。无头测试摆脱了这种加载时间，使您可以显着缩短测试时间。在我们的无头测试测试中，我们发现测试执行时间减少了 30% (source)。
避免使用多个time.sleep() 和WebDriverWait().until()（尤其是在for 循环内），而是使用简单的.implicitly_wait()。

代码示例：

def write_option_chain(code):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--start-maximized")
    browser = webdriver.Chrome(options=chrome_options)
    browser.implicitly_wait(10)
    url = "https://finance.yahoo.com/quote/{}/options?p={}".format(code, code)
    browser.get(url)
    date_elem = browser.find_elements_by_xpath(".//select/option")
    time_span = len(date_elem)
    print('{} option chains exists in {}'.format(time_span, code))
    df_all = pd.DataFrame()
    for item in range(1, time_span):
        element_date = browser.find_element_by_xpath('.//select/option[{}]'.format(item))
        print("parsing {}'s  put option chain on {} now".format(
        code, element_date.text))
        element_date.click()
        put_table = browser.find_element_by_xpath((".//table[@class='puts W(100%) Pos(r) list-options']"))
        put_table_string = put_table.get_attribute('outerHTML')
        df_put = pd.read_html(put_table_string)[0]
        df_all = df_all.append(df_put)
    browser.close()
    browser.quit()
    df_all.to_csv('/tmp/{}.csv'.format(code))
    print('{} otpion chain written into csv file'.format(code))

然后：

>>nas_list = ['aapl', 'adbe', 'adi', 'adp', 'adsk']
>>for item in nas_list:
  ....write_option_chain(code=item) #this saves your df at /tmp/{code}.csv'

通过这些简单的优化，完成提取所有内容的代码大约需要 180 秒。

【讨论】：