【问题标题】:How To Run Selenium-scrapy in parallel如何并行运行 Selenium-scrapy
【发布时间】:2021-05-09 09:46:03
【问题描述】:

我正在尝试使用 scrapy 和 selenium 抓取 javascript 网站。我使用 selenium 和 chrome 驱动程序打开 javascript 网站,然后使用 scrapy 从当前页面抓取所有指向不同列表的链接并将它们存储在列表中(到目前为止,这是尝试使用以下链接的最佳方法seleniumRequest 和回调解析新页面函数导致了很多错误)。然后,我遍历 URL 列表,在 selenium 驱动程序中打开它们并从页面中抓取信息。到目前为止,这会刮掉 16 页/分钟,考虑到该站点上的列表数量,这并不理想。理想情况下,我会像以下实现一样让 selenium 驱动程序并行打开链接:

How can I make Selenium run in parallel with Scrapy?

https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

但是,我不知道如何在我的 selenium-scrapy 代码中实现并行处理。 `

    import scrapy
    import time
    from scrapy.selector import Selector
    from scrapy_selenium import SeleniumRequest
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC

class MarketPagSpider(scrapy.Spider):
    name = 'marketPagination'
def start_requests(self):
    yield SeleniumRequest(
        url="https://www.cryptoslam.io/nba-top-shot/marketplace",
        wait_time=5,
        wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
        callback=self.parse
    )

responses = []

def parse(self, response):
    # initialize driver
    driver = response.meta['driver']
    driver.set_window_size(1920,1080)

    time.sleep(1)
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
    )

    rows = response_obj.xpath("//tbody/tr[@role='row']")
    for row in rows:
        link = row.xpath(".//td[4]/a/@href").get()
        absolute_url = response.urljoin(link)

        self.responses.append(absolute_url)

    for resp in self.responses:
        driver.get(resp)
        html = driver.page_source 
        response_obj = Selector(text=html)

        yield {
        'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
        'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
        
        }

我知道scrapy-splash 可以处理多处理,但我尝试抓取的网站无法在启动时打开(至少我不这么认为)

同时,我删除了分页代码行以保持代码简洁。

我对此很陌生,对使用 selenium 进行多处理的任何建议和解决方案持开放态度。

【问题讨论】:

  • 发布多处理代码,它照常工作,但每个“线程/进程”应该使用自己的驱动程序
  • @Wonka 我不太确定如何实现它。总的来说,我对多处理库非常不熟悉,我很抱歉
  • 请参阅 [this question}(stackoverflow.com/questions/53475578/…) 了解基本技术和接受的答案以及我的 (Booboo) 答案,这可确保驱动程序在您完成后终止。公认的答案是每个线程使用一个驱动程序而不是每个 URL 一个驱动程序的技术。换句话说,它重用驱动程序,就像您为非线程代码中的所有 URL 重用驱动程序一样。
  • @Booboo 嘿,谢谢您的回答!我设法让硒像您的解决方案一样进行多进程。但是,即使我把 del threadlocal 放在最后,脚本完成后我似乎也无法删除驱动程序。我实际上最终得到了这个错误:NameError: name 'threadLocal' is not defined
  • 接受的答案是声明threadLocal = threading.local()。我没有复制我的答案,假设它被理解了。我现在已经更新了答案以明确声明。

标签: python selenium web-scraping scrapy multiprocessing


【解决方案1】:

以下示例程序为演示目的创建了一个只有 2 个线程的线程池,然后抓取 4 个 URL 以获取它们的标题:

from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # suppress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)
        print('The driver was just created.')

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has terminated.')


threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


def get_title(url):
    driver = create_driver()
    driver.get(url)
    source = BeautifulSoup(driver.page_source, "lxml")
    title = source.select_one("title").text
    print(f"{url}: '{title}'")

# just 2 threads in our pool for demo purposes:
with ThreadPool(2) as pool:
    urls = [
        'https://www.google.com',
        'https://www.microsoft.com',
        'https://www.ibm.com',
        'https://www.yahoo.com'
    ]
    pool.map(get_title, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()
# pool.terminate() is called at exit of with block

打印:

The driver was just created.
The driver was just created.
https://www.google.com: 'Google'
https://www.microsoft.com: 'Microsoft - Official Home Page'
https://www.ibm.com: 'IBM - United States'
https://www.yahoo.com: 'Yahoo'
The driver has terminated.
The driver has terminated.

【讨论】:

    猜你喜欢
    • 2020-07-26
    • 1970-01-01
    • 2017-06-24
    • 1970-01-01
    • 1970-01-01
    • 2017-03-15
    • 2010-09-17
    • 1970-01-01
    相关资源
    最近更新 更多