如何使用 Scrapy 浏览基于 js/ajax(href="#") 的分页？答案

【问题标题】：How to navigate through js/ajax(href="#") based pagination with Scrapy?如何使用 Scrapy 浏览基于 js/ajax(href="#") 的分页？
【发布时间】：2020-02-18 10:43:37
【问题描述】：

我想遍历所有类别的网址并抓取每个页面的内容。虽然urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]] 在这段代码中我试图只获取第一个类别的 url，但我的目标是获取所有 url 和每个 url 中的内容。

我正在使用 scrapy_selenium 库。 Selenium 页面源没有传递给“scrape_it”函数。请检查我的代码，如果其中有任何问题，请告诉我。我是scrapy框架的新手。

下面是我的蜘蛛代码-

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from ..items import CouponcollectItem

class Couponsite6SpiderSpider(scrapy.Spider):
    name = 'couponSite6_spider'
    allowed_domains = ['www.couponcodesme.com']
    start_urls = ['https://www.couponcodesme.com/ae/categories']
    
    def parse(self, response):   
        urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]
        for url in urls:
            yield SeleniumRequest(
                url=response.urljoin(url),
                wait_time=3,
                callback=self.parse_urls
            ) 

    def parse_urls(self, response):
        driver = response.meta['driver']
        while True:
            next_page = driver.find_element_by_xpath('//a[@class="category_pagination_btn next_btn bottom_page_btn"]')
            try:
                html = driver.page_source
                response_obj = Selector(text=html)
                self.scrape_it(response_obj)
                next_page.click()
            except:
                break
        driver.close()

    def scrape_it(self, response):
        items = CouponcollectItem()
        print('Hi there')
        items['store_img_src'] = response.css('#temp1 > div > div.voucher_col_left.flexbox.spaceBetween > div.vouchercont.offerImg.flexbox.column1 > div.column.column1 > div > div > a > img::attr(src)').extract()
        yield items

我在 settings.py 文件中添加了以下代码 -

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

我附上一张terminal_output 截图。感谢您的时间！请帮我解决这个问题。

【问题讨论】：

您不应该将 selenium 与 Scrapy 一起使用，这也是一种过时的方法，您只需编辑下载器中间件以使用 Selenium 发出请求并返回它的 HTML。理想情况下，你会使用 Splash 而不是 selenium
docs.scrapy.org/en/latest/topics/dynamic-content.html
作为后续我提出了这个问题，请帮我解决这个问题stackoverflow.com/questions/60375046/…

标签： python selenium web-scraping scrapy

【解决方案1】：

问题是你不能在异步运行的线程之间共享驱动程序，你也不能并行运行多个。您可以取出收益，它会一次完成一个：

在顶部：

from selenium import webdriver
import time

driver = webdriver.Chrome()

然后在你的课堂上：

def parse(self, response):
  urls = response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()
  for url in urls:
    self.do_category(url)

def do_page(self):
  time.sleep(1)
  html = driver.page_source
  response_obj = Selector(text=html)
  self.scrape_it(response_obj)

def do_category(self, url):
  driver.get(url)
  self.do_page()
  next_links = driver.find_elements_by_css_selector('a.next_btn')
  while len(next_links) > 0:
    next_links[0].click()
    self.do_page()
    next_links = driver.find_elements_by_css_selector('a.next_btn')

如果这对你来说太慢了，我建议切换到 Puppeteer。

【讨论】：

让我整合这个代码。如果它适合我，我会告诉你的。