如何使用 Selenium 和 Scrapy 来爬取 ajax 页面答案

【问题标题】：How to use Selenium with Scrapy for crawling ajax pages如何使用 Selenium 和 Scrapy 来爬取 ajax 页面
【发布时间】：2018-05-09 18:11:07
【问题描述】：

我是 Scrapy 新手，我需要抓取一个页面，但在抓取要抓取的页面时遇到问题。

不填写页面上的任何字段，直接点击“PESQUISAR”（翻译：搜索）按钮，我需要将下面显示的所有页面都刮掉。

看来我的问题出在页面 javascript 中。我从未使用过 javascript。

from scrapy import Spider
from selenium import webdriver
from scrapy.selector import Selector

class CarfSpider(Spider):
    name = 'carf'
    allowed_domains = ['example.com']

    def start_requests(self):
        self.driver = webdriver.Chrome('/Users/Desktop/chromedriver')
        self.driver.get('example.com')
        sel = Selector(text=self.driver.page_source)
        carf = sel.xpath('//*[@id="botaoPesquisarCarf"]')

我的主要困难是跟踪这个页面。因此，如果有人可以帮助我解决这个问题，我将不胜感激。

抱歉英语不好，希望你能理解

【问题讨论】：

标签： python-3.x selenium scrapy selenium-chromedriver

【解决方案1】：

您必须使用驱动程序点击按钮Pesquisar，调用WebDriverWait 等待ID为tblJurisprudencia的表格元素出现，表明该页面已完全加载到获取源代码，它们会解析页面中的 Acordão 值。

# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep


class CarfSpider(Spider):

    name = 'carf'
    start_urls = ['https://carf.fazenda.gov.br/sincon/public/pages/ConsultarJurisprudencia/consultarJurisprudenciaCarf.jsf']

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path='/home/laerte/chromedriver')

    def parse(self, response):
        self.driver.get(response.url)

        self.driver.find_element_by_id('botaoPesquisarCarf').click()

        page_loaded = WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.ID, "tblJurisprudencia"))
        )

        if page_loaded:
            response_selenium = Selector(text=self.driver.page_source)

            table = response_selenium.xpath("//table[@id='tblJurisprudencia']")

            for row in table.xpath("//tr"):
                body = row.xpath("//div[@class='rich-panel-body ']")

                yield {
                    "acordao" : body.xpath("./a/text()").extract_first()
                }

【讨论】：

您应该使用带有预期条件的“显式等待”而不是静态睡眠。