【问题标题】:Scrapy-Selenium PaginationScrapy-Selenium 分页
【发布时间】:2021-10-13 23:14:30
【问题描述】:

谁能帮助我?我正在练习,我无法理解我在分页上做错了什么!它只返回第一页给我,有时会出现错误。当它工作时,它只是返回第一页。

“内容安全策略指令'frame-src'的源列表包含无效的源'*trackcmp.net'它将被忽略”,源:https://naturaldaterra.com.br/hortifruti.html?page=2"

import scrapy
from scrapy_selenium import SeleniumRequest

class ComputerdealsSpider(scrapy.Spider):
    name = 'produtos'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://naturaldaterra.com.br/hortifruti.html?page=1',
            wait_time=3,
            callback=self.parse
        )

    def parse(self, response):

        for produto in response.xpath("//div[@class='gallery-items-1IC']/div"):
            yield {
                'nome_produto': produto.xpath(".//div[@class='item-nameContainer-1kz']/span/text()").get(),
                'valor_produto': produto.xpath(".//span[@class='itemPrice-price-1R-']/text()").getall(),

            }
            
        next_page = response.xpath("//button[@class='tile-root-1uO'][1]/text()").get()
        if next_page:
            absolute_url = f"https://naturaldaterra.com.br/hortifruti.html?page={next_page}"
            yield SeleniumRequest(
                url=absolute_url,
                wait_time=3,
                callback=self.parse
            )

【问题讨论】:

    标签: selenium web-scraping scrapy scrapy-selenium


    【解决方案1】:

    问题是您的 xpath 选择器返回 None 而不是下一页号。考虑将其从

    next_page = response.xpath("//button[@class='tile-root-1uO'][1]/text()").get()
    

    next_page = response.xpath("//button[@class='tile-root_active-TUl tile-root-1uO']/following-sibling::button[1]/text()").get()
    

    对于您未来的项目,请考虑使用scrapy-playwright 来抓取 js 呈现的网站。它使用起来更快更简单。使用scrapy-playwright查看您的抓取工具的示例实现

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    
    class ComputerdealsSpider(scrapy.Spider):
        name = 'produtos'
    
        def start_requests(self):
    
            yield scrapy.Request(
                url='https://naturaldaterra.com.br/hortifruti.html?page=1',
                meta={"playwright": True}
            )
    
        def parse(self, response):
            for produto in response.xpath("//div[@class='gallery-items-1IC']/div"):
                yield {
                    'nome_produto': produto.xpath(".//div[@class='item-nameContainer-1kz']/span/text()").get(),
                    'valor_produto': produto.xpath(".//span[@class='itemPrice-price-1R-']/text()").getall(),
                }
            # scrape next page
            next_page = response.xpath(
                "//button[@class='tile-root_active-TUl tile-root-1uO']/following-sibling::button[1]/text()").get()
            yield scrapy.Request(
                url='https://naturaldaterra.com.br/hortifruti.html?page=' + next_page,
                meta={"playwright": True}
            )
    
    
    if __name__ == "__main__":
        process = CrawlerProcess(settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            }, })
        process.crawl(ComputerdealsSpider)
        process.start()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-04
      • 1970-01-01
      • 1970-01-01
      • 2014-10-02
      • 2015-04-02
      • 1970-01-01
      • 1970-01-01
      • 2023-04-03
      相关资源
      最近更新 更多