【问题标题】:Interpreting callbacks and cb_kwargs with scrapy用 scrapy 解释回调和 cb_kwargs
【发布时间】:2022-01-24 11:48:32
【问题描述】:

我即将与scrapy 达成个人里程碑。目的是正确理解callbackcb_kwargs,我已经阅读了无数次文档,但我通过可视化代码、实践和解释学得最好。

我有一个示例爬虫,目的是抓取书名、价格并进入每个书页并提取一条信息。我也在尝试了解如何正确获取接下来几页的信息,我知道这取决于对回调操作的理解。

当我运行我的脚本时,它只返回第一页的结果,我如何获得其他页面?

这是我的刮刀:

class BooksItem(scrapy.Item):
    items = Field(output_processor = TakeFirst())
    price = Field(output_processor = TakeFirst())
    availability = Field(output_processor = TakeFirst())

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ['https://books.toscrape.com']

    def start_request(self):
        for url in self.start_url:
            yield scrapy.Request(
                url, 
                callback = self.parse)

    def parse(self, response):
        data = response.xpath('//div[@class = "col-sm-8 col-md-9"]')
        for books in data:
            loader = ItemLoader(BooksItem(), selector = books)
            loader.add_xpath('items','.//article[@class="product_pod"]/h3/a//text()')
            loader.add_xpath('price','.//p[@class="price_color"]//text()')
            
            for url in [books.xpath('.//a//@href').get()]:
                yield scrapy.Request(
                    response.urljoin(url),
                    callback = self.parse_book,
                    cb_kwargs = {'loader':loader})

        for next_page in [response.xpath('.//div/ul[@class="pager"]/li[@class="next"]/a//@href').get()]:
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)


    def parse_book(self, response, loader):
        book_quote = response.xpath('//p[@class="instock availability"]//text()').get()
        

        loader.add_value('availability', book_quote)
        yield loader.load_item()

我认为问题在于我试图抓取接下来几页的部分。我尝试了使用以下方法的替代方法:

def start_request(self):
        for url in self.start_url:
            yield scrapy.Request(
                url, 
                callback = self.parse,
                cb_kwargs = {'page_count':0}
)

def parse(self, response, next_page):
    if page_count > 3:
        return
...
...
    page_count += 1    
    for next_page in [response.xpath('.//div/ul[@class="pager"]/li[@class="next"]/a//@href').get()]:
        yield response.follow(next_page, callback=self.parse, cb_kwargs = {'page_count': page_count})

但是,这种方法出现以下错误:

TypeError: parse() 缺少 1 个必需的位置参数:'page_cntr'

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:
    1. 应该是start_requestsself.start_urls(在函数内部)。

    2. get() 将返回第一个结果,你想要的是 getall() 以便返回一个列表。

    3. “next_page”部分不需要for循环,这不是错误,只是没有必要。

    4. for url in books.xpath 行中,您将获得每个网址两次,这又不是一个错误,但仍然......

    5. 这里data = response.xpath('//div[@class = "col-sm-8 col-md-9"]')你不是一一选择书籍,你选择整个书籍容器,你可以检查len(data.getall()) == 1

    6. book_quote = response.xpath('//p[@class="instock availability"]//text()').get() 将返回\n,查看源代码尝试找出原因(提示:'i' 标签)。

    将您的代码与此进行比较,看看我更改了什么:

    import scrapy
    from scrapy import Field
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import TakeFirst
    
    
    class BooksItem(scrapy.Item):
        items = Field(output_processor=TakeFirst())
        price = Field(output_processor=TakeFirst())
        availability = Field(output_processor=TakeFirst())
    
    
    class BookSpider(scrapy.Spider):
        name = "books"
        start_urls = ['https://books.toscrape.com']
    
        def start_requests(self):
            for url in self.start_urls:
                yield scrapy.Request(
                    url,
                    callback=self.parse)
    
        def parse(self, response):
            data = response.xpath('//div[@class = "col-sm-8 col-md-9"]//li')
            for books in data:
                loader = ItemLoader(BooksItem(), selector=books)
                loader.add_xpath('items', './/article[@class="product_pod"]/h3/a//text()')
                loader.add_xpath('price', './/p[@class="price_color"]//text()')
    
                for url in books.xpath('.//h3/a//@href').getall():
                    yield scrapy.Request(
                        response.urljoin(url),
                        callback=self.parse_book,
                        cb_kwargs={'loader': loader})
    
            next_page = response.xpath('.//div/ul[@class="pager"]/li[@class="next"]/a//@href').get()
            if next_page:
                yield response.follow(next_page, callback=self.parse)
    
        def parse_book(self, response, loader):
            # option 1:
            book_quote = response.xpath('//p[@class="instock availability"]/i/following-sibling::text()').get().strip()
    
            # option 2:
            # book_quote = ''.join(response.xpath('//div[contains(@class, "product_main")]//p[@class="instock availability"]//text()').getall()).strip()
            loader.add_value('availability', book_quote)
            yield loader.load_item()
    

    【讨论】:

    • 感谢您与我分享这个!我不知道你可以通过双斜杠进入容器。我以前没用过following-sibling::text(),所以我得研究一下。我很满意我只犯了一些小错误,现在我对cb_kwargscallbacks 的使用和对页面的迭代有足够的信心。
    猜你喜欢
    • 2014-05-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多