scrapy python (json) 的下一页问题答案

【问题标题】：Next page issues with scrapy python (json)scrapy python (json) 的下一页问题
【发布时间】：2022-01-21 12:41:27
【问题描述】：

我正在尝试从列表中提供邮政编码，但效果不佳（在课堂内）。 start_urls 按预期采用 sa1、sa2、sa3，但在 def 中仅传递“sa3”（最后一个），而 next_pages 仅获得“sa3”。

这是我的代码：

Class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    for postcode in postcodes:


        start_urls = [f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid']

        def parse(self, response):
            data = json.loads(response.body)
            properties = data.get('properties')
            for property in properties:
                yield {
                    'id': property.get('id'),
                    'price': property.get('price'),
                    'title': property.get('property-title'),
                    'url': response.urljoin(property.get('property-link'))
                }

            pages = int(100 / 23)
            postcode = self.postcode

            for number in range(1, pages +1):
                next_page = f"https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&page={number}&sort-field=keywords&under-offer=true&view=grid"
                yield scrapy.Request(next_page, callback=self.parse)

如果可能的话，我想达到这个结果。

This is start URL:  ['https://www.domainname-id=sa1&view=grid']
This is next page:  https://www.domainname-id=sa1&page=1&view=grid
This is next page:  https://www.domainname-id=sa1&page=2&view=grid
This is next page:  https://www.domainname-id=sa1&page=3&view=grid
This is start URL:  ['https://www.domainname-id=sa2&view=grid']
This is next page:  https://www.domainname-id=sa2&page=1&view=grid
This is next page:  https://www.domainname-id=sa2&page=2&view=grid
This is next page:  https://www.domainname-id=sa2&page=3&view=grid
This is start URL:  ['https://www.domainname-id=sa3&view=grid']
This is next page:  https://www.domainname-id=sa3&page=1&view=grid
This is next page:  https://www.domainname-id=sa3&page=2&view=grid
This is next page:  https://www.domainname-id=sa3&page=3&view=grid

感谢您的宝贵时间。

【问题讨论】：

标签： python api scrapy

【解决方案1】：

您创建start_urls 列表并一次又一次地覆盖它，因此您只能获得最后一个网址。相反，您需要附加到它：

start_urls = []

for postcode in postcodes:
    start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

编辑：

完整代码：

import scrapy
import json


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        pages = int(100 / 23)
        postcode = self.postcode

        for number in range(1, pages + 1):
            next_page = f"https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&page={number}&sort-field=keywords&under-offer=true&view=grid"
            yield scrapy.Request(next_page, callback=self.parse)

编辑 2：

import scrapy
import json


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        # pages = int(100 / 23)
        pages = 4   # int(100/23) = 4
        postcode = self.postcode    # always 'sa3'

        for number in range(1, pages + 1):
            next_page = f'{response.url}&page={number}'
            yield scrapy.Request(next_page, callback=self.parse)

编辑 3：

import scrapy
import json
import re


class OnthemarketSpider(scrapy.Spider):
    name = 'onthemarket'
    allowed_domains = ['onthemarket.com']

    postcodes = ('sa1'), ('sa2'), ('sa3')
    start_urls = []

    for postcode in postcodes:
        start_urls.append(f'https://www.onthemarket.com/async/search/properties/?search-type=for-sale&location-id={postcode}&sort-field=keywords&under-offer=true&view=grid&page=1')

    def parse(self, response):
        data = json.loads(response.body)
        properties = data.get('properties')
        for property in properties:
            yield {
                'id': property.get('id'),
                'price': property.get('price'),
                'title': property.get('property-title'),
                'url': response.urljoin(property.get('property-link'))
            }

        pages = 4   # int(100/23) = 4

        for number in range(1, pages + 1):
            next_page = re.sub(r'page=\d+', f'page={number}', response.url)
            yield scrapy.Request(next_page, callback=self.parse)

【讨论】：

感谢您的回复。这就是我所做的： allowed_domains = ['onthemarket.com'] start_urls = [] postcodes = ('sa1'), ('sa2') 用于邮政编码中的邮政编码： start_urls.append(f'onthemarket.com/async/search/properties/…{postcode }&sort-field=keywords&under-offer=true&view=grid') 不幸的是，它仍然只取最后一个（在这种情况下为'sa2'）。我想知道我是否做错了什么。
可能是因为它对我有用。进行编辑。
你是对的，它适用于所有 3 个邮政编码，但不完整。它只需要来自“sa1”和“sa2”的第一页，但需要来自“sa3”的所有页面，因此对于前 2 个邮政编码，它不会到达 next_page。任何想法为什么？再次感谢
@christian_bear 这是因为self.postcode 始终是“sa3”，请参阅编辑。如果这个答案有帮助，请采纳。
非常感谢您的时间和知识。我的错误是“忽略响应 onthemarket.com/async/search/properties/…>：HTTP 状态代码未处理或不允许”我假设是因为它同时获取 2 页 - &page=4&page=3>。有什么想法吗？谢谢