【问题标题】:Need help scraping contents of this page with Scrapy需要帮助使用 Scrapy 抓取此页面的内容
【发布时间】:2021-09-17 06:19:51
【问题描述】:

谁能告诉我如何使用 Scrapy 从该页面中抓取数据(姓名和号码)。数据是动态加载的。如果您检查网络选项卡,您会发现一个发往 https://www.icab.es/rest/icab-api/collegiates 的 POST 请求。所以我将它复制为 cURL 并通过 Postman 发送请求。但我收到错误。有人可以帮我吗? 网址:https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales/?extraSearch=false&probono=false

【问题讨论】:

    标签: web-scraping scrapy web-crawler


    【解决方案1】:

    这是一个非常好的问题!但也许下次你会想要添加你的代码并且可能会更好地格式化它。 How to ask

    解决方案:

    您需要重新创建请求。我用Burp Suite检查了请求。

    我在“start_urls”中获得了 url 的标头,以及 json_url 的标头和正文。

    如果你尝试从 start_request 获取 json_url,你会得到 401 错误,所以我们先去 'start_urls' url,然后才请求 json_url。

    完整代码:

    import scrapy
    
    
    class Temp(scrapy.Spider):
        name = "tempspider"
    
        allowed_domains = ['icab.es']
        start_urls = ['https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales']
        json_url = 'https://www.icab.es/rest/icab-api/collegiates'
    
        def start_requests(self):
            headers = {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                "Origin": "https://www.icab.es",
                "Accept-Encoding": "gzip, deflate, br",
                "Accept-Language": "en-US,en;q=0.5",
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "DNT": "1",
                "Host": "www.icab.es",
                "Pragma": "no-cache",
                "Sec-Fetch-Dest": "document",
                "Sec-Fetch-Mode": "navigate",
                "Sec-Fetch-Site": "none",
                "Sec-Fetch-User": "?1",
                "Sec-GPC": "1",
                "Upgrade-Insecure-Requests": "1",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
            }
    
            yield scrapy.Request(url=self.start_urls[0], headers=headers, callback=self.parse)
    
        def parse(self, response):
            headers = {
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "DNT": "1",
                "Pragma": "no-cache",
                "Sec-GPC": "1",
                'Accept': 'application/json',
                'Accept-Encoding': 'gzip, deflate',
                'Accept-Language': 'en-US,en;q=0.9',
                'Content-Type': 'application/json',
                'Host': 'www.icab.es',
                'Sec-Ch-Ua': '"Chromium";v="91", " Not;A Brand";v="99"',
                'Sec-Ch-Ua-Mobile': '?0',
                'Origin': 'https://www.icab.es',
                'Referer': 'https://www.icab.es/es/servicios-a-la-ciudadania/necesito-un-abogado/buscador-de-profesionales',
                'Sec-Fetch-Site': 'same-origin',
                'Sec-Fetch-Mode': 'cors',
                'Sec-Fetch-Dest': 'empty',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
                "X-KL-Ajax-Request": "Ajax_Request",
            }
            body = '{"filters":{"keyword":"","name":"","surname":"","street":"","postalCode":"","collegiateNumber":"","dedication":"","language":"","paginationFirst":"1","paginationLast":"25","paginationOrder":"surname","paginationOrderAscDesc":"ASC"}}'
    
            yield scrapy.Request(url=self.json_url, headers=headers, body=body, method='POST', callback=self.parse_json)
    
        def parse_json(self, response):
            json_response = response.json()
            members = json_response['members']
    
            for member in members:
                yield {
                    'randomPosition': member['randomPosition'],
                    'collegiateNumber': member['collegiateNumber'],
                    'surname': member['surname'],
                    'name': member['name'],
                    'gender': member['gender'],
                }
    

    输出:

    {'randomPosition': '27661107', 'collegiateNumber': '35080', 'surname': 'Abad Bamala', 'name': 'Ana', 'gender': 'M'}
    {'randomPosition': '98668217', 'collegiateNumber': '14890', 'surname': 'Abad Calvo', 'name': 'Encarnacion', 'gender': 'M'}
    {'randomPosition': '53180188', 'collegiateNumber': '29746', 'surname': 'Abad de Brocá', 'name': 'Laura', 'gender': 'M'}
    {'randomPosition': '41073111', 'collegiateNumber': '31865', 'surname': 'Abad Esteve', 'name': 'Joan Domènec', 'gender': 'H'}
    {'randomPosition': '63371735', 'collegiateNumber': '29647', 'surname': 'Abad Fernández', 'name': 'Dolors', 'gender': 'M'}
    {'randomPosition': '30290704', 'collegiateNumber': '45016', 'surname': 'Abad Hernández', 'name': 'Laura', 'gender': 'M'}
    {'randomPosition': '57510617', 'collegiateNumber': '16083', 'surname': 'Abad Mariné', 'name': 'Jose Antonio', 'gender': 'H'}
    ................
    ................
    ................
    

    【讨论】:

    • 非常感谢...
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多