【问题标题】:Scraping websites which has api which is javascript enabled site but can't be scraped through api and give me an error抓取具有 api 的网站,该 api 是启用了 javascript 的站点,但无法通过 api 抓取并给我一个错误
【发布时间】:2021-03-09 01:49:41
【问题描述】:
'# -*- coding: utf-8 -*-
import scrapy
import json

class NtsschoolSpider(scrapy.Spider):
    name = 'ntsschool'
    start_urls = ['https://directory.ntschools.net/#/schools']
    headers = {        
            "Accept": "application/json",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9,ur;q=0.8",
            "Referer": "https://directory.ntschools.net/",         
            "Sec-Fetch-Mode": "cors",
            "Sec-Fetch-Site": "same-origin",
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36",
            "X-Requested-With": "Fetch",
               }
            
    def parse(self, response):
        url = 'https://directory.ntschools.net/api/System/GetAllSchools'
        yield  scrapy.Request(url, 
                              callback = self.parse_api,
                              headers = self.headers)

    def  parse_api(self, response):
         base_url = 'https://directory.ntschools.net/api/System/GetSchool?itSchoolCode'
         raw_data = response.body
         data = json.loads(raw_data)                  
         for school in data:
             school_code = school['itSchoolCode']
             school_url = base_url + school_code
             request = scrapy.Request(school_url,
                                      callback = self.parse_url, 
                                      headers = self.headers  )

             yield request

    def  parse_url(self, response):
         raw_data = response.body
         data = json.loads(raw_data) 
         yield {
                'Name' : data['name'],
                'Physical_address': data['physicalAddress']['displayAddress'],
                'Postal_address': data['postalAddress']['displayAddress'],
                'Email': data['mail'],
                'Phone': data['telephoneNumber'] 

                 }
         '

错误是:

2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodelarapsch>: HTTP status code is not handled or not allowed
2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodelarrasch>: HTTP status code is not handled or not allowed
2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodekathesch>: HTTP status code is not handled or not allowed

【问题讨论】:

  • 查看获得 404 的 URL。你觉得他们有问题吗?特别是查询字符串,?itSchoolCodelarapsch?itSchoolCodelarrasch?itSchoolCodekathesch?
  • 感谢您指出错误,现在我明白了错误。

标签: python web-scraping scrapy http-status-code-404


【解决方案1】:

简单的错字(您忘记了base_url 末尾的“=”)。只需添加它,它就会工作:

base_url = 'https://directory.ntschools.net/api/System/GetSchool?itSchoolCode='

【讨论】:

  • 谢谢你,我添加了''='',它开始工作了。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-03-07
  • 1970-01-01
  • 1970-01-01
  • 2019-08-27
  • 2011-03-22
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多