【发布时间】:2021-03-09 01:49:41
【问题描述】:
'# -*- coding: utf-8 -*-
import scrapy
import json
class NtsschoolSpider(scrapy.Spider):
name = 'ntsschool'
start_urls = ['https://directory.ntschools.net/#/schools']
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9,ur;q=0.8",
"Referer": "https://directory.ntschools.net/",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36",
"X-Requested-With": "Fetch",
}
def parse(self, response):
url = 'https://directory.ntschools.net/api/System/GetAllSchools'
yield scrapy.Request(url,
callback = self.parse_api,
headers = self.headers)
def parse_api(self, response):
base_url = 'https://directory.ntschools.net/api/System/GetSchool?itSchoolCode'
raw_data = response.body
data = json.loads(raw_data)
for school in data:
school_code = school['itSchoolCode']
school_url = base_url + school_code
request = scrapy.Request(school_url,
callback = self.parse_url,
headers = self.headers )
yield request
def parse_url(self, response):
raw_data = response.body
data = json.loads(raw_data)
yield {
'Name' : data['name'],
'Physical_address': data['physicalAddress']['displayAddress'],
'Postal_address': data['postalAddress']['displayAddress'],
'Email': data['mail'],
'Phone': data['telephoneNumber']
}
'
错误是:
2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodelarapsch>: HTTP status code is not handled or not allowed
2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodelarrasch>: HTTP status code is not handled or not allowed
2020-11-26 12:18:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://directory.ntschools.net/api/System/GetSchool?itSchoolCodekathesch>: HTTP status code is not handled or not allowed
【问题讨论】:
-
查看获得 404 的 URL。你觉得他们有问题吗?特别是查询字符串,
?itSchoolCodelarapsch和?itSchoolCodelarrasch和?itSchoolCodekathesch? -
感谢您指出错误,现在我明白了错误。
标签: python web-scraping scrapy http-status-code-404