【问题标题】:Error 403 : HTTP status code is not handled or not allowed in scrapy错误 403:scrapy 中未处理或不允许 HTTP 状态代码
【发布时间】:2017-08-18 13:44:04
【问题描述】:

这是我写的代码,用于抓取justdial网站。

import scrapy
from scrapy.http.request import Request

class JustdialSpider(scrapy.Spider):
    name = 'justdial'
    # handle_httpstatus_list = [400]
    # headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
    # handle_httpstatus_list = [403, 404]
    allowed_domains = ['justdial.com']
    start_urls = ['https://www.justdial.com/Delhi-NCR/Chemists/page-1']
    # def  start_requests(self):
    #     # hdef start_requests(self):
    #     headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
    #     for url in self.start_urls:
    #         self.log("I just visited :---------------------------------- "+url)
    #         yield Request(url, headers=headers)
    def parse(self,response):
        self.log("I just visited the site:---------------------------------------------- "+response.url)
         urls = response.xpath('//a/@href').extract()
         self.log("Urls-------: "+str(urls))

这是终端中显示的错误:

2017-08-18 18:32:25 [scrapy.core.engine] INFO: Spider opened
2017-08-18 18:32:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pag
es/min), scraped 0 items (at 0 items/min)
2017-08-18 18:32:25 [scrapy.extensions.httpcache] DEBUG: Using filesystem cache
storage in D:\scrapy\justdial\.scrapy\httpcache
2017-08-18 18:32:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening o
n 127.0.0.1:6023
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/robots.txt> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.j
ustdial.com/Delhi-NCR/Chemists/page-1> (referer: None) ['cached']
2017-08-18 18:32:25 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
 <403 https://www.justdial.com/Delhi-NCR/Chemists/page-1>: HTTP status code is n
ot handled or not allowed 

我在 stackoverflow 上看到过类似的问题,我尝试了所有类似的方法, 您可以在代码中看到我尝试过的注释,

  • 更改了用户代理

  • 设置handle_httpstatus_list = [400]

注意:这个 (https://www.justdial.com/Delhi-NCR/Chemists/page-1) 网站甚至没有在我的系统中被阻止。当我在 chrome/mozilla 中打开网站时,它正在打开。这与 (https://www.practo.com/bangalore#doctor-search) 网站的错误相同。

【问题讨论】:

    标签: python http scrapy


    【解决方案1】:

    当您使用user_agent 蜘蛛属性设置用户代理时,它开始工作。可能设置请求标头是不够的,因为它会被默认用户代理字符串覆盖。所以设置蜘蛛属性

    user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
    

    (和你设置start_urls的方法一样)试试看。

    【讨论】:

      【解决方案2】:

      正如 (Tomáš Linhart) 所说, 我们必须在setting.py 中添加一个useragents 设置,比如,

      • USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML,如 Gecko)Chrome/22.0.1207.1 Safari/537.1'

      【讨论】:

        【解决方案3】:

        您的调查表明问题似乎与 HTTP 客户端(scrapy)有关,而不是网络问题(防火墙、IP 禁令)。

        阅读 scrapy 文档以打开调试日志记录。您想查看scrapy 发出的HTTP 请求的内容。它可能包含一个 cookie,该 cookie 由网站在用户代理仍然是scrapy 时设置。

        https://doc.scrapy.org/en/latest/topics/debug.html

        https://doc.scrapy.org/en/latest/faq.html?highlight=cookies#how-can-i-see-the-cookies-being-sent-and-received-from-scrapy

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-05-13
          • 1970-01-01
          相关资源
          最近更新 更多