【问题标题】:Scrapy Error : HTTP status code is not handled or not allowedScrapy 错误:HTTP 状态代码未处理或不允许
【发布时间】:2018-08-27 07:47:13
【问题描述】:

我在运行蜘蛛时遇到了问题。当我抓取它时,它显示如下错误:“HTTP 状态代码未处理”。

2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=OPPO-182%22%3EOPPO%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Nokia-109%22%3ENokia%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=HUAWEI-69%22%3EHUAWEI%3C/a%3E> (referer: http://id.priceprice.com/harga-hp/)
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Smartfren-178%22%3ESmartfren%3C/a%3E>: HTTP status code is not handled or not allowed
2018-08-27 14:30:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://id.priceprice.com/harga-hp/%3Ca%20href=%22/harga-hp/?maker=Evercoss-184%22%3EEvercoss%20Evercoss%3C/a%3E>: HTTP status code is not handled or not allowed

我已按照另一条指令编辑 setting.py 并添加代码:

user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"

但它仍然无法正常工作。

这是我的代码:

import scrapy
from handset.items import HandsetItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider


class HandsetpriceSpider(scrapy.Spider):
    name = 'price'
    allowed_domains = ['id.priceprice.com']
    start_urls = ['http://id.priceprice.com/harga-hp/']

    def parse(self, response):

        rules = (
                Rule(LinkExtractor(allow='div.listCont:nth-child(2) > ul:nth-child(1)'), callback='parse_details'),
                Rule(LinkExtractor(restrict_css='ul > li > a[href*="maker"]'), follow =True)                
               )
        for url in  response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract() :
            url = response.urljoin(url)
            yield scrapy.Request(url, callback = self.parse_details)

        next_page_url = response.css('li.last > a::attr(href)').extract_first()
        if next_page_url:
           next_page_url = response.urljoin(next_page_url)
           yield scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_details(self, response):
        yield {
       'Name' : response.css('div.itmName h3:nth-child(1) > a:nth-child(1) ::text').extract_first(),
       'Price' : response.css('div.itmPrice > a.price ::text').extract_first(),
        }

【问题讨论】:

  • 如果您收到 404 错误,那么您尝试抓取的 URL 不存在

标签: python scrapy


【解决方案1】:

你的选择器从 url 中得到很多:

scrapy shell http://id.priceprice.com/harga-hp/

In [3]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').extract()
Out[3]: 
['<a href="/harga-hp/?maker=OPPO-182">OPPO</a>',
 '<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
 '<a href="/harga-hp/?maker=Vivo-466">Vivo</a>',
....

所以链接中包含a href 和名称。 只剪掉链接部分:

In [4]: response.xpath('//ul[1]//li/a[contains(@href, "maker")]').css('a::attr(href)').extract()
Out[4]: 
['/harga-hp/?maker=OPPO-182',
 '/harga-hp/?maker=Vivo-466',
 '/harga-hp/?maker=Vivo-466',

在你的代码中使用这个选择器,你会得到:

2018-08-27 04:53:03 [scrapy.core.scraper] DEBUG: Scraped from <200 http://id.priceprice.com/harga-hp/?maker=Meizu-95>
{'Name': 'Meizu M6', 'Price': '\nRp 1.150.000\n - '}


{'Name': 'Infinix HOT 6 Pro', 'Price': '\nRp 1.599.000\n - '}

【讨论】:

  • 好的,谢谢先生,我能做到。但是我怎样才能为链接制定规则,所以如果我潦草地写它只是显示不同的制造商而不是同一个制造商。使用拒绝、拒绝域或拒绝扩展?
  • 你能用例子解释一下你想得到什么吗?
  • 重复项通常被scrapy过滤掉。所以你不必为此烦恼。
  • 好的,谢谢先生。所以我还有一个问题。我可以问你吗?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-05-13
  • 1970-01-01
相关资源
最近更新 更多