【问题标题】:ValueError: Missing scheme in request url: h 5ValueError:请求 url 中缺少方案:h 5
【发布时间】:2017-03-14 08:45:08
【问题描述】:

我使用 Scrapy 编写了一个蜘蛛来从网站获取图像。但是当我运行这个蜘蛛时,引发了这个错误。这是我关于获取 img_url 的代码:

img_url = div.find_all("img",class_="img-responsive img-thumbnail center-block")[0]['src']

当我将img_url放入浏览器时,我可以得到图像。但是当我通过蜘蛛下载图像时,它会引发错误。

  File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 57,
 in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h

蜘蛛.py

# -*- coding: utf-8 -*-

from scrapy.spiders import Spider
import scrapy
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from deep_web2.items import DeepWeb2Item

import sys
reload(sys)
sys.setdefaultencoding('utf8')


class DeepSpider(Spider):
    name = "deepSpider"
    staer_urls=["http://hansamktkykr5yt4.onion/category/1/"]
    bash_url = "http://hansamktkykr5yt4.onion"
    headers = {
        "Host": "hansamktkykr5yt4.onion",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Firefox/31.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
        "Connection": "keep-alive"
    }

    def start_requests(self):
        yield scrapy.Request(url="http://hansamktkykr5yt4.onion/category/1/",headers=self.headers,
                             callback=self.parse_item)

    def parse_item(self, response):
        sel = Selector(response)

        html = sel.extract()
        html = html.encode('utf-8')

        soup = BeautifulSoup(html,"lxml")

        item_rows  = soup.find_all("div",class_="row row-item")
        for div in item_rows:
            title = div.find_all("div",class_="item-details")[0].find_all("a")[0].get_text()
            url = div.find_all("div",class_="item-details")[0].find_all("a")[0]['href']
            address = div.find_all("small",class_="text-muted-666")[0].get_text()
            price = div.find_all("div",class_="col-xs-3 text-right listing-price")[0].find_all("strong")[0].get_text()

            img_url = div.find_all("img",class_="img-responsive img-thumbnail center-block")[0]['src']
            view_num =div.find_all("div",class_="text-muted text-center")[0].find_all("small")[0].get_text()

            link_ = self.bash_url+url
            yield scrapy.Request(url=link_,headers=self.headers,meta={"title":title,"address":address,
                                                                      "price":price,"img_url":img_url,
                                                                       "view_num":view_num},callback=self.parse_fetch)

        pageNum = soup.find_all("ul",class_="pagination")[0]
        now = pageNum.find_all("li",class_="active")[0].get_text()
        now = int(str(now).strip())
        print now
        for page_ in pageNum.find_all("li",class_=''):
            number_ = page_.get_text()
            try:
                temp = int(str(number_).strip())
            except:
                    continue
            page_next = int(str(number_).strip())
            if page_next==now+1:
                url = self.bash_url+page_.find_all("a")[0]['href']
                yield scrapy.Request(url=url,headers=self.headers,callback=self.parse_item)

    def parse_fetch(self, response):
        sel = Selector(response)

        html = sel.extract()
        html = html.encode('utf-8')

        soup = BeautifulSoup(html,"lxml")
        text = soup.find_all("p")[0].get_text()

        item = DeepWeb2Item()

        item['title'] = response.meta['title']
        item['address'] = response.meta['address']
        item['price'] = response.meta['price']
        item['img_url'] = response.meta['img_url']
        item['view_num'] = response.meta['view_num']
        item['content'] = text

        yield item

更多错误信息在这里:

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 587, in _
runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Python27\lib\site-packages\scrapy\pipelines\media.py", line 62, in pr
ocess_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "C:\Python27\lib\site-packages\scrapy\pipelines\images.py", line 147, in
get_media_requests
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 25,
 in __init__
    self._set_url(url)
  File "C:\Python27\lib\site-packages\scrapy\http\request\__init__.py", line 57,
 in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2017-03-15 08:42:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://hansam
ktkykr5yt4.onion/listing/63776/> (referer: http://hansamktkykr5yt4.onion/categor
y/1/)
2017-03-15 08:42:23 [scrapy.core.scraper] ERROR: Error processing {'address': u'
 Ships from: Netherlands',

【问题讨论】:

  • staer_urls 是干什么用的?不应该是start_urls吗?
  • 能否提供更完整的回溯?只有最后两行,它并没有说明哪个Request 实例化失败。
  • 我提供了更多的回溯。
  • 所以它与您的 "img_url" 字段相关,您似乎在 ImagesPipeline 设置中引用了该字段。 IMAGES_URLS_FIELD 需要在您的项目中引用包含 URL 列表而不是唯一 URL 的字段。试试item['img_url'] = [response.meta['img_url']]

标签: python scrapy


【解决方案1】:

你的蜘蛛 start_urls 必须是一个列表:

start_urls = ["https://www.google.com/"]

实际上,您的字符串被解释为 char 列表,当蜘蛛尝试获取第一个元素时,它会获取第一个字母“h”。

【讨论】:

  • 我的蜘蛛start_urls是一个列表,但错误还是来了
  • @Wnj 尝试将staer_ulrs 替换为start_urls
  • 它是立即停止还是在创建一些请求后停止?你能提供shell的输出吗?顺便说一句:def start_requests() 中的 Request 告诉 Scrapy 忽略我们的 start_urls。也许它有助于打印您传递给Request的所有url=的内容和type(),例如:link_ = self.bash_url+url yield scrapy.Request(url=link_, ...)
  • 它并没有立即停止,实际上它可以输出项目的内容。
猜你喜欢
  • 2017-07-02
  • 1970-01-01
  • 2020-12-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-02-01
  • 1970-01-01
  • 2015-02-15
相关资源
最近更新 更多