【问题标题】:Scrapy ImagesPipeline WARNING: File (unknown-error): Error downloading image from <GETScrapy ImagesPipeline 警告:文件(未知错误):从 <GET 下载图像时出错
【发布时间】:2015-03-21 03:27:46
【问题描述】:

我正在学习 Python 和 Scrapy,并且正在学习如何使用它下载图像。我现在有点卡住了,我无法弄清楚真正的问题是什么。

我在运行蜘蛛时收到此错误消息

<None>: Unsupported URL scheme '': no handler available for that scheme

[imageflip] WARNING: File (unknown-error): Error downloading image from <GET

请在此处查看我的 pipelines.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem


class PriceoflipkartPipeline(object):
    def process_item(self, item, spider):
        return item

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield scrapy.Request(image_url)

def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

请在此处查看我的 settings.py

SPIDER_MODULES = ['PriceoFlipkart.spiders']
NEWSPIDER_MODULE = 'PriceoFlipkart.spiders'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = 'D:\PriceoFlipkart\Images'
IMAGES_EXPIRES = 90

请在此处查看我的蜘蛛

import scrapy
from PriceoFlipkart.items import PriceoflipkartItem

class FlipkartSpider(scrapy.Spider):
    name = "imageflip"
    allowed_domains = ["flipkart.com"]
     start_urls = [
    "http://www.flipkart.com/moto-g-2nd-gen/p/itme5z8n9mt77ajr?pid=MOBDYGZ6SHNB7RFC&srno=b_1&ref=06f4e48c-9548-45fa-b3ac-fa5fdf0e0d22"
]

def parse(self, response):
    for sel in response.xpath('//body'):
        item = PriceoflipkartItem()
        item['image_urls'] = sel.select('//img[@class="productImage  current"]').extract()
        yield item

在我的 item.py 中我添加了以下代码

image_urls = scrapy.Field()
images = scrapy.Field()

请告诉我如何正确配置它以便下载图像。我在 Windows 8 机器上。先感谢您。

【问题讨论】:

    标签: python scrapy scrapy-spider scrapy-shell


    【解决方案1】:

    提取图像 URL 的 XPath 不正确,它应该在末尾包含 /@src 以仅提取图像的 URL。让它像:

    item['image_urls'] = sel.select(
        '//img[@class="productImage  current"]/@src').extract()
    

    【讨论】:

    • 好发现!谢谢。
    • 太棒了!谢谢。有效。我可以下载图片。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-06-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多