【问题标题】:Scrapy is not downloading imagesScrapy 不下载图像
【发布时间】:2018-12-15 09:25:11
【问题描述】:

我正在尝试使用 Scrapy 下载一些图像。我遵循了官方文档,复制并粘贴了一些示例并阅读了许多类似的问题,但它现在仍然有效。 我错过了什么?

我注意到项目管道看起来是空的,但我想不通。

2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines: []

另外,我尝试过不同的网站,玩过标题,......但没有。它看起来好像在工作,但没有保存任何文件。

我在这里发布我用来测试此功能的代码。

myspider.py:

class ImageSpider(scrapy.Spider):
    name = "imagespider"

    start_urls = [
        "http://www.upv.es/",
    ]

    def parse(self, response):
        for elem in response.xpath("//img"):
            img_url = elem.xpath("@src").extract_first()
            yield ImageItem(image_urls=[img_url]) # Not working
            #yield {'image_urls': [img_url]}  # Not working

items.py:

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py:

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '/Users/salva/Desktop/demo/demo/temp'

控制台:

2018-07-06 20:10:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-06 20:10:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 03:03:55) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.6.0-x86_64-i386-64bit
2018-07-06 20:10:18 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-06 20:10:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider opened
2018-07-06 20:10:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-06 20:10:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-06 20:10:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.upv.es/> (referer: None)
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/GRi.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/marcaUPVN1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/img_identif.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/menu-hamburguesa.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/espacio2.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-plegar_GR.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ico_nueva_ventana.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/icon-desplegar.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_valentia_hyperloop2.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_campus_109.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pcarrusel/slider_fsupv04_michigan.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_escuelas_fba_008.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_118.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_institutos_002.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icon_posgrado.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_alumnos_tecnologia_051.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pnoticias/icono_gente_campus_119.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_vida_universitaria.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_deportes3.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_alojamiento.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/ppromo/promo_valencia.jpg']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/mulet3-1.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/corma.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/andy.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/san_nicolas.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/formula.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/ico_videoplayer_pvideos.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pvideos/eco_sensor.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_Riunet.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_upvX.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliConsulta.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/icono_poliAPPS.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-twitter.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-facebook.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-linkedin.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-instagram.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-youtube.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/rs-google-plus.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/campus_excelencia-2WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/EMASupv-WH.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/xarxa_vives.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/universia_cl.png']}
2018-07-06 20:10:18 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.upv.es/>
{'image_urls': ['/imagenes/pinferior/forum_unesco_cl.png']}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-06 20:10:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 225,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 53981,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 744230),
 'item_scraped_count': 56,
 'log_count/DEBUG': 58,
 'log_count/INFO': 7,
 'memusage/max': 103243776,
 'memusage/startup': 103239680,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 7, 6, 18, 10, 18, 355192)}
2018-07-06 20:10:18 [scrapy.core.engine] INFO: Spider closed (finished)

【问题讨论】:

    标签: python image file download scrapy


    【解决方案1】:

    当我从终端运行蜘蛛(使用scrapy crawl myspider)时它可以工作,但当我从脚本运行它时(CrawlerProcess)就不行。

    https://github.com/scrapy/scrapy/issues/1904

    【讨论】:

      【解决方案2】:

      它正在按照说明从主链接中抓取,但您没有连接源和主链接。尝试这样的事情(尚未测试):

      def parse(self, response):
          for elem in response.xpath("//img"):
              img_url = elem.xpath("@src").extract_first()
              yield ImageItem(image_urls=[start_urls+img_url])
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-02-06
        • 1970-01-01
        相关资源
        最近更新 更多