【问题标题】:Scrapy Error. List index out of range刮擦错误。列表索引超出范围
【发布时间】:2017-11-20 05:51:00
【问题描述】:

我对网络抓取还很陌生,我目前正在尝试将 Scrapy 应用到我正在从事的 Tensorflow 项目中,但由于某种原因,Scrapy 没有给我任何结果。我相信在提取图像或标题本身的实际链接时我做错了。我偶然发现了一个从 imgur 中提取图像的示例,这就是我目前正在使用的。

Items.py

import scrapy

class ImgurItem(scrapy.Item):

    title = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py

BOT_NAME = 'imgur'
SPIDER_MODULES = ['imgur.spiders']
NEWSPIDER_MODULE = 'imgur.spiders'
ITEM_PIPELINES = {'imgur.pipelines.ImgurPipeline': 1}
IMAGES_STORE = 'I:\ScrapySpiders\imgur\imgur\Images'
ROBOTSTXT_OBEY = False

imgur_spider.py

 import scrapy

from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from imgur.items import ImgurItem

class ImgurSpider(CrawlSpider):
    name = 'imgur'
    allowed_domains = ['imgur.com']
    start_urls = ['http://www.imgur.com']
    rules = [Rule(LinkExtractor(allow=['/gallery/.*']), 'parse_imgur')]

    def parse_imgur(self, response):
        image = ImgurItem()
        image['title'] = response.xpath("//h1[@class='post-title']/text()").extract()
        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = ['http:'+rel[0]]
        return image

管道.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline

class ImgurPipeline(ImagesPipeline):

    def set_filename(self, response):
        #add a regex here to check the title is valid for a filename.
        return 'full/{0}.jpg'.format(response.meta['title'][0])

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url, meta={'title': item['title']})

    def get_images(self, response, request, info):
        for key, image, buf in super(ImgurPipeline, self).get_images(response, request, info):
            key = self.set_filename(response)
        yield key, image, buf

更新的错误日志:

Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
    key = self.set_filename(response)
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
    return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/pKsYl>
{'image_urls': ['http://i.imgur.com/YEQb03D.jpg'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://imgur.com/gallery/R6eQD> (referer: None)
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/QrKeE>
{'image_urls': ['http://i.imgur.com/OpDDRWr.png'], 'images': [], 'title': []}
2017-11-19 22:11:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/JKz3U>
{'image_urls': ['http://i.imgur.com/VChqgP9r.jpg'], 'images': [], 'title': []}
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'title': []}
2017-11-19 22:11:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://i.imgur.com/m9Cq6B1.png> (referer: None)
2017-11-19 22:11:27 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET http://i.imgur.com/m9Cq6B1.png> referred in <None>
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IHDR' 16 13
2017-11-19 22:11:27 [PIL.PngImagePlugin] DEBUG: STREAM b'IDAT' 41 8192
2017-11-19 22:11:28 [scrapy.pipelines.files] ERROR: File (unknown-error): Error processing file from <GET http://i.imgur.com/m9Cq6B1.png> referred in
<None>
Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://i.imgur.com/m9Cq6B1.png>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\files.py", line 356, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 98, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "c:\users\tomas\appdata\local\programs\python\python35\lib\site-packages\scrapy\pipelines\images.py", line 102, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 24, in get_images
    key = self.set_filename(response)
  File "I:\ScrapySpiders\imgur\imgur\pipelines.py", line 16, in set_filename
    return 'full/{0}.jpg'.format(response.meta['title'][0])
IndexError: list index out of range
2017-11-19 22:11:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://imgur.com/gallery/R6eQD>
{'image_urls': ['http://i.imgur.com/m9Cq6B1.png'], 'images': [], 'title': []}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-19 22:11:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.ValueError': 1,
 'downloader/request_bytes': 29607,
 'downloader/request_count': 122,
 'downloader/request_method_count/GET': 122,
 'downloader/response_bytes': 14490175,
 'downloader/response_count': 121,
 'downloader/response_status_count/200': 115,
 'downloader/response_status_count/301': 4,
 'downloader/response_status_count/302': 2,
 'file_count': 45,
 'file_status_count/downloaded': 45,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 11, 19, 20, 11, 28, 247434),
 'item_scraped_count': 68,
 'log_count/DEBUG': 274,
 'log_count/ERROR': 46,
 'log_count/INFO': 7,
 'log_count/WARNING': 3,
 'request_depth_max': 1,
 'response_received_count': 115,
 'scheduler/dequeued': 76,
 'scheduler/dequeued/memory': 76,
 'scheduler/enqueued': 76,
 'scheduler/enqueued/memory': 76,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 11, 19, 20, 11, 21, 643056)}
2017-11-19 22:11:28 [scrapy.core.engine] INFO: Spider closed (finished)

我知道有类似的线程指定了此确切代码的问题,但没有一个能够帮助我并解决我遇到的问题。显然,Imgur 更改了网络编码,我只是不知道应该如何提取这些链接

【问题讨论】:

标签: python scrapy


【解决方案1】:

这与网页抓取或 imgur 无关。您在这一行的开头遇到了 python 语法错误:

rel = response.xpath("//img[@src='//i.imgur.com/*.*'])".extract()

这是因为你有两个左括号,但前一行只有一个右括号:

#                              v
image['title'] = response.xpath(\
    "//h1[@class='post-title']/text()".extract()
#                                             ^^

response.xpath( 中的开头括号不平衡。

【讨论】:

  • 感谢您的提醒。我现在真的觉得很傻。然而,在修复了基本的语法错误之后(后来又出现了一些错误),我遇到了另一个问题,在这篇文章之前我一直在研究这个问题。似乎我的代码在途中丢失了对象。我的列表索引超出范围错误:image['image_urls'] = ['http:'+rel[0]] IndexError: list index out of range
【解决方案2】:

添加一个新的答案来清理事情。这应该有效:

修改parse_imgur函数为:

def parse_imgur(self, response):
    image = ImgurItem()
    image['title'] = response.xpath("//h1[contains(@class, 'post-title')]/text()").extract_first()
    rel = response.xpath("//img/@src").extract_first()
    try:
        image['image_urls'] = ['http:'+rel]
        return image
    except:
        pass

注意h1 类名末尾有一个空格。你可以使用@class="post-title ",或者我喜欢的方式使用contains(@class, 'post-title')

由于我使用.extract_first()作为图片标题,您还应该修改以下内容:

def set_filename(self, response):
    return 'full/{0}.jpg'.format(response.meta['title'][0])

到:

def set_filename(self, response):
    return 'full/{0}.jpg'.format(response.meta['title'])

其他改进可能是清理文件名和帖子标题的类选择器(例如,它还将选择名称为 long-post-titlepost-title-again 的类)。

【讨论】:

  • 确实如此。非常感谢您的耐心等待。我有几个临别问题,因为我决心学习scrapy。提取物与先提取物有何不同?为什么使用这种方法从 rel 中删除列表?
  • 很高兴我能帮上忙。使用extract() 会给你一个字符串列表(如果有的话),而extract_first() 只会提取与选择器匹配的第一个项目并返回一个简单的字符串(不是列表)。一般来说,如果您只希望选择器找到单个项目,我建议您使用extract_first(),因为在尝试索引空变量时可以避免任何索引问题。
  • 太棒了!再次感谢您在这方面为我提供的帮助!
【解决方案3】:

只需将引号移到右括号的正确一侧,它应该适合您:

rel = response.xpath("//img[@src='//i.imgur.com/*.*']").extract()

【讨论】:

  • 感谢您的关注。现在真的质疑我的注意力,虽然在修复它之后我在 image['image_urls'] = ['http:'+rel[0]] 收到了一个超出索引的错误
  • 如果您的 rel 变量最终为空并且您尝试引用空变量的第一个索引,则可能会发生这种情况。我将重写为rel = response.xpath("//img[@src='//i.imgur.com/*.*']").extract_first() 并从image['image_urls'] = ['http:'+rel[0]] 中删除索引,使其变为image['image_urls'] = ['http:'+rel]
  • 奇怪的是,您的解决方案最终以“无法将非类型对象隐式转换为 str”而修复之前的语法会在 pipelines.py 文件中出现错误,并出现索引不足的错误。你的解决方案对我来说似乎也更合乎逻辑,但我越来越困惑哈哈
  • 啊.. 有道理。如果rel 变量为空,则将其转换为字符串不是一个好主意。您可以使用 try-except 语句。 try: image['image_urls'] = ['http:'+rel] except: image['image_urls'] = None。您需要注意换行符和制表符,因为我无法在此评论中正确格式化。您还可以将None 更改为例如您选择的字符串。
  • 只是另一个想法。有时我们在抓取时会得到一些空值,这可能被认为是正常的,具体取决于您正在抓取的网站。然而,如果你所有的图片链接最终都是空的,那么爬虫就存在更根本的问题。
猜你喜欢
  • 2022-06-24
  • 1970-01-01
  • 2017-10-30
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-05-12
  • 1970-01-01
相关资源
最近更新 更多