【问题标题】:How to pass Scrapy object item to Images Pipeline如何将 Scrapy 对象项传递给图像管道
【发布时间】:2018-03-09 06:25:44
【问题描述】:

我有一个蜘蛛可以下载特定网站的 jpg 文件。过去,我在图像管道中解析了response.url,以便在下载文件时重命名文件。问题是该站点的目录结构很奇怪,因此解析image_urls 以重命名目标文件不起作用。作为一种解决方法,我只使用原始图形名称作为文件。

我想使用来自实际 Scrapy 对象本身的数据,但我似乎无法将变量从蜘蛛传递到图像管道。从下面的代码中,我想在蜘蛛中解析url 并将其作为变量传递给管道中的otImagesPipeline,但没有任何效果。我尝试查看 Scrapy 文档,但找不到如何执行此操作。

Scrapy 可以做到这一点吗?

这是我的蜘蛛代码:

settings.py:

BOT_NAME = 'bid'
MEDIA_ALLOW_REDIRECTS = True
SPIDER_MODULES = ['bid.spiders']
NEWSPIDER_MODULE = 'bid.spiders'
ITEM_PIPELINES = {'bid.pipelines.otImagesPipeline': 1}  
IMAGES_STORE = 'C:\\temp\\images\\filenametest'  

pipelines.py

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline

class otImagesPipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        targetfile = request.url.split('/')[-1]
        return targetfile

items.py

import scrapy

class BidItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    caption = scrapy.Field()
    image_urls = scrapy.Field()

getbid.py(蜘蛛)

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
from urllib import parse as urlparse

class GetbidSpider(CrawlSpider):
    name = 'getbid'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for sel in response.xpath('//a'):
          link = str(sel.xpath('@href').extract()[0])
          if (link.endswith('.jpg')):
            href = BidItem()
            href['url'] = response.url
            href['title'] = response.css("h1.entry-title::text").extract_first()
            href['caption'] = response.css("p.wp-caption-text::text").extract()
            href['image_urls'] = [link]
            yield href
            yield scrapy.Request(urlparse.urljoin('http://www.example.com/',link),callback=self.parse_item)

更新

感谢 Umair 的帮助,我能够完全按照我的需要进行修复。 这是修改后的代码:

getbid.py

    def parse_item(self, response):
        for sel in response.xpath('//a'):
          link = str(sel.xpath('@href').extract()[0])
          if (link.endswith('.jpg')):
            href = BidItem()
            href['url'] = response.url
            href['title'] = response.css("h1.entry-title::text").extract_first()
            href['caption'] = response.css("p.wp-caption-text::text").extract()
            future_dir = href['url'].split("/")[-2]
            href['images'] = {link: future_dir}
            yield href
            yield scrapy.Request(urlparse.urljoin(http://www.example.com/',link),callback=self.parse_item)

pipelines.py

class otImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        if 'images' in item:
            for image_url, img_dir in item['images'].items():
                request = scrapy.Request(url=image_url)
                request.meta['img_dir'] = img_dir
                yield request

    def file_path(self, request, response=None, info=None):
       filename = request.url.split('/')[-1]
       filedir = request.meta['img_dir']
       filepath = filedir + "/" + filename
       return filepath

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    在您的 Spider 类中有 IMAGES_STORE,以便您稍后可以在 ImagesPipelinefile_path 方法中访问它

    class GetbidSpider(CrawlSpider):
        name = 'getbid'
    
        IMAGE_DIR = 'C:\\temp\\images\\filenametest'
    
        custom_settings = {
           "IMAGES_STORE": IMAGE_DIR
        }
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            Rule(LinkExtractor(), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            for sel in response.xpath('//a'):
              link = str(sel.xpath('@href').extract()[0])
              if (link.endswith('.jpg')):
                href = BidItem()
                href['url'] = response.url
                href['title'] = response.css("h1.entry-title::text").extract_first()
                href['caption'] = response.css("p.wp-caption-text::text").extract()
    
                href['images'] = {link: href['title']}
    
                yield href
                yield scrapy.Request(urlparse.urljoin('http://www.example.com/',link),callback=self.parse_item)
    

    然后在你的ImagesPipeline

    class CustomImagePipeline(ImagesPipeline):
    
        def get_media_requests(self, item, info):
            if 'images' in item:
                for image_url, img_name in item['images'].iteritems():
    
                    request = scrapy.Request(url=image_url)
                    request.meta['img_name'] = img_name
                    yield request
    
        def file_path(self, request, response=None, info=None):
            return os.path.join(info.spider.IMAGE_DIR, request.meta['img_name'])
    

    【讨论】:

    • 谢谢。我不明白的唯一部分是在蜘蛛中放什么。你的意思是我应该用 item['images'] 替换 url、title、caption、image_urls 吗?根据当前蜘蛛中的内容,我应该为“image_link_here”和“image_name_here”输入什么值?
    • 谢谢,我肯定越来越近了。如果我在管道中打印项目,它会显示所有元素,包括“图像”。但是当它到达for image_url, img_name in item['images']: 行时,它会错误地显示ValueError: too many values to unpack (expected 2)。但如果我在此之前打印 item['images'],它只会显示 2 个值。
    • 谢谢你,我不得不做一些改变,但它就像我希望的那样工作。感谢您的帮助!
    • 使用for image_url, img_name in item['images'].iteritems() 而不是for image_url, img_name in item['images']
    • 好的,谢谢!这也是在 Python 3 中,所以我认为我必须将“iteritems”更改为“items”
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-05
    • 1970-01-01
    • 2015-10-10
    • 2022-10-25
    • 1970-01-01
    相关资源
    最近更新 更多