【发布时间】:2021-10-25 15:31:00
【问题描述】:
我正在尝试使用 scrapy 来抓取网站以下载图像。当我运行代码时,它运行得很好,但即使我在我的 settings.py 中指定了图像管道 nad 目录,它也不会下载图像
spider.py
import re
import scrapy
import os
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import ImagesItem
class ImageSpiderSpider(CrawlSpider):
name = 'image_spider'
allowed_domains = ['books.toscrape.com']
# start_urls = ['http://books.toscrape.com/']
def start_requests(self):
url = 'http://books.toscrape.com/'
yield scrapy.Request(url=url)
rules = (
Rule(LinkExtractor(allow=r'catalogue/'), callback='parse_image', follow=True),
)
# save_location = os.getcwd()
custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": '.images_download/full'
}
def parse_image(self, response):
if response.xpath('//div[@class="item active"]/img').get() is not None:
img = response.xpath('//div[@class="item active"]/img/@src').get()
"""
Computing the Absolute path of the image file.
"image_urls" require absolute path, not relative path
"""
m = re.match(r"^(?:../../)(.*)$", img).group(1)
url = "http://books.toscrape.com/"
img_url = "".join([url, m])
image = ImagesItem()
image["image_urls"] = [img_url] # "image_urls" must be a list
yield image
items.py
import scrapy
class ImagesItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py
BOT_NAME = 'images'
SPIDER_MODULES = ['images.spiders']
NEWSPIDER_MODULE = 'images.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "/Home/PycharmProjects/scrappy/images/images_downloader"
【问题讨论】:
-
我运行了你的代码,它运行得非常好,它下载了图像。我用 settings.py 和 custom_settings 都试过了,它们都适用。 (我使用了 PyCharm 和 scrapy 2.5.0)。
-
我没有运行它,但首先你必须检查文件夹是否存在,因为它不会自动创建它 - 如果文件夹不存在则它不会下载。第二:您可以使用
print()来查看代码的哪一部分被执行以及变量中有什么——它被称为"print debuging" -
我不明白你为什么使用
re.match。如果你想创建绝对 URL,那么它有absolute_url = response.urljoin(relative_url) -
你可以用
os.getcwd()代替"."-"IMAGES_STORE": "." -
标准管道下载到子文件夹
full-IMAGES_STORE/full- 所以你应该检查你是否有子文件夹full