【问题标题】:Scrapy is not crawling any URLsScrapy 没有抓取任何 URL
【发布时间】:2019-02-27 17:53:08
【问题描述】:

我将我的代码放在了scrapy shell 中来测试我的xpath,一切似乎都正常。但是我看不出为什么是 0 爬行。这是日志输出:

2019-02-27 18:04:47 [scrapy.utils.log] 信息:Scrapy 1.5.1 已启动 (机器人:jumia)2019-02-27 18:04:47 [scrapy.utils.log] 信息:版本: lxml 4.3.0.0,libxml2 2.9.9,cssselect 1.0.3,解析 1.5.1,w3lib 1.20.0,Twisted 18.9.0,Python 2.7.15+(默认,2018 年 11 月 28 日,16:27:22)-[GCC 8.2.0],pyOpenSSL 18.0.0(OpenSSL 1.1.0j 11 月 20 日 2018),密码学 2.4.2,平台 Linux-4.19.0-kali1-amd64-x86_64-with-Kali-kali-rolling-kali-rolling 2019-02-27 18:04:47 [scrapy.crawler] 信息:覆盖设置: {'NEWSPIDER_MODULE':'jumia.spiders','SPIDER_MODULES': ['jumia.spiders'],'ROBOTSTXT_OBEY':真,'BOT_NAME':'jumia'} 2019-02-27 18:04:47 [scrapy.middleware] 信息:启用的扩展: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2019-02-27 18:04:47 [scrapy.middleware] 信息:已启用下载器中间件: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2019-02-27 18:04:47 [scrapy.middleware] 信息:启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2019-02-27 18:04:47 [scrapy.middleware] INFO:启用项目管道:[] 2019-02-27 18:04:47 [scrapy.core.engine] 信息:Spider 打开 2019-02-27 18:04:47 [scrapy.extensions.logstats] 信息:抓取 0 页(以 0 页/分钟), 刮掉 0 件(0 件/分钟) 2019-02-27 18:04:47 [scrapy.extensions.telnet] 调试:Telnet 控制台正在监听 127.0.0.1:6029 2019-02-27 18:04:47 [scrapy.core.engine] 信息:关闭蜘蛛(已完成) 2019-02-27 18:04:47 [scrapy.statscollectors] 信息: 倾销 Scrapy 统计信息:{'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 2, 27, 17, 4, 47, 950397), 'log_count/DEBUG': 1, 'log_count/INFO': 7, 'memusage/max': 53383168, 'memusage/startup': 53383168, 'start_time': datetime.datetime(2019, 2、27、17、4、47、947520)} 2019-02-27 18:04:47 [scrapy.core.engine] 信息:蜘蛛关闭(完成)

这是我的蜘蛛代码:

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import MapCompose
    from scrapy.loader.processors import TakeFirst
    from jumia.items import JumiaItem


    class ProductDetails (scrapy.Spider):
        name = "jumiaProject"
        start_url = ["https://www.jumia.com.ng/computing/hp/"]

        def parse (self, response):

            search_results = response.css('section.products.-mabaya > div')

            for product in search_results: 

                product_loader = ItemLoader(item=JumiaItem(), selector=product)

                product_loader.add_css('brand','h2.title > span.brand::text')

                product_loader.add_css('name', 'h2.title > span.name::text')

                product_loader.add_css('link', 'a.link::attr(href)')


                yield product_loader.load_item()

这是我的items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import MapCompose
class JumiatesteItem(scrapy.Item):
    # define the fields for your item here like:
    name  = scrapy.Field()
    brand = scrapy.Field()
    price = scrapy.Field()
    link  = scrapy.Field()

【问题讨论】:

    标签: python scrapy-spider


    【解决方案1】:

    Spider 中的正确变量名应该是start_urls,而不是start_url。由于名称错误,它没有检测到任何 URL。

    【讨论】:

    • 感谢您提供的好信息,它成功了。我不敢相信我错过了s。非常感谢。
    猜你喜欢
    • 1970-01-01
    • 2022-12-07
    • 1970-01-01
    • 2017-04-06
    • 1970-01-01
    • 2014-02-25
    • 2012-01-12
    • 1970-01-01
    相关资源
    最近更新 更多