【问题标题】:Scrapy spider finishing scraping process without scraping anythingScrapy spider 完成刮削过程而不刮任何东西
【发布时间】:2019-02-01 01:58:45
【问题描述】:

我有一只蜘蛛,它会在亚马逊上搜索信息。

蜘蛛读取一个 .txt 文件,我在其中写下它必须搜索的产品,然后进入该产品的亚马逊页面,例如:

https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=laptop

我使用关键字=laptop 来更改要搜索的产品等。

我遇到的问题是蜘蛛无法正常工作,这很奇怪,因为一周前它还可以正常工作。

此外,控制台上没有出现任何错误,蜘蛛启动,“抓取”关键字,然后停止。

这是完整的蜘蛛

import scrapy
import re
import string
import random
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from genericScraper.items import GenericItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request

class GenericScraperSpider(CrawlSpider):

    name = "generic_spider"

    #Dominio permitido
    allowed_domain = ['www.amazon.com']

    search_url = 'https://www.amazon.com/s?field-keywords={}'

    custom_settings = {

        'FEED_FORMAT': 'csv',
        'FEED_URI' : 'datosGenericos.csv'

    }

    rules = {

        #Gets all the elements in page 1 of the keyword i search
        Rule(LinkExtractor(allow =(), restrict_xpaths = ('//*[contains(@class, "s-access-detail-page")]') ), 
                            callback = 'parse_item', follow = False)

}


    def start_requests(self):

        txtfile = open('productosGenericosABuscar.txt', 'r')

        keywords = txtfile.readlines()

        txtfile.close()

        for keyword in keywords:

            yield Request(self.search_url.format(keyword))



    def parse_item(self,response):

        genericAmz_item = GenericItem()


        #info de producto
        categoria = response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()

        genericAmz_item['nombreProducto'] = response.xpath('normalize-space(//span[contains(@id, "productTitle")]/text())').extract()
        genericAmz_item['precioProducto'] = response.xpath('//span[contains(@id, "priceblock")]/text()'.strip()).extract()
        genericAmz_item['opinionesProducto'] = response.xpath('//div[contains(@id, "averageCustomerReviews_feature_div")]//i//span[contains(@class, "a-icon-alt")]/text()'.strip()).extract()
        genericAmz_item['urlProducto'] = response.request.url
        genericAmz_item['categoriaProducto'] = re.sub('Back to search results for |"','', categoria) 

        yield genericAmz_item

我制作的其他具有类似结构的蜘蛛也可以工作,知道发生了什么吗?

这是我在控制台中得到的内容

2019-01-31 22:49:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: genericScraper)
2019-01-31 22:49:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,                     Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-31 22:49:26 [scrapy.crawler] INFO: Overridden settings:         {'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'genericScraper', 'DOWNLOAD_DELAY':     3, 'FEED_FORMAT': 'csv', 'FEED_URI': 'datosGenericos.csv', 'NEWSPIDER_MODULE':     'genericScraper.spiders', 'SPIDER_MODULES': ['genericScraper.spiders'],     'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36     (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-31 22:49:26 [scrapy.core.engine] INFO: Spider opened
2019-01-31 22:49:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-31 22:49:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on xxx.x.x.x:xxxx
2019-01-31 22:49:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s?field-keywords=Laptop> (referer: None)
2019-01-31 22:49:27 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-31 22:49:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 315,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2525,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 1, 1, 49, 27, 375619),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 2, 1, 1, 49, 26, 478037)}
2019-01-31 22:49:27 [scrapy.core.engine] INFO: Spider closed (finished)

【问题讨论】:

    标签: python-3.x web-scraping scrapy


    【解决方案1】:

    有趣!这可能是由于网站没有返回任何数据。您是否尝试过使用scrapy shell 进行调试。如果没有,请尝试检查 response.body 返回您想要抓取的预期数据。

    def parse_item(self,response):
         from scrapy.shell import inspect_response
         inspect_response(response, self)
    

    更多详情,请阅读scrapy shell的详细信息


    调试后,如果您仍然没有得到预期的数据,则意味着站点中有更多内容阻碍了抓取过程。这可能是动态脚本或cookie/local-storage/session 依赖项。

    对于动态/JS脚本,可以使用seleniumsplash
    selenium-with-scrapy-for-dynamic-page
    handling-javascript-in-scrapy-with-splash

    对于cookie/local-storage/session,您必须更深入地查看inspect 窗口并找出获取数据所必需的。

    【讨论】:

    • 感谢您的帮助!,尝试调试,似乎没有任何问题,所以我再次运行蜘蛛并且......它工作了。下次发生这种情况时要重新启动我的电脑并尝试重新运行蜘蛛
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-11-26
    • 2013-09-24
    • 1970-01-01
    • 1970-01-01
    • 2014-07-16
    相关资源
    最近更新 更多