【问题标题】:ERROR: Spider error processing in Scrapy Module错误:Scrapy 模块中的蜘蛛错误处理
【发布时间】:2018-10-19 08:15:29
【问题描述】:

我使用scrapy 编写了一个网络抓取程序,它从搜索结果中提取标题和正文,并在使用命令运行蜘蛛时

scrapy crawl reddit

它显示

调试:已爬网 (200) https://www.reddit.com/r/help/search?q=hydrochlorothiazide/>(推荐人: 无)

ERROR:蜘蛛错误处理 https://www.reddit.com/r/help/search?q=hydrochlorothiazide/>(推荐人: 无)

但是,如果我在一个scrapy shell 中一一运行这些命令,它就会被正确地抓取。有人可以帮我解决这个问题吗?

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    allowed_domains = ['www.reddit.com']
    start_urls = ['https://www.reddit.com/r/help/search?q=hydrochlorothiazide/']

    def parse(self, response):
        #view(self.response)
        posts = response.xpath('//*[@class="search-result-group"]')
        for post in posts:
            header = post.xpath('//*[@class="search-result-header"]/a/text()').extract_first()
            text = post.xpath('//*[@class="md"]/p/text()').extract_first()
            yield{'Header':header,'Text':text}

【问题讨论】:

    标签: python python-3.x web-scraping scrapy


    【解决方案1】:

    您使用的是哪个版本的scrapy? 将其升级到最新版本 (1.5.0)。

    创建空的虚拟环境并安装scrapy:

    projects > $ virtualenv --no-site-packages --python=python3.5 venv
    ...
    Installing setuptools, pkg_resources, pip, wheel...done.
    projects > $ source venv/bin/activate
    [3.5.5](venv) projects > $ pip freeze
    pkg-resources==0.0.0
    [3.5.5](venv) projects > $ pip install scrapy
    ...
    Successfully installed Automat-0.6.0 PyDispatcher-2.0.5 Twisted-18.4.0
    asn1crypto-0.24.0 attrs-18.1.0 cffi-1.11.5 constantly-15.1.0     
    cryptography-2.2.2 cssselect-1.0.3 hyperlink-18.0.0 idna-2.6 
    incremental-17.5.0 lxml-4.2.1 parsel-1.4.0 pyOpenSSL-17.5.0 pyasn1-0.4.2 
    pyasn1-modules-0.2.1 pycparser-2.18 queuelib-1.5.0 scrapy-1.5.0 
    service-identity-17.0.0 six-1.11.0 w3lib-1.19.0 zope.interface-4.5.0
    [3.5.5](venv) projects > $ pip freeze
    asn1crypto==0.24.0
    attrs==18.1.0
    Automat==0.6.0
    cffi==1.11.5
    constantly==15.1.0
    cryptography==2.2.2
    cssselect==1.0.3
    hyperlink==18.0.0
    idna==2.6
    incremental==17.5.0
    lxml==4.2.1
    parsel==1.4.0
    pkg-resources==0.0.0
    pyasn1==0.4.2
    pyasn1-modules==0.2.1
    pycparser==2.18
    PyDispatcher==2.0.5
    pyOpenSSL==17.5.0
    queuelib==1.5.0
    Scrapy==1.5.0
    service-identity==17.0.0
    six==1.11.0
    Twisted==18.4.0
    w3lib==1.19.0
    zope.interface==4.5.0
    

    制作scrapy项目并编写你的蜘蛛:

    [3.5.5](venv) projects > $ scrapy startproject reddit
    [3.5.5](venv) projects > $ cd reddit/reddit/spiders/
    [3.5.5](venv) spiders > $ touch spider.py && subl spider.py
    

    spider.py

    import scrapy
    
    class RedditSpider(scrapy.Spider):
        name = 'reddit'
        allowed_domains = ['www.reddit.com']
        start_urls = ['https://www.reddit.com/r/help/search?q=hydrochlorothiazide/']
    
        def parse(self, response):
            #view(self.response)
            posts = response.xpath('//*[@class="contents"]/div')
            for post in posts:
                header = post.xpath('.//*[@class="search-result-header"]/a/text()').extract_first()
                text = '\n'.join(post.xpath('.//*[@class="md"]/p/text()').extract())
                yield{'Header':header,'Text':text}
    

    启动爬虫:

    [3.5.5](venv) spiders > $ scrapy crawl reddit
    ...
    [scrapy.core.engine] INFO: Spider opened
    [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/robots.txt> (referer: None)
    [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reddit.com/r/help/search?q=hydrochlorothiazide/> (referer: None)
    [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/help/search?q=hydrochlorothiazide/>
    ...
    {'Text': '', 'Header': 'Human medicines European public assessment report (EPAR): Irbesartan Hydrochlorothiazide Zentiva (previously Irbesartan Hydrochlorothiazide Winthrop), irbesartan / hydrochlorothiazide, Revision: 18, Authorised'}
    [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/help/search?q=hydrochlorothiazide/>
    {'Text': '', 'Header': 'Human medicines European public assessment report (EPAR): Irbesartan/Hydrochlorothiazide Teva, irbesartan / hydrochlorothiazide, Revision: 6, Authorised'}
    [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/help/search?q=hydrochlorothiazide/>
    {'Text': '', 'Header': 'Human medicines European public assessment report (EPAR): MicardisPlus, telmisartan / hydrochlorothiazide, Revision: 22, Authorised'}
    [scrapy.core.engine] INFO: Closing spider (finished)
    [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 511,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 28254,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'item_scraped_count': 22,
     'log_count/DEBUG': 25,
     'log_count/INFO': 7,
     'memusage/max': 53526528,
     'memusage/startup': 53526528,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1}
    [scrapy.core.engine] INFO: Spider closed (finished)
    

    【讨论】:

    • Scrapy 1.5.0 - 项目:web_scraping 我只使用最新版本的scrapy。你是说我需要重新安装 Scrapy?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2010-12-20
    • 1970-01-01
    相关资源
    最近更新 更多