【问题标题】:Scrapy run from script not working从脚本运行 Scrapy 不起作用
【发布时间】:2013-09-18 14:08:38
【问题描述】:

我正在尝试运行一个使用 scrapy crall single 完美运行的 scrapy 蜘蛛,但我无法在 python 脚本中运行它。

我知道文档告诉如何:https://scrapy.readthedocs.org/en/0.18/topics/practices.html,我还阅读了这个已经回答的问题 (How to run Scrapy from within a Python script),但我无法完成这项工作。

主要问题是 SingleBlogSpider.parse 方法从不执行,而 start_requests 被执行

这里是运行该脚本的代码和输出。我也尝试将执行移至一个单独的文件,但同样的情况发生了。

from urlparse import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class SingleBlogSpider(BaseSpider):
    name = 'single'

    def __init__(self, **kwargs):
        super(SingleBlogSpider, self).__init__(**kwargs)

        url = kwargs.get('url') or kwargs.get('domain') or 'seaofshoes.com'
        if not url.startswith('http://') and not url.startswith('https://'):
            url = 'http://%s/' % url

        self.url = url
        self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
        self.link_extractor = SgmlLinkExtractor()
        self.cookies_seen = set()

        print 0, self.url

    def start_requests(self):
        print '1', self.url
        return [Request(self.url, callback=self.parse)]

    def parse(self, response):
        print '2'
        # Actual scraper code, that is never executed

if __name__ == '__main__':
    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy.settings import Settings
    from scrapy import log, signals

    spider = SingleBlogSpider(domain='scrapinghub.com')

    crawler = Crawler(Settings())
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

    log.start()
    reactor.run()

输出:

 0 http://scrapinghub.com/
 1 http://scrapinghub.com/
 2013-09-13 14:21:46-0500 [single] INFO: Closing spider (finished)
 2013-09-13 14:21:46-0500 [single] INFO: Dumping Scrapy stats:
     {'downloader/request_bytes': 221,
      'downloader/request_count': 1,
      'downloader/request_method_count/GET': 1,
      'downloader/response_bytes': 9403,
      'downloader/response_count': 1,
      'downloader/response_status_count/200': 1,
      'finish_reason': 'finished',
      'finish_time': datetime.datetime(2013, 9, 13, 19, 21, 46, 563184),
      'response_received_count': 1,
      'scheduler/dequeued': 1,
      'scheduler/dequeued/memory': 1,
      'scheduler/enqueued': 1,
      'scheduler/enqueued/memory': 1,
      'start_time': datetime.datetime(2013, 9, 13, 19, 21, 46, 328961)}
 2013-09-13 14:21:46-0500 [single] INFO: Spider closed (finished)

该程序永远不会到达SingleBlogSpider.parse 并打印“2”,因此它不会抓取任何内容。但正如您在输出中看到的那样,它确实发出了请求,所以不确定发生了什么。

Scrapy 版本 == 0.18.2

我真的无法发现错误,非常感谢您的帮助。

谢谢!

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    我相信,当您说“无法通过脚本运行”时,实际上是指“无法让爬虫生成输出文件”。它是文档代码示例中的bug。将您的代码更改为此。

    if __name__ == '__main__':
        from twisted.internet import reactor
        from scrapy.crawler import Crawler
        from scrapy import log, signals
        from scrapy.utils.project import get_project_settings
    
        spider = SingleBlogSpider(domain='scrapinghub.com')
        settings = get_project_settings()
        crawler = Crawler(settings)
        crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        crawler.configure()
        crawler.crawl(spider)
        crawler.start()
    
        log.start()
        reactor.run()
    

    如需进一步阅读,请查看answer

    【讨论】:

      【解决方案2】:

      parse() 实际上正在执行。只是打印不显示。

      只是为了测试,将a=b 放入parse()

      def parse(self, response):
          a = b
      

      你会看到exceptions.NameError: global name 'b' is not defined

      【讨论】:

      • 哇,我觉得自己很愚蠢。如果有人有兴趣解决现在显示的打印问题,只需删除log.start() 行。谢谢!
      • 一旦您启动日志,您就无法打印到控制台。现在的问题是,在那之后如何获得打印到控制台的能力?
      猜你喜欢
      • 2011-09-23
      • 2013-07-19
      • 2020-09-26
      • 1970-01-01
      • 2018-12-04
      • 2018-02-15
      • 2013-01-21
      • 1970-01-01
      • 2014-03-06
      相关资源
      最近更新 更多