【问题标题】:How to save the data from a scrapy crawler into a variable?如何将scrapy爬虫中的数据保存到变量中?
【发布时间】:2016-11-21 08:04:08
【问题描述】:

我目前正在构建一个 Web 应用程序,用于显示由 scrapy 蜘蛛收集的数据。用户发出请求,蜘蛛爬取网站,然后将数据返回给应用程序以得到提示。我想直接从刮板中检索数据,而不依赖于中间的 .csv 或 .json 文件。类似的东西:

from scrapy.crawler import CrawlerProcess
from scraper.spiders import MySpider

url = 'www.example.com'
spider = MySpider()
crawler = CrawlerProcess()
crawler.crawl(spider, start_urls=[url])
crawler.start()
data = crawler.data # this bit

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    这并不容易,因为 Scrapy 是非阻塞的并且在事件循环中工作;它使用 Twisted 事件循环,并且 Twisted 事件循环不可重新启动,所以你不能写crawler.start(); data = crawler.data - 在crawler.start() 进程永远运行之后,调用注册的回调直到它被杀死或结束。

    这些答案可能是相关的:

    如果您在应用程序中使用事件循环(例如,您有一个 Twisted 或 Tornado Web 服务器),则可以从爬网中获取数据,而无需将其存储到磁盘。这个想法是听 item_scraped 信号。我正在使用以下帮助程序使其变得更好:

    import collections
    
    from twisted.internet.defer import Deferred
    from scrapy.crawler import Crawler
    from scrapy import signals
    
    def scrape_items(crawler_runner, crawler_or_spidercls, *args, **kwargs):
        """
        Start a crawl and return an object (ItemCursor instance)
        which allows to retrieve scraped items and wait for items
        to become available.
    
        Example:
    
        .. code-block:: python
    
            @inlineCallbacks
            def f():
                runner = CrawlerRunner()
                async_items = scrape_items(runner, my_spider)
                while (yield async_items.fetch_next):
                    item = async_items.next_item()
                    # ...
                # ...
    
        This convoluted way to write a loop should become unnecessary
        in Python 3.5 because of ``async for``.
        """
        crawler = crawler_runner.create_crawler(crawler_or_spidercls)    
        d = crawler_runner.crawl(crawler, *args, **kwargs)
        return ItemCursor(d, crawler)
    
    
    class ItemCursor(object):
        def __init__(self, crawl_d, crawler):
            self.crawl_d = crawl_d
            self.crawler = crawler
    
            crawler.signals.connect(self._on_item_scraped, signals.item_scraped)
    
            crawl_d.addCallback(self._on_finished)
            crawl_d.addErrback(self._on_error)
    
            self.closed = False
            self._items_available = Deferred()
            self._items = collections.deque()
    
        def _on_item_scraped(self, item):
            self._items.append(item)
            self._items_available.callback(True)
            self._items_available = Deferred()
    
        def _on_finished(self, result):
            self.closed = True
            self._items_available.callback(False)
    
        def _on_error(self, failure):
            self.closed = True
            self._items_available.errback(failure)
    
        @property
        def fetch_next(self):
            """
            A Deferred used with ``inlineCallbacks`` or ``gen.coroutine`` to
            asynchronously retrieve the next item, waiting for an item to be
            crawled if necessary. Resolves to ``False`` if the crawl is finished,
            otherwise :meth:`next_item` is guaranteed to return an item
            (a dict or a scrapy.Item instance).
            """
            if self.closed:
                # crawl is finished
                d = Deferred()
                d.callback(False)
                return d
    
            if self._items:
                # result is ready
                d = Deferred()
                d.callback(True)
                return d
    
            # We're active, but item is not ready yet. Return a Deferred which
            # resolves to True if item is scraped or to False if crawl is stopped.
            return self._items_available
    
        def next_item(self):
            """Get a document from the most recently fetched batch, or ``None``.
            See :attr:`fetch_next`.
            """
            if not self._items:
                return None
            return self._items.popleft()
    

    该 API 的灵感来自 motor,这是一个用于异步框架的 MongoDB 驱动程序。使用 scrape_items,您可以在抓取后立即从扭曲或龙卷风回调中获取项目,其方式类似于从 MongoDB 查询中获取项目的方式。

    【讨论】:

      【解决方案2】:

      这可能为时已晚,但它可能对其他人有所帮助,您可以将回调函数传递给 Spider 并调用该函数以返回您的数据,如下所示:

      我们要使用的虚拟蜘蛛:

      class Trial(Spider):
          name = 'trial'
      
          start_urls = ['']
      
          def __init__(self, **kwargs):
              super().__init__(**kwargs)
              self.output_callback = kwargs.get('args').get('callback')
      
          def parse(self, response):
              pass
      
          def close(self, spider, reason):
              self.output_callback(['Hi, This is the output.'])
      

      带有回调的自定义类:

      from scrapy.crawler import CrawlerProcess
      from scrapyapp.spiders.trial_spider import Trial
      
      
      class CustomCrawler:
      
          def __init__(self):
              self.output = None
              self.process = CrawlerProcess(settings={'LOG_ENABLED': False})
      
          def yield_output(self, data):
              self.output = data
      
          def crawl(self, cls):
              self.process.crawl(cls, args={'callback': self.yield_output})
              self.process.start()
      
      
      def crawl_static(cls):
          crawler = CustomCrawler()
          crawler.crawl(cls)
          return crawler.output
      

      那么你可以这样做:

      out = crawl_static(Trial)
      print(out)
      

      【讨论】:

      • 如何将解析后的数据传递给output_callback(...)?我的意思是,您将在哪里获取单个变量中的解析数据?
      【解决方案3】:

      您可以将变量作为类的属性传递并将数据存储在其中。

      诅咒你需要在你的蜘蛛类的__init__方法中添加属性。

      from scrapy.crawler import CrawlerProcess
      from scraper.spiders import MySpider
      
      url = 'www.example.com'
      spider = MySpider()
      crawler = CrawlerProcess()
      data = []
      crawler.crawl(spider, start_urls=[url], data)
      crawler.start()
      print(data)
      

      【讨论】:

        猜你喜欢
        • 2019-06-24
        • 1970-01-01
        • 1970-01-01
        • 2022-01-18
        • 1970-01-01
        • 1970-01-01
        • 2014-12-11
        • 2013-06-18
        • 2021-12-26
        相关资源
        最近更新 更多