【问题标题】:How scrapy process the Request's callback function result?scrapy如何处理Request的回调函数结果?
【发布时间】:2014-12-27 07:46:51
【问题描述】:

谁能解释一下scrapy如何调用并处理Request的回调函数结果?

我知道scrapy可以接受对象的结果(请求,BaseItem,无)或对象的可迭代。例如:

1.返回对象(Request 或 BaseItem 或 None)

def parse(self, response):
    ...
    return scrapy.Request(...)

2。返回对象的迭代

def parse(self, response):
    ...
    for url in self.urls:
        yield scrapy.Request(...)

我认为它们在scrapy的代码中的某个地方是这样处理的。

# Assumed process_callback_result is a function that called after 
# a Request's callback function has been executed.
# The "result" parameter is the callback's returned value

def process_callback_result(self, result):

    if isinstance(result, scrapy.Request):
        self.process_request(result)

    elif isinstance(result, scrapy.BaseItem):
        self.process_item(result)

    elif result is None:
        pass

    elif isinstance(result, collections.Iterable):
        for obj in result:
            self.process_callback_result(obj)
    else:
        # show error message
        # ...

_process_spidermw_output函数中找到了<PYTHON_HOME>/Lib/site-packages/scrapy/core/scraper.py对应的代码:

def _process_spidermw_output(self, output, request, response, spider):
    """Process each Request/Item (given in the output parameter) returned
    from the given spider
    """
    if isinstance(output, Request):
        self.crawler.engine.crawl(request=output, spider=spider)
    elif isinstance(output, BaseItem):
        self.slot.itemproc_size += 1
        dfd = self.itemproc.process_item(output, spider)
        dfd.addBoth(self._itemproc_finished, output, response, spider)
        return dfd
    elif output is None:
        pass
    else:
        typename = type(output).__name__
        log.msg(format='Spider must return Request, BaseItem or None, '
                       'got %(typename)r in %(request)s',
                level=log.ERROR, spider=spider, request=request, typename=typename)

但是我找不到elif isinstance(result, collections.Iterable):逻辑的部分。

【问题讨论】:

    标签: python callback scrapy iterable


    【解决方案1】:

    那是因为_process_spidermw_output 只是单个项目/对象的处理程序。它是从scrapy.utils.defer.parallel 调用的。这是处理蜘蛛输出的函数:

    def handle_spider_output(self, result, request, response, spider):
            if not result:
                return defer_succeed(None)
            it = iter_errback(result, self.handle_spider_error, request, response, spider)
            dfd = parallel(it, self.concurrent_items,
                self._process_spidermw_output, request, response, spider)
            return dfd
    

    来源:https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L163-L169

    如您所见,它调用parallel 并将_process_spidermw_output 函数的句柄作为参数。参数名称为callable,它为iterable 的每个元素调用,其中包含蜘蛛结果。 parallel 函数是:

    def parallel(iterable, count, callable, *args, **named):
        """Execute a callable over the objects in the given iterable, in parallel,
        using no more than ``count`` concurrent calls.
        Taken from: http://jcalderone.livejournal.com/24285.html
        """
        coop = task.Cooperator()
        work = (callable(elem, *args, **named) for elem in iterable)
        return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])
    

    来源:https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py#L50-L58

    基本上,流程是这样的:
    当调用enqueue_scrape 时,它通过调用slot.add_response_requestrequestresponse 添加到slot.queue。然后queue 由调用self._scrape_scrape_next 处理。 _scrape 函数将 handle_spider_output 定义为一个回调函数,它将处理来自迭代器的项目。迭代器在调用_scrape2 时创建,在某一时刻它调用函数call_spider,该函数将回调注册到scrapy.utils.spider.iterate_spider_output

    def iterate_spider_output(result):
        return [result] if isinstance(result, BaseItem) else arg_to_iter(result)
    

    最后,真正将单项、None 或迭代器转换为迭代器的函数是scrapy.utils.misc.arg_to_iter()

    def arg_to_iter(arg):
        """Convert an argument to an iterable. The argument can be a None, single
        value, or an iterable.
        Exception: if arg is a dict, [arg] will be returned
        """
        if arg is None:
            return []
        elif not isinstance(arg, _ITERABLE_SINGLE_VALUES) and hasattr(arg, '__iter__'):
            return arg
        else:
            return [arg]
    

    【讨论】:

    • 这意味着回调函数(例如def parse)的返回值(Request,BaseItem)总是变成Iterable?将 Request/BaseItem 转换为 Iterable 的代码在哪里?从handle_spider_output,我只设法回溯到_scrape函数,但我不是很了解里面的代码。
    猜你喜欢
    • 1970-01-01
    • 2021-06-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多