【问题标题】:Python Scrapy - Yield statement not working as expectedPython Scrapy - Yield 语句未按预期工作
【发布时间】:2016-04-08 21:17:09
【问题描述】:

我有一个看起来像这样的 Scrapy 蜘蛛。基本上,它需要一个 URL 列表,跟随内部链接并抓取外部链接。我想要做的是让它同步,以便按顺序解析 url_list。

class SomeSpider(Spider):
    name = 'grablinksync'
    url_list = ['http://www.sports.yahoo.com/', 'http://www.yellowpages.com/']
    allowed_domains = ['www.sports.yahoo.com', 'www.yellowpages.com']
    links_to_crawl = []
    parsed_links = 0

    def start_requests(self):
        # Initial request starts here
        start_url = self.url_list.pop(0)
        return [Request(start_url, callback=self.get_links_to_parse)]

    def get_links_to_parse(self, response):
        for link in LinkExtractor(allow=self.allowed_domains).extract_links(response):
            self.links_to_crawl.append(link.url)
            yield Request(link.url, callback=self.parse_obj, dont_filter=True)

    def start_next_request(self):
        self.parsed_links = 0
        self.links_to_crawl = []
        # All links have been parsed, now generate request for next URL
        if len(self.url_list) > 0:
            yield Request(self.url_list.pop(0), callback=self.get_links_to_parse)

    def parse_obj(self,response):
        self.parsed_links += 1
        for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            item = CrawlsItem()
            item['DomainName'] = get_domain(response.url)
            item['LinkToOtherDomain'] = link.url
            item['LinkFoundOn'] = response.url
            yield item
        if self.parsed_links == len(self.links_to_crawl):
            # This doesn't work
            self.start_next_request()

我的问题是函数start_next_request() 从未被调用过。如果我将代码移动到 start_next_request() 内的 parse_obj() 函数内,那么它会按预期工作。

def parse_obj(self,response):
            self.parsed_links += 1
            for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
                item = CrawlsItem()
                item['DomainName'] = get_domain(response.url)
                item['LinkToOtherDomain'] = link.url
                item['LinkFoundOn'] = response.url
                yield item
            if self.parsed_links == len(self.links_to_crawl):
                # This works..
                self.parsed_links = 0
                self.links_to_crawl = []
                # All links have been parsed, now generate request for next URL
                if len(self.url_list) > 0:
                    yield Request(self.url_list.pop(0), callback=self.get_links_to_parse)

我想抽象出start_next_request() 函数,因为我打算从其他几个地方调用它。我知道这与 start_next_request() 作为生成器函数有关。但是我是生成器和产量的新手,所以我很难弄清楚我做错了什么。

【问题讨论】:

  • 请仔细阅读发布指南,您应该提取一个最小的示例。

标签: python scrapy yield


【解决方案1】:

那是因为yield 将函数变成了一个生成器,而仅仅写self.start_next_request() 并不会让生成器做任何事情。

生成器是懒惰的,这意味着除非你向它请求第一个对象 - 它不会做任何事情。

您可以将代码更改为:

def parse_obj(self,response):
    self.parsed_links += 1
    for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
        item = CrawlsItem()
        item['DomainName'] = get_domain(response.url)
        item['LinkToOtherDomain'] = link.url
        item['LinkFoundOn'] = response.url
        yield item
    if self.parsed_links == len(self.links_to_crawl):
        for res in self.start_next_request():
            yield res

即使return self.start_next_request() 也可以在您返回生成器时工作。

【讨论】:

    猜你喜欢
    • 2020-03-05
    • 2017-10-27
    • 2012-08-16
    • 2020-03-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-23
    • 2018-06-16
    相关资源
    最近更新 更多