【问题标题】:Scrapy feed output contains the expected output several times instead of just onceScrapy feed 输出多次包含预期输出,而不仅仅是一次
【发布时间】:2016-07-14 06:44:34
【问题描述】:

我写了一个蜘蛛,它的唯一目的是从http://www.funda.nl/koop/amsterdam/中提取一个数字,即底部寻呼机的最大页数(例如,下例中的数字255)。

我设法使用基于这些页面的 URL 匹配的正则表达式的 LinkExtractor 来做到这一点。蜘蛛如下图:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from Funda.items import MaxPageItem

class FundaMaxPagesSpider(CrawlSpider):
    name = "Funda_max_pages"
    allowed_domains = ["funda.nl"]
    start_urls = ["http://www.funda.nl/koop/amsterdam/"]

    le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/

    rules = (
    Rule(le_maxpage, callback='get_max_page_number'),
    )

    def get_max_page_number(self, response):
        links = self.le_maxpage.extract_links(response)
        max_page_number = 0                                                 # Initialize the maximum page number
        page_numbers=[]
        for link in links:
            if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                page_numbers.append(page_number)
                # if page_number > max_page_number:
                #     max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
        max_page_number = max(page_numbers)
        print("The maximum page number is %s" % max_page_number)
        yield {'max_page_number': max_page_number}

如果我通过在命令行中输入scrapy crawl Funda_max_pages -o funda_max_pages.json 来运行此提要输出,则生成的 JSON 文件如下所示:

[
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257},
{"max_page_number": 257}
]

我觉得奇怪的是 dict 输出了 7 次而不是一次。毕竟,yield 语句在for 循环之外。谁能解释这种行为?

【问题讨论】:

    标签: python scrapy


    【解决方案1】:
    1. 您的蜘蛛会转到第一个 start_url。
    2. 使用 LinkExtractor 提取 7 个网址。
    3. 下载这 7 个 URL 中的每一个,并在每个 URL 上调用 get_max_page_number
    4. 对于每个 url get_max_page_number 返回一个字典。

    【讨论】:

      【解决方案2】:

      作为一种解决方法,我已将输出写入要使用的文本文件,而不是 JSON 提要输出:

      import scrapy
      from scrapy.spiders import CrawlSpider, Rule
      from scrapy.linkextractors import LinkExtractor
      from scrapy.crawler import CrawlerProcess
      
      class FundaMaxPagesSpider(CrawlSpider):
          name = "Funda_max_pages"
          allowed_domains = ["funda.nl"]
          start_urls = ["http://www.funda.nl/koop/amsterdam/"]
      
          le_maxpage = LinkExtractor(allow=r'%s+p\d+' % start_urls[0])   # Link to a page containing thumbnails of several houses, such as http://www.funda.nl/koop/amsterdam/p10/
      
          rules = (
          Rule(le_maxpage, callback='get_max_page_number'),
          )
      
          def get_max_page_number(self, response):
              links = self.le_maxpage.extract_links(response)
              max_page_number = 0                                                 # Initialize the maximum page number
              for link in links:
                  if link.url.count('/') == 6 and link.url.endswith('/'):         # Select only pages with a link depth of 3
                      print("The link is %s" % link.url)
                      page_number = int(link.url.split("/")[-2].strip('p'))       # For example, get the number 10 out of the string 'http://www.funda.nl/koop/amsterdam/p10/'
                      if page_number > max_page_number:
                          max_page_number = page_number                           # Update the maximum page number if the current value is larger than its previous value
              print("The maximum page number is %s" % max_page_number)
              place_name = link.url.split("/")[-3]                                # For example, "amsterdam" in 'http://www.funda.nl/koop/amsterdam/p10/'
              print("The place name is %s" % place_name)
              filename = str(place_name)+"_max_pages.txt"                         # File name with as prefix the place name
              with open(filename,'wb') as f:
                  f.write('max_page_number = %s' % max_page_number)               # Write the maximum page number to a text file
              yield {'max_page_number': max_page_number}
      
      process = CrawlerProcess({
          'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
      })
      
      process.crawl(FundaMaxPagesSpider)
      process.start() # the script will block here until the crawling is finished
      

      我还调整了蜘蛛以将其作为脚本运行。该脚本将生成一个文本文件amsterdam_max_pages.txt,其中包含一行max_page_number: 257

      【讨论】:

      • 您仍在抓取 7 个网址,但您使用 max_page_number: 257 覆盖同一个文件 7 次...
      猜你喜欢
      • 2014-02-23
      • 1970-01-01
      • 1970-01-01
      • 2022-11-11
      • 1970-01-01
      • 2021-03-26
      • 1970-01-01
      • 2013-05-26
      • 1970-01-01
      相关资源
      最近更新 更多