【问题标题】:Scrapy not calling parse function with start_requestsScrapy 不使用 start_requests 调用解析函数
【发布时间】:2017-02-21 08:38:05
【问题描述】:

我对 Python 和 Scrapy 还很陌生,但似乎有些不对劲。根据文档和示例,重新实现 start_requests 函数将导致 Scrapy 使用返回 start_requests 而不是 start_urls 数组变量。

start_urls 一切正常,但是当我添加 start_requests 时,它不会进入 parse 函数。文档指出 parse 方法是

Scrapy 用于处理下载响应的默认回调, 当他们的请求没有指定回调时

parse 从未执行,跟踪我的记录器打印。

这是我的代码,它很短,因为我只是在玩弄它。

class Crawler(scrapy.Spider):

    name = 'Hearthpwn'
    allowed_domains = ['hearthpwn.com']
    storage_dir = 'C:/Users/Michal/PycharmProjects/HearthpwnCrawler/'
    start_urls = ['http://www.hearthpwn.com/decks/645987-nzoth-warrior']

    def start_requests(self):

        logging.log(logging.INFO, "Loading requests")
        yield Request(url='http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter')

    def parse(self, response):

        logging.log(logging.INFO, "parsing response")

        filename = response.url.split("/")[-1] + '.html'
        with open('html/' + filename, 'wb') as f:
            f.write(response.body)

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Crawler)
process.start()

并打印控制台:

2016-10-12 15:33:39 [scrapy] INFO: Scrapy 1.2.0 started (bot: scrapybot)
2016-10-12 15:33:39 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2016-10-12 15:33:39 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2016-10-12 15:33:39 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-12 15:33:39 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-12 15:33:39 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-12 15:33:39 [scrapy] INFO: Spider opened
2016-10-12 15:33:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-12 15:33:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-12 15:33:39 [root] INFO: Loading requests
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter>
2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1>
2016-10-12 15:33:41 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-10-12 15:33:41 [scrapy] INFO: Closing spider (finished)
2016-10-12 15:33:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 655,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1248,
 'downloader/response_count': 2,
 'downloader/response_status_count/302': 2,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 10, 12, 13, 33, 41, 740724),
 'log_count/DEBUG': 4,
 'log_count/INFO': 8,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 10, 12, 13, 33, 39, 441736)}
2016-10-12 15:33:41 [scrapy] INFO: Spider closed (finished)

感谢任何线索。

【问题讨论】:

    标签: python request scrapy


    【解决方案1】:
    2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter>
    2016-10-12 15:33:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> from <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter?cookieTest=1>
    2016-10-12 15:33:41 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    

    这里发生的情况是该网站多次重定向您,因此您最终会抓取相同的网址两次。 Scrapy spider默认有过滤掉重复请求的中间件,所以在创建Request对象时需要设置参数dont_filterTrue来忽略这个中间件。

    例如:

    def start_requests(self):
        yield ('http://scrapy.org', dont_filter=True) 
    

    【讨论】:

      【解决方案2】:

      使用 meta 字典中的 dont_merge_cookies 属性可以解决这个问题。

          def start_requests(self):
      
              logging.log(logging.INFO, "Loading requests")
              yield Request(url='http://www.hearthpwn.com/decks/646673-s31-legend-2eu-3asia-smorc-hunter',
                            meta={'dont_merge_cookies': True})
      

      【讨论】:

      • 感谢您和@Granitosaurus!虽然这个答案足以满足我的意图,但两者都给了我一个有趣的见解。我最终得到了重定向样式的链接,很容易将名称解析为原始形式并保存。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-09-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多