【问题标题】:Scrapy CrawlSpider isn't following the links on a particular pageScrapy CrawlSpider 没有关注特定页面上的链接
【发布时间】:2014-04-29 07:11:56
【问题描述】:

我制作了一个蜘蛛来抓取需要登录的论坛。我在登录页面上启动它。登录成功后我将蜘蛛引导到的页面出现问题。

如果我打开我的规则以接受所有链接,蜘蛛会成功跟踪登录页面上的链接。但是,它不遵循我使用 Request() 提供的页面上的任何链接。这表明这不是因为搞砸了 xpath。

登录似乎可以工作-page_parse函数将页面源写入文本文件,源来自我要查找的页面,只有登录后才能到达。但是,我在管道对每个页面进行屏幕截图的地方会捕获登录页面,但不会捕获我随后将其发送到的此页面。

这是蜘蛛:

class PLMSpider(CrawlSpider):
    name = 'plm'
    allowed_domains = ["patientslikeme.com"]
    start_urls = [
        "https://www.patientslikeme.com/login"
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']")), callback='post_parse', follow=False),
        Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']")), callback='page_parse', follow=True),
    )

    def __init__(self, **kwargs):
        ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
        CrawlSpider.__init__(self, **kwargs)

    def post_parse(self, response):
        url = response.url
        log.msg("Post parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        return item

    def page_parse(self, response):
        url = response.url
        log.msg("Page parse attempted for {0}".format(url))
        item = PLMItem()
        item['url'] = url
        f = open("body.txt", "w")
        f.write(response.body)
        f.close()
        return item

    def login_parse(self, response):
        log.msg("Login attempted")
        return [FormRequest.from_response(response,
                    formdata={'userlogin[login]': username, 'userlogin[password]': password},
                    callback=self.after_login)]

    def after_login(self, response):
        log.msg("Post login")
        if "Login unsuccessful" in response.body:
            self.log("Login failed", level=log.ERROR)
            return
        else:
            return Request(url="https://www.patientslikeme.com/forum/diabetes2/topics",
               callback=self.page_parse)

这是我的调试日志:

2014-03-21 18:22:05+0000 [scrapy] INFO: Scrapy 0.18.2 started (bot: plm)
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Optional features available: ssl, http11
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'plm.spiders', 'ITEM_PIPELINES': {'plm.pipelines.ScreenshotPipeline': 1}, 'DEPTH_LIMIT': 5, 'SPIDER_MODULES': ['plm.spiders'], 'BOT_NAME': 'plm', 'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeue.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeue.PickleFifoDiskQueue'}
2014-03-21 18:22:05+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-21 18:22:06+0000 [scrapy] DEBUG: Enabled item pipelines: ScreenshotPipeline
2014-03-21 18:22:06+0000 [plm] INFO: Spider opened
2014-03-21 18:22:06+0000 [plm] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-21 18:22:07+0000 [scrapy] INFO: Screenshooter initiated
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-21 18:22:07+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: None)
2014-03-21 18:22:08+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/login> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:08+0000 [scrapy] INFO: Login attempted
2014-03-21 18:22:08+0000 [plm] DEBUG: Filtered duplicate request: <GET https://www.patientslikeme.com/login> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-03-21 18:22:09+0000 [plm] DEBUG: Redirecting (302) to <GET https://www.patientslikeme.com/profile/activity/all> from <POST https://www.patientslikeme.com/login>
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/profile/activity/all> (referer: https://www.patientslikeme.com/login)
2014-03-21 18:22:10+0000 [scrapy] INFO: Post login
2014-03-21 18:22:10+0000 [plm] DEBUG: Crawled (200) <GET https://www.patientslikeme.com/forum/diabetes2/topics> (referer: https://www.patientslikeme.com/profile/activity/all)
2014-03-21 18:22:10+0000 [scrapy] INFO: Page parse attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:10+0000 [scrapy] INFO: Screenshot attempted for https://www.patientslikeme.com/forum/diabetes2/topics
2014-03-21 18:22:15+0000 [plm] DEBUG: Scraped from <200 https://www.patientslikeme.com/forum/diabetes2/topics>

    {'url': 'https://www.patientslikeme.com/forum/diabetes2/topics'}
2014-03-21 18:22:15+0000 [plm] INFO: Closing spider (finished)
2014-03-21 18:22:15+0000 [plm] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2068,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 4,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 53246,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 4,
     'downloader/response_status_count/302': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 3, 21, 18, 22, 15, 177000),
     'item_scraped_count': 1,
     'log_count/DEBUG': 13,
     'log_count/INFO': 8,
     'request_depth_max': 3,
     'response_received_count': 4,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2014, 3, 21, 18, 22, 6, 377000)}
2014-03-21 18:22:15+0000 [plm] INFO: Spider closed (finished)

感谢您提供的任何帮助。

---- 编辑----

我已尝试实施 Paul t. 的建议。不幸的是,我收到以下错误:

    Traceback (most recent call last):
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 93, in start
        if self.start_crawling():
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 168, in start_crawling
        return self.start_crawler() is not None
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 158, in start_crawler
        crawler.start()
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1213, in unwindGenerator
        return _inlineCallbacks(None, gen, Deferred())
    --- <exception caught here> ---
      File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 1070, in _inlineCallbacks
        result = g.send(result)
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 74, in start
        yield self.schedule(spider, batches)
      File "C:\Python27\lib\site-packages\scrapy\crawler.py", line 61, in schedule
        requests.extend(batch)
    exceptions.TypeError: 'Request' object is not iterable

由于它没有确定应归咎于蜘蛛的特定部分,因此我正在努力找出问题所在。

---- 编辑 2 ----

问题是由 Paul t. 提供的 start_requests 函数引起的,该函数使用 return 而不是 yield。如果我将其更改为 yield,它会完美运行。

【问题讨论】:

    标签: python-2.7 web-scraping scrapy


    【解决方案1】:

    我的建议是欺骗 CrawlSpider:

    • 对登录页面的手动请求,
    • 执行登录,
    • 然后才使用 CrawlSpider 的“魔法”,就好像 CrawlSpider 以 start_urls 开头一样

    这是一个例子:

    class PLMSpider(CrawlSpider):
        name = 'plm'
        allowed_domains = ["patientslikeme.com"]
    
        # pseudo-start_url
        login_url = "https://www.patientslikeme.com/login"
    
        # start URLs used after login
        start_urls = [
            "https://www.patientslikeme.com/forum/diabetes2/topics",
        ]
    
        rules = (
            # you want to do the login only once, so it's probably cleaner
            # not to ask the CrawlSpider to follow links to the login page
            #Rule(SgmlLinkExtractor(allow=(r"patientslikeme.com/login")), callback='login_parse', follow=True),
    
            # you can also deny "/login" to be safe
            Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='content-section']"),
                                   deny=('/login',)),
                 callback='post_parse', follow=False),
    
            Rule(SgmlLinkExtractor(restrict_xpaths=("//div[@class='pagination']"),
                                   deny=('/login',)),
                 callback='page_parse', follow=True),
        )
    
        def __init__(self, **kwargs):
            ScrapyFileLogObserver(open("debug.log", 'w'), level=logging.DEBUG).start()
            CrawlSpider.__init__(self, **kwargs)
    
        # by default start_urls pages will be sent to the parse method,
        # but parse() is rather special in CrawlSpider
        # so I suggest you create your own initial login request "manually"
        # and ask for it to be parsed by your specific callback
        def start_requests(self):
            yield Request(self.login_url, callback=self.login_parse)
    
        # you've got the login page, send credentials
        # (so far so good...)
        def login_parse(self, response):
            log.msg("Login attempted")
            return [FormRequest.from_response(response,
                        formdata={'userlogin[login]': username, 'userlogin[password]': password},
                        callback=self.after_login)]
    
        # so we got a response to the login thing
        # if we're good,
        # just do as if we were starting the crawl now,
        # basically doing what happens when you use start_urls
        def after_login(self, response):
            log.msg("Post login")
            if "Login unsuccessful" in response.body:
                self.log("Login failed", level=log.ERROR)
                return
            else:
                return [Request(url=u) for u in self.start_urls]
                # alternatively, you could even call CrawlSpider's start_requests() method directly
                # that's probably cleaner
                #return super(PLMSpider, self).start_requests()
    
        def post_parse(self, response):
            url = response.url
            log.msg("Post parse attempted for {0}".format(url))
            item = PLMItem()
            item['url'] = url
            return item
    
        def page_parse(self, response):
            url = response.url
            log.msg("Page parse attempted for {0}".format(url))
            item = PLMItem()
            item['url'] = url
            f = open("body.txt", "w")
            f.write(response.body)
            f.close()
            return item
    
        # if you want the start_urls pages to be parsed,
        # you need to tell CrawlSpider to do so by defining parse_start_url attribute
        # https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L38
        parse_start_url = page_parse
    

    【讨论】:

    • 感谢您的建议。我理解它的逻辑(尽管老实说,我不理解我原始设计中的缺陷)。但是,我遇到了一个错误;我已将其包含在上面的编辑中。你知道是什么原因造成的吗?
    • 规则的顺序很重要。将遵循与链接匹配的第一条规则。因此,如果您的分页链接也在//div[@class='content-section'] 内,则只会遵循post_parse 规则,并且对于这个,分页链接将不会被获取为follow=False。但我无法确认,因为我无法在没有登录的情况下看到这些页面。所以你可以尝试把第三条规则放在第二条之前
    • 我找到了它的原因。 BaseSpider 中的 start_requests 函数使用 yield 而不是 return。如果我在您提供的 start_request 函数中将 return 更改为 yield,它可以正常工作。感谢您的帮助!
    • 确实,我的错... start_requests() 必须返回一个可交互的:doc.scrapy.org/en/latest/topics/…。我会在我的答案中修复它
    【解决方案2】:

    您的登录页面由parse_start_url 方法解析。 您应该重新定义解析登录页面的方法。 看看documentation

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-08-10
      • 1970-01-01
      • 2012-09-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多