【问题标题】:I got timeout error with scrapy when crawling this page抓取此页面时,我遇到了 scrapy 超时错误
【发布时间】:2018-12-17 20:52:25
【问题描述】:

我无法抓取此页面https://www.adidas.pe/scrapy crawl my_spider 返回:

2018-12-17 15:36:39 [scrapy.core.engine] INFO: Spider opened
2018-12-17 15:36:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-17 15:36:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-12-17 15:36:39 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.adidas.pe/> from <GET http://adidas.pe/>
2018-12-17 15:37:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-12-17 15:38:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

我试图改变settings.py

COOKIES_ENABLED = True
ROBOTSTXT_OBEY = False

不工作

【问题讨论】:

    标签: web-scraping scrapy web-crawler scrapy-spider


    【解决方案1】:

    您可以尝试将USER_AGENT 更改为settings.py,它对我有用。我的settings.py

     # -*- coding: utf-8 -*-
    
    # Scrapy settings for adidas project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'adidas'
    
    SPIDER_MODULES = ['adidas.spiders']
    NEWSPIDER_MODULE = 'adidas.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-03-24
      • 2016-10-18
      • 1970-01-01
      • 1970-01-01
      • 2020-08-29
      • 2016-11-17
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多