【问题标题】:scrapy (or selenium) freezing after being redirected to a different website重定向到其他网站后,scrapy(或 selenium)冻结
【发布时间】:2013-12-20 01:43:47
【问题描述】:

我正在使用 Selenium 运行 scrapy CrawlSpider,但我遇到了一些奇怪的问题。蜘蛛爬行了一会儿,然后它就僵住了——似乎什么也没做或卡在某一点上。 我一直遇到这个问题,所以为了强行阻止蜘蛛,我不得不杀死 PhantomJS 驱动程序。我的蜘蛛在外部网站上运行良好,但每次我在我定制的本地主机网站上尝试它时,蜘蛛都会冻结。以下是错误日志:

scrapy crawl image -o test.csv -t csv
2013-12-19 18:12:43-0700 [scrapy] INFO: Scrapy 0.20.2 started (bot: cultr)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Optional features available: ssl, http11
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE':        
'cultr.spiders', 'FEED_URI': 'test.csv', 'SPIDER_MODULES': ['cultr.spiders'], 'BOT_NAME':       
'cultr', 'USER_AGENT': 'cultr (+http://cultr.business.ualberta.ca)', 'FEED_FORMAT': 'csv'}
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats,   
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, 
DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, 
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, 
OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-12-19 18:12:43-0700 [image] INFO: Spider opened
2013-12-19 18:12:43-0700 [image] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items    
(at 0 items/min)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-19 18:12:43-0700 [image] DEBUG: Crawled (200) <GET http://lh:8000/>     

(推荐人:无)

2013-12-19 18:12:43-0700 [image] DEBUG: Visiting start of site:http://lh:8000/
2013-12-19 18:12:43-0700 [image] DEBUG: Parsing images for:http://lh:8000/
2013-12-19 18:12:44-0700 [image] DEBUG: Scraped from <200http://lh:8000/>
{'AreaList': [36864],
 'CSSImagesList': [],
 'ImageIDList': [u':wdc:1387501964546'],
 'ImagesFileNames': [u'homepage-bcorp.png'],
 'ImagesList': [],
 'PositionList': [{'x': 8, 'y': 309}],
 'SiteUrl': u'http://localhosts:8000/',
 'WidthHeightList': [{'height': 192, 'width': 192}],
 'depth': 1,
 'domain': 'http://localhosts:8000',
 'htmlImagesList': [],
 'status': 'ok',
 'totalAreaOfImages': 36864,
 'totalNumberOfImages': 1}

2013-12-19 18:13:33-0700 [image] ERROR: Spider error processing <GET 
 http://<domain>:8000/pages/forbidden.html>
Traceback (most recent call last):
  File 

 "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib

 /python/twisted/internet/base.py", line 800, in runUntilCurrent
    call.func(*call.args, **call.kw)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/
 Extras/lib/python/twisted/internet/task.py", line 602, in _tick
    taskObj._oneWorkUnit()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/
 Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
    result = self._iterator.next()
  File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in    
 <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in 
      iter_errback
    yield next(it)
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", 
    line 23, in process_spider_output
    for x in result:
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", 
    line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Python/2.7/site-
    packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", 
     line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 67, 
    in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/eddieantonio/Work/cultr/spider/cultr/spiders/ImageSpider.py", line 164, 
    in parse_images
    driver.get(response.url)
  File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
     line 176, in get
    self.execute(Command.GET, {'url': url})
  File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
     line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/Library/Python/2.7/site-
     packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/Library/Python/2.7/site-
       packages/selenium/webdriver/remote/remote_connection.py", line 410, in _request
    resp = opener.open(request)
  File 
      "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
     line 404, in open
    response = self._open(req, data)
  File 
      "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
       line 422, in _open
    '_open', req)
  File 
     "/System/Library/Frameworks/Python.framework/Versions/2.7
      /lib/python2.7/urllib2.py", 
      line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions
      /2.7/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/
       2.7/lib/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File 
      "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 373, 
     in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''

【问题讨论】:

    标签: python selenium scrapy


    【解决方案1】:

    httplib.BadStatusLine 表示:

    如果服务器以我们不理解的 HTTP 状态代码响应,则引发。

    我认为当您抓取您的自定义网站时返回了一些错误信息。您应该使用 scrapy shellrequests 来获取 http://localhosts:8000/pages/forbidden.html 以查看结果。

    【讨论】:

      猜你喜欢
      • 2017-04-30
      • 2018-08-05
      • 1970-01-01
      • 2016-06-21
      • 2020-05-21
      • 1970-01-01
      • 2018-08-13
      • 2017-03-22
      • 2015-07-31
      相关资源
      最近更新 更多