【发布时间】:2013-11-23 01:19:35
【问题描述】:
史前史: 我在 Python 2.7.2+ 上运行 Scrapy 版本 0.16.2,它在 Linux Mint 上。 前几天I had this problem 在帮助下,我设法克服了它。有一段时间,Crawler 正常工作:
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Enabled item pipelines:
2013-11-23 01:02:51+0200 [basketsp17] INFO: Spider opened
2013-11-23 01:02:51+0200 [basketsp17] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6024
2013-11-23 01:02:51+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Redirecting (301) to <GET http://www.euroleague.net/main/results/by-date> from <GET http://www.euroleague.net/main/results/by-date/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleaguebasketball.net': <GET http://www.euroleaguebasketball.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.eurocupbasketball.com': <GET http://www.eurocupbasketball.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleague.tv': <GET http://www.euroleague.tv/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.euroleaguestore.net': <GET http://www.euroleaguestore.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'fantasychallenge.euroleague.net': <GET http://fantasychallenge.euroleague.net/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/TheEuroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'www.youtube.com': <GET http://www.youtube.com/euroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'euroleaguedevotion.ourtoolbar.com': <GET http://euroleaguedevotion.ourtoolbar.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'euroleague.synapticdigital.com': <GET http://euroleague.synapticdigital.com/>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'twitter.com': <GET http://twitter.com/Euroleague>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'kort.es': <GET http://kort.es/ulpGt>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Filtered offsite request to 'adserver.itsfogo.com': <GET http://adserver.itsfogo.com/click.aspx?zoneid=136145>
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/devotion/home> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/euroleaguenews/transactions/2013-14-signings> (referer: http://www.euroleague.net/main/results/by-date)
2013-11-23 01:02:51+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/features/blog/2013-2014> (referer: http://www.euroleague.net/main/results/by-date)
但是几次之后它就停止了爬行。我想知道问题出在哪里。如果我第二天尝试代码,它会再次工作几分钟并停止。好吧,它可以工作,但它不会爬行。如果我更改 start_urls,它会再次开始工作,并使用相同的代码再次停止。 这里有什么问题?
这是我在它停止后看到的:
scrapy crawl basketsp17
2013-11-22 03:07:15+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: basketbase)
2013-11-22 03:07:15+0200 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2013-11-22 03:07:15+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'basketbase.spiders', 'SPIDER_MODULES': ['basketbase.spiders'], 'BOT_NAME': 'basketbase'}
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Enabled item pipelines:
2013-11-22 03:07:16+0200 [basketsp17] INFO: Spider opened
2013-11-22 03:07:16+0200 [basketsp17] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-11-22 03:07:16+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-11-22 03:07:16+0200 [basketsp17] DEBUG: Redirecting (301) to <GET http://www.euroleague.net/main/results/by-date> from <GET http://www.euroleague.net/main/results/by-date/>
2013-11-22 03:07:16+0200 [basketsp17] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
2013-11-22 03:07:16+0200 [basketsp17] INFO: Closing spider (finished)
2013-11-22 03:07:16+0200 [basketsp17] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 489,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 12181,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 11, 22, 1, 7, 16, 471690),
'log_count/DEBUG': 8,
'log_count/INFO': 3,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 11, 22, 1, 7, 16, 172756)}
2013-11-22 03:07:16+0200 [basketsp17] INFO: Spider closed (finished)
这是我正在使用的代码:
from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.http import HtmlResponse
class Basketspider(CrawlSpider):
name = "basketsp17"
allowed_domains = ["www.euroleague.net"]
start_urls = ["http://www.euroleague.net/main/results/by-date/"]
rules = (
Rule(SgmlLinkExtractor(allow=('main\/results\/showgame\?gamecode\=/\d$\&seasoncode\=E2013\#!boxscore')),follow=True),
Rule(SgmlLinkExtractor(allow=()),callback='parse_item'),
)
def init_request(self):
return HtmlResponse("http://www.euroleague.net/main/results/by-date/", body = body)
def parse_item(self, response):
sel = HtmlXPathSelector(response)
items=[]
item = BasketbaseItem()
item['date'] = sel.select('//div[@class="gs-dates"]/text()').extract() # Game date
item['time'] = sel.select('//div[@class="gs-dates"]/span[@class="GameScoreTimeContainer"]/text()').extract() # Game time
items.append(item)
return items
【问题讨论】:
-
我很想知道你是如何解决这个问题的。请告诉我。谢谢。s
标签: python web-crawler scrapy