【发布时间】:2020-12-15 14:58:43
【问题描述】:
我创建了一个蜘蛛,当我在scrapy shell 上运行response.css 但是当我运行蜘蛛时它没有给出o/p。 以下是我的代码:
> import scrapy class tapo(scrapy.Spider):
> name="mapit"
> start_urls=["https://www.tapology.com/fightcenter?schedule=results"]
> def event_parse(self,response):
> event_link=response.css('.name a::attr(href)').getall()
> BASE_URL="https://www.tapology.com"
> event_urls=self.BASE_URL+event_link
> yield scrapy.Request(event_urls, callback=self.parse_attr)
> def event_items(self,response):
> event_name=response.css('.name a::text').getall()
> event_dtm=response.css('.datetime::text').getall()
> event_loc=response.css('.region a::text').getall()
> yield{
> 'event_name':event_name,
> 'event_dtm':event_dtm,
> 'event_loc':event_loc
> }
我通过键入 > scrapy crawl mapit 来运行蜘蛛,这就是发生的情况
2020-12-15 20:24:28 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: tapology1)
2020-12-15 20:24:28 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-12-15 20:24:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-15 20:24:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tapology1',
'NEWSPIDER_MODULE': 'tapology1.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['tapology1.spiders']}
2020-12-15 20:24:28 [scrapy.extensions.telnet] INFO: Telnet Password: 56ccb4901a1e00f7
2020-12-15 20:24:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-12-15 20:24:29 [scrapy.core.engine] INFO: Spider opened
2020-12-15 20:24:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-15 20:24:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-15 20:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/robots.txt> (referer: None)
2020-12-15 20:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
2020-12-15 20:24:31 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
Traceback (most recent call last):
File "c:\users\chaitali\anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 90, in _parse
return self.parse(response, **kwargs)
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 93, in parse
raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: tapo.parse callback is not defined
2020-12-15 20:24:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-15 20:24:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 478,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26743,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 2.060803,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 12, 15, 14, 54, 31, 263134),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2020, 12, 15, 14, 54, 29, 202331)}
2020-12-15 20:24:31 [scrapy.core.engine] INFO: Spider closed (finished)
感谢任何关于如何运行蜘蛛的想法和提示,我是scrapy的新手,我已经了解如何运行它,但可能会出错
编辑:进行了更正 我在这里没看到什么
2020-12-15 22:44:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
2020-12-15 22:44:39 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
Traceback (most recent call last):
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
return next(self.data)
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
return (_set_referer(r) for r in result or ())
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
for r in iterable:
File "C:\Users\Chaitali\scrapeit\tapology1\tapology1\spiders\tapogy.py", line 15, in parse
event_urls=self.BASE_URL+event_link
AttributeError: 'tapo' object has no attribute 'BASE_URL'
2020-12-15 22:44:39 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-15 22:44:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 478,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26935,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 3.043361,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 12, 15, 17, 14, 39, 661997),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2020, 12, 15, 17, 14, 36, 618636)}
2020-12-15 22:44:39 [scrapy.core.engine] INFO: Spider closed (finished)
【问题讨论】: