【问题标题】:Scrapy file after running gives no output but the selectors work in scrapy shell运行后的 Scrapy 文件没有输出,但选择器在 scrapy shell 中工作
【发布时间】:2020-12-15 14:58:43
【问题描述】:

我创建了一个蜘蛛,当我在scrapy shell 上运行response.css 但是当我运行蜘蛛时它没有给出o/p。 以下是我的代码:

> import scrapy  class tapo(scrapy.Spider):
>     name="mapit"
>     start_urls=["https://www.tapology.com/fightcenter?schedule=results"]
>         def event_parse(self,response):
>         event_link=response.css('.name a::attr(href)').getall()
>         BASE_URL="https://www.tapology.com"
>         event_urls=self.BASE_URL+event_link
>         yield scrapy.Request(event_urls, callback=self.parse_attr)
>     def event_items(self,response):
>         event_name=response.css('.name a::text').getall()
>         event_dtm=response.css('.datetime::text').getall()
>         event_loc=response.css('.region a::text').getall()
>         yield{
>             'event_name':event_name,
>             'event_dtm':event_dtm,
>             'event_loc':event_loc
>         }

我通过键入 > scrapy crawl mapit 来运行蜘蛛,这就是发生的情况

2020-12-15 20:24:28 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: tapology1)
2020-12-15 20:24:28 [scrapy.utils.log] INFO: Versions: lxml 4.6.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10.0.18362-SP0
2020-12-15 20:24:28 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-12-15 20:24:28 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tapology1',
 'NEWSPIDER_MODULE': 'tapology1.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['tapology1.spiders']}
2020-12-15 20:24:28 [scrapy.extensions.telnet] INFO: Telnet Password: 56ccb4901a1e00f7
2020-12-15 20:24:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-15 20:24:29 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-12-15 20:24:29 [scrapy.core.engine] INFO: Spider opened
2020-12-15 20:24:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-15 20:24:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-15 20:24:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/robots.txt> (referer: None)
2020-12-15 20:24:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
2020-12-15 20:24:31 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
Traceback (most recent call last):
  File "c:\users\chaitali\anaconda3\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 90, in _parse
    return self.parse(response, **kwargs)
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 93, in parse
    raise NotImplementedError(f'{self.__class__.__name__}.parse callback is not defined')
NotImplementedError: tapo.parse callback is not defined
2020-12-15 20:24:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-15 20:24:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 478,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 26743,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.060803,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 12, 15, 14, 54, 31, 263134),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/NotImplementedError': 1,
 'start_time': datetime.datetime(2020, 12, 15, 14, 54, 29, 202331)}
2020-12-15 20:24:31 [scrapy.core.engine] INFO: Spider closed (finished)

感谢任何关于如何运行蜘蛛的想法和提示,我是scrapy的新手,我已经了解如何运行它,但可能会出错

编辑:进行了更正 我在这里没看到什么

2020-12-15 22:44:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
2020-12-15 22:44:39 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.tapology.com/fightcenter?schedule=results> (referer: None)
Traceback (most recent call last):
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
    yield next(it)
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\utils\python.py", line 353, in __next__
    return next(self.data)
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\chaitali\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "C:\Users\Chaitali\scrapeit\tapology1\tapology1\spiders\tapogy.py", line 15, in parse
    event_urls=self.BASE_URL+event_link
AttributeError: 'tapo' object has no attribute 'BASE_URL'
2020-12-15 22:44:39 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-15 22:44:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 478,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 26935,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 3.043361,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 12, 15, 17, 14, 39, 661997),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/AttributeError': 1,
 'start_time': datetime.datetime(2020, 12, 15, 17, 14, 36, 618636)}
2020-12-15 22:44:39 [scrapy.core.engine] INFO: Spider closed (finished)

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    你需要一个名为parse 的方法在你的爬虫中,因为这是 Scrapy 寻找启动爬取过程的方法。 event_parse 可以重命名为parse,并假设您的问题中显示的缩进错误不在您的实际代码中,一切都应该没问题。

    此外,对于产量,这里还有一些其他错误。您使用了getall(),它返回一个列表。如果您想访问从该列表创建的所有链接,您的parse 方法中必须包含以下代码。

    BASE_URL = "https://www.tapology.com"
    event_urls = []
    for link in event_link:
        link = BASE_URL + link # BASE_URL is not an attribute of the spider, so do not use self.BASE_URL
        event_urls.append(link)
    
    yield from response.follow_all(event_urls, callback=self.parse_attr)
    

    此外,请确保您的parse_attr 已定义,否则对它的回调将不起作用。如果您希望产生的请求中的响应数据转到event_items,请将其设置为回调。

    这里可以创建一些快捷方式,但是由于您是 python 和/或 Scrapy 的新手,所以我试图让事情尽可能明确。

    【讨论】:

    • 我不认为我遇到过 parse_attr 。我在哪里使用它
    • @Chai your yield in event_parse 将回调设置为 parse_attr 方法。这可能是您的错字。如果您希望响应中的数据转到event_items,则将其设置为回调。我将编辑我的答案以包含更多信息。
    【解决方案2】:

    我想如果你把event_parse改成parse,然后你的callback方法改成parse,应该没问题:

     import scrapy  class tapo(scrapy.Spider):
    >     name="mapit"
    >     start_urls=["https://www.tapology.com/fightcenter?schedule=results"]
    >         def parse(self,response):                                   # CORRECTED
    >         event_link=response.css('.name a::attr(href)').getall()
    >         BASE_URL="https://www.tapology.com"
    >         event_urls=self.BASE_URL+event_link
    >         yield scrapy.Request(event_urls, callback=self.parse)       # CORRECTED
    >     def event_items(self,response):
    >         event_name=response.css('.name a::text').getall()
    >         event_dtm=response.css('.datetime::text').getall()
    >         event_loc=response.css('.region a::text').getall()
    >         yield{
    >             'event_name':event_name,
    >             'event_dtm':event_dtm,
    >             'event_loc':event_loc
    >         }
    

    或者如果不好,它至少应该给出输出或不同的错误。

    现在您已更正上述内容,我认为新错误是由于您尝试创建absolute url 的方式。

    event_link=response.css('.name a::attr(href)').getall() # why getall()? use line below
    event_link=response.css('.name a::attr(href)').get()
    event_urls=self.BASE_URL+event_link # change this
    event_urls = BASE_URL+event_link # to this
    

    【讨论】:

    • 只是为了清楚我像这样运行蜘蛛>scrapy crawl mapit(我认为它给出了相同的错误)如果我在运行调试中运行整个文件也没有任何反应。 (更正了代码)
    • 是的,没关系。如果你想把它保存为.csv,你可以做scrapy crawl mapit -o 'your_csv_name.csv'
    • 所以现在代码运行了,但输出为 0?
    • 我编辑了帖子。显示错误。你怎么知道错误是什么。我的意思是我几乎不明白。
    • 我不确定这会起作用,因为我没有在我的末端运行蜘蛛,而且我个人通常使用xpath 表达式。但是试一试,让我知道。
    猜你喜欢
    • 1970-01-01
    • 2020-09-01
    • 2018-05-18
    • 1970-01-01
    • 1970-01-01
    • 2022-01-20
    • 1970-01-01
    • 2014-03-19
    • 1970-01-01
    相关资源
    最近更新 更多