【发布时间】:2013-07-19 15:09:59
【问题描述】:
当我从命令行运行它时,我的刮板工作正常,但是当我尝试从 python 脚本中运行它时(使用 Twisted 概述的方法 here)它不会输出它通常的两个 CSV 文件做。我有一个创建和填充这些文件的管道,其中一个使用 CsvItemExporter(),另一个使用 writeCsvFile()。代码如下:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
这是我的文件结构:
SiteCrawler/ # the CSVs are normally created in this folder
runspider.py # this is the script that runs the scraper
scrapy.cfg
SiteCrawler/
__init__.py
items.py
pipelines.py
screenshooter.py
settings.py
spiders/
__init__.py
myfuncs.py
sitecrawler_spider.py
刮板似乎在所有其他方面都正常运行。命令行末尾的输出表明已爬取了预期的页面数量,并且蜘蛛似乎已正常完成。我没有收到任何错误消息。
---- 编辑: ----
将打印语句和语法错误插入管道没有任何效果,因此看起来管道被忽略了。为什么会这样?
下面是运行爬虫的脚本代码(runspider.py):
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
import logging
from SiteCrawler.spiders.sitecrawler_spider import MySpider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
【问题讨论】:
-
文件可以写在别的地方吗?你能检查你的输出文件路径还是使用绝对文件路径?
-
我想这与实际使用的设置有关。日志一开始说的是什么?您应该列出所有启用的中间件和管道
-
我正在查看doc.scrapy.org/en/latest/topics/… 和github.com/scrapy/scrapy/blob/master/scrapy/settings/…。也许你必须使用
CrawlerSetting(settings.module.to.use)。至少你应该能够通过分隔mysettings = CrawlerSettings(settings.modules.to.use)来检查你的runspider.py,也许用mysettings.get(setting_name)从这些设置中打印出一些值,然后crawler = Crawler(mysettings)... -
太棒了!我将来可能也需要它。您可以发布自己的答案并说明您是如何解决的。
标签: python python-2.7 export twisted scrapy