【问题标题】:Pass argument to scrapy spider within a python script在python脚本中将参数传递给scrapy spider
【发布时间】:2015-04-26 15:42:22
【问题描述】:

我可以使用来自 wiki 的以下配方在 python 脚本中运行爬虫:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

如您所见,我可以将domain 传递给FollowAllSpider,但我的问题是我怎样才能将start_urls(实际上是一个date)传递给我的蜘蛛使用上面的代码类?

这是我的蜘蛛类:

class MySpider(CrawlSpider):
    name = 'tw'
    def __init__(self,date):
        y,m,d=date.split('-') #this is a test , it could split with regex! 
        try:
            y,m,d=int(y),int(m),int(d)

        except ValueError:
            raise 'Enter a valid date'

        self.allowed_domains = ['mydomin.com']
        self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
        for question in questions:
            item = PoptopItem()
            item['url'] = question.extract()
            yield item['url']

这是我的脚本:

from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem

settings = get_project_settings()
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()

date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()

for url in item['url']:
    convertor(url)

reactor.run() # the script will block here until the spider_closed signal was sent

如果我只打印item,我会收到以下错误:

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>

项目:

import scrapy


class PoptopItem(scrapy.Item):
    titles= scrapy.Field()
    content= scrapy.Field()
    url=scrapy.Field()

【问题讨论】:

    标签: python python-2.7 web-scraping scrapy scrapy-spider


    【解决方案1】:

    您需要修改 __init__() 构造函数以接受 date 参数。另外,我会使用datetime.strptime() 来解析日期字符串:

    from datetime import datetime
    
    class MySpider(CrawlSpider):
        name = 'tw'
        allowed_domains = ['test.com']
    
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs) 
    
            date = kwargs.get('date')
            if not date:
                raise ValueError('No date given')
    
            dt = datetime.strptime(date, "%m-%d-%Y")
            self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
    

    然后,你可以这样实例化蜘蛛:

    spider = MySpider(date='01-01-2015')
    

    或者,您甚至可以完全避免解析日期,首先传递一个datetime 实例:

    class MySpider(CrawlSpider):
        name = 'tw'
        allowed_domains = ['test.com']
    
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs) 
    
            dt = kwargs.get('dt')
            if not dt:
                raise ValueError('No date given')
    
            self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
    
    spider = MySpider(dt=datetime(year=2014, month=01, day=01))
    

    而且,仅供参考,请参阅this answer 作为有关如何从脚本运行 Scrapy 的详细示例。

    【讨论】:

    • 非常感谢您的解释!正如我所说,时间解析器是一个测试!也感谢链接建议,现在你可以看到我的parse 函数产生了url 我怎么能得到它? (运行爬行后)
    • 我使用了物品,但它引发了KeyError 似乎它没有运行爬行! for url in item['url']:
    • @KasraAD 我认为您只需要yield item 而不是yield item['url']。让我知道它是否有帮助。
    • 在我的蜘蛛中我只是yield item 又是那个错误!我将编辑问题!我添加我的脚本!希望对您有所帮助!
    • @KasraAD 两件事:1.为什么要在运行爬行的脚本中实例化一个项目(我认为你在这里不需要它)如果你想对返回的项目进行后处理来自蜘蛛 - 在管道中进行。 2.你能不能也显示PoptopItem类的定义。谢谢。
    猜你喜欢
    • 2015-09-13
    • 1970-01-01
    • 2016-11-17
    • 1970-01-01
    • 2016-03-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-02
    相关资源
    最近更新 更多