为什么要学?
Scrapy_redis在scrapy的基础上实现了更多,更强大的功能。
有哪些功能体现?
request去重、爬虫持久化、实现分布式爬虫、断点续爬(带爬取的request存在redis中)、增量式爬虫(爬取过的生成指纹)
工作流程
先来看看之前的爬虫流程
再来看看scrapy_redis的爬虫流程
安装:
pip install scrapy-redis
源码包安装:
git clone git://github.com/rolando/scrapy-redis
官方文档在:https://scrapy-redis.readthedocs.io/en/stable/index.html#running-the-example-project
scrapy_redis 的源码在github:https://github.com/rmax/scrapy-redis
它提供了三个demo在example-projec/example中
三个案例有
先来看第一个案例:
dmoz.py
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class DmozSpider(CrawlSpider): """Follow categories and extract links.""" name = 'dmoz' allowed_domains = ['dmoz.org'] start_urls = ['http://www.dmoz.org/'] rules = [ Rule(LinkExtractor( restrict_css=('.top-cat', '.sub-cat', '.cat-item') ), callback='parse_directory', follow=True), ] def parse_directory(self, response): for div in response.css('.title-and-desc'): yield { 'name': div.css('.site-title::text').extract_first(), 'description': div.css('.site-descr::text').extract_first().strip(), 'link': div.css('a::attr(href)').extract_first(), }
这个案例很像我们自己写的crawlspider什么区别,所以接下来就要进行配置操作
先来看看官方的 Use the following settings in your project:
1 # 指定schedule队列 2 # Enables scheduling storing requests queue in redis. 3 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 4 5 6 # 指定哪个去重方法给request对象去重 7 # Ensure all spiders share same duplicates filter through redis. 8 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" 9 10 # Default requests serializer is pickle, but it can be changed to any module 11 # with loads and dumps functions. Note that pickle is not compatible between 12 # python versions. 13 # Caveat: In python 3.x, the serializer must return strings keys and support 14 # bytes as values. Because of this reason the json or msgpack module will not 15 # work by default. In python 2.x there is no such issue and you can use 16 # 'json' or 'msgpack' as serializers. 17 #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" 18 19 # 队列中的内容是否持久保存,False:在关闭redis的时候清空redis 20 # Don't cleanup redis queues, allows to pause/resume crawls. 21 #SCHEDULER_PERSIST = True 22 23 # Schedule requests using a priority queue. (default) 24 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' 25 26 # Alternative queues. 27 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' 28 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' 29 30 # Max idle time to prevent the spider from being closed when distributed crawling. 31 # This only works if queue class is SpiderQueue or SpiderStack, 32 # and may also block the same time when your spider start at the first time (because the queue is empty). 33 #SCHEDULER_IDLE_BEFORE_CLOSE = 10 34 35 # scrapy_redis实现的items保存到redis的pipeline 36 # Store scraped item in redis for post-processing. 37 ITEM_PIPELINES = { 38 'scrapy_redis.pipelines.RedisPipeline': 300 39 } 40 41 # The item pipeline serializes and stores the items in this redis key. 42 #REDIS_ITEMS_KEY = '%(spider)s:items' 43 44 # The items serializer is by default ScrapyJSONEncoder. You can use any 45 # importable path to a callable object. 46 #REDIS_ITEMS_SERIALIZER = 'json.dumps' 47 48 # 指定redis的地址 49 # Specify the host and port to use when connecting to Redis (optional). 50 #REDIS_HOST = 'localhost' 51 #REDIS_PORT = 6379 52 53 54 # 指定redis的地址 55 # Specify the full Redis URL for connecting (optional). 56 # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings. 57 #REDIS_URL = 'redis://user:pass@hostname:9001' 58 59 # Custom redis client parameters (i.e.: socket timeout, etc.) 60 #REDIS_PARAMS = {} 61 # Use custom redis client class. 62 #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' 63 64 # If True, it uses redis' ``spop`` operation. This could be useful if you 65 # want to avoid duplicates in your start urls list. In this cases, urls must 66 # be added via ``sadd`` command or you will get a type error from redis. 67 #REDIS_START_URLS_AS_SET = False 68 69 # Default start urls key for RedisSpider and RedisCrawlSpider. 70 #REDIS_START_URLS_KEY = '%(name)s:start_urls' 71 72 # Use other encoding than utf-8 for redis. 73 #REDIS_ENCODING = 'latin1'