scrapy-redis 分布式爬虫

为什么要学？

Scrapy_redis在scrapy的基础上实现了更多，更强大的功能。

有哪些功能体现？

request去重、爬虫持久化、实现分布式爬虫、断点续爬（带爬取的request存在redis中）、增量式爬虫（爬取过的生成指纹）

工作流程

先来看看之前的爬虫流程

scrapy-redis 分布式爬虫

再来看看scrapy_redis的爬虫流程

scrapy-redis 分布式爬虫

安装：

pip install scrapy-redis

源码包安装：

git clone git://github.com/rolando/scrapy-redis

官方文档在：https://scrapy-redis.readthedocs.io/en/stable/index.html#running-the-example-project

scrapy_redis 的源码在github：https://github.com/rmax/scrapy-redis

它提供了三个demo在example-projec/example中

scrapy-redis 分布式爬虫

三个案例有

scrapy-redis 分布式爬虫

先来看第一个案例：

dmoz.py

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DmozSpider(CrawlSpider):
    """Follow categories and extract links."""
    name = 'dmoz'
    allowed_domains = ['dmoz.org']
    start_urls = ['http://www.dmoz.org/']

    rules = [
        Rule(LinkExtractor(
            restrict_css=('.top-cat', '.sub-cat', '.cat-item')
        ), callback='parse_directory', follow=True),
    ]

    def parse_directory(self, response):
        for div in response.css('.title-and-desc'):
            yield {
                'name': div.css('.site-title::text').extract_first(),
                'description': div.css('.site-descr::text').extract_first().strip(),
                'link': div.css('a::attr(href)').extract_first(),
            }

这个案例很像我们自己写的crawlspider什么区别,所以接下来就要进行配置操作

先来看看官方的 Use the following settings in your project:

 1 # 指定schedule队列
 2 # Enables scheduling storing requests queue in redis.
 3 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
 4 
 5 
 6 # 指定哪个去重方法给request对象去重
 7 # Ensure all spiders share same duplicates filter through redis.
 8 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
 9 
10 # Default requests serializer is pickle, but it can be changed to any module
11 # with loads and dumps functions. Note that pickle is not compatible between
12 # python versions.
13 # Caveat: In python 3.x, the serializer must return strings keys and support
14 # bytes as values. Because of this reason the json or msgpack module will not
15 # work by default. In python 2.x there is no such issue and you can use
16 # 'json' or 'msgpack' as serializers.
17 #SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"
18 
19 # 队列中的内容是否持久保存，False：在关闭redis的时候清空redis
20 # Don't cleanup redis queues, allows to pause/resume crawls.
21 #SCHEDULER_PERSIST = True
22 
23 # Schedule requests using a priority queue. (default)
24 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
25 
26 # Alternative queues.
27 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
28 #SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
29 
30 # Max idle time to prevent the spider from being closed when distributed crawling.
31 # This only works if queue class is SpiderQueue or SpiderStack,
32 # and may also block the same time when your spider start at the first time (because the queue is empty).
33 #SCHEDULER_IDLE_BEFORE_CLOSE = 10
34 
35 # scrapy_redis实现的items保存到redis的pipeline
36 # Store scraped item in redis for post-processing.
37 ITEM_PIPELINES = {
38     'scrapy_redis.pipelines.RedisPipeline': 300
39 }
40 
41 # The item pipeline serializes and stores the items in this redis key.
42 #REDIS_ITEMS_KEY = '%(spider)s:items'
43 
44 # The items serializer is by default ScrapyJSONEncoder. You can use any
45 # importable path to a callable object.
46 #REDIS_ITEMS_SERIALIZER = 'json.dumps'
47 
48 # 指定redis的地址
49 # Specify the host and port to use when connecting to Redis (optional).
50 #REDIS_HOST = 'localhost'
51 #REDIS_PORT = 6379
52 
53 
54 # 指定redis的地址
55 # Specify the full Redis URL for connecting (optional).
56 # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
57 #REDIS_URL = 'redis://user:pass@hostname:9001'
58 
59 # Custom redis client parameters (i.e.: socket timeout, etc.)
60 #REDIS_PARAMS  = {}
61 # Use custom redis client class.
62 #REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'
63 
64 # If True, it uses redis' ``spop`` operation. This could be useful if you
65 # want to avoid duplicates in your start urls list. In this cases, urls must
66 # be added via ``sadd`` command or you will get a type error from redis.
67 #REDIS_START_URLS_AS_SET = False
68 
69 # Default start urls key for RedisSpider and RedisCrawlSpider.
70 #REDIS_START_URLS_KEY = '%(name)s:start_urls'
71 
72 # Use other encoding than utf-8 for redis.
73 #REDIS_ENCODING = 'latin1'

View Code