scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

Python虚拟环境的安装和配置(windows)

1.先在电脑上将python2.7和python3.5版本安装完成，并记清楚安装路径,统一安装在D盘

2.配置系统环境变量中的path路径，添加路径的版本即为默认使用版本

3.在命令行工具中输入pip install virtualenv 下载python虚拟环境

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

4.在使用pip下载包时，经常会出现超时等情况，可以使用国内镜像提高下载速度，例如豆瓣源，下载速度还是非常快的

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

5.使用virtualenv 虚拟环境名称命令创建虚拟环境，会在当前所在目录进行创建

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

6.进入虚拟环境

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

7.退出虚拟环境

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

8.如何创建指定版本的虚拟环境？

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

进入虚拟环境：

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

使用上面这种方式需要记住每一个虚拟环境的目录，太麻烦，设置快捷进入虚拟环境方式：

建议使用第二种方式简单

1.下载virtualenvwrapper-win 包

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

2.输入workon命令查看是否可用

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

3.使用virtualenvwrapper创建虚拟环境

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

默认放在C:\Users\Administrator\Envs目录中

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

可以修改存放的路径：
找到系统环境变量，添加WORKON_HOME为指定路径即可

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

运行workon，目录中没有虚拟环境了，因为默认目录已经改变，可以将之前的虚拟环境拷贝到新目录下

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

拷贝后

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

新建一个虚拟环境，完成后自动进入该虚拟环境

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

创建指定版本的虚拟环境

Mkvirtualenv –python=D:\python\python3.5\python.exe py

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

可以正常使用安装库

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

以后再进入虚拟环境，就不需要记住安装路径了直接使用以下命令：

列出虚拟环境列表：workon

新建虚拟环境：mkvirtualenv [虚拟环境名称]

启动/切换虚拟环境：workon [虚拟环境名称]

离开虚拟环境：deactivate

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

去github网站把两个文件下载之后放到虚拟环境的文件下

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

运行scrapy的时候，如果没有安装pypiwin32，会出异常。安装一下pypiwin32

pip install pypinwin32

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

安装pypiwin32会出现以下情况说明需要依次安装

Automat constantly hyperlink incremental zope.interface

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

这几个需要单独pip install 进行安装

pip install Automat
pip install constantly
pip install hyperlink

Pip install incremental

Pip install zope.interface

以上安装完毕之后，再安装scrapy: pip install scrapy

配置解释器:

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

知识点:

命令如下:

1. Workon scrapy 进入爬虫虚拟环境

2. cd C:\Users\Administrator\Desktop\6-爬虫\py3scrapy 爬虫项目存放的位置

3. Scrapy startproject TestSpider TestSpider: 爬虫项目名称

4. Cd TestSpider 进入根目录

5. Scrapy genspider baidu baidu.com

baidu:作用到两个地方1.文件名baidu.py 2. name=”baidu”

baidu.com: 爬虫的起始的url start_urls = [ ]

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

各个结构的解释:

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

将settings中的 ROBOTSTXT = False 中的False更改为 True 之后


# ROBOTSTXT_OBEY = True  更改False之后  可以不遵守浏览器的协议进行网上数据胡爬取
ROBOTSTXT_OBEY = False

网址:搜索查看官方文档:https://doc.scrapy.org/en/latest/intro/tutorial.html可能这个网站的文档比较老,可以实时查找最新的文档

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

2. scrapy各部分之间的数据流向是如何交互的，详细描述一下。(5分)

1>引擎将起始url构造成Request交给调度器；

2>调度器对Request对象生成指纹信息，根据是否去重来决定是否将Request放入队列中；

3>引擎从调度器得队列中不断得获取下一个Request请求；

4>引擎将Reques请求交给下载器Downloader进行下载，期间会经过下载器中间件process_request得处理；

5>下载器下载完成以后，经过process_response将Response对象返回给引擎；

6>引擎将Response对象交给爬虫Spider进行解析，提取数据，期间经过爬虫中间件；

7>爬虫Spider将提取得结果传递给引擎，引擎将item交给管道，将Request对象交给调度器继续调度；

源码的分析:

scrapy---安装配置虚拟环境--爬虫--知识点--配置cookiespool

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for TestSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'TestSpider'

SPIDER_MODULES = ['TestSpider.spiders']
NEWSPIDER_MODULE = 'TestSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'TestSpider (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0'

# Obey robots.txt rules
# Scrapy框架默认遵守 robots.txt 协议规则，robots规定了一个网站中，哪些地址可以请求，
                                         哪些地址不能请求。
# 默认是True，设置为False不遵守这个协议。
ROBOTSTXT_OBEY = False


# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 配置scrapy的请求连接数，默认会同时并发16个请求。
# CONCURRENT_REQUESTS = 10

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs

# 下载延时，请求和请求之间的间隔，降低爬取速度，default: 0
# DOWNLOAD_DELAY = 3


# CONCURRENT_REQUESTS_PER_DOMAIN：针对网站(主域名)设置的最大请求并发数。
# CONCURRENT_REQUESTS_PER_IP：某一个IP的最大请求并发数。
# The download delay setting will honor only one of:#二者选一
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16


# Disable cookies (enabled by default)
# 是否启用Cookie的配置，默认是可以使用Cookie的。主要是针对一些网站是禁用Cookie的。
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False


# Override the default request headers:

# 配置默认的请求头Headers.
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }


# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 配置自定义爬虫中间件，scrapy也默认启用了一些爬虫中间件，可以在这个配置中关闭。
# SPIDER_MIDDLEWARES = {
#    'TestSpider.middlewares.TestspiderSpiderMiddleware': 543,
# }


# 下载中间件，配置自定义的中间件或者取消Scrapy默认启用的中间件。
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'TestSpider.middlewares.TestspiderDownloaderMiddleware': 543,
# }


# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }


# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 配置自定义的PIPELINES，或者取消Scrapy默认启用的中间件。
# ITEM_PIPELINES = {
#    'TestSpider.pipelines.TestspiderPipeline': 300,
# }


# 限速配置
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

# 是否开启自动限速
# AUTOTHROTTLE_ENABLED = True


# The initial download delay
# 配置初始url的下载延时
# AUTOTHROTTLE_START_DELAY = 5


# The maximum download delay to be set in case of high latencies
# 配置最大请求时间
# AUTOTHROTTLE_MAX_DELAY = 60


# 配置请求和请求之间的下载间隔，单位是秒
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0


# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False


# 关于Http缓存的配置，默认是不启用。
# 对于同一个页面的请求进行数据的缓存，如果后续还有相同的请求，直接从缓存中进行获取。
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'