爬虫之scrapy--基本操作

参考博客：https://www.cnblogs.com/wupeiqi/p/6229292.html

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。其可以应用在数据挖掘，信息处理或存储历史数据等一系列的程序中。

Scrapy 使用了 Twisted异步网络库来处理网络通讯。

爬虫之scrapy--基本操作

安装 scrapy

 1 Linux
 2       pip3 install scrapy
 3  
 4  
 5 Windows
 6       a. pip3 install wheel
 7       b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
 8       c. 进入下载目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
 9       d. pip3 install scrapy
10       e. 下载并安装pywin32：https://sourceforge.net/projects/pywin32/files/

scrapy 基本使用流程

一、创建项目（projects）命令： scrapy startproject 项目名

二、创建任务（spide）命令：scrapy genspider 任务名域名

三、运行作务命令： scrapy crawl 任务名

PS: 命令：scrapy list #查看爬虫任务名列表

SCRAPY 项目结构图

scrapy_test\

　　　　|---commads\

　　　　　　　　|--crawlall.py

　　　　　　　　#自制scrapy命令使用如图

爬虫之scrapy--基本操作

 1 #!usr/bin/env python
 2 #-*-coding:utf-8-*-
 3 # Author calmyan 
 4 #scrapy_test 
 5 #2018/6/7    11:14
 6 #__author__='Administrator'
 7 
 8 from scrapy.commands import ScrapyCommand
 9 from scrapy.utils.project import get_project_settings
10 
11 
12 class Command(ScrapyCommand):
13 
14     requires_project = True
15 
16     def syntax(self):
17         return '[options]'
18 
19     def short_desc(self):
20         return 'Runs all of the spiders'
21 
22     def run(self, args, opts):
23         spider_list = self.crawler_process.spiders.list()#爬虫任务列表
24         for name in spider_list:
25             self.crawler_process.crawl(name, **opts.__dict__) #加载到执行列表
26         self.crawler_process.start()#开始并发执行

crawlall.py