这篇博文主要写Scrapy框架的安装与使用
Scrapy框架安装
命令行进入C:\Anaconda2\Scripts目录,运行:conda install Scrapy
创建Scrapy项目
1)进入打算存储的目录下,执行scrapy startproject 文件名 命令即可创建
新文件目录及内容
demo/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
这些文件分别是:
scrapy.cfg: 项目的配置文件demo/: 该项目的python模块.demo/items.py: 项目中的item文件,即写将要抓取的内容.demo/pipelines.py: 项目中的pipelines文件,即写数据如何存储.demo/settings.py: 项目的设置文件,即写如何定制Scrapy组件,这个比较复杂可以忽略.demo/spiders/: 放置spider代码的目录,即写如何实现爬取.
定义爬虫文件
1)定义Item
#item.py import scrapy from scrapy.item import Item, Field class DoubanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field( title = scrapy.Field() url = scrapy.Field() rate= scrapy.Field() tag = scrapy.Field()
2)定义spider
# coding:utf8 import scrapy from douban.items import DoubanItem from scrapy.selector import Selector import re from douban.pipelines import DoubanPipeline import json import urllib import sys reload(sys) sys.setdefaultencoding(\'utf-8\') class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["douban.com"] start_urls = [ "https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=1000&page_start=0", ] def start_requests(self): reqs = [] tags = [u\'热门\', u\'最新\', u\'经典\', u\'豆瓣高分\', u\'冷门佳片\', u\'华语\', u\'欧美\', u\'韩国\', u\'日本\', u\'动作\', u\'喜剧\', u\'爱情\', u\'科幻\', u\'悬疑\', u\'恐怖\', u\'文艺\'] for i in tags: url = \'https://movie.douban.com/j/search_subjects?type=movie&tag=\' + str(i) + \'&sort=recommend&page_limit=1000&page_start=0\' req = scrapy.Request(url) reqs.append(req) return reqs def parse(self, response): html = response.body url = response.url # print u\'地址\',url tag = re.findall(u\'tag=(.*?)&\',url)[0] tag=urllib.unquote(tag) # print type(tag) dictt = json.loads(html) dd = dictt[\'subjects\'] items = [] for a in dd: # self.get_tag(tag) pre_item = TutorialItem() pre_item[\'url\'] = a[\'url\'] pre_item[\'title\'] = a[\'title\'] pre_item[\'rate\'] = a[\'rate\'] pre_item[\'tag\'] = tag items.append(pre_item) return items
3)定义pipeline
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don\'t forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo class DoubanPipeline(object): def process_item(self, item, spider): db = spider.settings.get(\'db\') dbb = pymongo.MongoClient(db) db = dbb[\'douban\'] # lis = (item[\'url\'],item[\'rate\']) db.info.insert(dict(item)) # lis = (item[\'title\'], item[\'PORT\'], item[\'POSITION\'], item[ # \'TYPE\'], item[\'SPEED\'], item[\'last_check_time\']) return item
4)定义setting
1 MONGO_HOST = "127.0.0.1" # 主机IP 2 MONGO_PORT = 27017 # 端口号 3 MONGO_DB = "Spider" # 库名 4 MONGO_COLL = "douban" # collection名 5 # MONGO_USER = "Ryana" 6 # MONGO_PSW = "123456"
运行Spider
进入spider所在的文件夹,执行scrapy crawl spiderName 命令即可,这里再推荐一个MongoDB可视化工具Robomongo,运行结果如下图