Ryana

这篇博文主要写Scrapy框架的安装与使用

 

Scrapy框架安装

命令行进入C:\Anaconda2\Scripts目录,运行:conda install Scrapy

 

创建Scrapy项目

1)进入打算存储的目录下,执行scrapy startproject 文件名 命令即可创建

新文件目录及内容

demo/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件分别是:

  • scrapy.cfg: 项目的配置文件
  • demo/: 该项目的python模块.
  • demo/items.py: 项目中的item文件,即写将要抓取的内容.
  • demo/pipelines.py: 项目中的pipelines文件,即写数据如何存储.
  • demo/settings.py: 项目的设置文件,即写如何定制Scrapy组件,这个比较复杂可以忽略.
  • demo/spiders/: 放置spider代码的目录,即写如何实现爬取.

 

定义爬虫文件

1)定义Item

#item.py
 import scrapy
 from scrapy.item import Item, Field
 
 class DoubanItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field(
      title = scrapy.Field()
      url = scrapy.Field()
      rate= scrapy.Field()
      tag = scrapy.Field()

2)定义spider

# coding:utf8
import scrapy
from douban.items import DoubanItem
from scrapy.selector import Selector
import re
from douban.pipelines import DoubanPipeline
import json
import urllib

import sys
reload(sys)
sys.setdefaultencoding(\'utf-8\')
class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["douban.com"]
    start_urls = [
        "https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=1000&page_start=0",
    ]
    
    def start_requests(self):       
        reqs = []
        tags = [u\'热门\', u\'最新\', u\'经典\', u\'豆瓣高分\', u\'冷门佳片\', u\'华语\', u\'欧美\',
                u\'韩国\', u\'日本\', u\'动作\', u\'喜剧\', u\'爱情\', u\'科幻\', u\'悬疑\', u\'恐怖\', u\'文艺\']
        
        for i in tags:
            url = \'https://movie.douban.com/j/search_subjects?type=movie&tag=\' + str(i) + \'&sort=recommend&page_limit=1000&page_start=0\'
            req = scrapy.Request(url)
            reqs.append(req)

        return reqs

    def parse(self, response):
        html = response.body
        url = response.url
        # print u\'地址\',url
        tag = re.findall(u\'tag=(.*?)&\',url)[0]
        tag=urllib.unquote(tag)
        # print type(tag)

        dictt = json.loads(html)
        dd = dictt[\'subjects\']
        items = []
        for a in dd:
            # self.get_tag(tag)
            pre_item = TutorialItem()
            pre_item[\'url\'] = a[\'url\']
            pre_item[\'title\'] = a[\'title\']
            pre_item[\'rate\'] = a[\'rate\']
            pre_item[\'tag\'] = tag           
            items.append(pre_item)

        return items

3)定义pipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don\'t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class DoubanPipeline(object):

    def process_item(self, item, spider):
        db = spider.settings.get(\'db\')
        dbb = pymongo.MongoClient(db)
        db = dbb[\'douban\']
        # lis = (item[\'url\'],item[\'rate\'])
        db.info.insert(dict(item))
        # lis = (item[\'title\'], item[\'PORT\'], item[\'POSITION\'], item[
        #        \'TYPE\'], item[\'SPEED\'], item[\'last_check_time\'])
        return item

4)定义setting

1 MONGO_HOST = "127.0.0.1"  # 主机IP
2 MONGO_PORT = 27017  # 端口号
3 MONGO_DB = "Spider"  # 库名 
4 MONGO_COLL = "douban"  # collection名
5 # MONGO_USER = "Ryana"
6 # MONGO_PSW = "123456"

 

运行Spider

进入spider所在的文件夹,执行scrapy crawl  spiderName 命令即可,这里再推荐一个MongoDB可视化工具Robomongo,运行结果如下图

 

  

 

分类:

技术点:

相关文章:

  • 2021-12-05
  • 2021-09-09
  • 2021-04-20
  • 2021-09-19
  • 2021-04-12
  • 2022-01-09
  • 2022-01-25
  • 2021-04-18
猜你喜欢
  • 2021-11-15
  • 2021-09-11
  • 2021-10-27
  • 2021-11-02
  • 2021-06-20
  • 2021-12-25
相关资源
相似解决方案