Created spider 'meiju' using template 'basic' in module:
movie.spiders.meiju
----------------------------------
- scrapy.cfg 项目的配置信息,主要为Scrapy命令行工具提供一个基础的配置信息。
- (真正爬虫相关的配置信息在settings.py文件中)
- items.py 设置数据存储模板,用于结构化数据,如:Django的Model
- pipelines 数据处理行为,如:一般结构化的数据持久化
- settings.py 配置文件,如:递归的层数、并发数,延迟下载等
- spiders 爬虫目录,如:创建文件,编写爬虫规则
-----------------------------------------
---------------设置数据存储模板(1)--------------------
items.py
import scrapy class MovieItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field()
----------------------编写爬虫(2)---------------------------
# -*- coding: utf-8 -*- import scrapy from movie.items import MovieItem class MeijuSpider(scrapy.Spider): name = 'meiju' allowed_domains = ['meijutt.com'] start_urls = ['http://meijutt.com/'] def parse(self, response): movies = response.xpath('//div[@class="list_2"]/ul/li') for each_movie in movies: item = MovieItem() item['name'] = each_movie.xpath('./a/@title').extract()[0] yield item
-------------设置配置文件(3)----------------
settings.py增加如下内容
ITEM_PIPELINES = {'movie.pipelines.MoviePipeline':100}--------------编写数据处理脚本(4)------------------
pipelines.py
class MoviePipeline(object): def process_item(self, item, spider): with open("my_meiju.txt",'a') as fp: fp.write(item['name'].encode("utf8") + '\n')
---------执行爬虫(5)-----------
scrapy crawl meiju --nolog