【发布时间】:2018-08-04 20:53:08
【问题描述】:
我想运行多个蜘蛛,所以我尝试使用CrawlerProcess。但是我发现open_spider方法会在开头和结尾运行两次process_item方法。
它导致当蜘蛛打开时,我删除了我的集合并将数据保存到 mongodb 完成。它最终会再次删除我的收藏。
我该如何解决这个问题以及为什么方法 open_spider 运行两次?
我输入scrapy crawl movies 运行项目:
这是我的电影.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time
# scrapy api imports
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from Tainan.FirstSpider import FirstSpider
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
process = CrawlerProcess(get_project_settings())
process.crawl(FirstSpider)
process.start()
这是我的 FirstSpider.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
def parse(self, response):
movieHrefs = response.xpath('//*[@class="release_movie_name"]/a/@href').extract()
for movieHref in movieHrefs:
yield Request(movieHref, callback=self.parse_page)
def parse_page(self, response):
print 'FirstSpider => parse_page'
movieImage = response.xpath('//*[@class="foto"]/img/@src').extract()
cnName = response.xpath('//*[@class="movie_intro_info_r"]/h1/text()').extract()
enName = response.xpath('//*[@class="movie_intro_info_r"]/h3/text()').extract()
movieDate = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[0].extract()
movieTime = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[1].extract()
imdbScore = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[3].extract()
movieContent = response.xpath('//*[@class="gray_infobox_inner"]/span/text()').extract_first().strip()
yield {'image': movieImage, 'cnName': cnName, 'enName': enName, 'movieDate': movieDate, 'movieTime': movieTime, 'imdbScore': imdbScore, 'movieContent': movieContent}
这是我的 pipelines.py:
from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):
global open_count
open_count = 1
global process_count
process_count = 1
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
# My issue is here it will print open_spider count = 2 finally.
def open_spider(self, spider):
global open_count
print 'Pipelines => open_spider count =>'
print open_count
open_count += 1
self.collection.remove({})
# open_spider method call first time and process_item save data to my mongodb.
# but when process_item completed, open_spider method run again...it cause my data that i have saved it has been removed.
def process_item(self, item, spider):
global process_count
print 'Pipelines => process_item count =>'
print process_count
process_count += 1
self.collection.insert(dict(item))
return item
我想不通,有人可以帮助我,将不胜感激。提前致谢。
【问题讨论】: