open_spider 方法在使用 CrawlerProcess 时运行两次答案

【问题标题】：open_spider method run two times when using CrawlerProcessopen_spider 方法在使用 CrawlerProcess 时运行两次
【发布时间】：2018-08-04 20:53:08
【问题描述】：

我想运行多个蜘蛛，所以我尝试使用CrawlerProcess。但是我发现open_spider方法会在开头和结尾运行两次process_item方法。

它导致当蜘蛛打开时，我删除了我的集合并将数据保存到 mongodb 完成。它最终会再次删除我的收藏。

我该如何解决这个问题以及为什么方法 open_spider 运行两次？

我输入scrapy crawl movies 运行项目：

这是我的电影.py：

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time

# scrapy api imports
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from Tainan.FirstSpider import FirstSpider

class MoviesSpider(scrapy.Spider):
    name = 'movies'
    allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
    start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']

process = CrawlerProcess(get_project_settings())

process.crawl(FirstSpider)
process.start()

这是我的 FirstSpider.py：

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request


class FirstSpider(scrapy.Spider):
    name = 'first'
    allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
    start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']

    def parse(self, response):
        movieHrefs = response.xpath('//*[@class="release_movie_name"]/a/@href').extract()       
        for movieHref in movieHrefs:
            yield Request(movieHref, callback=self.parse_page)

    def parse_page(self, response):
        print 'FirstSpider => parse_page'
        movieImage = response.xpath('//*[@class="foto"]/img/@src').extract()
        cnName = response.xpath('//*[@class="movie_intro_info_r"]/h1/text()').extract()
        enName = response.xpath('//*[@class="movie_intro_info_r"]/h3/text()').extract()
        movieDate = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[0].extract()
        movieTime = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[1].extract()
        imdbScore = response.xpath('//*[@class="movie_intro_info_r"]/span/text()')[3].extract()
        movieContent = response.xpath('//*[@class="gray_infobox_inner"]/span/text()').extract_first().strip()
        yield {'image': movieImage, 'cnName': cnName, 'enName': enName, 'movieDate': movieDate, 'movieTime': movieTime, 'imdbScore': imdbScore, 'movieContent': movieContent}

这是我的 pipelines.py：

from pymongo import MongoClient
from scrapy.conf import settings

class MongoDBPipeline(object):

    global open_count
    open_count = 1
    global process_count
    process_count = 1

    def __init__(self):
        connection = MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT'])
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]
    # My issue is here it will print open_spider count = 2 finally.
    def open_spider(self, spider):
        global open_count
        print 'Pipelines => open_spider count =>'
        print open_count
        open_count += 1
        self.collection.remove({})
    # open_spider method call first time and process_item save data to my mongodb.
    # but when process_item completed, open_spider method run again...it cause my data that i have saved it has been removed.
    def process_item(self, item, spider):
        global process_count
        print 'Pipelines => process_item count =>'
        print process_count
        process_count += 1
        self.collection.insert(dict(item))
        return item

我想不通，有人可以帮助我，将不胜感激。提前致谢。

【问题讨论】：

标签： python scrapy pymongo

【解决方案1】：

我该如何解决这个问题以及为什么方法 open_spider 运行两次？

open_spider 方法每个蜘蛛运行一次，而您正在运行两个蜘蛛。

我输入scrapy crawl movies 运行项目

crawl 命令将运行名为 movies (MoviesSpider) 的蜘蛛。
为此，它必须导入 movies 模块，这将导致它也运行您的 FirstSpider。

现在，如何解决这个问题取决于您想要做什么。
也许你应该只运行一个蜘蛛，或者每个蜘蛛有单独的设置，或者完全不同的东西。

【讨论】：

但是我在 movies.py 中没有 yield 任何数据，我只是在 FirstSpider.py 中 yield 它。我认为它应该运行一次open_spider 方法。
产出数据无关紧要。 open_spider 在蜘蛛启动时调用，在scrapy 知道它是否会返回任何项目之前。如果您不想让电影蜘蛛做任何事情，为什么还要运行它？
所以我运行了两个蜘蛛使open_spider 运行了两次我现在明白了！非常感谢！