MongoDB 及 scrapy 应用

1.Scrapy 使用 MongoDB

https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb

Write items to MongoDB

In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.

The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

2.MongoDB Tutorial

https://api.mongodb.com/python/current/tutorial.html

建立文件夹并运行 MongoDB instance

C:\Users\win7>mongod --dbpath e:\mongodb\db

连接数据库

from pymongo import MongoClient
client = MongoClient()
# client = MongoClient('localhost', 27017)
# client = MongoClient('mongodb://localhost:27017/')

db = client.test_database
# db = client['test-database']

collection(等同于table) 插入一个个 document

posts = db.posts
# posts = db['posts']

import datetime
post = {"author": "Mike",
        "text": "My first blog post!",
        "tags": ["mongodb", "python", "pymongo"],
        "date": datetime.datetime.utcnow()}
        
post2 = {"author": "Martin",
        "text": "My second blog post!",
        "tags": ["mongodb", "python", "pymongo"],
        "date": datetime.datetime.utcnow()}        

post_id = posts.insert_one(post).inserted_id  #其实等于 result =posts.insert_one(post) 再 post_id = result.inserted_id, 而 insert_many 则是 inserted_ids 返回一个list

posts.insert_one(post2)

允许插入重复 document

插入之后自动更新了 post3，再次执行 posts.insert_one(post3) 提示 ObjectId 重复

如果插入 post3 之前执行了 post4 = post3.copy() 其实可以插入相同内容

In [689]: post3 = {"author": "Mike",
     ...:         "text": "My first blog post!",
     ...:         "tags": ["mongodb", "python", "pymongo"],
     ...:         "date": datetime.datetime.utcnow()}

In [690]: posts.insert_one(post3)
Out[690]: <pymongo.results.InsertOneResult at 0xb803788>

In [691]: post3
Out[691]:
{'_id': ObjectId('59e57919fca565500c8e3692'),
 'author': 'Mike',
 'date': datetime.datetime(2017, 10, 17, 3, 29, 14, 966000),
 'tags': ['mongodb', 'python', 'pymongo'],
 'text': 'My first blog post!'}

检查确认：

db.collection_names(include_system_collections=False)

posts.count()

import pprint
pprint.pprint(posts.find_one())  #满足限制条件，而且仅限一条。不设条件也即get the first document from the posts collection

posts.find_one({"author": "Mike"})

for i in posts.find():    # find() returns a Cursor instance, which allows us to iterate over all matching documents.  返回 Cursor 迭代器，同样支持 posts.find({"author": "Mike"})
    print i

c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py

A cursor / iterator over Mongo query results.

In [707]: posts.find()
Out[707]: <pymongo.cursor.Cursor at 0x118a62b0>

In [708]: a=posts.find()

In [709]: a?
Type:           Cursor
String form:    <pymongo.cursor.Cursor object at 0x00000000116C6208>
File:           c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py
Docstring:
A cursor / iterator over Mongo query results.

Init docstring:
Create a new cursor.

Should not be called directly by application developers - see
:meth:`~pymongo.collection.Collection.find` instead.

.. mongodoc:: cursors

View Code