0
1.Scrapy 使用 MongoDB
https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb
Write items to MongoDB
In this example we’ll write items to MongoDB using pymongo. MongoDB address and database name are specified in Scrapy settings; MongoDB collection is named after item class.
The main point of this example is to show how to use from_crawler() method and how to clean up the resources properly.:
import pymongo class MongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item
2.MongoDB Tutorial
https://api.mongodb.com/python/current/tutorial.html
建立文件夹并运行 MongoDB instance
C:\Users\win7>mongod --dbpath e:\mongodb\db
连接数据库
from pymongo import MongoClient client = MongoClient() # client = MongoClient('localhost', 27017) # client = MongoClient('mongodb://localhost:27017/') db = client.test_database # db = client['test-database']
collection(等同于table) 插入一个个 document
posts = db.posts # posts = db['posts'] import datetime post = {"author": "Mike", "text": "My first blog post!", "tags": ["mongodb", "python", "pymongo"], "date": datetime.datetime.utcnow()} post2 = {"author": "Martin", "text": "My second blog post!", "tags": ["mongodb", "python", "pymongo"], "date": datetime.datetime.utcnow()} post_id = posts.insert_one(post).inserted_id #其实等于 result =posts.insert_one(post) 再 post_id = result.inserted_id, 而 insert_many 则是 inserted_ids 返回一个list
posts.insert_one(post2)
允许插入重复 document
插入之后自动更新了 post3,再次执行 posts.insert_one(post3) 提示 ObjectId 重复
如果插入 post3 之前执行了 post4 = post3.copy() 其实可以插入相同内容
In [689]: post3 = {"author": "Mike",
...: "text": "My first blog post!",
...: "tags": ["mongodb", "python", "pymongo"],
...: "date": datetime.datetime.utcnow()}
In [690]: posts.insert_one(post3)
Out[690]: <pymongo.results.InsertOneResult at 0xb803788>
In [691]: post3
Out[691]:
{'_id': ObjectId('59e57919fca565500c8e3692'),
'author': 'Mike',
'date': datetime.datetime(2017, 10, 17, 3, 29, 14, 966000),
'tags': ['mongodb', 'python', 'pymongo'],
'text': 'My first blog post!'}
检查确认:
db.collection_names(include_system_collections=False) posts.count() import pprint pprint.pprint(posts.find_one()) #满足限制条件,而且仅限一条。不设条件也即get the first document from the posts collection posts.find_one({"author": "Mike"}) for i in posts.find(): #find()returns aCursorinstance, which allows us to iterate over all matching documents. 返回 Cursor 迭代器,同样支持 posts.find({"author": "Mike"}) print i
c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py
A cursor / iterator over Mongo query results.
In [707]: posts.find() Out[707]: <pymongo.cursor.Cursor at 0x118a62b0> In [708]: a=posts.find() In [709]: a? Type: Cursor String form: <pymongo.cursor.Cursor object at 0x00000000116C6208> File: c:\program files\anaconda2\lib\site-packages\pymongo\cursor.py Docstring: A cursor / iterator over Mongo query results. Init docstring: Create a new cursor. Should not be called directly by application developers - see :meth:`~pymongo.collection.Collection.find` instead. .. mongodoc:: cursors