批量更新太慢答案

【问题标题】：Bulk update is too slow批量更新太慢
【发布时间】：2016-10-07 17:50:19
【问题描述】：

我正在使用 pymongo 进行批量更新。
下面的名字列表是一个不同的名字列表（每个名字在集合中可能有多个文档）

代码 1：

bulk = db.collection.initialize_unordered_bulk_op()
for name in names:  
    bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
print bulk.execute()

代码 2：

bulk = db.collection.initialize_unordered_bulk_op()
counter = 0
for name in names:  
    bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
    counter =counter + 1
    if (counter % 100 == 0):
        print bulk.execute()
        bulk = db.collection.initialize_unordered_bulk_op()
if (counter % 100 != 0):
    print bulk.execute()

我的收藏中有 50000 个文档。如果我去掉计数器和 if 语句（代码 1），代码就会卡住！使用 if 语句（代码 2），我假设此操作不应该超过几分钟，但它需要的时间远不止于此！你能帮我加快速度还是我的假设错了？！

【问题讨论】：

标签： mongodb pymongo

【解决方案1】：

您很可能忘记添加索引来支持您的查询！这将为您的每个操作触发完整的集合扫描，这很无聊（正如您所意识到的那样）。

以下代码确实使用 update_many 进行了测试，并且在 'name' 和 'A' 字段上不带和带索引的批量内容。你得到的数字不言自明。

备注，我没有足够的热情为 50000 个没有索引但为 10000 个文档执行此操作。 10000 的结果是：

没有索引和 update_many：38.6 秒
无索引和批量更新：28.7 秒
使用 index 和 update_many：3.9 秒
索引和批量更新：0.52 秒

添加索引的 50000 个文档需要 2.67 秒。我确实在 windows 机器上运行了测试，mongo 在 docker 的同一主机上运行。

有关索引的更多信息，请参阅https://docs.mongodb.com/manual/indexes/#indexes。简而言之：索引保存在 RAM 中，允许快速查询和查找文档。索引必须专门选择匹配您的查询。

from pymongo import MongoClient
import random
from timeit import timeit


col = MongoClient()['test']['test']

col.drop() # erase all documents in collection 'test'
docs = []

# initialize 10000 documents use a random number between 0 and 1 converted 
# to a string as name. For the documents with a name > 0.5 add the key A
for i in range(0, 10000):
    number = random.random()
    if number > 0.5:
        doc = {'name': str(number),
        'A': True}
    else:
        doc = {'name': str(number)}
    docs.append(doc)

col.insert_many(docs) # insert all documents into the collection
names = col.distinct('name') # get all distinct values for the key name from the collection


def update_with_update_many():
    for name in names:
        col.update_many({'A': {'$exists': False}, 'Name': name},
                        {'$set': {'B': 1, 'C': 2, 'D': 3}})

def update_with_bulk():
    bulk = col.initialize_unordered_bulk_op()
    for name in names:
        bulk.find({'A': {'$exists': False}, 'Name': name}).\
            update({'$set': {'B': 1, 'C': 2, 'D': 3}})
    bulk.execute()

print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
col.create_index('A') # this adds an index on key A
col.create_index('Name') # this adds an index on key Name
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))

【讨论】：

感谢您的帮助，但我认为您上面给出的时间不正确，因为它们不是针对 10000 个文档，而只是其中的一半（考虑到 > 0.5 和
另外，索引如何加快进程？你能分享一下这背后的理论吗？
在我的回答中添加了更多信息。但是，mongodb 提供了相当不错的免费在线课程：university.mongodb.com/courses/M101P/about 我建议您参加其中之一以快速了解 mongo。
感谢您的帮助！ :)
我知道这已经过时了，但是如果你想要一个好的 gui 方法来创建索引，你可以使用 MongoDB Compass。它提供了一个非常直观的过程，您可以查看索引相对于其他索引的大小。