mongodb查询耗时过长答案

【问题标题】：mongodb query takes too long timemongodb查询耗时过长
【发布时间】：2018-01-06 17:42:15
【问题描述】：

我的 mongodb 集合中有以下文档：

{'name' : 'abc-1','parent':'abc', 'price': 10}
{'name' : 'abc-2','parent':'abc', 'price': 5}
{'name' : 'abc-3','parent':'abc', 'price': 9}
{'name' : 'abc-4','parent':'abc', 'price': 11}

{'name' : 'efg', 'parent':'', 'price': 10}
{'name' : 'efg-1','parent':'efg', 'price': 5}
{'name' : 'abc-2','parent':'efg','price': 9}
{'name' : 'abc-3','parent':'efg','price': 11}

我想执行以下操作：

a. Group By distinct parent
b. Sort all the groups based on price
c. For each group select a document with minimum price
  i. check each record's parent sku exists as a record in name field
  ii. If the name exists, do nothing
  iii. If the record does not exists, insert a document with parent as empty and other values as the  value of the record selected previously (minimum value).

我厌倦了按如下方式使用：

db.file.find().sort([("price", 1)]).forEach(function(doc){
          cnt = db.file.count({"sku": {"$eq": doc.parent}});
          if (cnt < 1){
               newdoc = doc;
               newdoc.name = doc.parent;
               newdoc.parent = "";
              delete newdoc["_id"];
              db.file.insertOne(newdoc);
          }
});

它的问题是它需要太多时间。这里有什么问题？如何优化？聚合管道会是一个很好的解决方案，如果是的话怎么做？

【问题讨论】：

你有多少条记录？您的 price 和 sku 字段是否已编入索引？
1- 将name 用作真实的_id('_id'=name)，2- whole_DB = db.file.find() 而不是whole_DB.forEach(..............) 为什么要扫描整个数据库两次？ 3- db.file.find() != db.file.aggregate(......) 所以所有数据库条目都不是搜索结果。 4- db['PA'].aggregate(...............) 意思是 P = product, A= product_name _first letter 使用 collections 来避免创建一个大的哈希文件。
@GarbageCollector 我无法索引它们，因为集合将是动态的，我需要动态创建它们、搜索它们、导出集合并删除集合。

标签： python mongodb pymongo

【解决方案1】：

检索一组产品名称✔

def product_names():
    for product in db.file.aggregate([{$group: {_id: "$name"}}]):
        yield product['_id']

product_names = set(product_names())

检索具有最小值的产品团体价格✔

result_set = db.file.aggregate([
    {
        '$sort': {
            'price': 1,
        }
    }, 
    {
        '$group': {
            '_id': '$parent',
            'name': {
                '$first': '$name',
            }, 
            'price': {
                '$min': '$price',
            }
        }
    }, 
    {
        '$sort': {
            'price': 1,
        }
    }
])

如果名称不在集合中，则插入在 2 中检索到的产品在 1 中检索到的产品名称。✔

from pymongo.operations import InsertOne

def insert_request(product):
    return InsertOne({
        name: product['name'],
        price: product['price'],
        parent: ''
    })

requests = (
    insert_request(product)
    for product in result_set
    if product['name'] not in product_names
)
db.file.bulk_write(list(requests))

步骤 2 和 3 可以在 aggregation 管道中实现。

db.file.aggregate([
    {
        '$sort': {'price': 1}
    }, 
    {
        '$group': {
            '_id': '$parent',
            'name': {
                '$first': '$name'
            }, 
            'price': {
                '$min': '$price'
            },
        }
    }, 
    {
        '$sort': {
            'price': 1
        }
    }, 
    {
        '$project': {
            'name': 1, 
            'price': 1,
            '_id': 0, 
            'parent':''
        }
    }, 
    {
        '$match': {
            'name': {
                '$nin': list(product_names())
            }
        }
    }, 
    {
        '$out': 'file'
    }
])

【讨论】：

我做了一些不同的事情，如下所示： var all_docs = db.file.aggregate([{"$group": {"_id": "parent_id","price": {"$min": "$price"},"doc": {"$first": "$$ROOT"}}}]); all_docs.forEach(function(doc){ cnt = db.file.count({"sku": {"$eq": doc.parent_id}}); if (cnt