查找 mongo 文档，同时忽略重复值 mongo 端答案

【问题标题】：find mongo documents while ignoring duplicate values mongo side查找 mongo 文档，同时忽略重复值 mongo 端
【发布时间】：2017-02-09 21:55:53
【问题描述】：

（受this one启发的问题）

给定一个数据集：

db.mycollection.insert([
  {a:1, b:2, c:3},
  {a:1, b:3, c:4},
  {a:0, b:1, c:3},
  {a:3, b:2, c:4}
  {a:4, b:1, c:4}
])

我想为一个键的给定值（比如 a 应该在 0 到 3 之间）找到一个并且只有一个文档，并忽略该值的后续查找，即如果一个文档具有已找到 a 的值 1，搜索不应再返回任何具有 1 作为 a 键值的文档。结果的顺序可以由另一个键的值来确定。

在我们的示例中，预期的输出将是：

# Findings are sorted by value of the b key
[{a:0, b:1, c:3}, {a:3, b:2, c:4}, {a:1, b:2, c:3}]

这是我处理的代码，然后我不得不从我这边而不是 mongo 那边删除重复项。

import pymongo, pandas

result = dict(db.mycollection.find({'a': {'$in': [i for i in range(4)]}}).sort('b', pymongo.ASCENDING))

print(result)
>>> [{a:0, b:1, c:3}, {a:3, b:2, c:4}, {a:1, b:2, c:3}, {a:1, b:3, c:4}]

由于我使用的集合可能包含数百万个文档，因此我需要在 mongo 端完成“忽略重复”部分，从而节省内存和数据传输时间。

【问题讨论】：

按 b 键排序，然后按 a 组排序，然后首先选择。

标签： mongodb mongodb-query pymongo

【解决方案1】：

来自 Veeram 的评论：

l = [i for i in range(4)]

result = db.mycollection.aggregate([{'$sort': {'b': 1}},
                           {'$group': {
                              '_id': '$a',
                              'data': {'$first': '$$ROOT'}
                                      }
                            },
                            {'$match': {'_id': {'$in': l}}}])

result_list = [i['data'] for i in result]

print(result_list) # Omitted the ObjectId that should appear too
>>>[{'a': 3, 'b': 2, 'c': 4},
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 0, 'b': 1, 'c': 3}]

这似乎对我有用，您只需要注意您的结果不一定按“b”键排序，因为它会在查看“b”顺序之前遍历“a”键。

【讨论】：