根据字段删除重复文档答案

【问题标题】：Remove duplicate documents based on field根据字段删除重复文档
【发布时间】：2016-11-17 12:05:41
【问题描述】：

我已经看到了很多解决方案，但是它们都适用于 Mongo v2，不适合 V3。

我的文档如下所示：

    { 
    "_id" : ObjectId("582c98667d81e1d0270cb3e9"), 
    "asin" : "B01MTKPJT1", 
    "url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1", 
    "image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg", 
    "salesRank" : NumberInt(442137), 
    "title" : "Trump Wins 45th President Presidential Victory T-Shirt", 
    "brand" : "\"Getting Political On Me\"", 
    "favourite" : false, 
    "createdAt" : ISODate("2016-11-16T17:33:26.763+0000"), 
    "updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}

我的收藏包含大约 50 万份文档。我想删除所有 ASIN 相同的重复文档（1 个除外）

我怎样才能做到这一点？

【问题讨论】：

你试过这个answer吗？
@chridam 我不喜欢这个答案。这不是要走的路（
您打算保留哪个文件？最后创建或更新？
@Styvane 说实话并不重要，因为其余部分的内容是相同的（除了 id、createdAt、updatedAt）
@Styvane 考虑到 OP 的数据库大小，这不是最好的，但排序的概念就足够了。可以通过利用 Bulk API 方法而不是集合的 remove() 方法来更好地优化。

标签： mongodb mongodb-query aggregation-framework

【解决方案1】：

这是我们实际上可以使用聚合框架而不是客户端处理来完成的事情。

MongoDB 3.4

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$replaceRoot": { "newRoot": "$doc" } },
        { "$out": "collection" }
    ]

)

MongoDB 版本

db.collection.aggregate(
    [ 
        { "$sort": { "_id": 1 } }, 
        { "$group": { 
            "_id": "$asin", 
            "doc": { "$first": "$$ROOT" } 
        }}, 
        { "$project": { 
            "asin": "$doc.asin", 
            "url": "$doc.url", 
            "image": "$doc.image", 
            "salesRank": "$doc.salesRank", 
            "title": "$doc.salesRank", 
            "brand": "$doc.brand", 
            "favourite": "$doc.favourite", 
            "createdAt": "$doc.createdAt", 
            "updatedAt": "$doc.updatedAt" 
        }},
        { "$out": "collection" }
    ]
)

【讨论】：

感谢您的回答。我想我在查询中做错了——我运行了第一个，但我所有的收藏都不见了：/这样说是为了以防有人在没有备份的情况下运行它
我使用的是 MongoDB 3.4，它的工作原理非常棒，非常感谢。很容易添加更多这样的键'_id': {"country":"$country", "city":"$city", 'language':"$language" }
如果您不想覆盖当前集合（例如用作查询），请删除 { "$out": "collection" }，或使用 {"$out": "newcollection"} 创建新集合（名为“newcollection”）。
@SkippyleGrandGourou 这是因为我们希望为每个文档维护最旧的版本。 _id 在这种情况下是 ObjectId 是基于时间戳的。
壮观的答案！节省了我大量的时间。老实说，应该在MongoDB的文档中放置一个更具体的例子来实现$replaceRoot和$$ROOT

【解决方案2】：

使用 for 循环，虽然会花费一些时间，但会完成工作

db.amazon_sales.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){
    db.amazon_sales.remove({_id:{$gt:doc._id}, asin:doc.asin});
})

然后和这个索引

db.amazon_sales.createIndex( { "asin": 1 }, { unique: true } )

【讨论】：