【问题标题】：Remove duplicate values in mongodb删除mongodb中的重复值
【发布时间】：2016-03-28 15:12:40
【问题描述】：

我正在使用 python 和 tornado 学习 mongodb。我有一个 mongodb 集合，当我这样做时

db.cal.find()

{     
    "Pid" : "5652f92761be0b14889d9854",
    "Registration" : "TN 56 HD 6766",
    "Vid" : "56543ed261be0b0a60a896c9",
    "Period" : "10-2015",
    "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",
        "14-10-2015",
        "15-10-2015",
        "18-10-2015"
    ],
    "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",
        "5-10-2015",
        "7-10-2015",
        "8-10-2015"
    ],
    "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015"
    ],

    "AOr": [
        "23-10-2015",
        "27-10-2015",
        "23-10-2015",
        "27-10-2015"
    ]
}

我需要一个操作来删除Booked,NA,AOs,AOr 中的重复值。最后应该是

{
     "Pid" : "5652f92761be0b14889d9854",
      "Registration" : "TN 56 HD 6766",
      "Vid" : "56543ed261be0b0a60a896c9",
      "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",

      ],
      "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",

      ],

      "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",

      ],

      "AOr": [
        "23-10-2015",
        "27-10-2015",

      ]
}

如何在 mongodb 中实现这一点？

【问题讨论】：

标签： python mongodb mongodb-query pymongo

【解决方案1】：

您不能首先在此处使用“dropDups”语法，因为它已从 MongoDB 2.6 开始“弃用”并在 MongoDB 3.0 中删除，甚至无法使用。

要从每个列表中删除重复项，您需要在 python 中使用 set 类。

import pymongo


fields = ['Booked', 'NA', 'AOs', 'AOr']
client = pymongo.MongoClient()
db = client.test
collection = db.cal
bulk = colllection.initialize_ordered_op()
count = 0
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    bulk.find({'_id': document['_id']}).update_one({'$set': update})
    count = count + 1
    if count % 200 == 0:
        bulk.execute()
        bulk = colllection.initialize_ordered_op()

if count > 0:
    bulk.execute()

MongoDB 3.2 deprecates Bulk() 及其关联方法并提供.bulkWrite() 方法。此方法可从 Pymongo 3.2 以bulk_write() 获得。使用此方法要做的第一件事是导入UpdateOne 类。

from pymongo import UpdateOne


requests = [] # list of write operations
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    requests.append(UpdateOne({'_id': document['_id']}, {'$set': update}))
collection.bulk_write(requests)

这两个查询给出了相同的预期结果：

{'AOr': ['27-10-2015', '23-10-2015'],
 'AOs': ['15-10-2015', '14-10-2015', '18-10-2015'],
 'Booked': ['7-10-2015', '5-10-2015', '8-10-2015'],
 'NA': ['1-10-2015', '4-10-2015', '3-10-2015', '2-10-2015'],
 'Period': '10-2015',
 'Pid': '5652f92761be0b14889d9854',
 'Registration': 'TN 56 HD 6766',
 'Vid': '56543ed261be0b0a60a896c9',
 '_id': ObjectId('567f808fc6e11b467e59330f')}

【讨论】：

【解决方案2】：

工作解决方案

我创建了一个基于 JavaScript 的工作解决方案，可在 mongo shell 上使用：

var codes = ["AOs", "Booked", "NA", "AOr"]

// Use bulk operations for efficiency
var bulk = db.dupes.initializeUnorderedBulkOp()

db.dupes.find().forEach(
  function(doc) {

    // Needed to prevent unnecessary operatations
    changed = false
    codes.forEach(
      function(code) {
        var values = doc[code]
        var uniq = []

        for (var i = 0; i < values.length; i++) {
          // If the current value can not be found, it is unique
          // in the "uniq" array after insertion
          if (uniq.indexOf(values[i]) == -1 ){
            uniq.push(values[i])
          }
        }

        doc[code] = uniq

        if (uniq.length < values.length) {
          changed = true
        }

      }
    )

    // Update the document only if something was changed
    if (changed) {
      bulk.find({"_id":doc._id}).updateOne(doc)
    }
  }
)

// Apply all changes
bulk.execute()

带有示例输入的结果文档：

replset:PRIMARY> db.dupes.find().pretty()
{
  "_id" : ObjectId("567931aefefcd72d0523777b"),
  "Pid" : "5652f92761be0b14889d9854",
  "Registration" : "TN 56 HD 6766",
  "Vid" : "56543ed261be0b0a60a896c9",
  "Period" : "10-2015",
  "AOs" : [
    "14-10-2015",
    "15-10-2015",
    "18-10-2015"
  ],
  "Booked" : [
    "5-10-2015",
    "7-10-2015",
    "8-10-2015"
  ],
  "NA" : [
    "1-10-2015",
    "2-10-2015",
    "3-10-2015",
    "4-10-2015"
  ],
  "AOr" : [
    "23-10-2015",
    "27-10-2015"
  ]
}

在`dropDups` 中使用索引

这根本行不通。首先，根据 3.0 版，此选项不再存在。既然我们已经发布了 3.2，我们应该找到一种可移植的方式。

其次，即使使用 dropDups，文档也明确指出：

dropDups boolean : MongoDB 仅索引第一次出现的键，并从包含该键后续出现的集合中删除所有文档 .

因此，如果另一个文档的其中一个帐单代码中的值与前一个相同，则整个文档将被删除。

【讨论】：

您可以使用Remove Duplicates from JavaScript Array 中显示的方法之一从这些数组中删除重复项，然后使用带有批量操作的$set 运算符来更新文档。另请注意，MongoDB 3.2 弃用 Bulk() 及其相关方法。
shell 上既没有 Jquery 也没有 ecma 6，对吧？ ;) 我看不出识别唯一性的劣势在哪里。但是 3.2 很好，我也会添加解决方案。

【解决方案3】：

假设您想从集合中删除重复的日期，因此您可以使用 dropDups: true 选项添加唯一索引：

db.bill_codes.ensureIndex({"fieldName":1}, {unique: true, dropDups: true})

注意：首先备份您的数据库，以防它不完全符合您的预期。

【讨论】：

这只会删除其中一个字段具有完全相同值的其他文档。
我得到错误：{ "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "errmsg" : "exception: bad index key pattern { Registration: \"TN 56 HD 6766\", Pid : \"5652f92761be0b14889d9854\" }: 未知索引插件 'TN 56 HD 676'", "code" : 67, "ok" : 0 }
您必须提及您的集合键索引，而不是名称和节点进入标准。
这不仅过时，而且显然是错误的，如果没有备份的建议，这将是完全危险的。已弃用的 dropDups 删除了所有 documents，这些文件恰好在索引中具有相同的键值，而不是重复值。

【解决方案4】：

你试过“Distinct()”吗？

链接：https://docs.mongodb.org/v3.0/reference/method/db.collection.distinct/

使用 distinct 指定查询

以下示例从 dept 等于“A”的文档中返回嵌入在 item 字段中的字段 sku 的不同值：

db.inventory.distinct( "item.sku", { dept: "A" } )

该方法返回以下不同 sku 值的数组：

[ "111", "333" ]

【讨论】：

不会减少保存的数据。

工作解决方案

在dropDups 中使用索引

在`dropDups` 中使用索引