如何更改日期格式并在 mongodb 查询过滤器中连接字符串匹配？答案

【问题标题】：How to change date format and have concatenated string matches in mongodb query filter?如何更改日期格式并在 mongodb 查询过滤器中连接字符串匹配？
【发布时间】：2019-11-13 06:19:52
【问题描述】：

我正在根据一个条件匹配位于 2 个不同数据库中的两个集合，并为符合该条件的记录创建一个新集合。

下面是使用简单的标准，但我需要一个不同的标准。

定义

function insertBatch(collection, documents) {
  var bulkInsert = collection.initializeUnorderedBulkOp();
  var insertedIds = [];
  var id;
  documents.forEach(function(doc) {
    id = doc._id;
    // Insert without raising an error for duplicates
    bulkInsert.find({_id: id}).upsert().replaceOne(doc);
    insertedIds.push(id);
  });
  bulkInsert.execute();
  return insertedIds;
}

function moveDocuments(sourceCollection, targetCollection, filter, batchSize) {
 print("Moving " + sourceCollection.find(filter).count() + " documents from " + sourceCollection + " to " + targetCollection);
  var count;
  while ((count = sourceCollection.find(filter).count()) > 0) {
    print(count + " documents remaining");
    sourceDocs = sourceCollection.find(filter).limit(batchSize);
    idsOfCopiedDocs = insertBatch(targetCollection, sourceDocs);

    targetDocs = targetCollection.find({_id: {$in: idsOfCopiedDocs}});
  }
  print("Done!")
}

呼叫

var db2 = new Mongo("<URI_1>").getDB("analy")
var db = new Mongo("<URI_2>").getDB("clone")
var readDocs= db2.coll1
var writeDocs= db.temp_coll
var Urls = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url" ,{})
var filter= {"Url": {$in: Urls }}
moveDocuments(readDocs, writeDocs, filter, 10932)

简而言之，我的标准是不同的"Url" 字符串。相反，我希望 Url + Date 字符串成为我的标准。有两个问题：

在一个集合中，日期格式为ISODate("2016-03-14T13:42:00.000+0000")，而在另一个集合中，日期格式为"2018-10-22T14:34:40Z"。那么，如何让它们统一起来，让它们相互匹配呢？
假设，我们得到了1. 的解决方案，并且我们创建了一个具有串联字符串UrlsAndDate 而不是Urls 的新数组。我们如何动态创建一个类似的串联字段并将其与其他集合匹配？

例如：（非功能代码！）

var UrlsAndDate = new Mongo("<URI_2>").getDB("clone").myCollection.distinct("Url"+"formated_Date" ,{})
var filter= {"Url"+"formated_Date": {$in: Urls }}
readDocs.find(filter)
...and do the same stuff as above!

有什么建议吗？

有一个蛮力解决方案，但不可行！

问题：我想合并 2 个集合 mycoll 和 coll1。两者都有一个字段名称Url 和日期。 mycoll 有 35000 文档，coll1 有 4.7M 文档（16+gb）-无法加载到 m/m。

算法，使用 pymongo 客户端编写：

遍历mycoll
- 创建一个src字符串“url+common_date_format”
- 尝试在coll1 中查找匹配项，因为coll1 很大，我无法将其加载到 m/m 中并当作字典！所以，我一次又一次地迭代这个集合中的每个文档。
  1. 遍历coll1
    - 创建一个目标字符串“url+common_date_format” 如果src_string == dest_string 将此文档插入名为 temp_coll 的新集合中这是一个糟糕的算法，因为 O(35000*4.7M)，需要很长时间才能完成！如果我能以 m/m 加载 4.7M，那么运行时间将减少到 O(35000)，这是可行的！

对另一种算法的任何建议！

【问题讨论】：

集合中是否可以有多个具有相同 {url, date} 的文档具有 470 万条记录？
不，可能有 >1 个 url 匹配，这就是为什么我决定与日期配对，以便获得唯一匹配！。
所以你是说 url 和 date 的组合是唯一的？
是的！你有更好的算法吗？有什么建议吗？

标签： python algorithm mongodb-query aggregation-framework pymongo

【解决方案1】：

如果集合尚不存在，我要做的第一件事是使用 {url: 1, date: 1} 在集合上创建复合索引。假设集合 A 有 35k 文档，集合 B 有 4.7M 文档。我们无法在内存中加载整个 4.7M 文档数据。您正在内部循环中迭代 B 的光标对象。我假设一旦游标对象用尽，您将再次查询该集合。

这里有一些观察，为什么我们每次迭代超过 470 万个文档。我们可以只获取与A 中每个文档的 url 和日期匹配的文档，而不是获取所有 4.7M 文档然后进行匹配。将a_doc 日期转换为b_doc 格式然后进行查询比将两者都转换为通用格式要好，因为这迫使我们进行4.7M 文档迭代。阅读下面的伪代码。

a_docs = a_collection.find()
c_docs = []
for doc in a_docs:
    url = doc.url
    date = doc.date
    date = convert_to_b_collection_date_format(date)
    query = {'url': url, 'date': date}
    b_doc = b_collection.find(query)
    c_docs.append(b_doc)
c_docs = covert_c_docs_to_required_format(c_docs)
c_collection.insert_many(c_docs)

上面我们循环了 35k 文档并为每个文档进行过滤。鉴于我们已经创建了索引，查找需要对数时间，这似乎是合理的。

【讨论】：