MongoDB查询以获取具有外部链接文档计数的文档列表答案

【问题标题】：MongoDB query to fetch list of documents with count of external linked documentsMongoDB查询以获取具有外部链接文档计数的文档列表
【发布时间】：2018-03-23 10:08:42
【问题描述】：

我有一个 mongodb 数据库，其集合文档大致如下：

// user document
{
    _id: $oid,
    name: "name",
    description: "description".
    // ...
}

// book document
{
    _id: $oid,
    userId: "..."
    name: "name",
    description: "description"
    // ...
}

// page document
{
    _id: $oid,
    bookId: "..."
    name: "name",
    description: "description"
    // ...
}

一个用户有很多书，而一本书有很多页。每个实体都是一个单独的文档的原因是因为用户可以拥有数千本书，而一本书可以有数千页，因此，如果所有内容都在一个文档中，我们可以很容易地达到 16MB 的限制。

检索指定userId 的书籍列表的最佳方法是什么，每本书都有一个pageCount 字段？

这是我需要的json结果。

{
    books: [{
        _id: $oid,
        name: "name1",
        description: "description1",
        pageCount: 8
    }, {
        _id: $oid,
        name: "name2",
        description: "description2",
        pageCount: 12
    },
        // ...
    ]
}

使用 SQL 数据库可以非常简单地使用连接计数，但是使用 mongodb 除了进行单独查询以获取书籍列表然后获取每本书的页数之外，我看不到任何简单的解决方案。

【问题讨论】：

如果书籍不经常丢失页面，那么将pageCount 预聚合到book 文档中是有意义的。否则你每次都需要做昂贵的$lookup aggregation。
@AlexBlex 预聚合是我考虑过的。这是否意味着每次添加新的page 或删除现有的page 时我都必须更新book 文档？我想这需要在应用程序级别以非事务方式完成。
是的，你看到了问题所在。刚刚回答a similar question。您可以选择 - 复杂的写入、意外的差异、缓慢的读取（使用查找或 2 个单独的查询）。选择哪一个实际上取决于数据完整性的关键程度与读取的速度有多慢。
@AlexBlex MongoDB v4 事务支持看起来很有希望在发布时实现预聚合。

标签： mongodb nosql mongodb-query aggregation-framework nosql-aggregation

【解决方案1】：

它没有直接回答问题，而是给出了一些关于

的想法

进行单独查询以获取书籍列表，然后获取每本书的页数

部分。这并不总是一件坏事。 Mongodb 在简单查询中非常有效，因此我给您一些数字来考虑单个 $lookup 管道与多个查询的性能，并鼓励您在数据集上测试典型查询。例如，如果您不需要一次获取所有数据，分页会产生巨大的影响。

设置

一个包含 100 个用户 X 1,000 本书 X 1,000 页的小型数据库，每个用户在一个微型 1 vCPU / 2 GB 内存 / 50 GB 磁盘 / LON1 - Ubuntu MongoDB 3.4.10 on 16.04 droplet。

pages 集合创建如下：

for USERID in {1..100}; do   
    echo "" > pages.json;     
    for BOOKID in {1..1000}; do       
       ./node_modules/.bin/mgeneratejs "{\"bookId\": \"$USERID-$BOOKID\", \"name\": {\"\$sentence\":{\"words\":3}}, \"description\": \"\$paragraph\"}" -n 1000 >> pages.json
    done     
    cat pages.json | mongoimport -d so -c pages 
done

和books 几乎一样。

基本数据：

db.books.stats(1024*1024)
    "ns" : "so.books",
    "size" : 50,
    "count" : 100000,
    "avgObjSize" : 533,
    "storageSize" : 52,
    "nindexes" : 2,
    "totalIndexSize" : 1,
    "indexSizes" : {
            "_id_" : 0,
            "userId_1" : 0
    },

db.pages.stats(1024*1024)
    "ns" : "so.pages",
    "size" : 51673,
    "count" : 100000000,
    "avgObjSize" : 541,
    "storageSize" : 28920,
    "nindexes" : 2,
    "totalIndexSize" : 1424,
    "indexSizes" : {
            "_id_" : 994,
            "bookId_1" : 430
    },

$查找

@chridam 回答的管道

db.books.aggregate([
    { "$match": { "userId": 18 } },
    { "$lookup": {
        "from": "pages",
        "localField": "_id",
        "foreignField": "bookId",
        "as": "pageCount"
    }},
    { "$addFields": {
        "pageCount": { "$size": "$pageCount" }
    }}
])

提供极快的响应：

    "op" : "command",
    "command" : {
            "aggregate" : "books"
    },
    "keysExamined" : 1000,
    "docsExamined" : 1000,
    "nreturned" : 101,
    "responseLength" : 57234,
    "millis" : 1028

对于前 100 个文档，让您在一秒钟内开始处理文档。

整个事情的总时间：

db.books.aggregate([
    { "$match": { "userId": 18 } },
    { "$lookup": {
        "from": "pages",
        "localField": "_id",
        "foreignField": "bookId",
        "as": "pageCount"
    }},
    { "$addFields": {
        "pageCount": { "$size": "$pageCount" }
    }}
]).toArray()

再增加 8 秒：

    "op" : "getmore",
    "query" : {
            "getMore" : NumberLong("32322423895"),
            "collection" : "books"
    },
    "keysExamined" : 0,
    "docsExamined" : 0,
    "nreturned" : 899,
    "responseLength" : 500060,
    "millis" : 8471

检索所有数据的总时间超过 9 秒

多个查询

检索书籍：

let bookIds = []; 
db.books.find({userId:12}).forEach(b=>{bookIds.push(b._id);});

在 10 毫秒内填充数组：

"op" : "query",
"query" : {
        "find" : "books",
        "filter" : {
                "userId" : 34
        }
},
"keysExamined" : 101,
"docsExamined" : 101,
"nreturned" : 101,
"responseLength" : 54710,
"millis" : 3

和

"op" : "getmore",
"query" : {
        "getMore" : NumberLong("34224552674"),
        "collection" : "books"
},
"keysExamined" : 899,
"docsExamined" : 899,
"nreturned" : 899,
"responseLength" : 485698,
"millis" : 7

计数页数：

db.pages.aggregate([
    { $match: { bookId: { $in: bookIds } } }, 
    { $group: { _id: "$bookId", cnt: { $sum: 1 } } }
]).toArray()

总共需要 1.5 秒：

"op" : "command",
"command" : {
        "aggregate" : "pages"
},
"keysExamined" : 1000001,
"docsExamined" : 0,
"nreturned" : 101,
"responseLength" : 3899,
"millis" : 1574

和

"op" : "getmore",
"query" : {
        "getMore" : NumberLong("58311204806"),
        "collection" : "pages"
},
"keysExamined" : 0,
"docsExamined" : 0,
"nreturned" : 899,
"responseLength" : 34935,
"millis" : 0

合并结果

不是查询，但应该在应用程序级别完成。在 mongoshell javascript 中需要几毫秒，这使得 检索所有数据的总时间不到 2 秒。

【讨论】：

谢谢，这很有趣。来自 SQL 背景，进行单独的查询总是听起来很糟糕，但使用 Mongodb 情况肯定会有所不同。

【解决方案2】：

使用 MongoDB 的聚合框架，有一个名为 $lookup 的管道阶段，它允许您对同一数据库中的另一个集合进行左外连接，以过滤来自“已连接”集合的文档进行处理。

因此，有了这个武器，您可以运行聚合管道操作，将书籍集合连接到页面集合。

在管道步骤中，您可以通过从“连接”查询结果数组的大小来获取pageCount。

假设您的 MongoDB 服务器版本至少为 3.4，请考虑运行以下聚合操作以获得所需的结果：

db.books.aggregate([
    { "$match": { "userId": userId } },
    { "$lookup": {
        "from": "pages",
        "localField": "_id",
        "foreignField": "bookId",
        "as": "pageCount"
    }},
    { "$addFields": {
        "pageCount": { "$size": "$pageCount" }
    }}
])

或者，您可以运行 users 集合中的 $lookup 管道

db.user.aggregate([
    { "$match": { "_id": userId } },
    { "$lookup": {
        "from": "books",
        "localField": "_id",
        "foreignField": "userId",
        "as": "books"
    }},
    { "$lookup": {
        "from": "pages",
        "localField": "books._id",
        "foreignField": "bookId",
        "as": "pages"
    }},
    { "$addFields": {
        "books": {
            "$map": {
                "input": "$books",
                "as": "book",
                "in": {
                    "name": "$$book.name",
                    "description": "$$book.description",
                    "pageCount": { "$size": "$$book.pages" }
                }
            }
        }
    }}
])

【讨论】：

为什么不db.books.aggregate？它应该保存 1 个查找阶段。
还是一样，因为您需要用户数据
但是为什么呢？ 检索指定用户 ID 的图书列表的最佳方式不需要我提供任何用户数据。
是的，结果中不需要用户数据。我仍然对与在应用程序级别进行单独页面查询相比对性能的影响感到好奇。
你说得对，我误读了这个要求。我已经更新了我的答案以反映这一点。谢谢！

【解决方案3】：

您可以使用聚合框架中的$lookup 阶段：

db.Users.aggregate([
    {$match: {_id: userId}},
    {$lookup: {
        from: "Book",
        localField: "userId",
        foreignField: "_id",
        as: "book"
    }},
    {$lookup: {
        from: "Page",
        localField: "bookId",
        foreignField: "book._id",
        as: "page"
    }}
])

并添加阶段$group以计算页数。但我认为这个查询会很慢。如果你想在之后对你的集合进行分片，或者如果它已经是这种情况，你就不能使用 $lookup

【讨论】：

是的，这是我考虑过的，但我担心的是分片数据库的性能和面向未来的情况。谢谢
在这种情况下，最好的解决方案是使用多个查询来完成。您始终可以测试查询的性能并查看它是否满足您的需求，但它仅适用于非分片集合，因此...