文档集合的 MongoDB 条件数组元素总和答案

【问题标题】：MongoDB Conditional Sum of Array Elements for a collection of documents文档集合的 MongoDB 条件数组元素总和
【发布时间】：2020-02-26 17:05:17
【问题描述】：

我有以下 MongoDB 文档集合，每个文档都包含一个名为“history”的字段，其中包含一个包含字段“date”和“points”的子文档数组。

[{
    history: [{
        date: "2019-20-20",
        points: 1,
    }, {
        date: "2019-20-21",
        points: 1,
    }, {
        date: "2019-20-22",
        points: 1,
    }, {
        date: "2019-20-23",
        points: 1,
    }],
}, {
    history: [{
        date: "2019-20-20",
        points: 1,
    }, {
        date: "2019-20-21",
        points: 2,
    }, {
        date: "2019-20-22",
        points: 3,
    }, {
        date: "2019-20-23",
        points: 4,
    }],
}]

我不确定构建产生以下输出的查询的最佳方法是什么。对于以下示例，日期范围（含）为“2019-20-21”到“2019-20-22”。 “totalPoints”是一个新字段，其中包含该日期范围内“历史”字段中所有点的总和。

[{
    history: [{
        date: "2019-20-20",
        points: 1,
    }, {
        date: "2019-20-21",
        points: 1,
    }, {
        date: "2019-20-22",
        points: 1,
    }, {
        date: "2019-20-23",
        points: 1,
    }],
    totalPoints: 2,
}, {
    history: [{
        date: "2019-20-20",
        points: 1,
    }, {
        date: "2019-20-21",
        points: 2,
    }, {
        date: "2019-20-22",
        points: 3,
    }, {
        date: "2019-20-23",
        points: 4,
    }],
    totalPoints: 5,
}]

以下是我正在尝试做的总体思路：

User.aggregate([{
    $addFields: {
        totalPoints: { $sum: points in "history" field if date range between "2019-20-21" and "2019-20-22" } ,
    }
}]);

我想创建一个新的“totalPoints”字段的原因是因为最终我想通过“totalPoints”字段进行排序。

【问题讨论】：

注意：如果要按“计算”值排序，由于top-k排序算法，MongoDB性能会很差
@Valijon 如果我对计算值进行排序，您对如何提高查询性能有什么建议吗？
你需要做一个基准测试，MongoDb 没有为such situation 提供任何解决方案。考虑使用here描述的“堆搜索”在 MongoDB 之外进行排序@

标签： mongodb mongoose aggregation-framework

【解决方案1】：

对于单个管道，您可以将$reduce 与$filter 组合得到总和，如下所示：

var startDate = "2019-20-21";
var endDate = "2019-20-22";
User.aggregate([
    { "$addFields": { 
        "totalPoints": {
            "$reduce": {
                "input": {
                    "$filter": {
                        "input": "$history",
                        "as": "el",
                        "cond": {
                            "$and": [
                                { "$gte": ["$$el.date", startDate] },
                                { "$lte": ["$$el.date", endDate ] },
                            ]
                        }
                    }
                },
                "initialValue": 0,
                "in": { "$add": [ "$$value", "$$this.points" ] }
            }
        }
    } }
]);

另一种选择是有两个管道阶段，您可以在其中使用过滤数组开始聚合，该数组仅包含与日期范围查询匹配的元素。为此，将$addFields 与$filter 结合使用，您的过滤条件使用条件运算符$and 与比较运算符$gte 和$lte。以下管道显示了这一点：

{ "$addFields": { 
    "totalPoints": {
        "$filter": {
            "input": "$history",
            "cond": {
                "$and": [
                    { "$gte": ["$$this.date", "2019-20-21"] },
                    { "$lte": ["$$this.date", "2019-20-22"] },
                ]
            }
        }
    }
} },

获得过滤后的数组后，您可以使用$sum 在下一个管道中轻松获得总和，因此您的完整管道变为

var startDate = "2019-20-21";
var endDate = "2019-20-22";
User.aggregate([
    { "$addFields": { 
        "totalPoints": {
            "$filter": {
                "input": "$history",
                "cond": {
                    "$and": [
                        { "$gte": ["$$this.date", startDate] },
                        { "$lte": ["$$this.date", endDate ] },
                    ]
                }
            }
        }
    } },
    { "$addFields": { 
        "totalPoints": { "$sum": "$totalPoints.points" }
    } }
])

【讨论】：

感谢您提供这两种方法 - 说单一管道方法比您提供的替代方法性能更高是否正确？
正确，单个管道更好，尽管我相信在上述情况下性能增量可以忽略不计。将大大提高性能的是减少通过管道的数据量，即使用上述日期范围查询执行初始管道 stage$match