mongodb计算连续匹配文档的数量并将它们合并为一个答案

【问题标题】：mongodb count the number of consecutive matching docs and merge them into onemongodb计算连续匹配文档的数量并将它们合并为一个
【发布时间】：2022-03-19 13:06:02
【问题描述】：

假设我有一个名为 chatMessages 的 mongodb 集合，具有这些属性（在 nodejs 上使用 mongoose）：

const schema = {
  _id: ObjectID,
  chatID: String,
  type: String,
  message: String,
  senderID: String,
  date: Date
}

假设我在集合中有 5 个文档：

[
  {
    _id: 'id1',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'first message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-02T01:01:11.001Z')
  },
  {
    _id: 'id2',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z')
  },
  {
    _id: 'id3',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'third message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-04T01:01:11.001Z')
  },
  {
    _id: 'id4',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'fourth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-05T01:01:11.001Z')
  },
  {
    _id: 'id5',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'fifth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-06T01:01:11.001Z')
  }
]

我想查询这些文档，以便将类型为 groupedMsgs 的连续行（按日期排序）表示为一个，以及在最后一个唯一的 groupedMsgs 中存在的连续行数输出。具体来说，我想要如下所示的输出：

[
  {
    _id: 'id1',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'first message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-02T01:01:11.001Z')
  },
  {
    _id: 'id2',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'second message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-03T01:01:11.001Z'),
    numConsecutiveItems: 2
  },
  {
    _id: 'id4',
    chatID: 'alternateChat',
    type: 'txt',
    message: 'fourth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-05T01:01:11.001Z')
  },
  {
    _id: 'id5',
    chatID: 'alternateChat',
    type: 'groupedMsgs',
    message: 'fifth message',
    senderID: 'first_sender_id',
    date: new Date('0001-01-06T01:01:11.001Z'),
    numConsecutiveItems: 1
  }
]

请注意，third message 不在最终输出中，因为它的类型为 groupedMsgs，并连续跟随另一条类型为 groupedMsgs 的消息，而 second message 具有相同的 numConsecutiveItems 或 2原因。更重要的是，fifth message 之所以存在，是因为它不会立即跟随另一个groupedMsgs 消息，并且出于同样的原因，它的numConsecutiveItems 的值是1。什么是可以为我执行此操作的聚合管道？我的偏好是避免使用$accumulator、$function、$where 和$accumulator，以避免在查询期间运行 javascript，因为这会减慢查询操作，但我仍然愿意接受所有答案。

【问题讨论】：

每个文档的“_id”值在整个集合中都必须是唯一的。您显示了多个具有相同“_id”的文档。
你是对的，这是一个复制粘贴错误。我已经编辑了这些值，使它们都独一无二。

标签： mongodb mongoose mongodb-query aggregation-framework

【解决方案1】：

可以，但不推荐。

我在这里使用的方法（根据您的要求避免使用“累加器”）不是一个好的做法，不应该用于大型数据集，因为它首先将所有数据分组到一个文档中，然后然后乘以文档的数量。如果你确实选择使用类似的东西，我强烈建议在它之前添加一个 $match 阶段，以减少计算文档的数量。

在您的情况下，您可以在短时间内运行它（块），注意“type”与“groupedMsgs”不同的消息是在它们之间进行切换的好地方。

    db.collection.aggregate([
  {"$sort": { date: 1}},
  {"$group": {"_id": 0, 
        data: {"$push": {type: "$type", chatID: "$chatID", message: "$message",
          senderID: "$senderID", date: "$date", id: "$_id"}},
        docs: {"$push": {type: "$type", id: "$_id"}}}},
  {"$unwind": {path: "$data", includeArrayIndex: "data.inx"}},
  {"$addFields": {"nextType": {$cond: [
                                {$lt: [{$add: ["$data.inx", 1]},{$size: "$docs"}]},
                                {"$arrayElemAt": ["$docs",{$add: ["$data.inx", 1]}]},
                                "NA"]},
                "prevType": {$cond: [
                                {$gte: [{$add: ["$data.inx", -1]}, 0]},
                                {"$arrayElemAt": ["$docs", {$add: ["$data.inx", -1]}]},
                                "NA"]}}},
  {"$project": {data: 1, nextType: "$all.nextType.type", prevType: "$all.prevType.type",
      isfirstGroup: {$cond: [{$and: [{$ne: ["$data.type", "$prevType.type"]}, 
                                     {$eq: ["$data.type", "groupedMsgs"]}]}, true, false]},
      isLastGroup: {$cond: [{$and: [{$ne: ["$data.type", "$nextType.type"]},
                                    {$eq: ["$data.type", "groupedMsgs"]}]}, true, false]}}},
  {"$group": {_id: 0, starts: {$push: {isfirstGroup: "$isfirstGroup", origInx: "$data.inx"}},
                        ends: {$push: {isLastGroup: "$isLastGroup", origInx: "$data.inx",
                        data: "$data"}}}},
  {$facet: {
      groupedMsgs: [
        {$project: {startItems: {$filter: {input: "$starts", as: "item",
                    cond: {$eq: ["$$item.isfirstGroup", true]}}},
                    endItems: {$filter: {input: "$ends", as: "item", 
                    cond: {$eq: ["$$item.isLastGroup", true]}}}}},
        {"$unwind": {path: "$endItems", includeArrayIndex: "endItems.inx"}},
        {"$project": {data: "$endItems.data",  maxInx: "$endItems.origInx", 
                        minObj: {"$arrayElemAt": ["$startItems", "$endItems.inx"]}}},
        {"$addFields": {"data.numConsecutiveItems": {"$subtract": [{$add: ["$maxInx", 1]}, 
                                                                    "$minObj.origInx"]}}},
        {"$replaceRoot": {newRoot: "$data"}}
      ],
      others: [
        {"$project": {items: {$filter: {input: "$ends", as: "item", 
                                        cond: {$ne: ["$$item.data.type", "groupedMsgs"]}}}}},
        {"$unwind": "$items"},
        {"$replaceRoot": {newRoot: "$items.data"}}
      ]
    }
  },
  {"$project": {results: {$setUnion: ["$groupedMsgs", "$others"]}}},
  {"$unwind": "$results"},
  {"$project": {type: "$results.type", chatID: "$results.chatID", message: "$results.message",
                senderID: "$results.senderID", date: "$results.date", id: "$results._id",
                numConsecutiveItems: "$results.numConsecutiveItems"}},
  {"$sort": {date: 1}},
])

您可以在操场上查看：https://mongoplayground.net/p/1C5apUd4e5p。请注意，我在操场上添加了一份文档以使事情更清楚

【讨论】：

感谢您的回答！我的系统中还有一些其他限制，实际上要求我在将聚合消息添加到数据库时识别它们。所以这个任务归结为$filtering 和$projecting 获得我想要的输出的必要字段。我会尽量找时间发布对我有用的东西，但你可以阅读更多here