【问题标题】:MongoDB aggregate query to get the Unique element list and count of every instance in recordMongoDB聚合查询以获取记录中每个实例的唯一元素列表和计数
【发布时间】:2018-06-16 04:30:20
【问题描述】:

我有 2 个收藏,如下所示。

数据1:

{ "_id" : , "timestamp" : ISODate("2016-01-05T07:42:37.312Z"), "Prof_Name" : "Jack ", "SUBJECT" : "Maths, Chemistry, Machinery1, Ele1" }
{ "_id" : , "timestamp" : ISODate("2016-01-05T07:42:37.312Z"), "Prof_Name" : "Mac", "SUBJECT" : "Chemistry, CS, German" }

数据2:

{ "_id" : ObjectId(""), timestamp" : ISODate("2016-08-05T07:42:37.312Z", "SUBJECT_ID" : "Maths", "ID" : "OI-12", "Rating" : 6, "UUID" : 8123 }
{ "_id" : ObjectId(""), timestamp" : ISODate("2017-09-05T07:42:37.312Z", "SUBJECT_ID" : "Maths, Machinery1, German", "ID" : "OI-134", "Rating" : 6, "UUID" : 8123 }
{ "_id" : ObjectId(""), timestamp" : ISODate("2016-01-05T07:42:37.312Z", "SUBJECT_ID" : "Machinery1, Maths, French, German", "ID" : "OI-32", "Rating" : 3, "UUID" : 8123 }
{ "_id" : ObjectId(""), timestamp" : ISODate("2016-01-05T07:42:37.312Z", "SUBJECT_ID" : "CS, Chemistry", "ID" : "OI-36", "Rating" : , "UUID" : 8124 }

我想在时间戳 2016 年 1 月到 2106 年 11 月之间获得 3 个集合,其中对于来自“data1”的每个 Prof_Name 和 SUBJECT 中的主题,检查它是否存在于“data2”中并将 UUID 和 UUID 计数为 1,如果在下一条记录中找到相同的主题,使 UUID 计数 =2,依此类推。这就是我的收藏的样子..

数据3:

{ "_id" : ,
"Prof_Name" : "Jack", 
"Subjects_list" : [ "Maths", "Chemistry", "Machinery1"], 
"UUID_list" : [8123, 8124 ], 
"UUID_count" : 3,   // Because UUID 8123 has present in 2 records which comes under 2016 timestamp
"subject_count" : 3 } // Ele1 is not mentioned because it has not been seen in any of the data2 record
{ "_id" : , 
"Prof_Name" : "Mac", 
"Subjects_list" : [ "CS"], 
"UUID_list" : [8124 ],  
"UUID_count" : 1,   // Because UUID 8123 has present in 2 records which comes under 2016 timestamp
"subject_count" : 1 }

我的汇总查询是:

db.data1.aggregate([
  {
    "$addFields": {
      "SUBJECT": {
        "$split": [
          "$SUBJECT",
          ", "
        ]
      }
    }
  },
  {
    "$unwind": "$SUBJECT"
  },
  {
    "$lookup": {
      "from": "data2",
      "let": {
        "subject": "$SUBJECT"
      },
      "pipeline": [
        {
          "$addFields": {
            "SUBJECT_ID": {
              "$split": [
                "$SUBJECT_ID",
                ", "
              ]
            }
          }
        },
        {
          "$match": {
            "$expr": {
              "$in": [
                "$$subject",
                "$SUBJECT_ID"
              ]
            }
          }
        },
        {
          "$project": {
            "UUID": 1,
            "_id": 0
          }
        }
      ],
      "as": "ref_data"
    }
  },
  {
    "$unwind": {
      "path": "$ref_data",
      "preserveNullAndEmptyArrays": true
    }
  },
  {
    "$group": {
      "_id": "$Prof_Name",
      "subjects_list": {
        "$addToSet": "$SUBJECT"
      },
      "UUID_list": {
        "$addToSet": "$ref_data.UUID"
      }
    }
  },
  {
    "$addFields": {
      "Prof_Name": "$_id",
      "UUID_count": {
        "$size": "$UUID_list"
      },
      "subject_count": {
        "$size": "$subjects_list"
      }
    }
  },
  {
    "$project": {
      "_id": 0
    }
  },
  {
    "$out": "data3"
  }
])

这个查询需要什么修改才能得到上面提到的集合数据3,主要是UUID_list和UUID-count和Subject_list。

还想知道如何在下面的查询聚合查询中匹配给定月份和年份但不是 iso 的记录的时间戳。

试过这个:

    { "$project": {"year":{"$year":"$timestamp"},"month":{"$month":"$timestamp"}}},{ "$match":{"year" :"2016","month": "01"}}  

但确实有效。

【问题讨论】:

    标签: mongodb mongodb-query aggregation-framework pymongo aggregation


    【解决方案1】:

    您可以通过将主题从逗号分隔值更改为数据库中的数组来简化聚合。

    "SUBJECT" : ["Maths", "", "Chemistry", "Machinery1", "Ele1"]

    您可以使用以下聚合。

    db.data1.aggregate([
    {"$lookup":{
      "from":"data2",
      "localField":"SUBJECT",
      "foreignField":"SUBJECT_ID",
      "as":"ref_data"
    }}, // outputs all the input documents where there is any match between two subjects array.
    {"$unwind":{"path":"$ref_data","preserveNullAndEmptyArrays":true}},
    {"$match":{"ref_data.timestamp":{"$gte":ISODate("2016-01-01T00:00:00.000Z"), "$lte":ISODate("2016-11-31T11:59:59.999Z")}}},
    {"$addFields":{"SUBJECT":{"$setIntersection":["$SUBJECT","$ref_data.SUBJECT_ID"]}}}, // outputs the common subjects (matching) between two subjects array
    {"$unwind":"$SUBJECT"},
    {"$group":{
      "_id":{
        "Prof_Name":"$Prof_Name",
        "UUID":"$ref_data.UUID",
        "SUBJECT":"$SUBJECT"
      }
    }},// outputs all the distinct combination of UUID and Subject
    {"$group":{
      "_id":"$_id.Prof_Name",
      "UUID_count":{"$sum":1},
      "subjects_list":{"$push":"$_id.SUBJECT"},
      "UUID_distinct_list":{"$addToSet":"$_id.UUID"}
    }}, // outputs the distinct uuid list, count the uuids & subjects list 
    {"$addFields": {
      "Prof_Name": "$_id",
      "UUID_distinct_count": {
        "$size": "$UUID_distinct_list"
      },
      "subject_count": {
        "$size": "$subjects_list"
      }
    }}, // Adds the subject list size
    {"$project": {"_id": 0}},// excludes the id from final output
    {"$out":"data3"}])
    

    无需修改架构,您就可以使用以下聚合查询。

    db.data1.aggregate([
      {"$lookup":{
        "from":"data2",
        "let":{"subject":{"$split":["$SUBJECT",", "]}},
        "pipeline":[
          {"$match": {"expr":{"$and":[{"$eq":[{"$year":"$timestamp"}, 2016]}, {"$eq":[{"$month":"$timestamp"}, 1]}]}}},
          {"$addFields":{"SUBJECT_ID":{"$split":["$SUBJECT_ID",", "]},"SUBJECT":"$$subject"}},
          {"$unwind":"$SUBJECT"},
          {"$match":{"$expr":{"$in":["$SUBJECT","$SUBJECT_ID"]}}},
          {"$facet":{
            "UUID":[{"$group":{"_id":{"id":"$_id","UUID":"$UUID"}}},{"$count":"UUID_Count"}],
            "REST":[
              {"$group":{"_id":null,"subjects_list":{"$addToSet":"$SUBJECT"},"UUID_distinct_list":{"$addToSet":"$UUID"}}},
              {"$addFields":{"subject_count":{"$size":"$subjects_list"},"UUID_distinct_count":{"$size":"$UUID_distinct_list"}}},
              {"$project":{"_id":0}}
             ]
          }},
          {"$replaceRoot":{"newRoot":{"$mergeObjects":[{"$arrayElemAt":["$UUID",0]},{"$arrayElemAt":["$REST",0]}]}}}
        ],
        "as":"ref_data"
      }},
      {"$unwind":{"path":"$ref_data","preserveNullAndEmptyArrays":true}},
      {"$addFields":{"ref_data.Prof_Name":"$Prof_Name"}},
      {"$replaceRoot":{"newRoot":"$ref_data"}},
      {"$out":"data3"}
    ])
    

    【讨论】:

    • 查询适用于一切,但 UUID-coun 不是我想要的。我想计算每条记录的每次出现,但这给了我唯一的 UUID 计数。
    • 现在可以试试吗?我已删除该组。
    • 试过它没有给出预期的输出。是否可以对 ref_data.UUID 进行分组并计数并将其添加到新字段中,例如 UUID_unique ?并让 UUID_count 像以前的查询一样:添加类似 {"$group":{"_id":"$_id.UUID","UUID_unique":{"$sum":1}
    • 您能否在帖子中更新您的示例?我认为最初的答案与您在预期的 json 中的答案相匹配。
    • 最初的答案也没有给出预期的 UUID 计数(每条记录中每一次出现的计数)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-02-03
    • 2020-04-09
    • 1970-01-01
    • 2016-07-20
    • 1970-01-01
    • 2023-04-10
    • 1970-01-01
    相关资源
    最近更新 更多