mongodb 中匹配的 OR 条件的百分比答案

【问题标题】：Percentage of OR conditions matched in mongodbmongodb 中匹配的 OR 条件的百分比
【发布时间】：2014-04-23 11:21:30
【问题描述】：

我的数据格式如下..

{
  "_id" : ObjectId("534fd4662d22a05415000000"),
  "product_id" : "50862224",
  "ean" : "8808992479390",
  "brand" : "LG",
  "model" : "37LH3000",
  "features" : [{
      {
      "key" : "Screen Format",
      "value" : "16:9",
    }, {
      "key" : "DVD Player / Recorder",
      "value" : "No",
    }, 
      "key" : "Weight in kg",
      "value" : "12.6",
    }
    ... so on
    ]
}

我需要将一种产品的功能与其他产品进行比较，然后根据功能匹配的百分比将结果分成不同的类别（100% 匹配，50-99% 匹配）。

我最初的想法是为每个功能准备一个动态查询或条件，并在 php 中执行百分比操作，但这意味着 mongodb 甚至会返回我，即使是那些只有 1 个功能匹配的产品。而且我认为几乎一个类别的所有产品都可能有一些共同点，所以我担心我可能会在 php 中开发很多产品。

我基本上有两个问题。

还有其他方法吗？
我使用的数据结构是否足以支持我正在寻找的功能，或者我是否应该考虑更改它

【问题讨论】：

我说的对吗：如果一种产品有 3 个功能，而另一种产品有 4 个功能，而其中只有 2 个相等，那么结果将是 2/3 = 67%？
@DenisNikanorov 是的，你是对的。匹配率为 67%。

标签： php mongodb mongodb-query aggregation-framework mongodb-php

【解决方案1】：

您的解决方案确实应该是特定于 MongoDB 的，否则您最终会在客户端进行计算和可能的匹配，这对性能没有好处。

当然，您真正想要的是一种在服务器端进行处理的方法：

db.products.aggregate([

    // Match the documents that meet your conditions
    { "$match": {
        "$or": [
            { 
                "features": { 
                    "$elemMatch": {
                       "key": "Screen Format",
                       "value": "16:9"
                    }
                }
            },
            { 
                "features": { 
                    "$elemMatch": {
                       "key" : "Weight in kg",
                       "value" : { "$gt": "5", "$lt": "8" }
                    }
                }
            },
        ]
    }},

    // Keep the document and a copy of the features array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "product_id": "$product_id",
            "ean": "$ean",
            "brand": "$brand",
            "model": "$model",
            "features": "$features"
        },
        "features": 1
    }},

    // Unwind the array
    { "$unwind": "$features" },

    // Find the actual elements that match the conditions
    { "$match": {
        "$or": [
            { 
               "features.key": "Screen Format",
               "features.value": "16:9"
            },
            { 
               "features.key" : "Weight in kg",
               "features.value" : { "$gt": "5", "$lt": "8" }
            },
        ]
    }},

    // Count those matched elements
    { "$group": {
        "_id": "$_id",
        "count": { "$sum": 1 }
    }},

    // Restore the document and divide the mated elements by the
    // number of elements in the "or" condition
    { "$project": {
        "_id": "$_id._id",
        "product_id": "$_id.product_id",
        "ean": "$_id.ean",
        "brand": "$_id.brand",
        "model": "$_id.model",
        "features": "$_id.features",
        "matched": { "$divide": [ "$count", 2 ] }
    }},

    // Sort by the matched percentage
    { "$sort": { "matched": -1 } }

])

既然您知道所应用的 $or 条件的“长度”，那么您只需找出“特征”数组中有多少元素符合这些条件。这就是管道中的第二个 $match 的全部内容。

一旦你有了这个计数，你只需将条件数除以作为 $or 传入的条件数。这里的美妙之处在于，现在您可以用这种方式做一些有用的事情，比如按相关性排序，然后甚至“分页”结果服务器端。

当然，如果您想要对此进行一些额外的“分类”，您需要做的就是在管道的末尾添加另一个 $project 阶段：

    { "$project": {
        "product_id": 1
        "ean": 1
        "brand": 1
        "model": 1,
        "features": 1,
        "matched": 1,
        "category": { "$cond": [
            { "$eq": [ "$matched", 1 ] },
            "100",
            { "$cond": [ 
                { "$gte": [ "$matched", .7 ] },
                "70-99",
                { "$cond": [
                   "$gte": [ "$matched", .4 ] },
                   "40-69",
                   "under 40"
                ]} 
            ]}
        ]}
    }}

或类似的东西。但是$cond 运算符可以在这里为您提供帮助。

架构应该没问题，因为您可以在特征数组中的条目的“键”和“值”上建立一个复合索引，这应该可以很好地扩展查询。

当然，如果您确实需要更多的东西，例如分面搜索和结果，您可以查看 Solr 或弹性搜索等解决方案。但是这里的完整实现会有点冗长。

【讨论】：

完美。我非常感谢你 :) 我还是 mongo 的新手，对管道一无所知。
@Ankit 很高兴它有所帮助，也很高兴您了解到您正在测试的条件数量的除数对于获得正确的结果很重要。祝你好运。

【解决方案2】：

我假设您希望将集合的其余部分与给定产品进行比较，这是聚合的教科书示例：

lookingat = db.products.findOne({product_id:'50862224'})
matches = db.products.aggregate([
    { $unwind: '$features' },
    { $match: { features: { $in: lookingat.features }}},
    { $group: { _id: '$product_id', matchedfeatures: { $sum:1 }}},
    { $sort: { matchedfeatures: -1 }},
    { $limit: 5 },
    { $project: { _id:0, product_id: '$_id',
                  pctmatch: { $multiply: [ '$matchedfeatures',
                                           100/lookingat.features.length ]}
      }}
])

从集合中具有 6 个特征的产品的角度简要介绍这一点，并将其与具有 4 个特征的目标产品（“lookat”）进行比较，其中 3 个匹配：

$unwind 将 1 个具有 6 个特征的文档转换为 6 个其他相同的文档，每个文档具有 1 个特征
$match 在目标的特征数组中查找该特征（请注意，只有当两个文档具有相同的字段名称和值，以相同的顺序时，它们才“相等”），丢弃不匹配的 3 个，并通过 3 做
$group 使用这 3 个匹配文档并生成一个新文档，告诉您有 3 个文档与该 product_id 匹配
$sort 和 $limit 为您提供最相关的结果，并留下您担心的所有 1 特征匹配
$project 允许您将 $group 步骤中的 _id 重命名为 product_id 并将匹配特征的数量计算为百分比（我们通过认识到我们计算中的 3 个术语中有 2 个是常数和可以在JS中划分）

【讨论】：

感谢@ben-gamble 的准确回答。那确实解决了我的问题.. 只需要选择其他答案，因为这让我可以更好地控制我希望功能如何匹配。在使用 #IN 时，我假设它只会检查是否相等。如果我的理解有误，请纠正我。如果我应该对这两个答案都给予赏金，但不幸的是我不能。