【发布时间】:2016-06-07 23:44:40
【问题描述】:
我有 2 个 mongo 集合:
companies:每条记录都是一个包含多个字段(城市、国家等)的公司 —> 100k rows
{company_id:1, country:"USA", city:"New York",...}
{company_id:2, country:"Spain", city:"Valencia",... }
{company_id:3, country:"France", city:"Paris",... }
scores:有日期块,每个块都有company_id + score,例子——>100k rows in each block
{date: 2016-05-29, company_id:1, score:90}
{date: 2016-05-29, company_id:2, score:87}
{date: 2016-05-29, company_id:3, score:75}
...
{date: 2016-05-22, company_id:1, score:88}
{date: 2016-05-22, company_id:2, score:87}
{date: 2016-05-22, company_id:3, score:76}
...
{date: 2016-05-15, company_id:1, score:91}
{date: 2016-05-15, company_id:2, score:82}
{date: 2016-05-15, company_id:3, score:73}
...
目标:
我想检索可以按某些字段(国家、城市、...)过滤的公司列表+其最新分数(2016-05-29),ordered by score descending
即:在一个集合中过滤,在另一个集合中过滤+排序
注意:scores.date 上有一个索引,我们可以轻松快速地定位/预计算最高日期(本例中为 2016-05-29)
尝试:
我一直在尝试使用$lookup 进行aggregate 查询。当过滤器完成(并且公司数量较少)时,查询速度会更快。
查询如下:-
db.companies.aggregate([
{$match: {"status": "running", "country": "USA", "city": "San Francisco",
"categories": { $in: ["Software"]}, dummy: false}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id: "$_id",
"company_id": "$company_id",
"company_name": "$company_name",
"status": "$status",
"city": "$city",
"country": "$country",
"categories": "$categories",
"dummy": "$dummy",
"score": "$scores.score",
"date": "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})
但当过滤器很小或为空(更多公司)时,$sort 部分需要几秒钟。
db.companies.aggregate([
{$match: {"status": "running"}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id: "$_id",
"company_id": "$company_id",
"company_name": "$company_name",
"status": "$status",
"city": "$city",
"country": "$country",
"categories": "$categories",
"dummy": "$dummy",
"score": "$scores.score",
"date": "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})
可能是因为过滤器找到的公司数量。 59 行比 89k 更容易订购
> db.companies.count({"status": "running", "country": "USA", "city": "San Francisco", "categories": { $in: ["Software"]}, dummy: false})
59
> db.companies.count({"status": "running"})
89043
我尝试了不同的方法,按分数聚合,按日期过滤,按分数排序(索引日期+分数在这里非常有用),一切都非常快,直到我过滤公司的最后一个$match属性
db.scores.aggregate([
{$match:{"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort:{"score":-1}},
{$lookup:{from: "companies", localField: "company_id", foreignField: "company_id", as:"companies"}},
{$unwind:"$companies"},
{$project: {_id: "$companies._id",
"company_id": "$companies.company_id",
"company_name": "$companies.company_name",
"status": "$companies.status",
"city": "$companies.city",
"country": "$companies.country",
"categories": "$companies.categories",
"dummy": "$companies.dummy"}},
"score": "$score",
"date": "$date"
{$match:{"status": "running", "country":"USA", "city": "San Francisco",
"categories": { $in: ["Software"]}, dummy: false}}
],{allowDiskUse: true})
使用这种方法,大过滤器(前面的例子)很慢,小过滤器(只是{"status": "running"})更快
有什么方法可以加入这两个集合,过滤它们并按一个字段排序?
【问题讨论】:
标签: mongodb mongodb-query aggregation-framework