如何对 LINQ to Objects 查询进行分区？答案

【问题标题】：How to partition a LINQ to Objects query?如何对 LINQ to Objects 查询进行分区？
【发布时间】：2011-07-16 21:06:06
【问题描述】：

这是一个资源分配问题。我的目标是运行查询以获取任何时间段的最高优先级班次。

数据集非常大。对于这个例子，假设 1000 家公司有 100 个班次（尽管实际数据集更大）。它们都加载到内存中，我需要对它们运行一个 LINQ to Objects 查询：

    var topShifts =
            (from s in shifts
            where (from s2 in shifts
                   where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
                   orderby s2.Priority
                   select s2).First().Equals(s)
            select s).ToList();

问题在于，如果不进行优化，LINQ to Objects 将比较两组中的每个对象，将所有 1,000 x 100 与 1,000 x 100 进行交叉连接，这相当于 100 亿 (10,000,000,000) 次比较。我想要的是只比较每个公司内的对象（就好像公司在 SQL 表中被索引）。这将产生 1000 组 100 x 100 对象，总共进行 1000 万（10,000,000）次比较。随着公司数量的增长，后者将呈线性增长，而不是指数增长。

像I4o 这样的技术可以让我做这样的事情，但不幸的是，我没有在执行这个查询的环境中使用自定义集合的奢侈。此外，我只希望在任何给定数据集上运行一次此查询，因此持久索引的值是有限的。我希望使用一种扩展方法，按公司对数据进行分组，然后在每个组上运行表达式。

完整示例代码：

public struct Shift
{
    public static long Iterations;

    private int companyId;
    public int CompanyId
    {
        get { Iterations++; return companyId; }
        set { companyId = value; }
    }

    public int Id;
    public int TimeSlot;
    public int Priority;
}

class Program
{
    static void Main(string[] args)
    {
        const int Companies = 1000;
        const int Shifts = 100;
        Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
        var timer = Stopwatch.StartNew();

        Console.WriteLine("Populating data");
        var shifts = new List<Shift>();
        for (int companyId = 0; companyId < Companies; companyId++)
        {
            for (int shiftId = 0; shiftId < Shifts; shiftId++)
            {
                shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
            }
        }
        Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine("Computing Top Shifts");
        var topShifts =
                (from s in shifts
                where (from s2 in shifts
                       where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
                       orderby s2.Priority
                       select s2).First().Equals(s)
                select s).ToList();
        Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine("\nShifts:");
        foreach (var shift in shifts.Take(20))
        {
            Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
        }

        Console.WriteLine("\nTop Shifts:");
        foreach (var shift in topShifts.Take(10))
        {
            Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
        }

        Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));

        Console.WriteLine("Any key to continue");
        Console.ReadKey();
    }
}

示例输出：

1000 个公司 x 100 个班次
填充数据
在 10.00 毫秒内完成
计算最高班次
在 520,721.00 毫秒内完成

班次：
C 0 Id 0 T 0 P0
C 0 Id 1 T 0 P1
C 0 Id 2 T 0 P2
C 0 标识 3 T 1 P3
C 0 Id 4 T 1 P4
C 0 Id 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 Id 7 T 2 P2
C 0 Id 8 T 2 P3
C 0 Id 9 T 3 P4
C 0 Id 10 T 3 P0
C 0 Id 11 T 3 P1
C 0 Id 12 T 4 P2
C 0 Id 13 T 4 P3
C 0 Id 14 T 4 P4
C 0 Id 15 T 5 P0
C 0 Id 16 T 5 P1
C 0 Id 17 T 5 P2
C 0 Id 18 T 6 P3
C 0 Id 19 T 6 P4

最高班次：
C 0 Id 0 T 0 P0
C 0 Id 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 Id 10 T 3 P0
C 0 Id 12 T 4 P2
C 0 Id 15 T 5 P0
C 0 Id 20 T 6 P0
C 0 Id 21 T 7 P1
C 0 标识 25 T 8 P0
C 0 Id 27 T 9 P2

比较总数：10,000,000,015.00
任意键继续

问题：

如何对查询进行分区（同时仍作为单个 LinQ 查询执行）以便将比较从 100 亿减少到 1000 万？
有没有比子查询更有效的解决问题的方法？

【问题讨论】：

很好地解释了一个菜鸟的问题。更像这样:-) +1
谢谢。实际上，我有一个后续（即将推出），将专注于使用范围而不是 TimeSlot ID 来匹配时间。为了避免问题过于复杂，我想单独发布它。
作为参考，here's the follow-up question 不使用 TimeSlot ID。引入重叠检测后，我仍然无法弄清楚如何进行分组，但我确信分组也是解决这个问题的方法。

标签： c# .net indexing clr linq-to-objects

【解决方案1】：

怎么样

            var topShifts = from s in shifts.GroupBy(s => s.CompanyId)
                        from a in s.GroupBy(b => b.TimeSlot)
                        select a.OrderBy(p => p.Priority).First();

似乎得到了相同的输出，但有 100015 次比较

@Geoff 的编辑让我的比较减半 :-)

【讨论】：

+1 我不认为我的回答能胜过这个:) 这在我的电脑上在 5 毫秒内完成。
谢谢。这解决了它，并且运行得非常快（如果我将 .ToList() 添加回查询的末尾，对我来说是 86 毫秒）。我喜欢它完全绕过子查询。授予这个答案，因为它是第一个发布 netsted groupBy 的人。

【解决方案2】：

您是否尝试过使用group by:

var topShifts =  from s in shifts
                 group s by new { 
                     CompanyId = s.CompanyId, 
                     TimeSlot = s.TimeSlot } into p
                 let temp = p.OrderBy(x => x.Priority).FirstOrDefault()
                 select new
                     {
                         CompanyId = temp.CompanyId,
                         TimeSlot = temp.TimeSlot,
                         Id = temp.Id,
                         Priority = temp.Priority
                     };

【讨论】：

得到不同的结果集，但我仍在试图找出原因
@大卫。它现在工作。我也按shiftId 分组。 Speed wise 与您的答案相似，所以我猜这正是您喜欢的代码风格。

【解决方案3】：

我有点不确定你想要说实话，但从阅读你的代码我会说你可以做类似的事情

(from company in shifts.GroupBy(s=>s.CompanyID)
 let lowPriority = (from slot in company.GroupBy(s=>s.TimeSlot)
select slot).OrderBy(s=>s.Priority).First()
 select lowPriority).ToList();

【讨论】：

@Geoff 我能说什么。我的手机没有编译器，我的脑袋也没有产生错误消息，但这个想法应该很明显（尤其是在此之后发布的答案使用相同的方法:)）希望它现在可以编译但无法验证跨度>
很公平。它仍然无法正常工作，但我明白你的意思。