提高算法执行时间答案

【问题标题】：Improve algorithm execution time提高算法执行时间
【发布时间】：2012-12-02 02:22:20
【问题描述】：

我正在从事数据挖掘项目，我为关联规则任务选择了 Apriori 算法。简单地说，我对执行时间的执行方式不满意。我将只描述我的代码中有问题的部分。

我有两个列表列表。

List<List<int>> one;

List<List<int>> two;

我必须遍历列表one 的元素并检查one[i] 是否是two[j] 的子集

foreach(List<int> items in one)
{

    foreach(List<int> items2 in two)
    {

        if(items2.ContainsSetOf(items1))
        {
            //do something
        }
}

我在想是否有办法减少这种方法的执行时间。（并行执行，使用字典等）

你们知道如何减少它吗？

谢谢！

【问题讨论】：

标签： c# algorithm parallel-processing

【解决方案1】：

让它们成为集合列表，并使用集合操作来查找一个集合是否是另一个集合的子集。

示例

HashSet<int> set1 = new HashSet<int>();
set1.Add(1);
set1.Add(2);

HashSet<int> set2 = new HashSet<int>();
set2.Add(1);
set2.Add(2);
set2.Add(3);

List<HashSet<int>> one = new List<HashSet<int>>();
one.add(set1);
one.add(set2);

List<HashSet<int>> two = new List<HashSet<int>>();
two.add(set1);
two.add(set2);

foreach(Set<int> setA in one) {
    foreach(Set<int> setB in two) {
        if(setA.IsSubsetOf(setB)) {
            // do something
        }
    }
}

【讨论】：

是的，我可以使用IsSubSet() 方法，但问题不在这里。但是我仍然必须将每个元素与另一个元素进行比较，即 N^2。也许我理解错了你的解决方案。能否提供代码示例？
@JohnLatham：拥有两个List<HashSet<T>> 将缩短执行时间。子集在适当的集合上比列表便宜得多。您也可以考虑使用索引。
@Ibrahim，谢谢，你知道迭代大约应该快多少次吗？
如果我正确理解您的问题，迭代次数与以前相同，因为要求是将one 中的所有集合与two 中的所有集合进行比较，但您在检查自身，因为集合的实现方式使subset 测试更有效。话虽如此，迭代次数可能会少于n^2，但如果不提供有关问题和集合中存储数据的性质的更多详细信息，就很难知道这一点。

【解决方案2】：

C# 代码 sn-p

var dict = new Dictionary<int, HashSet<List<int>>>();

foreach (List<int> list2 in two) {
   foreach (int i in list2) {
      if(dict.ContainsKey(i) == FALSE) {
         //create empty HashSet dict[i]
         dict.Add(i, new HashSet<List<int>>());
      }
      //add reference to list2 to the HashSet dict[i]
      dict[i].Add(list2); 
   }
}

foreach (List<int> list1 in one) {
   HashSet<List<int>> listsInTwoContainingList1 = null;
   foreach (int i in list1) {
      if (listsInTwoContainingList1 == null) {
         listsInTwoContainingList1 = new HashSet<List<int>>(dict[i]);
      } else {
         listsInTwoContainingList1.IntersectWith(dict[i]);
      }
      if(listsInTwoContainingList1.Count == 0) {   //optimization :p
         break;
      }
   }
   foreach (List<int> list2 in listsInTwoContainingList1) {
      //list2 contains list1
      //do something
   }   
}

示例

L2= {
L2a = {10, 20, 30, 40}
L2b = {30, 40, 50, 60}
L2c = {10, 25, 30, 40}
}

L1 = {
L1a = {10, 30, 40}
L1b = {30, 25, 50}
}

在第一部分代码之后：

dict[10] = {L2a, L2c}
dict[20] = {L2a}
dict[25] = {L2c}
dict[30] = {L2a, L2b, L2c}
dict[40] = {L2a, L2b, L2c}
dict[50] = {L2c}
dict[60] = {L2c}

在代码的第二部分：

L1a: dict[10] n dict[30] n dict[40] = {L2a, L2c}
L1b: dict[30] n dict[25] n dict[50] = { }

所以L1a 包含在L2a 和L2c 中，但L1b 没有。

复杂性

现在关于算法复杂度，假设L1有n1元素，L2有n2元素，L1的子列表的平均元素数是m1和平均元素数L2 的子列表中有 m2。那么：

原来的解决办法是： O(n1 x n2 x m1 x m2)，如果 containsSetOf 方法执行嵌套循环，或者充其量是 O(n1 x n2 x (m1 + m2))，如果它使用 HashSet。 Is7aq的解决方案也是O(n1 x n2 x (m1 + m2))。
建议的解决方案是： O(n2 x m2 + n1 x (m1 x nd + n2))，其中nd 是集合dict[i] 的平均元素数。

所提出的解决方案的效率很大程度上取决于这个nd：

如果nd 很大——接近n2（当每个整数都是L2 的每个子列表的一部分时），那么它和原来的一样慢。
李> 1234563 987654350@ 很大。

【讨论】：

【解决方案3】：

如果您想减少检查“列表中的列表”（或设置为子集）的次数，一种方法是构建列表的层次结构（树）。当然，性能改进（如果有的话）取决于数据 - 如果没有列表包含其他列表，您将必须像现在一样进行所有检查。

【讨论】：

谢谢你，伊戈尔。我正在考虑类似的事情。但仍然希望它有可能在 O(N) 时间内完成。