性能思路（内存中的 C# hashset 和包含太慢）答案

【问题标题】：Performance ideas (in-memory C# hashset and contains too slow)性能思路（内存中的 C# hashset 和包含太慢）
【发布时间】：2011-07-03 01:50:28
【问题描述】：

我有以下代码

private void LoadIntoMemory()
{
    //Init large HashSet
    HashSet<document> hsAllDocuments = new HashSet<document>();

    //Get first rows from database
    List<document> docsList = document.GetAllAboveDocID(0, 500000);

    //Load objects into dictionary
    foreach (document d in docsList)
    {
        hsAllDocuments.Add(d);
    }

    Application["dicAllDocuments"] = hsAllDocuments;
}

private HashSet<document> documentHits(HashSet<document> hsRawHit, HashSet<document> hsAllDocuments, string query, string[] queryArray)
{
    int counter = 0;
    const int maxCount = 1000;

    foreach (document d in hsAllDocuments)
    {
        //Headline
        if (d.Headline.Contains(query))
        {
            if (counter >= maxCount)
                break;
            hsRawHit.Add(d);
            counter++;
        }

        //Description
        if (d.Description.Contains(query))
        {
            if (counter >= maxCount)
                break;
            hsRawHit.Add(d);
            counter++;
        }

        //splitted query word by word
        //string[] queryArray = query.Split(' ');
        if (queryArray.Count() > 1)
        {
            foreach (string q in queryArray)
            {
                if (d.Headline.Contains(q))
                {
                    if (counter >= maxCount)
                        break;
                    hsRawHit.Add(d);
                    counter++;
                }

                //Description
                if (d.Description.Contains(q))
                {
                    if (counter >= maxCount)
                        break;
                    hsRawHit.Add(d);
                    counter++;
                }
            }
        }

    }

    return hsRawHit;
}

首先我将所有数据加载到一个哈希集中（通过应用程序供以后使用） - 运行良好 - 完全可以对我正在做的事情放慢速度。

将在 C# 中运行 4.0 框架（无法使用异步内容更新到 4.0 的新升级）。

documentHits 方法在我当前的设置中运行相当慢 - 考虑到它都在内存中。我可以做些什么来加快这个方法？

例子会很棒 - 谢谢。

【问题讨论】：

您的分析器运行说什么是最慢的？从那开始。它有多慢？您对“足够快”的预算是多少？
文档数量可能不是线性的。
既然你只是遍历内容，为什么还要费心使用 HashSet。
他使用HashSet来防止重复，这是错误的方式。

标签： c# performance hashtable contains hashset

【解决方案1】：

如果您在开始创建数据库时有大量时间，可以考虑使用 Trie。

Trie 将使字符串搜索更快。

end here 中有一点解释和实现。

另一个实现：Trie class

【讨论】：

【解决方案2】：

您正在线性运行所有文档以查找匹配项 - 这是 O(n)，如果您解决了逆问题，您可以做得更好，类似于全文索引的工作方式：从查询词开始并预处理集合匹配每个查询词的文档 - 因为这可能会变得复杂，我建议只使用具有全文功能的数据库，这将比您的方法快得多。

另外，你在滥用 HashSet - 而是只使用一个列表，不要重复 - documentHits() 中产生匹配的所有案例都应该是排他的。

【讨论】：

【解决方案3】：

我看到您使用的是HashSet，但您没有使用它的任何优点，所以您应该改用List。

需要时间的是遍历所有文档并在其他字符串中查找字符串，因此您应该尝试尽可能多地消除这些。

一种可能性是设置哪些文档包含哪些字符对的索引。如果字符串query 包含Hello，您将查看包含He、el、ll 和lo 的文档。

您可以设置Dictionary<string, List<int>>，其中字典键是字符组合，列表包含文档列表中文档的索引。当然，设置字典需要一些时间，但您可以专注于不太常见的字符组合。如果一个字符组合存在于 80% 的文档中，那么它对于消除文档是毫无用处的，但如果一个字符组合只存在于 2% 的文档中，它就消除了您 98% 的工作。

如果您遍历列表中的文档并将出现次数添加到字典中的列表中，则索引列表将被排序，因此稍后加入列表将非常容易。当您向列表添加索引时，您可以在列表变得太大并且您发现它们对于消除文档没有用处时丢弃列表。这样，您将只保留较短的列表，并且它们不会消耗太多内存。

编辑：

它整理了一个小例子：

public class IndexElliminator<T> {

  private List<T> _items;
  private Dictionary<string, List<int>> _index;
  private Func<T, string> _getContent;

  private static HashSet<string> GetPairs(string value) {
    HashSet<string> pairs = new HashSet<string>();
    for (int i = 1; i < value.Length; i++) {
      pairs.Add(value.Substring(i - 1, 2));
    }
    return pairs;
  }

  public IndexElliminator(List<T> items, Func<T, string> getContent, int maxIndexSize) {
    _items = items;
    _getContent = getContent;
    _index = new Dictionary<string, List<int>>();
    for (int index = 0;index<_items.Count;index++) {
      T item = _items[index];
      foreach (string pair in GetPairs(_getContent(item))) {
        List<int> list;
        if (_index.TryGetValue(pair, out list)) {
          if (list != null) {
            if (list.Count == maxIndexSize) {
              _index[pair] = null;
            } else {
              list.Add(index);
            }
          }
        } else {
          list = new List<int>();
          list.Add(index);
          _index.Add(pair, list);
        }
      }
    }
  }

  private static List<int> JoinLists(List<int> list1, List<int> list2) {
    List<int> result = new List<int>();
    int i1 = 0, i2 = 0;
    while (i1 < list1.Count && i2 < list2.Count) {
      switch (Math.Sign(list1[i1].CompareTo(list2[i2]))) {
        case 0: result.Add(list1[i1]); i1++; i2++; break;
        case -1: i1++; break;
        case 1: i2++; break;
      }
    }
    return result;
  }

  public List<T> Find(string query) {
    HashSet<string> pairs = GetPairs(query);
    List<List<int>> indexes = new List<List<int>>();
    bool found = false;
    foreach (string pair in pairs) {
      List<int> list;
      if (_index.TryGetValue(pair, out list)) {
        found = true;
        if (list != null) {
          indexes.Add(list);
        }
      }
    }
    List<T> result = new List<T>();
    if (found && indexes.Count == 0) {
      indexes.Add(Enumerable.Range(0, _items.Count).ToList());
    }
    if (indexes.Count > 0) {
      while (indexes.Count > 1) {
        indexes[indexes.Count - 2] = JoinLists(indexes[indexes.Count - 2], indexes[indexes.Count - 1]);
        indexes.RemoveAt(indexes.Count - 1);
      }
      foreach (int index in indexes[0]) {
        if (_getContent(_items[index]).Contains(query)) {
          result.Add(_items[index]);
        }
      }
    }
    return result;
  }

}

测试：

List<string> items = new List<string> {
  "Hello world",
  "How are you",
  "What is this",
  "Can this be true",
  "Some phrases",
  "Words upon words",
  "What to do",
  "Where to go",
  "When is this",
  "How can this be",
  "Well above margin",
  "Close to the center"
};
IndexElliminator<string> index = new IndexElliminator<string>(items, s => s, items.Count / 2);

List<string> found = index.Find("this");
foreach (string s in found) Console.WriteLine(s);

输出：

What is this
Can this be true
When is this
How can this be

【讨论】：

【解决方案4】：

您不应针对所有测试步骤测试每个文档！

您应该在第一次成功的测试结果后转到下一个文档。

hsRawHit.Add(d);
counter++;

你应该在counter++;之后添加continue;

hsRawHit.Add(d);
counter++;
continue;

【讨论】：