STL 容器以获得最佳性能？答案

【问题标题】：STL Container For Best Performance?STL 容器以获得最佳性能？
【发布时间】：2020-11-18 19:18:36
【问题描述】：

我有一个项目，我需要读取一个文本文件并记录在 EoF 之前读取的每个字符串、字符或数字的出现次数。

然后我需要打印前 10 个最常用的单词。

例如，该文件将包含“这是此项目的测试”。我会阅读这个并将每个单词及其当前计数存储在一个容器中。

现在，随着输入的增加，我们根据时间复杂度的效率如何进行评分。所以，我需要一些帮助来选择最有效的 STL 容器。

似乎顺序并不重要，我可以永远在最后插入，而且我永远不必插入。但是，我必须在容器中搜索最常用的 10 个单词。对于此类需求，哪个 STL 容器的时间复杂度最高？

另外，如果你能解释你的推理，让我知道更多未来，那就太好了！

【问题讨论】：

肯定unordered_map
std::unordered_map 用于从单词到频率计数的映射。
A std::unordered_map 将获得频率计数，但问题的另一部分是获得前 10 名，这本身可以使用跟踪前 10 名的最小堆来完成. 使用 STL 的堆是通过利用 std::make_heap、std::push_heap 等来完成的。
@PaulMcKenzie geeksforgeeks.org 说无序地图的搜索时间是 O(n) 最坏的情况。这似乎并没有那么糟糕。同时，map 的搜索时间是 log(n)，但插入也是 log(n) + rebalance，而无序 map 是 O(1) 或 O(n)。看起来地图可以立即使用，不是吗？
@Zevias 请记住，地图由键和数据组成。数据未排序。因此，您仍然需要对数据进行排序以获得前 10 名。同样，您只考虑作业的前半部分，而不考虑第二部分（获得前 10 名）。好的，所以你有了这张带有计数的字符串图，很好——现在你要怎么做才能进入前 10 名？

标签： c++ performance stl containers time-complexity

【解决方案1】：

假设您决定使用std::unordered_map<std::string, int> 来获取项目的频率计数。这是一个好的开始，但需要解决的问题的另一部分是获得前 10 项。

每当一个问题询问“获取前 N 个”或“获取最小的 N”或类似问题时，都有各种获取此信息的方法。

一种方法是对数据进行排序并获取第一个N 项。使用std::sort 或良好的排序例程，该操作的时间复杂度应为O(N*log(N))。

另一种方法是使用N 项的最小堆或最大堆，具体取决于您要分别获得顶部N 还是底部N。

假设您有使用unordered_set 获取频率计数的工作代码。这是一个使用 STL 堆函数来获取顶部 N 项的例程。它尚未经过全面测试，但应该演示如何处理堆。

#include <vector>
#include <algorithm>
#include <iostream>
#include <unordered_map>

void print_top_n(const std::unordered_map<std::string, int>& theMap, size_t n)
{
    // This is the heap
    std::vector<std::pair<std::string, int>> vHeap;

    // This lambda is the predicate to build and perform the heapify 
    auto heapfn =
        [](std::pair<std::string, int>& p1, std::pair<std::string, int>& p2) -> bool
    { return p1.second > p2.second; };

    // Go through each entry in the map
    for (auto& m : theMap)
    {
        if (vHeap.size() < n)
        {
            // Add item to the heap, since we haven't reached n items yet 
            vHeap.push_back(m);
            
            // if we have reached n items, now is the time to build the heap  
            if (vHeap.size() == n)
                // make the min-heap of the N elements   
                std::make_heap(vHeap.begin(), vHeap.end(), heapfn);
            continue;
        }
        else
        // Heap has been built.  Check if the next element is larger than the 
        // top of the heap
        if (vHeap.front().second <= m.second)
        {
            // adjust the heap 
            // remove the front of the heap by placing it at the end of the vector
            std::pop_heap(vHeap.begin(), vHeap.end(), heapfn);
            // get rid of that item now 
            vHeap.pop_back();
            // add the new item 
            vHeap.push_back(m);
            // heapify
            std::push_heap(vHeap.begin(), vHeap.end(), heapfn);
        }
    }

    // sort the heap    
    std::sort_heap(vHeap.begin(), vHeap.end(), heapfn);

    // Output the results
    for (auto& v : vHeap)
        std::cout << v.first << " " << v.second << "\n";
}

int main()
{
    std::unordered_map<std::string, int> test = { {"abc", 10},
        { "123",5 },
        { "456",1 },
        { "xyz",15 },
        { "go",8 },
        { "text1",7 },
        { "text2",17 },
        { "text3",27 },
        { "text4",37 },
        { "text5",47 },
        { "text6",9 },
        { "text7",7 },
        { "text8", 22 },
        { "text9", 8 },
        { "text10", 2 } };
    print_top_n(test, 10);
}

输出：

text5 47
text4 37
text3 27
text8 22
text2 17
xyz 15
abc 10
text6 9
text9 8
go 8

使用堆的好处是：

heapifying 的复杂性是O(log(N))，而不是排序例程提供的通常的O(N*log(N))。
请注意，只有当我们检测到最小堆上的顶部项目将被丢弃时，我们才需要进行堆化。
除了字符串到频率计数的原始映射之外，我们不需要存储频率计数到字符串的整个（多）映射。
堆将仅存储N元素，而不管原始映射中有多少项目。

【讨论】：

【解决方案2】：

我使用了两个容器来完成此类任务：std::unordered_map<std::string, int> 用于存储词频，std::map<int, std::string> 用于跟踪最常用的词。

在使用新词更新第一个地图的同时，您还更新了第二个地图。为了保持整洁，如果第二张地图的大小超过 10，请删除最不常用的单词。

更新

针对下面的 cmets，我做了一些基准测试。

首先，@PaulMcKenzie - 你是对的：为了保持联系，我需要 std::map<int, std::set<std::string>>（当我开始实施这一点时，这一点变得很明显）。

第二，@dratenik - 事实证明你也是正确的。虽然不断清理频率图使其保持较小，但开销并不能带来好处。此外，只有当客户想要查看“运行总数”（正如我在项目中被要求的那样）时才需要这样做。当加载所有单词时，在后处理中完全没有意义。

为了测试，我使用了alice29.txt（可在线获得），经过预处理 - 我删除了所有标点符号并转换为大写。这是我的代码：

int main()
{
  auto t1 = std::chrono::high_resolution_clock::now();
  std::ifstream src("c:\\temp\\alice29-clean.txt");
  std::string str;
  std::unordered_map<std::string, int> words;
  std::map<int, std::set<std::string>> freq;
  int i(0);
  while (src >> str)
  {
    words[str]++;
    i++;
  }
  for (auto& w : words)
  {
    freq[w.second].insert(w.first);
  }
  int count(0);
  for (auto it = freq.rbegin(); it != freq.rend(); ++it)
  {
    for (auto& w : it->second)
    {
      std::cout << w << " - " << it->first << std::endl;
      ++count;
    }
    if (count >= 10)
      break;
  }
  auto t2 = std::chrono::high_resolution_clock::now();
  std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;
  return i;
}

【讨论】：

它至少会使内存需求增加一倍，boost::multiindex 会更好。
如果频率计数相同，这将不起作用。
我认为您至少需要一个 std::multimap 来处理那个。
“更新时”+“删除最不频繁”——这可能会使元素一直进出前 10 名（添加分配、解除分配等）。在最后对收集的计数进行单独传递就不那么愚蠢了
@Slava - 如果我将所有单词都保留在第二张地图中，它只会使内存翻倍。