奇怪的算法性能答案

【问题标题】：Strange algorithm performance奇怪的算法性能
【发布时间】：2017-06-07 07:11:04
【问题描述】：

为了上下文，我编写了这个算法来获取任何字符串的唯一子字符串的数量。它为字符串构建后缀树，计算它包含的节点并将其作为答案返回。我想解决的问题需要一个 O(n) 算法，所以这个问题只是关于这段代码的行为，而不是关于它的表现有多糟糕。

struct node{
    char value = ' ';
    vector<node*> children;
    ~node()
    {
        for (node* child: children)
        {
            delete child;
        }
    }
};

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        string tmp = aString.substr(i, aString.size());
        node* currentNode = root;
        char indexToNext = 0;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == tmp[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < tmp.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = tmp[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

我决定对这个算法进行基准测试，我只是简单地循环一个大字符串，每次迭代都取一个更大的子字符串，调用 numberOfUniqueSusbstrings 测量它需要多长时间才能结束。

我以八度音阶绘制它，这就是我得到的（x 是字符串大小，y 是时间，以微秒为单位）

我最初认为问题出在输入字符串上，但它只是我从书中得到的一个字母数字字符串（任何其他文本的行为都一样奇怪）。

还尝试使用相同参数对函数的多次调用进行平均，结果几乎相同。

这是用 g++ problem.cpp -std=c++14 -O3 编译的，但似乎在 -O2 和 -O0 上做同样的事情。

编辑： 在@interjay 的回答之后，我尝试了将函数保留为：

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        char indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

它确实使它更快一点。但同样奇怪的是我绘制了这个：

x = 1000 发生了什么事，我不知道会发生什么。

另一个好的衡量标准：

我现在已经为一个 999 大小的字符串运行 gprof：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)
^L
            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------

对于大小为 1001 的字符串：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)


            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------


Index by function name

  [11] _GLOBAL__sub_I__Z7imprimePK4node [1] node::~node()
  [12] numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [10] void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)

但是，运行分析器似乎会消除影响，并且两种情况下的时间几乎相同。

【问题讨论】：

您的图表在开始时有很大的离散度，在 1000 之后有非常小的离散度。看起来一些标准类（vector 或 string）如果包含超过 1000 个元素，则使用另一种算法。
这真的是 1000，而不是 1024？
您使用的是什么操作系统和标准库？在 1k 分配后，堆的行为是否会发生某种变化？
"某事发生在 x = 1000" -- 我可能看到了事，但某事不是也在 x = 256 发生吗？这些对我来说就像图书馆内部的神奇编号缓冲区。
你有一个错字：char indexToNext = i; 应该是int indexToNext = i;。一旦indexToNext 达到128，该函数开始访问aString 中的负仓位。

标签： c++ string algorithm performance suffix-tree

【解决方案1】：

for (int i = 0; i < aString.size(); ++i)
{
    string tmp = aString.substr(i, aString.size());

这已经使您的算法 O(n^2) 或更糟。对 substr 的调用平均会创建一个n/2 大小的子字符串，所以它需要 O(n)，你调用它 n 次。

看来您实际上并不需要 tmp 字符串，因为您只是从中读取。相反，从原始字符串中读取，但相应地更改您的索引。

for (int j = indexToNext; j < tmp.size(); ++j) 循环可能还会为您的算法提供 O(n^2) 总时间（我说“可能”是因为它取决于 indexToNext 的计算值，但从随机字符串测试来看，它似乎成立真的）。它运行 O(n) 次，每次最多需要 O(n) 次迭代。

【讨论】：

你应该添加一个关于std::string_view的部分，这将允许他在没有复制的情况下操作子字符串。
我编辑添加了有关您的推荐的信息。它确实改进了代码，但在我看来，它揭示了一种更奇怪的性能......情况。
我不同意这种分析。将substr 的调用视为占用O(1) 时间更合适，因为OP 正在处理较小的（~1000 字节）字符串，并且从内存中读取的 any 将加载附近的元素。此外，通过将常量字符串引用与长度一起传递给子例程来测试您的建议会导致性能下降。
@Richard 1000 字节在这里绝对大到足以引起问题。 substr 是 O(k)，其中 k 是结果的大小。您可以清楚地看到 OP 帖子中的原始图形是抛物线的，而在他解决了我指出的问题后编辑中的图形看起来是线性的。我不知道您所说的“将常量字符串引用与长度一起传递给子例程”是什么意思，因为我不建议这样做。顺便说一句，说某事物是 O(1)，因为它是在小尺寸上测试的，这表明您不了解大 O 表示法（它与渐近性能有关）。
我不太相信 OP 的图表：在他们应用修复之前，您建议他们显示 0-1400 x 范围。之后，它们显示出更大的 x 范围，看起来更线性。在所有图表上，0-1000 范围都可以作为抛物线。

【解决方案2】：

比起string 或vector，我更怀疑malloc。它完全有可能以不同的方式处理 1000 字节。合并释放的块可能很昂贵。它可能会避免尝试合并较大的块（即通过将它们维护在池中）。但这真的只是一个猜测。您为什么不尝试使用分析器并获取一些真实数据？ gprof 使用方便。

This article 在glibc malloc 上有一些有趣的细节。如果这就是你的程序的底层内容，那么描述的 bin 类型之间的差异可能会起作用。事实上，块被释放到一个偶尔会重新组织的“未分类的垃圾箱”。峰值可能是这些重组以防止堆增长。如果这个理论是正确的，那么平滑可能是堆区域增长到重组成本不高的大小的结果。

但同样，这都是可以通过运行分析器来解决的猜想，以查看 1000 的时间在哪里。

【讨论】：

刚刚添加了 999 和 1001 的分析器结果。这是我第一次运行 gprof，所以它可能不是预期的结果。我使用标志-pg 编译，然后使用gprof -b a.out gmon.out > analysis.txt 运行分析（按照教程）。

【解决方案3】：

大多数人的工作假设似乎是在库中硬编码了某种神奇的数字，导致性能在 999-1000 左右发生相变（LSerni 除外，他有先见之明地观察到可能存在多个幻数）。

我将尝试系统地探索这个和下面的一些其他假设（源代码在这个答案的末尾）。

然后我运行我的代码，看看是否可以在我的 Intel(R) Core(TM) i5 CPU M480、Linux 4.8.0-34-generic 机器上复制您的结果，并使用 G++ 6.2.0-5ubuntu2 作为我的编译器带有-O3 优化。

果然，999-1000 之间有一个神奇的下降（还有一个接近 1600）：

请注意，我的 trans-1000 数据集不如你的干净：这可能是因为我在我的机器上在后台玩了一些其他的东西，而你的测试环境比较安静。

我的下一个问题是：这个神奇的 1000 数字在环境之间是否稳定？

所以我尝试使用 G++ 4.9.2 在 Intel(R) Xeon(R) CPU E5-2680 v3、Linux 2.6.32-642.6.1.el6.x86_64 机器上运行代码。而且，不出所料，幻数不同，出现在 975-976：

这告诉我们，如果有一个幻数，它会在版本之间发生变化。出于几个原因，这降低了我对幻数理论的信心。 (a) 它改变了。 (b) 1000+24 字节的开销是一个很好的魔术候选。 975+49 字节则更少。 (c) 第一个环境在较慢的处理器上有更好的软件，但第一个环境显示了我认为更差的性能：等到 1000 来加快速度。这似乎是一种回归。

我尝试了不同的测试：使用不同的随机输入数据运行程序。这给出了这个结果：

上图中的重点是999-1000的跌幅并没有那么特别。它看起来像之前的许多下降：速度缓慢下降，然后急剧提高。还值得注意的是，以前的许多下降都没有对齐。

这向我表明这是一种依赖于输入的行为，并且运行之间存在相关性。因此，我想知道如果我通过随机化它们的顺序来降低运行之间的相关性会发生什么。这给了：

999-1000 左右仍有一些事情发生：

让我们放大更多：

在速度更快的计算机上使用旧软件运行此程序会产生类似的结果：

放大：

由于随机化考虑不同长度字符串的顺序基本上消除了运行之间的缓慢累积（上述相关性），这表明您看到的现象需要某种全局状态。因此，C++ 字符串/向量不能作为解释。因此，malloc、“操作系统”或架构约束必须是解释。

请注意，当长度顺序是随机的时，代码运行速度会变慢而不是变快。在我看来，这与超出某种缓存大小是一致的，但信号中的噪声与本文中的第一个图相结合也表明可能存在内存碎片。因此，我决定在每次运行之前重新启动程序以确保新堆。结果如下：

现在我们看到没有更多的休息或跳跃。这表明缓存大小不是问题，而是观察到的行为与程序的整体内存使用有关。

反对缓存效果的另一个论点如下。两台机器都有 32kB 和 256kB 的 L1 和 L2 缓存，因此它们的缓存性能应该相似。我的慢机器有一个 3,072kB 三级缓存。如果假设每次分配一个 4kB 页面，1000 个节点分配 4,000kB，这接近缓存大小。但是，速度快的机器有一个 30,720kB 的 L3 缓存，在 975 处显示中断。如果这种现象是缓存效应，那么您会认为中断（如果有的话）会在以后出现。因此，我很确定缓存在这里不起作用。

唯一剩下的罪魁祸首是malloc。

为什么会这样？我不知道。但是，作为程序员，我不在乎，如下。

这可能有一个解释，但它的深度太深而无法改变或真正担心。我可以做一些奇特的事情来修复它，但这需要考虑它黑暗的腹部某处发生了什么。我们专门使用 C++ 等高级语言来避免弄乱这些细节，除非我们真的必须这样做。

我的结果表明在这种情况下我们不必这样做。 (a) 最后一张图告诉我们，代码的任何独立运行都可能表现出接近最佳的行为，(b) 随机顺序运行可以提高性能，以及 (c) 效率损失约为百分之一一秒钟，这是完全可以接受的，除非您正在处理大量数据。

源代码如下。请注意，代码将您版本的char indexToNext 更改为int indexToNext，修复了可能的整数溢出问题。测试interjay's suggestion 避免复制字符串实际上会导致性能下降。

#include <string>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <time.h>
#include <algorithm>

struct profiler
{
  std::string name;
  std::chrono::high_resolution_clock::time_point p;
  profiler(std::string const &n) :
      name(n), p(std::chrono::high_resolution_clock::now()) { }
  ~profiler()
  {
      using dura = std::chrono::duration<double>;
      auto d = std::chrono::high_resolution_clock::now() - p;
      std::cout //<< name << ": "
          << std::chrono::duration_cast<dura>(d).count()
          << std::endl;
  }
};

#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

struct node {
  char value = ' ';
  std::vector<node*> children;
  ~node(){
    for (node* child: children)
      delete child;
  }
};

int numberOfUniqueSubstrings(const std::string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        int indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode  = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}


int main(int argc, char **argv){
  const int MAX_LEN = 1300;

  if(argc==1){
    std::cerr<<"Syntax: "<<argv[0]<<"<SEED> [LENGTH]"<<std::endl;
    std::cerr<<"Seed of -1 implies all lengths should be explore and input randomized from time."<<std::endl;
    std::cerr<<"Positive seed sets the seed and explores a single input of LENGTH"<<std::endl;
    return -1;
  }

  int seed = std::stoi(argv[1]);

  if(seed==-1)
    srand(time(NULL));
  else
    srand(seed);

  //Generate a random string of the appropriate length
  std::string a;
  for(int fill=0;fill<MAX_LEN;fill++)
      a.push_back('a'+rand()%26);

  //Generate a list of lengths of strings to experiment with
  std::vector<int> lengths_to_try;
  if(seed==-1){
    for(int i=1;i<MAX_LEN;i++)
      lengths_to_try.push_back(i);
  } else {  
    lengths_to_try.push_back(std::stoi(argv[2]));
  }

  //Enable this line to randomly sort the strings
  std::random_shuffle(lengths_to_try.begin(),lengths_to_try.end());

  for(auto len: lengths_to_try){
    std::string test(a.begin(),a.begin()+len);

    std::cout<<len<<" ";
    {
      PROFILE_BLOCK("Some time");
      node *n;
      int c = numberOfUniqueSubstrings(test,n);
      delete n;
    }
  }
}

substr 是一个“常数”

OP 的原始代码包括以下内容：

for (int i = 0; i < aString.size(); ++i)
{
  string tmp = aString.substr(i, aString.size());

这里的substr 操作在字符串长度上占用O(n) 时间。在an answer below中，有人认为这个O(n)操作导致OP的原始代码性能不佳。

我不同意这个评估。由于缓存和 SIMD 操作，CPU 可以读取和复制高达 64 字节（或更多！）的块中的数据。因此，内存分配的成本可能会主导复制字符串的成本。因此，对于 OP 的输入大小，substr 操作更像是一个昂贵的常量，而不是一个额外的循环。

这可以通过编译代码的测试来证明，例如g++ temp.cpp -O3 --std=c++14 -g 和分析，例如sudo operf ./a.out -1。生成的时间使用配置文件如下所示：

25.24%  a.out    a.out                [.] _ZN4nodeD2Ev        #Node destruction                                                                           
24.77%  a.out    libc-2.24.so         [.] _int_malloc                                                                                    
13.93%  a.out    libc-2.24.so         [.] malloc_consolidate                                                                            
11.06%  a.out    libc-2.24.so         [.] _int_free                                                                                      
 7.39%  a.out    libc-2.24.so         [.] malloc                                                                                        
 5.62%  a.out    libc-2.24.so         [.] free                                                                                          
 3.92%  a.out    a.out                [.] _ZNSt6vectorIP4nodeSaIS1_EE19_M_emplace_back_auxIJRKS1_EEEvDpOT_                              
 2.68%  a.out    a.out                [.]
 8.07%  OTHER STUFF

从中可以看出内存管理在运行时占主导地位。

【讨论】：

我认为你写的一切——顺便说一句，这是一个很好的分析——与我的理论一致，即这是 malloc 的免费列表维护的产物。这可以通过在时间上完全避免 malloc 来进一步支持：预先分配具有最大可能大小的字符和子数组的节点池，而不是动态分配。
谢谢，@Gene：我认为你的回答触及了问题的核心（并相应地赞成）。不过，我会担心在没有 malloc 的情况下使用运行结果，因为这会同时影响缓存行为。我不太熟悉这种效果的分析，但这似乎是最好的经验路线。