为什么一棵树的 DFS 速度较慢，而另一棵树的 DFS 速度较快？答案

【问题标题】：Why is DFS slower in one tree and faster in the other?为什么一棵树的 DFS 速度较慢，而另一棵树的 DFS 速度较快？
【发布时间】：2016-09-12 13:10:14
【问题描述】：

更新：原来在生成树的解析器中有一个错误。更多内容在最终编辑中。

让T 是一棵二叉树，这样每个内部节点都恰好有两个孩子。对于这棵树，我们要编写一个函数，对T 中的每个节点v 查找v 定义的子树中的节点数。

示例

输入

期望的输出

用红色表示我们要计算的数字。树的节点将存储在一个数组中，我们称之为TreeArray，按照前序布局。

对于上面的示例，TreeArray 将包含以下对象：

10, 11, 0, 12, 13, 2, 7, 3, 14, 1, 15, 16, 4, 8, 17, 18, 5, 9, 6

树的一个节点由以下结构描述：

struct tree_node{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node,
    // what we want to compute

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
        id = -1;
        size = 1;
        pos = lpos = rpos = -1;
        numChildren = 0;
    }

};

计算所有size 值的函数如下：

void testCache(int cur){

    if(treeArray[cur].numChildren == 0){
        treeArray[cur].size = 1;
        return;
    }

    testCache(treeArray[cur].lpos);
    testCache(treeArray[cur].rpos);

    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + 
    treeArray[treeArray[cur].rpos].size + 1;

}

我想了解为什么当T 看起来像这样（几乎像左转链）时，这个函数更快：

当T 看起来像这样（几乎像一条右转链）时，速度会变慢：

以下实验在 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 上运行，配备 8 GB RAM、L1 缓存 256 KB、L2 缓存 1 MB、L3 缓存 6 MB。

图中的每个点都是以下for循环的结果（参数由轴定义）：

for (int i = 0; i < 100; i++) {
        testCache(0);
}

n对应节点总数，时间以秒为单位。正如我们所看到的，很明显，随着n 的增长，当树看起来像左行链时，函数会快得多，即使两种情况下节点的数量完全相同。

现在让我们试着找出瓶颈在哪里。我使用PAPI library 来计算有趣的硬件计数器。

第一个计数器是指令，我们实际花费了多少指令？当树木看起来不同时有区别吗？

差异不显着。看起来对于较大的输入，左行链需要的指令较少，但差异是如此之小，所以我认为可以安全地假设它们都需要相同数量的指令。

看到我们已经将树存储在treeArray 内的一个很好的预购布局中，看看缓存中发生了什么是有意义的。不幸的是，对于 L1 缓存，我的计算机没有提供任何计数器，但我有 L2 和 L3。

让我们看看对 L2 缓存的访问。当我们在 L1 缓存中发生未命中时，会发生对 L2 缓存的访问，因此这也是 L1 未命中的间接计数器。

我们可以看到右向树需要更少的 L1 未命中，因此它似乎有效地使用了缓存。

L2 未命中也是如此，右向树似乎更有效率。仍然没有任何迹象表明为什么正确的树会这么慢。再来看看L3吧。

在 L3 中，正确的树会爆炸。所以问题似乎出在 L3 缓存中。不幸的是，我无法解释这种行为背后的原因。为什么正确的树的 L3 缓存会搞砸？

下面是整个代码和实验：

#include <iostream>
#include <fstream>
#define BILLION  1000000000LL

using namespace std;


/*
 *
 * Timing functions
 *
 */

timespec startT, endT;

void startTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}

double endTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
    return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}

/*
 *
 * tree node
 *
 */

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node
    tree_node_temp *leftChild;
    tree_node_temp *rightChild;

    tree_node_temp(){
        id = -1;
        size = 1;
        leftChild = nullptr;
        rightChild = nullptr;
        numChildren = 0;
    }

};

struct tree_node{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
        id = -1;
        pos = lpos = rpos = -1;
        numChildren = 0;
    }

};

/*
 *
 * Tree parser. The input is a file containing the tree in the newick format.
 *
 */

string treeNewickStr; //string storing the newick format of a tree that we read from a file
int treeCurSTRindex; //index to the current position we are in while reading the newick string
int treeNumLeafs; //number of leafs in current tree
tree_node ** treeArrayReferences; //stack of references to free node objects
tree_node *treeArray; //array of node objects
int treeStackReferencesTop; //the top index to the references stack
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree


//helper function for readNewick
tree_node_temp* readNewickHelper() {

    int i;
    if(treeCurSTRindex == treeNewickStr.size())
        return nullptr;

    tree_node_temp * leftChild;
    tree_node_temp * rightChild;

    if(treeNewickStr[treeCurSTRindex] == '('){
        //create a left child
        treeCurSTRindex++;
        leftChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ','){
        //create a right child
        treeCurSTRindex++;
        rightChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){
        treeCurSTRindex++;
        tree_node_temp * cur = new tree_node_temp();
        cur->numChildren = 2;
        cur->leftChild = leftChild;
        cur->rightChild = rightChild;
        cur->size = 1 + leftChild->size + rightChild->size;
        return cur;
    }

    //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)
    i = 0;
    char treeLabel[20]; //buffer used for the label
    while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){
        treeLabel[i] = treeNewickStr[treeCurSTRindex];
        treeCurSTRindex++;
        i++;
    }

    treeLabel[i] = '\0';
    tree_node_temp * cur = new tree_node_temp();
    cur->numChildren = 0;
    cur->id = atoi(treeLabel)-1;
    treeNumLeafs++;

    return cur;
}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser
void treeInit(tree_node_temp * curRoot){

    tree_node * curFinalRoot = treeArrayReferences[curpos];

    curFinalRoot->pos = curpos;

    if(curRoot->numChildren == 0) {
        curFinalRoot->id = curRoot->id;
        return;
    }

    //add left child
    tree_node * cnode = treeArrayReferences[treeStackReferencesTop];
    curFinalRoot->lpos = curpos + 1;
    curpos = curpos + 1;
    treeStackReferencesTop++;
    cnode->id = curRoot->leftChild->id;
    treeInit(curRoot->leftChild);

    //add right child
    curFinalRoot->rpos = curpos + 1;
    curpos = curpos + 1;
    cnode = treeArrayReferences[treeStackReferencesTop];
    treeStackReferencesTop++;
    cnode->id = curRoot->rightChild->id;
    treeInit(curRoot->rightChild);

    curFinalRoot->id = curRoot->id;
    curFinalRoot->numChildren = 2;
    curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal
void updateInternalNodeIDs(int cur){

    tree_node* curNode = treeArrayReferences[cur];

    if(curNode->numChildren == 0){
        return;
    }
    curNode->id = treeNumLeafs++;
    updateInternalNodeIDs(curNode->lpos);
    updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser
void treeFreeMemory(tree_node_temp* cur){

    if(cur->numChildren == 0){
        delete cur;
        return;
    }
    treeFreeMemory(cur->leftChild);
    treeFreeMemory(cur->rightChild);

    delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.
//this tree is scattered anywhere in the memory.

tree_node* readNewick(string& file){

    treeCurSTRindex = -1;
    treeNewickStr = "";
    treeNumLeafs = 0;

    ifstream treeFin;

    treeFin.open(file, ios_base::in);
    //read the newick format of the tree and store it in a string
    treeFin>>treeNewickStr;
    //initialize index for reading the string
    treeCurSTRindex = 0;
    //create the tree in main memory
    tree_node_temp* root = readNewickHelper();

    //store the tree in an array following the pre order layout
    treeArray = new tree_node[root->size];
    treeArrayReferences = new tree_node*[root->size];
    int i;
    for(i=0;i<root->size;i++)
        treeArrayReferences[i] = &treeArray[i];
    treeStackReferencesTop = 0;

    tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];
    curpos = treeStackReferencesTop;
    treeStackReferencesTop++;
    finalRoot->id = root->id;
    treeInit(root);

    //update the internal node ids (the leaf ids are defined by the ids stored in the newick string)
    updateInternalNodeIDs(0);
    //close the file
    treeFin.close();

    //free the memory of initial tree
    treeFreeMemory(root);
    //return the pre order tree
    return finalRoot;

}

/*
 *
 *
 * DOT FORMAT OUTPUT --- BEGIN
 *
 *
 */

void treeBstPrintDotAux(tree_node* node, ofstream& treeFout) {

    if(node->numChildren == 0) return;

    treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->lpos]->id<<";\n";
    treeBstPrintDotAux(treeArrayReferences[node->lpos], treeFout);

    treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->rpos]->id<<";\n";
    treeBstPrintDotAux(treeArrayReferences[node->rpos], treeFout);

}

void treePrintDotHelper(tree_node* cur, ofstream& treeFout){
    treeFout<<"digraph BST {\n";
    treeFout<<"    node [fontname=\"Arial\"];\n";

    if(cur == nullptr){
        treeFout<<"\n";
    }
    else if(cur->numChildren == 0){
        treeFout<<"    "<<cur->id<<";\n";
    }
    else{
        treeBstPrintDotAux(cur, treeFout);
    }

    treeFout<<"}\n";
}

void treePrintDot(string& file, tree_node* root){

    ofstream treeFout;
    treeFout.open(file, ios_base::out);
    treePrintDotHelper(root, treeFout);
    treeFout.close();

}

/*
 *
 *
 * DOT FORMAT OUTPUT --- END
 *
 *
 */

/*
 * experiments
 *
 */

tree_node* T;
int n;

void testCache(int cur){

    if(treeArray[cur].numChildren == 0){
        treeArray[cur].size = 1;
        return;
    }

    testCache(treeArray[cur].lpos);
    testCache(treeArray[cur].rpos);

    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;

}


int main(int argc, char* argv[]){

    string Tnewick = argv[1];
    T = readNewick(Tnewick);

    n = T->size;
    double tt;

    startTimer();
    for (int i = 0; i < 100; i++) {
        testCache(0);
    }

    tt = endTimer();
    cout << tt / BILLION << '\t' << T->size;
    cout<<endl;

    return 0;
}

通过键入g++ -O3 -std=c++11 file.cpp 进行编译通过键入./executable tree.txt 运行。在tree.txt 中，我们将树存储在newick format 中。

Here 是一棵有 10^5 片叶子的左行树

Here 是一棵右行树，有 10^5 个叶子

我得到的运行时间： ~0.07 秒左行树 ~0.12 秒的右行树

我为这篇冗长的帖子道歉，但鉴于问题似乎很狭窄，我找不到更好的方式来描述它。

提前谢谢你！

编辑：

这是在 MrSmith42 的回答之后的后续编辑。我知道地点扮演着非常重要的角色，但我不确定我是否理解这里的情况。

对于上面的两个示例树，让我们看看随着时间的推移我们如何访问内存。

对于左行树：

对于正确的树：

在我看来，在这两种情况下，我们都有本地访问模式。

编辑：

这是关于条件分支数量的图：

这是一个关于分支错误预测数量的图：

Here 是一棵有 10^6 片叶子的左行树

Here 是一棵右行树，有 10^6 片叶子

最终编辑：

我很抱歉浪费了大家的时间，我使用的解析器有一个参数，我想让我的树看起来像“左”还是“右”。那是一个浮点数，它必须接近 0 才能让它向左走，接近 1 才能让它向右走。然而，为了让它看起来像一条链，它必须非常小，比如0.000000001 或0.999999999。对于小的输入，即使对于像 0.0001 这样的值，树看起来也像一个链。我认为这个数字足够小，它也可以为更大的树提供一个链条，但正如我将展示的那样，情况并非如此。如果你使用像0.000000001 这样的数字，解析器会因为浮点问题而停止工作。

vadikrobot 的回答表明我们存在位置问题。受他的实验启发，我决定概括上面的访问模式图，看看它不仅在示例树中的行为方式，而且在任何树中的行为方式。

我将 vadikrobot 的代码修改为如下所示：

void testCache(int cur, FILE *f) {

    if(treeArray[cur].numChildren == 0){
        fprintf(f, "%d\t", tim++);
        fprintf (f, "%d\n", cur);
        treeArray[cur].size = 1;
        return;
    }

    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    testCache(treeArray[cur].lpos, f);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    testCache(treeArray[cur].rpos, f);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", treeArray[cur].lpos);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", treeArray[cur].rpos);
    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + 
    treeArray[treeArray[cur].rpos].size + 1;
}

错误解析器生成的访问模式

让我们看一棵有 10 片叶子的左树。

看起来很不错，正如上图所预测的（我只是忘记了在上图中，当我们找到一个节点的大小时，我们还访问了该节点的大小参数，源代码中的cur以上）。

让我们看一棵有 100 片叶子的左树。

看起来和预期的一样。 1000片叶子呢？

这绝对不是预期的。右上角有一个小三角形。这样做的原因是因为这棵树看起来不像一个左行链，有一个小子树挂在最后的某个地方。当叶子为 10^4 时，问题变得更大。

让我们看看右树会发生什么。当叶子为 10 时：

看起来不错，100片叶子怎么样？

看起来也不错。这就是为什么我质疑正确树的局部性，对我来说，两者似乎至少在理论上是局部的。现在，如果您尝试增加大小，会发生一些有趣的事情：

对于 1000 片叶子：

对于 10^4 片叶子，事情变得更加混乱：

正确解析器生成的访问模式

我没有使用通用解析器，而是为这个特定问题创建了一个解析器：

#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, char* argv[]){

    if(argc!=4){
        cout<<"type ./executable n{number of leafs} type{l:left going, r:right going} outputFile"<<endl;
        return 0;
    }

    int i;

    int n = atoi(argv[1]);

    if(n <= 2){cout<<"leafs must be at least 3"<<endl; return 0;}

    char c = argv[2][0];

    ofstream fout;
    fout.open(argv[3], ios_base::out);

    if(c == 'r'){

        for(i=0;i<n-1;i++){

            fout<<"("<<i<<",";

        }
        fout<<i;
        for(i=0;i<n;i++){
            fout<<")";
        }
        fout<<";"<<endl;

    }
    else{

        for(i=0;i<n-1;i++){
            fout<<"(";
        }

        fout<<1<<","<<n<<")";

        for(i=n-1;i>1;i--){
            fout<<","<<i<<")";
        }
        fout<<";"<<endl;

    }

    fout.close();


return 0;
}

现在访问模式看起来像预期的那样。

对于有 10^4 片叶子的左树：

在黑色部分，我们从低位到高位，但是前一个低点和当前低点之间的距离很小，前一个高点和当前高点之间的距离是很小的。因此，缓存必须足够智能以容纳两个块，一个用于低位，一个用于高位，从而产生少量缓存未命中。

对于有 10^4 片叶子的右树：

The original experiments again。这次我最多只能尝试 10^5 片叶子，因为正如 Mysticial 所注意到的，由于树的高度，我们会出现堆栈溢出，而在之前的实验中不是这种情况，因为高度小于那个预期的。

从时间上看，它们似乎执行相同的操作，但缓存和分支却不是。右树在分支预测中胜过左树，左树在缓存中胜过右树。

也许我的 PAPI 使用有误，perf 的输出：

左树：

正确的树：

我可能又搞砸了，对此我深表歉意。我在此处包含了我的尝试，以防万一有人想继续调查。

【问题讨论】：

你不要在任何地方打电话给startTimer()
那些图是PAPI生成的？
很抱歉，我在编辑时忘记在 for 循环之前添加 startTimer()。运行时间现在更改为~0.7s 为l.txt 和~0.12s 为r.txt。我使用dot 函数和en.wikipedia.org/wiki/DOT_(graph_description_language) 生成了图表
切换递归顺序会发生什么？ IOW，在左侧之前搜索右侧？
另外，如果您的测试规模高达 200 万个节点，我希望基准测试能够同时在左右树上进行 stackoverflow。

标签： c++ algorithm performance caching tree

【解决方案1】：

更新：

我及时绘制数组中访问元素的数量

void testCache(int cur, FILE *f) {
   if(treeArray[cur].numChildren == 0){
       fprintf (f, "%d\n", cur);
       treeArray[cur].size = 1;
       return;
   }

   fprintf (f, "%d\n", cur);
   testCache(treeArray[cur].lpos, f);
   fprintf (f, "%d\n", cur);
   testCache(treeArray[cur].rpos, f);

   fprintf (f, "%d\n", treeArray[cur].lpos);
   fprintf (f, "%d\n", treeArray[cur].rpos);
   treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;
}

因此，我绘制了结果文本文件的 999990 元素：

您可以看到，对于左侧树，所有元素都是本地访问的，但对于右侧树，则存在访问不均匀性。

旧：

我尝试使用 valgrind 计算内存读取次数。对的人

valgrind --tool=callgrind --cache-sim ./a.out right
==11493== I   refs:      427,444,674
==11493== I1  misses:          2,288
==11493== LLi misses:          2,068
==11493== I1  miss rate:        0.00%
==11493== LLi miss rate:        0.00%
==11493== 
==11493== D   refs:      213,159,341  (144,095,416 rd + 69,063,925 wr)
==11493== D1  misses:     15,401,346  ( 12,737,497 rd +  2,663,849 wr)
==11493== LLd misses:        329,337  (      7,935 rd +    321,402 wr)
==11493== D1  miss rate:         7.2% (        8.8%   +        3.9%  )
==11493== LLd miss rate:         0.2% (        0.0%   +        0.5%  )
==11493== 
==11493== LL refs:        15,403,634  ( 12,739,785 rd +  2,663,849 wr)
==11493== LL misses:         331,405  (     10,003 rd +    321,402 wr)
==11493== LL miss rate:          0.1% (        0.0%   +        0.5%  )

对于左边的

valgrind --tool=callgrind --cache-sim=yes ./a.out left

==11496== I   refs:      418,204,722
==11496== I1  misses:          2,327
==11496== LLi misses:          2,099
==11496== I1  miss rate:        0.00%
==11496== LLi miss rate:        0.00%
==11496== 
==11496== D   refs:      204,114,971  (135,076,947 rd + 69,038,024 wr)
==11496== D1  misses:     19,470,268  ( 12,661,123 rd +  6,809,145 wr)
==11496== LLd misses:        306,948  (      7,935 rd +    299,013 wr)
==11496== D1  miss rate:         9.5% (        9.4%   +        9.9%  )
==11496== LLd miss rate:         0.2% (        0.0%   +        0.4%  )
==11496== 
==11496== LL refs:        19,472,595  ( 12,663,450 rd +  6,809,145 wr)
==11496== LL misses:         309,047  (     10,034 rd +    299,013 wr)
==11496== LL miss rate:          0.0% (        0.0%   +        0.4%  )

如您所见，在“右”的情况下读取“rd”的内存数量比在左的情况下更大

【讨论】：

你是对的，不幸的是，这导致我意识到我滥用解析器的参数来创建这些树。基本上我在上面运行的所有实验都没有真正在我声称它们运行的树上运行.. 真是一团糟，我真的很抱歉（我会更新我的帖子以提供更多细节）

【解决方案2】：

由于节点在我们内存中的位置，缓存未命中是不同的。如果您按照它们在内存中的顺序访问节点，则缓存很可能已经从缓存中的 ram 加载了它们（因为加载缓存页面（很可能大于您的节点之一））。

如果您以随机顺序（相对于 RAM 中的位置）或相反顺序访问节点，则缓存更有可能尚未从 RAM 加载它们。

所以差异不是因为你的树的结构，而是你的 RAM 中树节点的位置与你想要访问它们的顺序相比。

编辑：（将访问模式添加到问题后）：

您可以在访问模式图中看到：
在“左行树”上，访问在大约一半的访问之后从低索引跳转到高索引。因此，随着距离的增加，后半段很可能总是会导致缓存未命中。
在 “右向树” 上，后半部分至少有 2 个节点彼此靠近（按访问顺序），而且接下来的两个节点有时在同一个缓存页面上有点运气。

【讨论】：

感谢您的回答，我添加了一个编辑作为跟进。我不确定我们是否不对正确的树进行本地访问。除非我误解了testCache() 是如何进行内存访问的。
一般来说，将内存中的相邻节点转移到更高的 RAM 地址会更快，因为 CPU 中的缓存管理器针对这个方向进行了优化。例如。程序的命令在 RAM 中按此顺序排列，如果没有跳转或循环，这就是它们的执行顺序，数据也是如此。例如。视频、图像等按此顺序存储。
对于左行树，如果我们访问A[i]，然后访问A[j]，其中i 为低，j 为高，则下一对访问将是A[i-1] 和A[j+1]。虽然我知道为什么这会导致缓存未命中，但缓存是否会同时保存来自A[i] 和A[j] 的附近块，因此不会因为这次跳转而导致缓存未命中？关于正确的树，访问模式是A[j]、A[j+1]、A[j-2]、A[j-1]、A[j-4]、A[j-3] 等。我不确定为什么缓存无法获取足够大的块以有效地进行这些读取。
@jsguy：如果幸运的话，两个'块都在缓存中。但至少对于左边部分，我们在 RAM 地址中倒退，这在大多数情况下都很糟糕。