八叉树实施的速度问题答案

【问题标题】：Speed concerns of octree implementation八叉树实施的速度问题
【发布时间】：2016-05-06 13:07:15
【问题描述】：

几天来，我正在努力加快我的力导向图的实施。到目前为止，我已经实现了使用八叉树来减少计算次数的 Barnes-Hut 算法。我已经对其进行了多次测试，并且与力相关的计算数量确实大大减少了。下面是没有 Barns-Hut（蓝线）和有（红线）的节点数的计算图：尽管现在它应该快很多，但事实是，在速度（时间）方面，升级只有百分之几。

我想可能导致这种情况的一部分是树的创建和树放置中的元素。因为元素不断移动，我需要在每个循环中重新创建树，直到达到某个停止条件。但是，如果我将花费大量时间创建树，我将失去我在力计算增加方面获得的时间。至少这是我的想法。这就是我在主文件循环中添加元素的方式：

void AddTreeElements(Octree* tree, glm::vec3* boundries, Graph& graph)
{
    for(auto& node:graph.NodeVector())
    {
        node.parent_group = nullptr;
        if(node.pos[0] < boundries[1][0] && node.pos[0] > boundries[0][0] &&
                node.pos[1] > boundries[4][1] && node.pos[1] < boundries[1][1] &&
                node.pos[2] < boundries[0][2] && node.pos[2] > boundries[3][2])
        {
            tree->AddObject(&node.second);
            continue;
        }

        if(node.pos[0] < boundries[0][0])
        {
            boundries[0][0] = node.pos[0]-1.0f;
            boundries[3][0] = node.pos[0]-1.0f;
            boundries[4][0] = node.pos[0]-1.0f;
            boundries[7][0] = node.pos[0]-1.0f;
        }
        else if(node.pos[0] > boundries[1][0])
        {
            boundries[1][0] = node.pos[0]+1.0f;
            boundries[2][0] = node.pos[0]+1.0f;
            boundries[5][0] = node.pos[0]+1.0f;
            boundries[6][0] = node.pos[0]+1.0f;
        }

        if(node.pos[1] < boundries[4][1])
        {
            boundries[4][1] = node.pos[1]-1.0f;
            boundries[5][1] = node.pos[1]-1.0f;
            boundries[6][1] = node.pos[1]-1.0f;
            boundries[7][1] = node.pos[1]-1.0f;
        }
        else if(node.pos[1] > boundries[0][1])
        {
            boundries[0][1] = node.pos[1]+1.0f;
            boundries[1][1] = node.pos[1]+1.0f;
            boundries[2][1] = node.pos[1]+1.0f;
            boundries[3][1] = node.pos[1]+1.0f;
        }

        if(node.pos[2] < boundries[3][2])
        {
            boundries[2][2] = node.pos[2]-1.0f;
            boundries[3][2] = node.pos[2]-1.0f;
            boundries[6][2] = node.pos[2]-1.0f;
            boundries[7][2] = node.pos[2]-1.0f;
        }
        else if(node.pos[2] > boundries[0][2])
        {
            boundries[0][2] = node.pos[2]+1.0f;
            boundries[1][2] = node.pos[2]+1.0f;
            boundries[4][2] = node.pos[2]+1.0f;
            boundries[5][2] = node.pos[2]+1.0f;
        }
    }
}

我在这里做的是遍历图中的所有元素并将它们添加到树根。另外，我正在扩展代表我的八叉树边界的框以用于下一个循环，因此所有节点都将适合其中。

八叉树结构更新的重要字段如下：

Octree* trees[2][2][2];
glm::vec3 vBoundriesBox[8];
bool leaf;
float combined_weight = 0;
std::vector<Element*> objects;

以及负责更新的部分代码：

#define MAX_LEVELS 5

void Octree::AddObject(Element* object)
{
    this->objects.push_back(object);
}

void Octree::Update()
{
    if(this->objects.size()<=1 || level > MAX_LEVELS)
    {
        for(Element* Element:this->objects)
        {
            Element->parent_group = this;
        }
        return;
    }

    if(leaf)
    {
        GenerateChildren();
        leaf = false;
    }

    while (!this->objects.empty())
    {
        Element* obj = this->objects.back();
        this->objects.pop_back();
        if(contains(trees[0][0][0],obj))
        {
            trees[0][0][0]->AddObject(obj);
            trees[0][0][0]->combined_weight += obj->weight;
        } else if(contains(trees[0][0][1],obj))
        {
            trees[0][0][1]->AddObject(obj);
            trees[0][0][1]->combined_weight += obj->weight;
        } else if(contains(trees[0][1][0],obj))
        {
            trees[0][1][0]->AddObject(obj);
            trees[0][1][0]->combined_weight += obj->weight;
        } else if(contains(trees[0][1][1],obj))
        {
            trees[0][1][1]->AddObject(obj);
            trees[0][1][1]->combined_weight += obj->weight;
        } else if(contains(trees[1][0][0],obj))
        {
            trees[1][0][0]->AddObject(obj);
            trees[1][0][0]->combined_weight += obj->weight;
        } else if(contains(trees[1][0][1],obj))
        {
            trees[1][0][1]->AddObject(obj);
            trees[1][0][1]->combined_weight += obj->weight;
        } else if(contains(trees[1][1][0],obj))
        {
            trees[1][1][0]->AddObject(obj);
            trees[1][1][0]->combined_weight += obj->weight;
        } else if(contains(trees[1][1][1],obj))
        {
            trees[1][1][1]->AddObject(obj);
            trees[1][1][1]->combined_weight += obj->weight;
        }
    }

    for(int i=0;i<2;i++)
    {
        for(int j=0;j<2;j++)
        {
            for(int k=0;k<2;k++)
            {
                trees[i][j][k]->Update();
            }
        }
    }
}

bool Octree::contains(Octree* child, Element* object)
{
    if(object->pos[0] >= child->vBoundriesBox[0][0] && object->pos[0] <= child->vBoundriesBox[1][0] &&
       object->pos[1] >= child->vBoundriesBox[4][1] && object->pos[1] <= child->vBoundriesBox[0][1] &&
       object->pos[2] >= child->vBoundriesBox[3][2] && object->pos[2] <= child->vBoundriesBox[0][2])
        return true;
    return false;
}

因为我使用指针来移动树元素，所以我认为对象创建/销毁在这里不是问题。我认为可能会影响速度的一个地方是：

Element* obj = this->objects.back();
this->objects.pop_back();
if(contains(trees[0][0][0],obj))

虽然我不确定如何省略/加快速度。有人有什么建议可以在这里做什么吗？

编辑：

我做了一些餐巾数学运算，我想还有一个地方可能会导致速度大幅下降。 Update 方法中的边界检查看起来做了很多工作，而我计算得出的是，在最坏的情况下，这样做会增加复杂性：

number_of_elements*number_of_childern*number_of_faces*MAX_LEVELS

在我的情况下等于 number_of_elements*240。

有人可以确认我的想法是否合理吗？

【问题讨论】：

codereview.stackexchange.com
@Mihai 根据您的建议，我已将其发布在那里：codereview.stackexchange.com/questions/127693/…
DrunkCoder 所说的可能会有所帮助，但请记住性能优化的前三个规则：测量、测量、测量！为您的平台使用采样 CPU 分析器（例如 Linux 上的 perf+hotspot、Windows 上的 Visual Studio 分析器或 macOS 上的 Instruments），然后使用该数据找出性能问题的罪魁祸首。

标签： c++ algorithm tree

【解决方案1】：

如果我理解正确，您是在每个八叉树节点中存储一个指针向量？

std::vector<Element*> objects;

...

void Octree::AddObject(Element* object)
{
    this->objects.push_back(object);
}

正如我从这段代码中了解到的，对于八叉树构建，您的父节点 pop_back 元素指针来自父向量并开始向后推以将适当的元素传输给子节点。

如果是这种情况，我可以立即说这是一个主要瓶颈，甚至无需测量，因为我之前处理过此类八叉树实现并将其构建改进了 10 倍以上，并通过简单地使用单个-链表，在这种特殊情况下，与少量vectors（每个节点一个）相比，它显着减少了所涉及的堆分配/释放，甚至改善了空间局部性。我并不是说这是唯一的瓶颈，但它绝对是一个重要的瓶颈。

如果是这样的话，我的建议是这样的：

struct OctreeElement
{
     // Points to next sibling.
     OctreeElement* next;

     // Points to the element data (point, triangle, whatever).
     Element* element;
};

struct OctreeNode
{
     OctreeNode* children[8];
     glm::vec3 vBoundriesBox[8];

     // Points to the first element in this node
     // or null if there are none.
     OctreeElement* first_element;

     float combined_weight;
     bool leaf;
};

这实际上只是一个初步的通行证，但应该会有很大帮助。然后，当您将元素从父级传输到子级时，不会有推回和弹回，也没有堆分配。你所做的只是操纵指针。将元素从父级传输到子级：

// Pop off element from parent.
OctreeElement* elt = parent->first_element;
parent->first_element = elt->next;

// Push it to the nth child.
elt->next = children[n];
children[n]->first_element = elt;

从上面可以看出，通过链接表示，我们需要做的就是操纵 3 个指针从一个节点传输到另一个节点——无需堆分配，无需增加大小、检查容量等。进一步，您将存储元素的开销减少到每个节点一个指针和每个元素一个指针。每个节点一个向量在内存使用中往往会非常爆炸，因为向量通常可以占用 32+ 字节，即使只是默认构造，因为许多实现在必须存储数据指针、大小和容量的基础上预先分配了一些内存。

仍有很大的改进空间，但第一次通过应该会有很大帮助，如果您使用高效的分配器（例如空闲列表或顺序分配器）分配 OctreeElement* 或将它们存储在稳定的数据结构中，则更是如此'不会使指针无效，但提供一些连续性，例如std::deque。如果您愿意做更多的工作，请使用std::vector 存储所有元素（整个树的所有元素，而不是每个节点一个向量）并使用该向量中的索引而不是指针将元素链接在一起。如果您对链表使用索引而不是指针，则可以连续存储所有节点而无需使用内存分配器，只需使用一个大的旧 vector 来存储所有内容并将链接的内存需求减半（假设为 64 位如果你可以使用索引，那么 32 位索引就足够了）。

如果您使用 32 位索引，您可能也不需要所有 32 位，此时您可以使用 31 位并塞入 leaf 布尔值，这会大大增加节点的大小（大约 4 个字节，带有填充和假设该布尔字段为 64 位的指针的对齐要求）到第一个元素中，或者只是将第一个子索引设置为 -1 以指示叶子，如下所示：

struct OctreeElement
{
     // Points to the element data (point, triangle, whatever).
     int32_t element;

     // Points to next sibling.
     int32_t next;
};

struct OctreeNode
{
     // This can be further reduced down to two
     // vectors: a box center and half-size. A
     // little bit of arithmetic can still improve
     // efficiency of traversal and building if
     // the result is fewer cache misses and less
     // memory use.
     glm::vec3 vBoundriesBox[8];

     // Points to the first child. We don't need
     // to store 8 indices for the children if we
     // can assume that all 8 children are stored
     // contiguously in an array/vector. If the
     // node is a leaf, this stores -1.
     int32_t children;

     // Points to the first element in this node
     // or -1 if there are none.
     int32_t first_element;

     float combined_weight;
};

struct Octree
{
     // Stores all the elements for the entire tree.
     vector<OctreeElement> elements;

     // Stores all the nodes for the entire tree. The
     // first node is the root.
     vector<OctreeNode> nodes;
};

这一切仍然非常初级，还有很大的改进空间，我无法在一个答案中真正涵盖，但仅仅做这几件事应该已经有很大帮助，首先要避免单独的 vector per节点作为你最大的改进。

用于减少堆分配和提高引用位置的链表

我觉得过去与我共事过的许多 C++ 开发人员要么忘记了，要么可能从未学过，但链表不必总是转化为增加的堆分配和缓存未命中，尤其是当每个节点不需要单独的堆分配。如果比较点是一大堆很小的向量，那么链表实际上会减少缓存未命中并减少堆分配。举个基本的例子：

假设实际的网格有 10,000 个单元格。在这种情况下，只为每个单元存储一个 32 位索引并使用存储在一个大数组（或一个大 vector）中的 32 位索引将元素链接在一起会便宜得多，并且需要更少的内存分配（以及通常少得多的内存）比存储 10,000 个向量。 Vector 是一种用于存储大量数据的出色结构，但它不是您想要用于存储大量可变大小列表的东西。单链表已经有了很大的改进，它们非常适合以恒定的时间将元素从一个列表转移到另一个列表，而且非常便宜，因为这只需要操作 3 个指针（或 3 个索引）而无需任何额外的分支.

因此，链表仍有很多用途。当您实际以减少而不是增加堆分配的方式使用它们时，它们特别有用。

【讨论】：

可能很愚蠢的问题：这个链表如何更本地缓存？