在 main() 之外初始化 std::vector 会导致性能下降（多线程）答案

【问题标题】：Initializing std::vector outside of main() causes performance drop (multithreading)在 main() 之外初始化 std::vector 会导致性能下降（多线程）
【发布时间】：2020-10-30 05:40:02
【问题描述】：

我正在编写路径跟踪器作为编程练习。昨天我终于决定实现多线程——而且效果很好。然而，一旦我将我在main() 中编写的测试代码包装在一个单独的renderer 类中，我注意到性能显着下降。简而言之 - 似乎在main() 之外的任何地方填充std::vector 会导致使用其元素的线程性能更差。我设法用简化的代码隔离并重现了这个问题，但不幸的是我仍然不知道它为什么会发生或如何解决它。

性能下降非常明显且一致：

  97 samples - time = 28.154226s, per sample = 0.290250s, per sample/th = 1.741498
  99 samples - time = 28.360723s, per sample = 0.286472s, per sample/th = 1.718832
 100 samples - time = 29.335468s, per sample = 0.293355s, per sample/th = 1.760128

vs.

  98 samples - time = 30.197734s, per sample = 0.308140s, per sample/th = 1.848841
  99 samples - time = 30.534240s, per sample = 0.308427s, per sample/th = 1.850560
 100 samples - time = 30.786519s, per sample = 0.307865s, per sample/th = 1.847191

我最初在这个问题中发布的代码可以在这里找到：https://github.com/Jacajack/rt/tree/mt_debug 或在编辑历史记录中。

我创建了一个结构foo，它应该模仿我的renderer 类的行为，并负责在其构造函数中初始化路径跟踪上下文。有趣的是，当我删除foo 的构造函数的主体并改为执行此操作时（直接从main() 初始化contexts）：

std::vector<rt::path_tracer> contexts; // Can be on stack or on heap, doesn't matter
foo F(cam, scene, bvh, width, height, render_threads, contexts); // no longer fills `contexts`

contexts.reserve(render_threads);
for (int i = 0; i < render_threads; i++)
    contexts.emplace_back(cam, scene, bvh, width, height, 1000 + i);

F.run(render_threads);

性能恢复正常。但是，如果我将这三行包装成一个单独的函数并从这里调用它，那就更糟了。我在这里能看到的唯一模式是在main() 之外填充contexts 向量会导致问题。

我最初认为这是一个对齐/缓存问题，所以我尝试将 path_tracers 与 Boost 的 aligned_allocator 和 TBB 的 cache_aligned_allocator 对齐，但没有结果。事实证明，即使只有一个线程在运行，这个问题仍然存在。我怀疑它一定是某种疯狂的编译器优化（我正在使用-O3），尽管这只是一个猜测。您是否知道此类行为的任何可能原因以及可以采取哪些措施来避免这种行为？

这发生在gcc 10.1.0 和clang 10.0.0 上。目前我只使用-O3。

我设法在这个独立示例中重现了类似的问题：

#include <iostream>
#include <thread>
#include <random>
#include <algorithm>
#include <chrono>
#include <iomanip>

struct foo
{
    std::mt19937 rng;
    std::uniform_real_distribution<float> dist;
    std::vector<float> buf;
    int cnt = 0;
    
    foo(int seed, int n) :
        rng(seed),
        dist(0, 1),
        buf(n, 0)
    {
    }
    
    void do_stuff()
    {
        // Do whatever
        for (auto &f : buf)
            f = (f + 1) * dist(rng);
        cnt++;
    }
};

int main()
{
    int N = 50000000;
    int thread_count = 6;
    
    struct bar
    {
        std::vector<std::thread> threads;
        std::vector<foo> &foos;
        bool active = true;
        
        bar(std::vector<foo> &f, int thread_count, int n) :
            foos(f)
        {
            /*
            foos.reserve(thread_count);
            for (int i = 0; i < thread_count; i++)
                foos.emplace_back(1000 + i, n);
            //*/
        }
        
        void run(int thread_count)
        {
            auto task = [this](foo &f)
            {
                while (this->active)
                    f.do_stuff();
            };

            threads.reserve(thread_count);
            for (int i = 0; i < thread_count; i++)
                threads.emplace_back(task, std::ref(foos[i]));
        }
    };
    
    
    std::vector<foo> foos;
    bar B(foos, thread_count, N);
    
    ///*
    foos.reserve(thread_count);
    for (int i = 0; i < thread_count; i++)
        foos.emplace_back(1000 + i, N);
    //*/
    
    B.run(thread_count);
    
    std::vector<float> buffer(N, 0);
    int samples = 0, last_samples = 0;
    
    // Start time
    auto t_start = std::chrono::high_resolution_clock::now();
    
    while (1)
    {
        last_samples = samples;
        samples = 0;
        for (auto &f : foos)
        {
            std::transform(
                f.buf.cbegin(), f.buf.cend(),
                buffer.begin(),
                buffer.begin(),
                std::plus<float>()
            );
            samples += f.cnt;
        }
        
        if (samples != last_samples)
        {
            auto t_now = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> t_total = t_now - t_start;
            std::cerr << std::setw(4) << samples << " samples - time = " << std::setw(8) << std::fixed << t_total.count() 
                << "s, per sample = " << std::setw(8) << std::fixed << t_total.count() / samples 
                << "s, per sample/th = " << std::setw(8) << std::fixed << t_total.count() / samples * thread_count << std::endl;
        }
    }
}

和结果：

For N = 100000000, thread_count = 6

In main():
 196 samples - time = 26.789526s, per sample = 0.136681s, per sample/th = 0.820088
 197 samples - time = 27.045646s, per sample = 0.137288s, per sample/th = 0.823725
 200 samples - time = 27.312159s, per sample = 0.136561s, per sample/th = 0.819365


vs.
In foo::foo():
 193 samples - time = 22.690566s, per sample = 0.117568s, per sample/th = 0.705406
 196 samples - time = 22.972403s, per sample = 0.117206s, per sample/th = 0.703237
 198 samples - time = 23.257542s, per sample = 0.117462s, per sample/th = 0.704774
 200 samples - time = 23.540432s, per sample = 0.117702s, per sample/th = 0.706213

结果似乎与我的路径跟踪器中发生的情况相反，但可见的差异仍然存在。

谢谢

【问题讨论】：

你是如何编译你的代码的？您是否启用了编译器优化？如果没有，请按照步骤 1 进行操作。
请提供minimal reproducible examples，细节很重要。还包括您用于编译的所有编译器标志
最初我启用了-ffast-math、-march=native、-ftree-vectorize 和-O3。现在我只有-O3，它仍然会发生。
您还应该包括基准（细节很重要）
是的，去吧。目前还没有引用旧代码的答案，如果有人想查看旧代码，它不会被永久删除。

标签： c++ multithreading performance c++11 vector

【解决方案1】：

foo::buf 存在竞争条件 - 一个线程在其中进行存储，花药读取它。这是未定义的行为，但在 x86-64 平台上，在此特定代码中是无害的。

我无法重现您对 Intel i9-9900KS 的观察，两种变体都打印相同的 per sample 统计信息。

使用 gcc-8.4 编译，g++ -o release/gcc/test.o -c -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG test.cc

使用int N = 50000000;，每个线程都在其自己的float[N] 数组上运行，该数组占用200MB。这样的数据集不适合 CPU 缓存，并且程序会导致大量数据缓存未命中，因为它需要从内存中获取数据：

$ perf stat -ddd ./release/gcc/test
[...]
      71474.813087      task-clock (msec)         #    6.860 CPUs utilized          
                66      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           341,942      page-faults               #    0.005 M/sec                  
   357,027,759,875      cycles                    #    4.995 GHz                      (30.76%)
   991,950,515,582      instructions              #    2.78  insn per cycle           (38.43%)
   105,609,126,987      branches                  # 1477.571 M/sec                    (38.40%)
       155,426,137      branch-misses             #    0.15% of all branches          (38.39%)
   150,832,846,580      L1-dcache-loads           # 2110.294 M/sec                    (38.41%)
     4,945,287,289      L1-dcache-load-misses     #    3.28% of all L1-dcache hits    (38.44%)
     1,787,635,257      LLC-loads                 #   25.011 M/sec                    (30.79%)
     1,103,347,596      LLC-load-misses           #   61.72% of all LL-cache hits     (30.81%)
   <not supported>      L1-icache-loads                                             
         7,457,756      L1-icache-load-misses                                         (30.80%)
   150,527,469,899      dTLB-loads                # 2106.021 M/sec                    (30.80%)
        54,966,843      dTLB-load-misses          #    0.04% of all dTLB cache hits   (30.80%)
            26,956      iTLB-loads                #    0.377 K/sec                    (30.80%)
           415,128      iTLB-load-misses          # 1540.02% of all iTLB cache hits   (30.79%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.419122076 seconds time elapsed

如果您在 NUMA CPU（例如具有多个插槽的 AMD Ryzen 和 Intel Xeon）上运行此应用程序，那么您的观察结果可能是由于线程在远程 NUMA 节点上相对于分配了foo::buf 的 NUMA 节点的不利放置来解释。那些最后一级数据缓存未命中必须读取内存，如果该内存位于远程 NUMA 节点中，则需要更长的时间。

要解决此问题，您可能希望在使用它的线程中分配内存（而不是像代码那样在主线程中）并使用可识别 NUMA 的分配器，例如 TCMalloc。详情请见NUMA aware heap memory manager。

在运行基准测试时，您可能希望修复 CPU 频率，以便在运行期间不会动态调整，在 Linux 上，您可以使用 sudo cpupower frequency-set --related --governor performance 来做到这一点。

【讨论】：

我知道这种竞争条件，但我接受它是为了进行实时预览。我的 CPU 是 i7-8700k，对于任何数量的线程（包括 1 和固定频率），差异都很明显。我将尝试在其他计算机上运行此代码以查看它的行为并报告我的结果。为每个线程分配内存似乎也是一种合理的方法——我会尝试的。谢谢！
@Jacajack 在非 NUMA CPU 上，如 i7-8700k，您不需要 NUMA 感知内存分配。
是的，但至少这是可以尝试的新东西，以防我发现问题所在。这是我对 Core 2 Duos 的结果 - 差异较小，但很明显：pastebin.com/XuZFW3iv 看到您的编译命令后，我尝试添加 -falign-*=64 我的路径跟踪器的 CMake - 不幸的是没有结果。
@Jacajack 我刚刚验证了我的代码、可执行文件和结果，还在 Intel Xeon Gold 6132 上运行了基准测试。我没有观察到任何差异。我怀疑您编译和运行的代码与您发布的代码不同。
我制作了这个图：imgur.com/a/Ps6IceR，而且似乎它们都在较长一段时间后最终收敛到大致相同的值。一开始，差异显然更大（显然，“init in main”版本在这里碰巧更快？至少从图表来看......）这就是让我认为总体差异很大的原因。我想我将来在进行基准测试时必须更加耐心。感谢您的宝贵时间！