为什么并行版本比单线程版本慢。虚假分享？答案

【问题标题】：Why is the parallelized version slower than the single-thread version. False sharing?为什么并行版本比单线程版本慢。虚假分享？
【发布时间】：2021-08-16 11:41:57
【问题描述】：

所以这里有两个版本的代码：

class VectorCount {
private:
     char               *arr;int                
     size;unsigned long long count;
public:
     VectorCount(char *arr, unsigned int size, unsigned long long count) : arr(arr), size(size), count(count) {}

     void add() {
          while(count--) {
               for (int i = 0; i < size; i++) {
                         arr[i]++;
               }
          }
     }
};

// Single-thread version of the code
void main_st() {
     // initizalize array
     char arr[10];

     // create object
     VectorCount v(arr, 10, 100000000);

     // run add
     v.add();
}

// Parallelized version of the code
void main_mt() {
     // initialize array
     char arrA[10];
     char arrB[10];

     // create objects
     VectorCount v1(arrA, 10, 50000000);
     VectorCount v2(arrB, 10, 50000000);

     // create threads
     thread t1, t2;
     t1.create_thread(v1, v1.add);
     t2.create_thread(v2, v2.add);

     // join threads
     t1.join();
     t2.join();

     // Code to do the final sum of the two VectorCount objects to get the same
     // result as the single-threaded version (assume negligible overhead here)
}

假设程序创建了一个 VectorCount 对象的单个实例，其数组大小为 10，计数为 1 亿。单线程版本需要 5 秒才能完成求和。并行化版本使用两个线程，每个线程拥有一个单独的 VectorCount 实例，数组大小为 10，因此每个 VectorCount 实例的计数仅为 5000 万，因为每个线程完成了一半的工作。并行版本需要 8 秒才能完成。为什么它更慢？我在想这是由于虚假分享造成的。但我不确定。缓存大小为 64 字节。

我们可以让并行版本运行得更快吗？我正在考虑更改 VectorCount 数组大小。但是，使用 2 个线程时，并行化版本的运行速度是多少？由于缓存大小是 64 字节，int 是 4 字节，那么 size = 16 会解决这种情况下的错误共享吗？ (4 x 16 = 64)。

感谢您的帮助。

【问题讨论】：

如果有一个可重现的例子就好了。最好使用quick-bench.com
有一些可能性：首先是它可能在同一个核心上进行两个线程调度，这比单线程慢。其次，可能每次在不同的核心上安排两个线程版本，这会导致旧核心中的缓存刷新和当前核心中的缓存缺失。
你的代码做的很少，很可能被编译器优化掉了毫无意义
是t1 和t2 std::threads？我似乎找不到任何名为 create_thread 的成员函数。

标签： c++ multithreading parallel-processing process

【解决方案1】：

您的代码做的工作很少，只要有足够的上下文，优化器就可以将您的代码基本上变成（假设在您的情况下，count 可以被size 整除）：

for (int i = 0; i < size; i++) {
  arr[i] = count / size;
}

由于从未使用过arr 的值，优化器甚至可以消除此代码。

请注意，由于arr 未初始化，您的代码具有未定义的行为。您应该使用char arr[10] {0} 将数组初始化为0。

但是，优化人员的工作越复杂，就越不可能发现这些机会。通过std::thread 构造函数的机制传递您的数组和参数可能会阻止优化器意识到它可以以这种方式优化您的代码。将您的代码重新排列为：

std::thread t1([] {
    char arrA[10]{0};
    VectorCount v1(arrA, 10, 50'000'000);
    v1.add();
    });
std::thread t2([] {
    char arrA[10]{0};
    VectorCount v1(arrA, 10, 50'000'000);
    v1.add();
    }
);

使它更类似于您的单线程版本，优化器能够发挥它的魔力，并行版本更快：https://godbolt.org/z/9j9ocoMTd。如果您查看第一个版本的汇编代码，您会注意到提到了50000000，但没有提到100000000，在第二个版本中您会注意到50000000 也不存在，因为两个版本都优化到什么都没有。

请注意，它仍然比单线程版本慢，但由于单线程版本只需要约 100 纳秒，因此创建和连接 2 个线程的开销可能比并行节省的开销要多。

在我的带有 Visual Studio 的机器上，此代码对于单线程版本需要 85 毫秒，对于并行版本需要 45 毫秒（Visual Studio 无法完全删除您的代码）。

阻止优化器在 GCC 上会产生更相似的结果：https://godbolt.org/z/Mhf5xjx33

【讨论】：