C++ 11 std 线程求和与原子非常慢答案

【问题标题】：C++ 11 std thread sumation with atomic very slowC++ 11 std 线程求和与原子非常慢
【发布时间】：2015-03-04 00:24:53
【问题描述】：

我想学习在 VS2012 中使用 C++ 11 std::threads，我编写了一个非常简单的 C++ 控制台程序，它有两个线程，它们只是增加一个计数器。我还想测试使用两个线程时的性能差异。测试程序如下：

#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>

std::atomic<long long> sum(0);
//long long sum;

using namespace std;

const int RANGE = 100000000;

void test_without_threds()
{
    sum = 0;
    for(unsigned int j = 0; j < 2; j++)
    for(unsigned int k = 0; k < RANGE; k++)
        sum ++ ;
}

void call_from_thread(int tid) 
{
    for(unsigned int k = 0; k < RANGE; k++)
        sum ++ ;
}

void test_with_2_threds()
{
    std::thread t[2];
    sum = 0;
    //Launch a group of threads
    for (int i = 0; i < 2; ++i) {
        t[i] = std::thread(call_from_thread, i);
    }

    //Join the threads with the main thread
    for (int i = 0; i < 2; ++i) {
        t[i].join();
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    chrono::time_point<chrono::system_clock> start, end;

    cout << "-----------------------------------------\n";
    cout << "test without threds()\n";

    start = chrono::system_clock::now();
    test_without_threds();
    end = chrono::system_clock::now();

    chrono::duration<double> elapsed_seconds = end-start;

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    cout << "-----------------------------------------\n";
    cout << "test with 2_threds\n";

    start = chrono::system_clock::now();
    test_with_2_threds();
    end = chrono::system_clock::now();

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    _getch();
    return 0;
}

现在，当我将 long long 变量（已注释）用于计数器时，我得到的值与正确的值不同 - 100000000 而不是 200000000。我不确定为什么会这样，我想这两个线程正在同时更改计数器，但我不确定它是如何发生的，因为 ++ 只是一个非常简单的指令。似乎线程在一开始就缓存了 sum 变量。两个线程的性能为 110 毫秒，而一个线程的性能为 200 毫秒。

所以根据文档正确的方法是使用std::atomic。然而，现在这两种情况的性能要差得多，大约 3300 ms 没有线程，15820 ms 有线程。在这种情况下使用 std::atomic 的正确方法是什么？

【问题讨论】：

你是在 Release 模式下编译吗？
没有原子，你的程序会有未定义的行为。使用原子，它会更慢，因为您现在有两个线程一直在争夺同一个变量。多线程非常昂贵，只有在某些特殊情况下才值得。
即使没有线程也慢得多。我还有什么其他选择可以让它正确快速地工作？
VS2012 也有一个相当低效的std::atomic 实现。 VS2013 好很多。
用sum += RANGE; 和atomic 替换循环，性能几乎没有那么重要。并发的一般规则是最小化争用，而不是毫无意义地最大化它。

标签： c++ c++11 visual-studio-2012 atomic stdthread

【解决方案1】：

我不确定为什么会这样，我想这两个线程同时更改计数器，但我不确定它是如何发生的，因为 ++ 只是一个非常简单的指令。

每个线程都将sum 的值拉入寄存器，递增寄存器，最后在循环结束时将其写回内存。

所以根据文档正确的方法是使用std::atomic。然而，现在这两种情况的性能都要差得多，大约 3300 ms 没有线程，15820 ms 有线程。在这种情况下使用 std::atomic 的正确方法是什么？

您为std::atomic 提供的同步付费。它不会像使用非同步整数那样快，尽管您可以通过优化 add 的内存顺序来稍微提高性能：

sum.fetch_add(1, std::memory_order_relaxed);

在这种特殊情况下，您正在为 x86 进行编译并在 64 位整数上进行操作。这意味着编译器必须生成代码来更新两个 32 位操作中的值；如果您将目标平台更改为 x64，编译器将生成代码以在单个 64 位操作中执行增量。

一般来说，解决此类问题的方法是减少对共享数据的写入次数。

【讨论】：

如何从两个 32 位的原子操作中生成一个 64 位的原子操作？
fetch_add 对我的工作速度并没有快得多，原子计算同样慢得多，即使没有线程，我想两个处理器都在使用自己的寄存器？他们在开始时将值复制到共享内存中。
@Yakk：它涉及使用lock cmpxchg8b 指令进行旋转，该指令在 8 字节内存上进行原子比较和交换。加法是通过add 和adc（带进位相加）指令在两个 32 位寄存器上执行的。
@BajMile：使用std::atomic 确保线程必须在每次操作后将其结果提交回内存。在这种情况下，两条线正在互相踩着脚趾。他们都在争夺谁可以更新同一块内存。
@collin 是一个更新值的 64 位原子操作。我猜有两个 32 位机器代码操作来生成要更新的值，授予。我的观点是原子操作不可分解。

【解决方案2】：

您的代码有几个问题。首先，所涉及的所有“输入”都是编译时常量，因此一个好的编译器可以预先计算单线程代码的值，因此（无论您为range 提供的值如何）它显示为正在运行在 0 毫秒内。

其次，您在所有线程之间共享一个变量 (sum)，从而强制它们的所有访问在此时同步。没有同步，就会产生未定义的行为。正如您已经发现的那样，同步对该变量的访问是相当昂贵的，因此您通常希望在合理的情况下避免它。

一种方法是为每个线程使用单独的小计，这样它们就可以并行进行加法，而不需要同步，最后将单个结果加在一起。

还有一点是要防止虚假分享。当两个（或更多）线程正在写入真正独立但已分配在同一高速缓存行中的数据时，就会出现错误共享。在这种情况下，即使（如前所述）您没有在线程之间实际共享任何数据，也可以序列化对内存的访问。

基于这些因素，我稍微重写了您的代码，以便为每个线程创建一个单独的sum 变量。这些变量属于class 类型，可以（相当）直接访问数据，但确实阻止优化器看到它可以在编译时完成整个计算，所以我们最终将一个线程与 4 个线程进行比较（提醒我：我确实将线程数从 2 增加到 4，因为我使用的是四核机器）。不过，我将该数字移到了一个 const 变量中，因此使用不同数量的线程进行测试应该很容易。

#include <iostream>
#include <thread>
#include <conio.h>
#include <atomic>
#include <numeric>

const int num_threads = 4;

struct val {
    long long sum;
    int pad[2];

    val &operator=(long long i) { sum = i; return *this; }
    operator long long &() { return sum; }
    operator long long() const { return sum; }
};

val sum[num_threads];

using namespace std;

const int RANGE = 100000000;

void test_without_threds()
{
    sum[0] = 0LL;
    for(unsigned int j = 0; j < num_threads; j++)
    for(unsigned int k = 0; k < RANGE; k++)
        sum[0] ++ ;
}

void call_from_thread(int tid) 
{
    for(unsigned int k = 0; k < RANGE; k++)
        sum[tid] ++ ;
}

void test_with_threads()
{
    std::thread t[num_threads];
    std::fill_n(sum, num_threads, 0);
    //Launch a group of threads
    for (int i = 0; i < num_threads; ++i) {
        t[i] = std::thread(call_from_thread, i);
    }

    //Join the threads with the main thread
    for (int i = 0; i < num_threads; ++i) {
        t[i].join();
    }
    long long total = std::accumulate(std::begin(sum), std::end(sum), 0LL);
}

int main()
{
    chrono::time_point<chrono::system_clock> start, end;

    cout << "-----------------------------------------\n";
    cout << "test without threds()\n";

    start = chrono::system_clock::now();
    test_without_threds();
    end = chrono::system_clock::now();

    chrono::duration<double> elapsed_seconds = end-start;

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    cout << "-----------------------------------------\n";
    cout << "test with threads\n";

    start = chrono::system_clock::now();
    test_with_threads();
    end = chrono::system_clock::now();

    cout << "finished calculation for "
              << chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << "ms.\n";

    cout << "sum:\t" << sum << "\n";\

    _getch();
    return 0;
}

当我运行这个时，我的结果更接近我猜你所希望的：

-----------------------------------------
test without threds()
finished calculation for 78ms.
sum:    000000013FCBC370
-----------------------------------------
test with threads
finished calculation for 15ms.
sum:    000000013FCBC370

...总和是相同的，但 N 个线程将速度提高了大约 N 倍（最多为可用内核数）。

【讨论】：

是的，如果线程使用自己的计数器，它将起作用。没有办法异步工作，它可以正确快速地工作。谢谢。

【解决方案3】：

尝试使用前缀增量，这将提高性能。在我的机器上测试，std::memory_order_relaxed 没有任何优势。

【讨论】：