在带有并行执行策略的 <numeric> 中使用 std::reduce() 中的 BinaryOp答案

【问题标题】：Using BinaryOp within std::reduce() from <numeric> with parallel execution policy在带有并行执行策略的 <numeric> 中使用 std::reduce() 中的 BinaryOp
【发布时间】：2021-03-02 20:33:34
【问题描述】：

我在使用 <numeric> STL 标头中的 std::reduce() 函数时没有发现问题。

由于我找到了解决方法，我将首先展示预期的行为：

uint64_t f(uint64_t n)
{
   return 1ull; 
}

uint64_t solution(uint64_t N) // here N == 10000000
{
    uint64_t r(0);

    // persistent array of primes
    const auto& app = YTL::AccumulativePrimes::global().items(); 

    auto citEnd = std::upper_bound(app.cbegin(), app.cend(), 2*N);
    auto citBegin = std::lower_bound(app.cbegin(), citEnd, N);

    std::vector<uint64_t> v(citBegin, citEnd);

    std::for_each(std::execution::par,
                    v.begin(), v.end(),
                    [](auto& p)->void {p = f(p); });

    r = std::reduce(std::execution::par, v.cbegin(), v.cend(), 0);
    return r; // here is correct answer: 606028
}

但是，如果我想避免中间向量，而是在reduce() 本身的现场应用二元运算符，也是并行的，它每次都会给我不同的答案：

uint64_t f(uint64_t n)
{
   return 1ull;
}

uint64_t solution(uint64_t N) // here N == 10000000
{
    uint64_t r(0);

    // persistent array of primes
    const auto& app = YTL::AccumulativePrimes::global().items(); 

    auto citEnd = std::upper_bound(app.cbegin(), app.cend(), 2*N);
    auto citBegin = std::lower_bound(app.cbegin(), citEnd, N);

    // bug in parallel reduce?! 
    r = std::reduce(std::execution::par,
                    citBegin, citEnd, 0ull,
                    [](const uint64_t& r, const uint64_t& v)->uint64_t { return r + f(v); });
    return r; // here the value of r is different every time I run!! 
}

谁能解释一下为什么后一种用法是错误的？

我正在使用 MS C++ 编译器 cl.exe：版本 19.28.29333.0；
Windows SDK 版本：10.0.18362.0；
平台工具集：Visual Studio 2019 (v142)
C++ 语言标准：预览 - 最新 C++ 工作草案 (/std:c++latest) 中的功能
计算机：Dell XPS 9570 i7-8750H CPU @ 2.20GHz，16GB RAM 操作系统：Windows 10 64bit

【问题讨论】：

您对std::reduce 的使用似乎等同于std::count_if，而一个微不足道的谓词总是返回true。或者，就此而言，std::distance(citBegin, citEnd)（但没有std::distance 的并行版本，所以如果迭代器不是随机访问的，std::count_if 可能会更快）。
实际的 f() 函数是不平凡的，我已经简化了实际的测试用例来消除其他疑问，所以你在下面的答案中是对的：我的二元运算符是不可交换的。所以我必须坚持使用并行 for_each() 或 transform() 但使用中间容器，然后并行减少它，或者使用顺序策略，减少将等同于串行累积。在我的情况下，使用向量的第一种方法仍然比内存管理的开销更快，并且一起避免了不确定性。感谢您的帮助！
您正在寻找std::transform_reduce。有一个重载可以完全满足您的需求。
非常好。现在使用 std::transform_reduce 比具有中间向量的并行变体获得了 30% 以上的性能。太棒了！

标签： c++ multithreading visual-c++ c++17 reduce

【解决方案1】：

来自cppreference：“如果binary_op 不是关联的或不可交换的，则行为是不确定的。”这是你观察到的；你的不是可交换的。

您的二元运算假设第一个参数始终是累加器，第二个参数始终是元素值。通常情况并非如此。例如。最简单的并行归约形式会将范围分成两半，分别归约，然后合并结果 - 使用相同的操作，在您的情况下会丢失一半的值。

你真正想要的是std::transform_reduce。如

r = std::transform_reduce(
        std::execution::par, citBegin, citEnd, 0ull,
        std::plus<uint64_t>{}, f);

【讨论】：