C++ 中的循环融合（如何帮助编译器？）答案

【问题标题】：Loop fusion in C++ (how to help the compiler?)C++ 中的循环融合（如何帮助编译器？）
【发布时间】：2017-02-01 04:42:05
【问题描述】：

我试图了解 C++ 编译器在什么情况下能够执行循环融合，什么时候不能。

以下代码测量了计算向量中所有值的平方双数 (f(x) = (2*x)^2) 的两种不同方法的性能。

#include <chrono>
#include <iostream>
#include <numeric>
#include <vector>

constexpr int square( int x )
{
    return x * x;
}

constexpr int times_two( int x )
{
    return 2 * x;
}

// map ((^2) . (^2)) $ [1,2,3]
int manual_fusion( const std::vector<int>& xs )
{
    std::vector<int> zs;
    zs.reserve( xs.size() );
    for ( int x : xs )
    {
        zs.push_back( square( times_two( x ) ) );
    }
    return zs[0];
}

// map (^2) . map (^2) $ [1,2,3]
int two_loops( const std::vector<int>& xs )
{
    std::vector<int> ys;
    ys.reserve( xs.size() );
    for ( int x : xs )
    {
        ys.push_back( times_two( x ) );
    }

    std::vector<int> zs;
    zs.reserve( ys.size() );
    for ( int y : ys )
    {
        zs.push_back( square( y ) );
    }
    return zs[0];
}

template <typename F>
void test( F f )
{
    const std::vector<int> xs( 100000000, 42 );

    const auto start_time = std::chrono::high_resolution_clock::now();
    const auto result = f( xs );
    const auto end_time = std::chrono::high_resolution_clock::now();

    const auto elapsed = end_time - start_time;
    const auto elapsed_us = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    std::cout << elapsed_us / 1000 << " ms - " << result << std::endl;
}

int main()
{
    test( manual_fusion );
    test( two_loops );
}

两个循环的版本takes about twice as much time 作为一个循环的版本，即使是-O3 用于GCC 和Clang。

有没有办法让编译器优化two_loops 使其与manual_fusion 一样快，而无需在第二个循环中就地操作？我问的原因是我想更快地像fplus::enumerate(fplus::transform(f, xs)); 那样对我的库FunctionalPlus 进行链式调用。

【问题讨论】：

您在第二个版本中有两个分配。让它直接操作（修改）ys
非常感谢it works！你认为有可能让编译器优化分配吗？
我不这么认为。这样做的猜测太多（一个猜测已经禁止优化）

标签： c++ loops optimization compiler-optimization

【解决方案1】：

您可以尝试如下修改您的 two_loops 函数：

int two_loops( const std::vector<int>& xs )
{
    std::vector<int> zs;
    zs.reserve( xs.size() );
    for ( int x : xs )
    {
        zs.push_back( times_two( x ) );
    }

    for ( int i=0 : i<zs.size(); i++ )
    {
        zs[i] = ( square( zs[i] ) );
    }
    return zs[0];
}

重点是避免两次分配内存和push_back到另一个向量

【讨论】：

谢谢。这行得通。但我希望在不删除分配的情况下有可能。原因是，我想更快地像fplus::enumerate(fplus::transform(f, xs)); 那样对我的库FunctionalPlus 进行链式调用。我只是相应地扩展了我的问题。
如果你想要链式调用，那么你将不得不付出更长执行时间的代价。至少你可以避免分配和 push_backs。第一次调用将创建向量，随后的调用将更改向量内容。
不幸的是，更改内容不适合我在 FunctionalPlus 中使用的方法。