【发布时间】:2016-01-12 10:56:14
【问题描述】:
在更大的数值计算中,我必须执行将两个向量元素的乘积相加的琐碎任务。由于此任务需要经常完成,因此我尝试利用编译器 (VC2015) 的自动矢量化功能。我引入了一个临时向量,其中产品在第一个循环中保存,然后在第二个循环中执行求和。优化设置为完整,首选快速代码。这样,第一个循环被编译器向量化(我从编译器输出中知道这一点)。
结果令人惊讶。矢量化代码在我的机器(核心 i5-4570 3.20 GHz)上的执行速度比简单代码慢 3 倍。有人可以解释为什么以及什么可以提高性能吗?我已经将算法片段的两个版本都放入了一个最小的运行示例,我自己用它来测试:
#include "stdafx.h"
#include <vector>
#include <Windows.h>
#include <iostream>
using namespace std;
int main()
{
// Prepare timer
LARGE_INTEGER freq,c_start,c_stop;
QueryPerformanceFrequency(&freq);
int size = 20000000; // size of data
double v = 0;
// Some data vectors. The data inside doesn't matter
vector<double> vv(size);
vector<double> tt(size);
vector<float> dd(size);
// Put random values into the vectors
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
// The simple version of the algorithm fragment
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++)
{
v += tt[p] * dd[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Simple version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
// The version that is auto-vectorized
for (int i = 0; i < size; i++)
{
tt[i] = rand();
dd[i] = rand();
}
v = 0;
QueryPerformanceCounter(&c_start); // start timer
for (int p = 0; p < size; p++) // This loop is vectorized according to compiler output
{
vv[p] = tt[p] * dd[p];
}
for (int p = 0; p < size; p++)
{
v += vv[p];
}
QueryPerformanceCounter(&c_stop); // Stop timer
cout << "Vectorized version took: " << ((double)(c_stop.QuadPart - c_start.QuadPart)) / ((double)freq.QuadPart) << " s" << endl;
cout << v << endl; // We use v once. This avoids its calculation to be optimized away.
cin.ignore();
return 0;
}
【问题讨论】:
标签: c++ performance visual-c++ vectorization