【发布时间】:2019-05-21 04:29:19
【问题描述】:
我正在编写一个基本上利用 SSE2 优化此代码的代码:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}
在这个:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);
__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
_mm_store_pd(pC, result);
loaded_a = _mm_load_pd(pA + 2);
loaded_b = _mm_load_pd(pB + 2);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 2, result);
loaded_a = _mm_load_pd(pA + 4);
loaded_b = _mm_load_pd(pB + 4);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 4, result);
loaded_a = _mm_load_pd(pA + 6);
loaded_b = _mm_load_pd(pB + 6);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 6, result);
loaded_a = _mm_load_pd(pA + 8);
loaded_b = _mm_load_pd(pB + 8);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
}
我会说它运作良好。但是,找不到 SSE2 的任何 exp 函数来完成操作链。
正在阅读this,看来我需要从库中调用标准exp()?
真的吗?这不是惩罚吗?还有其他方法吗?不同的内置函数?
我正在使用MSVC、/arch:SSE2、/O2,正在生成 32 位代码。
【问题讨论】:
-
它需要有多准确?当然,有各种各样的技巧,在准确性/时间方面有各种权衡
-
也许指数的近似值会起作用?
-
pC + 1与pC和pC+2重叠。请记住,每个向量的宽度为 2doubles,而您使用的是double*而不是__m128d*。 -
另见sse_mathfun.h。它是用 SSE 内在函数编写的。它的起源是标量 cephes 库。
标签: c++ simd intrinsics sse2 exp