为等距角快速准确地迭代生成正弦和余弦答案

【问题标题】：Fast and accurate iterative generation of sine and cosine for equally-spaced angles为等距角快速准确地迭代生成正弦和余弦
【发布时间】：2018-08-07 21:18:33
【问题描述】：

在某些应用中，需要多个角度的正弦和余弦，其中角度是通过重复将相等大小的增量 incr 添加到起始值 base 得出的。出于性能原因，不是为每个生成的角度调用 sin()、cos() 标准数学库函数（或者可能是非标准 sincos() 函数），计算 sin(base) 可能非常有利 和 cos(base) 只有一次，然后通过 angle-sum formulas 的应用导出所有其他正弦和余弦：

sin(base+incr) = cos(incr) · sin(base) + sin(incr) · cos(base)
cos(base+incr) = cos (incr) · cos(base) - sin(incr) · sin(base)

这仅需要对比例因子 sin(incr) 和 cos(incr) 进行一次预计算，无论执行多少次迭代。

这种方法存在几个问题。如果增量很小，cos(incr) 将是一个接近 1 的数字，在以有限精度浮点格式计算时，通过隐式减法抵消导致精度损失。此外，由于计算不以有利的数值形式 sin(base+incr) = sin(base) + adjust 排列，因此会产生不必要的舍入误差，其中计算量 adjust 的大小明显小于 sin(base)（类似于余弦）。

由于一个人通常会应用数十到数百个迭代步骤，因此这些错误会累积。如何以最有利于保持高精度的方式构建迭代计算？如果 fused multiply-add 操作 (FMA) 可用（通过标准数学函数 fma() 和 fmaf() 公开），应该对算法进行哪些更改？

【问题讨论】：

在重写标准函数以利用速度时，通常会损失准确性或范围或便携性。然而，速度、范围、便携性与准确性的重要性并不相同，因此我们只能比较不同的质量。 IOW，如何同时评价快速和准确？ double my_sin(double x) { return x; } 速度快、便携性高且精度高，可能超过所有 double（小的）的 25%，但对于大值却很糟糕。

标签： c algorithm floating-point trigonometry

【解决方案1】：

将half-angle formula 应用于正弦可以解决问题中提到的两个影响准确性的问题：

sin(incr/2) = √((1-cos(incr))/2) ⇒
sin²(incr/2) = (1-cos(incr) ))/2 ⇔
2·sin²(incr/2) = 1-cos(incr) ⇔
1-2·sin²(incr/2 ) = cos(incr)

将其代入原始公式会得到此中间表示：

sin(base+incr) = (1 - 2·sin²(incr/2)) · sin(base) + sin(incr) · cos(base)
cos(base+incr) = (1 - 2·sin²(incr/2)) · cos(base) - sin(incr) · sin(base)

通过对术语进行简单的重新排序，即可得出所需的公式形式：

sin(base+incr) = sin(base) + (sin(incr) · cos(base) - 2·sin²(incr/2) · sin(base))
cos(base+incr) = cos(base) - (2·sin²(incr/2) · cos(base) + sin(incr) · sin(base))

和原来的公式一样，这只需要一次性预计算两个比例因子，即2·sin²(incr/2)和sin(incr) .对于小增量，它们都很小：保留完整的准确性。

关于如何将 FMA 应用于此计算，有两种选择。可以通过取消使用单次调整的方法来最小化操作次数，而是使用两次，希望 FMA 操作减少的舍入误差（一次舍入两次操作）可以补偿精度损失：

sin(base+incr) = fma (-2·sin²(incr/2), sin(base), fma ( sin(incr), cos(base), sin(base))) em>
cos(base+incr) = fma (-2·sin²(incr/2), cos(base), fma (-sin(incr), sin(base), cos(base)) )

另一种选择是将单个 FMA 应用于改进的公式，尽管目前尚不清楚两个乘法中的哪一个应该映射到 FMA 内的未取整乘法：

sin(base+incr) = sin(base) + fma (sin(incr), cos(base), -2·sin²(incr/2) · sin(base))cos(base+incr) = cos(base) - fma (sin(incr), sin(base), 2·sin²(incr/2) · cos(base))

下面的脚手架通过生成许多 (base, incr) 对来评估上面讨论的每个计算替代方案，然后对它们中的每一个进行迭代一定数量的步骤同时收集生成的所有正弦和余弦值的错误。由此它为每个测试用例计算一个root-mean square error，分别用于正弦、余弦。最后报告在所有生成的测试用例中观察到的最大 RMS 误差。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define NAIVE    (1)
#define ROBUST   (2)
#define FAST     (3)
#define ACCURATE (4)
#define MODE (ACCURATE)

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    double sumerrsqS, sumerrsqC, rmsS, rmsC, maxrmsS = 0, maxrmsC = 0;
    double refS, refC, errS, errC;
    float base, incr, s0, c0, s1, c1, tt;
    int count, i;

    const int N = 100;  // # rotation steps per test case
    count = 2000000;    // # test cases (a pair of base and increment values)

#if MODE == NAIVE
    printf ("testing: NAIVE (without FMA)\n");
#elif MODE == FAST
    printf ("testing: FAST (without FMA)\n");
#elif MODE == ACCURATE
    printf ("testing: ACCURATE (with FMA)\n");
#elif MODE == ROBUST
    printf ("testing: ROBUST (with FMA)\n");
#else
#error unsupported MODE
#endif // MODE

    do {
        /* generate test case */
        base = (float)(KISS * 1.21e-10);      // starting angle, < 30 degrees
        incr = (float)(KISS * 2.43e-10 / N);  // increment, < 60/n degrees

        /* set up rotation parameters */
        s1 = sinf (incr);
#if MODE == NAIVE
        c1 = cosf (incr);
#else
        tt = sinf (incr * 0.5f);
        c1 = 2.0f * tt * tt;
#endif // MODE
        sumerrsqS = 0;
        sumerrsqC = 0;

        s0 = sinf (base); // initial sine
        c0 = cosf (base); // initial cosine

        /* run test case through N rotation steps */
        i = 0;
        do {         

            tt = s0; // old sine
#if MODE == NAIVE
            /* least accurate, 6 FP ops */
            s0 = c1 * tt + s1 * c0; // new sine
            c0 = c1 * c0 - s1 * tt; // new cosine
#elif MODE == ROBUST
            /* very accurate, 8 FP ops */
            s0 = ( s1 * c0 - c1 * tt) + tt; // new sine
            c0 = (-s1 * tt - c1 * c0) + c0; // new cosine
#elif MODE == FAST
            /* accurate and fast, 4 FP ops */
            s0 = fmaf (-c1, tt, fmaf ( s1, c0, tt)); // new sine
            c0 = fmaf (-c1, c0, fmaf (-s1, tt, c0)); // new cosine
#elif MODE == ACCURATE
            /* most accurate, 6 FP ops */
            s0 = tt + fmaf (s1, c0, -c1 * tt); // new sine
            c0 = c0 - fmaf (s1, tt,  c1 * c0); // new cosine
#endif // MODE
            i++;

            refS = sin (fma ((double)i, (double)incr, (double)base));
            refC = cos (fma ((double)i, (double)incr, (double)base));
            errS = ((double)s0 - refS) / refS;
            errC = ((double)c0 - refC) / refC;
            sumerrsqS = fma (errS, errS, sumerrsqS);
            sumerrsqC = fma (errC, errC, sumerrsqC);
        } while (i < N);

        rmsS = sqrt (sumerrsqS / N);
        rmsC = sqrt (sumerrsqC / N);
        if (rmsS > maxrmsS) maxrmsS = rmsS;
        if (rmsC > maxrmsC) maxrmsC = rmsC;

    } while (--count);

    printf ("max rms error sin = % 16.9e\n", maxrmsS);
    printf ("max rms error cos = % 16.9e\n", maxrmsC);

    return EXIT_SUCCESS;
}

测试支架的输出表明，最快的基于 FMA 的替代方法优于问题中的朴素方法，而更准确的基于 FMA 的替代方法是所考虑的替代方法中最准确的：

testing: NAIVE (without FMA)
max rms error sin =  4.837386842e-006
max rms error cos =  6.884047862e-006

testing: ROBUST (without FMA)
max rms error sin =  3.330292645e-006
max rms error cos =  4.297631502e-006

testing: FAST (with FMA)
max rms error sin =  3.532624939e-006
max rms error cos =  4.763623188e-006

testing: ACCURATE (with FMA)
max rms error sin =  3.330292645e-006
max rms error cos =  4.104813533e-006

【讨论】：

您还应该检查sin(base)*sin(base) + cos(base)*cos(base) 偏离统一的程度，看看是否/何时可能需要重整化。
FMA 有时没那么有用，但要精确：Is my fma() broken?

【解决方案2】：

如果您想在较长的迭代计数中最大限度地提高准确性，您可以在从前一个精确值生成增量结果的同时增量计算一个没有累积误差的精确值。

例如，如果您为 x=6 预计算 sin(incr*2^x) 和 cos(incr*2^x) ...例如，31，那么您可以使用角度和公式计算每个 incr=64*n 的结果，同时输出前 64 个值。 p>

每 64 个值您就丢弃增量生成的结果以支持精确的结果，因此不会长时间累积错误。

此外，由于您只需要来自任何精确基数的 64 个增量结果，因此您可以预先计算直接从基数而不是先前结果计算这些结果所需的 64 个正弦和余弦。

【讨论】：

【解决方案3】：

可以用以下方式重新排列 sin(base+incr) 和 cos(base+incr) 的方程：

sin(base+incr) = cos(incr) · sin(base) + sin(incr) · cos(base)
sin(base+incr) = sin (base) + (1 - cos(incr)) · -sin(base) + sin(incr) · cos(base)
sin(base+incr) = sin(base) + sin(incr) · (-1 / sin(incr) · (1 - cos(incr)) · sin(base) + cos(base))
sin(base+incr) = sin(base) + sin(incr) · (-tan(incr/2) · sin(base) + cos(base))

cos(base+incr) = cos(incr) · cos(base) - sin(incr) · sin(base)
cos(base+incr) = cos (base) - sin(incr) · (tan(incr/2) · cos(base) + sin(base))

这里我们使用公式(1-cos(x)/sin(x) = tan(x/2), 例如，参见here。与其他方法相比，这不会立即产生更准确的结果，但实际上它工作得很好，我们稍后会看到。

同样，这需要一次性预计算两个比例因子 sin(incr) 和 tan(incr/2)。在 C 中，我们可以用 4 个 fma-s 编写公式：

        s0 = fmaf ( s1, fmaf (-tt, c1, c0), tt); // new sine
        c0 = fmaf (-s1, fmaf ( c0, c1, tt), c0); // new cosine

完整更新的测试代码在此答案的末尾。使用gcc -O3 -Wall -m64 -march=skylake fastsincos.c -lm（GCC 版本 7.3），结果为：

testing: FAST (with FMA)
max rms error sin =  3.532624939e-06
max rms error cos =  4.763623188e-06

testing: ACCURATE (with FMA)
max rms error sin =  3.330292645e-06
max rms error cos =  4.104813533e-06

testing: FAST_ACC (with FMA)
max rms error sin =  3.330292645e-06
max rms error cos =  3.775300478e-06

新的解决方案FAST_ACC在这个测试中确实比其他解决方案更准确。

修改后的测试代码：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define NAIVE    (1)
#define ROBUST   (2)
#define FAST     (3)
#define ACCURATE (4)
#define FAST_ACC (5)
#define MODE (FAST_ACC)

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    double sumerrsqS, sumerrsqC, rmsS, rmsC, maxrmsS = 0, maxrmsC = 0;
    double refS, refC, errS, errC;
    float base, incr, s0, c0, s1, c1, tt;
    int count, i;

    const int N = 100;  // # rotation steps per test case
    count = 2000000;    // # test cases (a pair of base and increment values)

#if MODE == NAIVE
    printf ("testing: NAIVE (without FMA)\n");
#elif MODE == FAST
    printf ("testing: FAST (without FMA)\n");
#elif MODE == ACCURATE
    printf ("testing: ACCURATE (with FMA)\n");
#elif MODE == ROBUST
    printf ("testing: ROBUST (with FMA)\n");
#elif MODE == FAST_ACC
    printf ("testing: FAST_ACC (with FMA)\n");
#else
#error unsupported MODE
#endif // MODE

    do {
        /* generate test case */
        base = (float)(KISS * 1.21e-10);      // starting angle, < 30 degrees
        incr = (float)(KISS * 2.43e-10 / N);  // increment, < 60/n degrees

        /* set up rotation parameters */
        s1 = sinf (incr);
#if MODE == NAIVE
        c1 = cosf (incr);
#elif MODE == FAST_ACC
        c1 = tanf (incr * 0.5f);
#else
        tt = sinf (incr * 0.5f);
        c1 = 2.0f * tt * tt;
#endif // MODE
        sumerrsqS = 0;
        sumerrsqC = 0;

        s0 = sinf (base); // initial sine
        c0 = cosf (base); // initial cosine

        /* run test case through N rotation steps */
        i = 0;
        do {         

            tt = s0; // old sine
#if MODE == NAIVE
            /* least accurate, 6 FP ops */
            s0 = c1 * tt + s1 * c0; // new sine
            c0 = c1 * c0 - s1 * tt; // new cosine
#elif MODE == ROBUST
            /* very accurate, 8 FP ops */
            s0 = ( s1 * c0 - c1 * tt) + tt; // new sine
            c0 = (-s1 * tt - c1 * c0) + c0; // new cosine
#elif MODE == FAST
            /* accurate and fast, 4 FP ops */
            s0 = fmaf (-c1, tt, fmaf ( s1, c0, tt)); // new sine
            c0 = fmaf (-c1, c0, fmaf (-s1, tt, c0)); // new cosine
#elif MODE == ACCURATE
            /* most accurate, 6 FP ops */
            s0 = tt + fmaf (s1, c0, -c1 * tt); // new sine
            c0 = c0 - fmaf (s1, tt,  c1 * c0); // new cosine
#elif MODE == FAST_ACC
            /* fast and accurate, 4 FP ops */
            s0 = fmaf ( s1, fmaf (-tt, c1, c0), tt); // new sine
            c0 = fmaf (-s1, fmaf ( c0, c1, tt), c0); // new cosine
#endif // MODE
            i++;

            refS = sin (fma ((double)i, (double)incr, (double)base));
            refC = cos (fma ((double)i, (double)incr, (double)base));
            errS = ((double)s0 - refS) / refS;
            errC = ((double)c0 - refC) / refC;
            sumerrsqS = fma (errS, errS, sumerrsqS);
            sumerrsqC = fma (errC, errC, sumerrsqC);
        } while (i < N);

        rmsS = sqrt (sumerrsqS / N);
        rmsC = sqrt (sumerrsqC / N);
        if (rmsS > maxrmsS) maxrmsS = rmsS;
        if (rmsC > maxrmsC) maxrmsC = rmsC;

    } while (--count);

    printf ("max rms error sin = % 16.9e\n", maxrmsS);
    printf ("max rms error cos = % 16.9e\n", maxrmsC);

    return EXIT_SUCCESS;
}

【讨论】：