定点中的 Log2 近似答案

【问题标题】：Log2 approximation in fixed-point定点中的 Log2 近似
【发布时间】：2019-02-13 01:24:35
【问题描述】：

我已经使用查找表和低阶多项式近似实现了定点 log2 函数，但对整个 32 位定点范围 [-1,+1) 的精度不太满意。输入格式为 s0.31，输出格式为 s15.16。

我在这里发布这个问题，以便其他用户可以发布他的答案（一些 cmets 在另一个线程中交换，但他们更愿意在单独的线程中提供全面的答案）。欢迎任何其他答案，如果您能提供算法及其实现的一些速度与准确性细节，我将不胜感激。

谢谢。

【问题讨论】：

你现在达到什么样的准确度，你希望达到什么样的准确度？
原来速度对我来说非常重要，所以我只使用一阶近似。我不记得整个范围内的平均误差，但最大相对误差约为 5%。汇编实现现在非常快，所以我愿意以更多时钟周期为代价来提高准确性。我只是想了解您的方法并了解我可以在哪里为我的应用程序绘制精度与速度线。
问题到底是什么？（我可以猜到，但是这样表述会很有帮助，谢谢。）
@JohnMcFarlane，如有任何混淆，我们深表歉意。我在另一个线程中问 njuffa，他是否为 log2 函数的定点实现做过 minmax 多项式逼近，他建议提出一个新问题，以便他提供解决方案。
@Ali 我没有看到指向另一个线程的链接，您还没有在这个线程中提出任何问题。也许：“使用定点实现 log2 近似的好方法是什么？”

【解决方案1】：

通过简单地计算定点数x 中的前导零位，可以将log2(x) 确定为最接近的严格更小的整数。在许多处理器架构上，都有“计数前导零”机器指令或内在指令。如果这不可用，则可以通过多种方式构建一个相当有效的 clz() 实现，其中一种包含在下面的代码中。

为了计算对数的小数部分，两个主要的明显竞争者是表中的插值和极小极大多项式近似。在这种特定情况下，在一个相当小的表格中进行二次插值似乎是更有吸引力的选择。 x = 2ⁱ * (1+f)，其中 0 ≤ f i，并使用f 的前导位来索引表。抛物线通过这个和下表的两个条目拟合，动态计算抛物线的参数。结果被四舍五入，并应用启发式调整以部分补偿定点算术的截断性质。最后，将整数部分相加，得到最终结果。

应该注意，计算涉及有符号整数的右移，可能为负数。我们需要将这些右移映射到机器代码级别的算术右移，这是 ISO-C 标准不保证的。然而，在实践中，大多数编译器都会做所希望的事情。在本例中，我在运行 Windows 的 x64 平台上使用了 Intel 编译器。

使用 32 位字的 66 项表，最大绝对误差可以降低到 8.18251e-6，从而达到完整的s15.16 精度。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

#define FRAC_BITS_OUT (16)
#define INT_BITS_OUT  (15)
#define FRAC_BITS_IN  (31)
#define INT_BITS_IN   ( 0)

/* count leading zeros: intrinsic or machine instruction on many architectures */
int32_t clz (uint32_t x)
{
    uint32_t n, y;

    n = 31 + (!x);
    if ((y = (x & 0xffff0000U))) { n -= 16;  x = y; }
    if ((y = (x & 0xff00ff00U))) { n -=  8;  x = y; }
    if ((y = (x & 0xf0f0f0f0U))) { n -=  4;  x = y; }
    if ((y = (x & 0xccccccccU))) { n -=  2;  x = y; }
    if ((    (x & 0xaaaaaaaaU))) { n -=  1;         }
    return n;
}

#define LOG2_TBL_SIZE (6)
#define TBL_SIZE      ((1 << LOG2_TBL_SIZE) + 2)

/* for i = [0,65]: log2(1 + i/64) * (1 << 31) */
const uint32_t log2Tab [TBL_SIZE] =
{
    0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50, 
    0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2, 
    0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c, 
    0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d, 
    0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6, 
    0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a, 
    0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e, 
    0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751, 
    0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa, 
    0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0, 
    0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5, 
    0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b, 
    0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b, 
    0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373, 
    0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8, 
    0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846, 
    0x80000000, 0x816fe50b
};

#define RND_SHIFT     (31 - FRAC_BITS_OUT)
#define RND_CONST     ((1 << RND_SHIFT) / 2)
#define RND_ADJUST    (0x10d) /* established heuristically */

/* 
   compute log2(x) in s15.16 format, where x is in s0.31 format
   maximum absolute error 8.18251e-6 @ 0x20352845 (0.251622232)
*/   
int32_t fixed_log2 (int32_t x)
{
    int32_t f1, f2, dx, a, b, approx, lz, i, idx;
    uint32_t t;

    /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
    lz = clz (x);
    i = INT_BITS_IN - lz;
    /* normalize f */
    t = (uint32_t)x << (lz + 1);
    /* index table of log2 values using LOG2_TBL_SIZE msbs of fraction */
    idx = t >> (32 - LOG2_TBL_SIZE);
    /* difference between argument and smallest sampling point */
    dx = t - (idx << (32 - LOG2_TBL_SIZE));
    /* fit parabola through closest three sampling points; find coeffs a, b */
    f1 = (log2Tab[idx+1] - log2Tab[idx]);
    f2 = (log2Tab[idx+2] - log2Tab[idx]);
    a = f2 - (f1 << 1);
    b = (f1 << 1) - a;
    /* find function value for argument by computing ((a*dx+b)*dx) */
    approx = (int32_t)((((int64_t)a)*dx) >> (32 - LOG2_TBL_SIZE)) + b;
    approx = (int32_t)((((int64_t)approx)*dx) >> (32 - LOG2_TBL_SIZE + 1));
    approx = log2Tab[idx] + approx;
    /* round fractional part of result */
    approx = (((uint32_t)approx) + RND_CONST + RND_ADJUST) >> RND_SHIFT;
    /* combine integer and fractional parts of result */
    return (i << FRAC_BITS_OUT) + approx;
}

/* convert from s15.16 fixed point to double-precision floating point */
double fixed_to_float_s15_16 (int32_t a)
{
    return a / 65536.0;
}

/* convert from s0.31 fixed point to double-precision floating point */
double fixed_to_float_s0_31 (int32_t a)
{
    return a / (65536.0 * 32768.0);
}

int main (void)
{
    double a, res, ref, err, maxerr = 0.0;
    int32_t x, start, end;

    start = 0x00000001;
    end =   0x7fffffff;
    printf ("testing fixed_log2 with inputs in [%17.10e, %17.10e)\n",  
            fixed_to_float_s0_31 (start), fixed_to_float_s0_31 (end));

    for (x = start; x < end; x++) {
        a = fixed_to_float_s0_31 (x);
        ref = log2 (a);
        res = fixed_to_float_s15_16 (fixed_log2 (x));
        err = fabs (res - ref);
        if (err > maxerr) {
            maxerr = err;
        }
    }

    printf ("max. err = %g\n", maxerr);
    return EXIT_SUCCESS;
}

为了完整起见，我在下面展示了极小极大多项式近似。这种近似的系数可以通过 Maple、Mathematica、Sollya 等多种工具生成，或者使用 Remez 算法的自制代码生成，这是我在这里使用的。下面的代码显示了原始浮点系数、用于最大化中间计算精度的动态缩放，以及用于减轻非舍入定点算法影响的启发式调整。

计算log2(x) 的典型方法是使用 x = 2ⁱ * (1+f) 并在 [ √½, √2]，这意味着我们在初级逼近区间 [√½-1, √2-1] 上使用多项式 p(f)。

在我们希望使用 32 位 mulhi 操作作为其基本构建块的限制下，中间计算尽可能扩大操作数以提高准确性，因为这是许多 32 位架构上的本机指令，可通过内联机器代码或作为内在代码访问。与基于表的代码一样，有符号数据的右移可能为负数，并且这种右移必须映射到算术右移，这是 ISO-C 不保证但大多数 C 编译器可以做到的。

我设法将此变体的最大绝对误差降至 1.11288e-5，因此几乎完全 s15.16 准确度，但比基于表格的变体略差。我怀疑我应该在多项式中添加一个附加项。

/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
    return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}

#define RND_SHIFT  (25 - FRAC_BITS_OUT)
#define RND_CONST  ((1 << RND_SHIFT) / 2)
#define RND_ADJUST (-2) /* established heuristically */

/* 
    compute log2(x) in s15.16 format, where x is in s0.31 format
    maximum absolute error 1.11288e-5 @ 0x5a82689f (0.707104757)
*/   
int32_t fixed_log2 (int32_t x)
{
    int32_t lz, i, f, p, approx;
    uint32_t t;
    /* x = 2**i * (1 + f), 0 <= f < 1. Find i */
    lz = clz (x);
    i = INT_BITS_IN - lz;
    /* force (1+f) into range [sqrt(0.5), sqrt(2)] */
    t = (uint32_t)x << lz;    
    if (t > (uint32_t)(1.414213562 * (1U << 31))) {
        i++;
        t = t >> 1;
    }
    /* compute log2(1+f) for f in [-0.2929, 0.4142] */
    f = t - (1U << 31);
    p =              + (int32_t)(-0.206191055 * (1U << 31) -  1);
    p = mulhi (p, f) + (int32_t)( 0.318199910 * (1U << 30) - 18);
    p = mulhi (p, f) + (int32_t)(-0.366491705 * (1U << 29) + 22);
    p = mulhi (p, f) + (int32_t)( 0.479811855 * (1U << 28) -  2);
    p = mulhi (p, f) + (int32_t)(-0.721206390 * (1U << 27) + 37);
    p = mulhi (p, f) + (int32_t)( 0.442701618 * (1U << 26) + 35);
    p = mulhi (p, f) + (f >> (31 - 25));
    /* round fractional part of the result */
    approx = (p + RND_CONST + RND_ADJUST) >> RND_SHIFT;
    /* combine integer and fractional parts of result */
    return (i << FRAC_BITS_OUT) + approx;
}

【讨论】：

这么棒的答案！如果有人想知道抛物线拟合发生了什么，op 使用拉格朗日插值公式，但有一个巧妙的技巧，将函数转换为 (0,0) 并将 (0,0) 作为第一个点。还有一个偷偷摸摸的除以二（在第二个近似语句中），以防你的数字不检查。
@gwiazdorr 自 1987 年左右以来我一直在推广这种方法，我很确定当时我对拉格朗日插值一无所知。将插值坐标系的原点放在第一个点，并将等距节点之间的 x 轴距离定义为单位后，通过将其他两个点代入 ax²+bx，我们得到两个联立方程组：a+b = f1和4*a+2*b = f2。这给了我们2*a = f2 - 2*f1 和2*b = 4*f1 - f2 = 2*f1 - 2*a。