定点算术中的单精度答案

【问题标题】：Single precision in Fixed Point Arithmetic定点算术中的单精度
【发布时间】：2017-11-19 00:08:19
【问题描述】：

对于使用定点算术的泰勒级数计算，我需要最多 6 位小数精度。我尝试了不同的定点格式来实现 6 位小数精度。

例如，使用 s16.15（左移 15）格式，我有多达 2 个小数位精度。1 个符号位、16 个整数位和 15 个小数位。

对于 s8.23（左移 23）格式，最多保留 4 个小数位，而对于 s4.27（左移 27）格式，精度仍然相同。我原以为情况会好转。

下面是一个泰勒级数展开，用来计算围绕某个点a的自然对数。

所以q=x-a，x是1到2之间的用户输入。

  // These are converted constants into s4.27 fixed point format
  const int32_t con=0x0B8AA3B3; //1.44269504088895
  const int32_t c0=0x033E647E; //0.40546510810816
  const int32_t c1=0x05555555; //0.66666666666666
  const int32_t c2=0x01C71C72; //0.222222222222
  const int32_t c3=0x00CA4588; //0.0987654321
  const int32_t c4=0x006522C4; //0.04938271605
  const int32_t c5=0x0035F069; //0.02633744856
  const int32_t c6=0x001DF757; //0.01463191587

//Expanded taylor series    
taylor=c0+mul(q,(c1-mul(q,(c2+mul(q,(c3-mul(q,(c4-mul(q,(c5+mul(q,c6)))))))))));
// Multiplication function
int32_t mul(int32_t x, int32_t y)
{
int32_t mul;
mul=((((x)>>13)*((y)>>13))>>1); // for s4.27 format, the best possible right shift
return mul;
}

上面提到的代码sn-ps用在C中。

我需要的结果：0.584963 但我得到的结果是：0.584949

我怎样才能获得更高的精度？

【问题讨论】：

您确定您看到的错误是由于精度有限，而不是您使用的是近似公式吗？
但是，您的 mul 函数似乎会立即删除每个参数的 13 位，这可能对精度非常不利。
(a) 说明输入是什么。 (b) 提供Minimal, Complete, and Verifiable Example。 (c) 使用double 算术或 64 位整数算术进行计算，以测试问题是由于算术的精度还是您使用的多项式的质量。它还可以帮助说明您使用泰勒级数逼近的原始函数，以便人们可以检查您的系数。
这似乎是某种与对数相关的计算（c0 == log(1.5) 但我不知道正在计算什么。为了在这种定点计算中最大化精度，通常（1）改变从术语到术语的定点格式；这需要知道输入域以避免溢出 (2) 使用 mulhi 指令或宽乘法（具有全双宽度乘积），因此原始参数的所有位都进入每个乘积。最好使用极大极小近似而不是泰勒级数展开。
cmets 不反映代码。 c2=0x01C71C72; --> 0.222222223878...，不是0.222222222222

标签： c fixed-point taylor-series

【解决方案1】：

OP 的 mul() 丢掉了太多的精度。

(x)>>13)*((y)>>13) 立即丢弃 x 和 y 的最低有效 13 位。

改为执行 64 位乘法

int32_t mul_better(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  mul >>= 27;

  // Code may want to detect overflow here (not shown)

  return (int32_t) mul;
}

更好的是，在丢弃最低有效位之前将乘积四舍五入到最接近（平成偶数）。简化是可能的。下面的详细代码是说明性的。

int32_t mul_better(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  int32_t least = mul % ((int32_t)1 << 27);
  mul /= (int32_t)1 << 27;
  int carry = 0;
  if (least >= 0) {
    if (least >  ((int32_t)1 << 26) carry = 1;
    else if ((least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = 1;
  } else {
    if (-least > ((int32_t)1 << 26) carry = -1;
    else if ((-least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = -1;
  }
  return (int32_t) (mul + carry);
}

int32_t mul(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  return mul >> 27;
}

void foo(double x) {
  int32_t q = (int32_t) (x * (1 << 27));  // **
  int32_t taylor =
      c0 + mul(q, (c1 - mul(q, (c2  + mul(q,
      (c3 - mul(q, (c4 - mul(q, (c5 + mul(q, c6)))))))))));
  printf("%f %f\n", x,  taylor * 1.0 / (1 << 27));
}

int main(void) {
  foo(0.303609);
}

输出

0.303609 0.584963

** 也可以在这里四舍五入，而不是简单地将 FP 截断为整数。

【讨论】：