java - 如何在Java中将浮点数的有效位截断为任意精度？ [复制]答案

【问题标题】：How do I truncate the significand of a floating point number to an arbitrary precision in Java? [duplicate]java - 如何在Java中将浮点数的有效位截断为任意精度？ [复制]
【发布时间】：2018-02-11 10:07:32
【问题描述】：

我想在两个被比较的数字中引入一些人为的精度损失，以消除小的舍入误差，这样我就不必在每次涉及x 和y 的比较中使用Math.abs(x - y) < eps 成语。

本质上，我想要一些类似于将double 向下转换为float 然后向上转换为double 的东西，除了我还想保留非常大和非常小的指数和我想对保留的有效位数进行一些控制。

给定以下函数，该函数生成 64 位 IEEE 754 数的有效数字的二进制表示：

public static String significand(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH);
}

我想要一个函数 reducePrecision(double x, int bits) 来降低 double 的有效数字的精度，这样：

significand(reducePrecision(x, bits)).substring(bits).equals(String.format("%0" + (52 - bits) + "d", 0))

换句话说，reducePrecision(x, bits) 的有效位中bits-最高有效位之后的每一位都应该是 0，而reducePrecision(x, bits) 有效位中的bits-最高有效位应该合理地接近 @ 987654336@-x 的有效位中的最高有效位。

【问题讨论】：

(a) 这不会“消除”舍入错误；它使它们更大。这不是处理浮点运算中舍入错误的好方法。 (b) 将浮点数舍入为有效位数中特定位数的方法是已知的。我们最近遇到了一个问题，为此我指出了the Veltkamp-Dekker split algorithm。
同意，降低精度总是会导致更大的错误。对目标的更好描述是“指定浮点数的任意二进制离散化”。
我现在确实看到这个问题与一个相当常见的问题重复。我需要在我的 Google-fu 上工作。

标签： java floating-point precision ieee-754

【解决方案1】：

假设x 是您希望降低精度的数字，bits 是您希望保留的有效位的数量。

当bits足够大并且x的数量级足够接近0时，那么x * (1L << (bits - Math.getExponent(x)))将缩放x，这样需要去除的位就会出现在小数部分（在小数点之后），而将保留的位将出现在整数分量中（在小数点之前）。然后您可以将其四舍五入以去除小数部分，然后将四舍五入的数字除以(1L << (bits - Math.getExponent(x))) 以恢复x 的数量级，即：

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * (1L << exponent)) / (1L << exponent);
}

但是，(1L << exponent) 将在 Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62 时崩溃。解决方案是使用Math.pow(2, exponent)（或来自this answer 的快速pow2(exponent) 实现）来计算2 的小数或非常大的幂，即：

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}

但是，Math.pow(2, exponent) 将在 exponent 接近 -1074 或 +1023 时崩溃。解决方案是使用Math.scalb(x, exponent)，这样就不必显式计算2的幂，即：

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}

但是，Math.round(y) 返回一个long，因此它不会保留Infinity、NaN 以及Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent) 的情况。此外，Math.round(y) 总是将关系四舍五入到正无穷大（例如Math.round(0.5) == 1 && Math.round(1.5) == 2）。解决方案是使用Math.rint(y) 接收double 并保留无偏的IEEE 754 舍入到最近、绑定到偶数规则（例如Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0），即：

public static double reducePrecision(double x, int bits) {
    int exponent = bits - Math.getExponent(x);
    return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}

最后，这是一个确认我们期望的单元测试：

public static String decompose(double d) {
    int SIGN_WIDTH = 1;
    int EXP_WIDTH = 11;
    int SIGNIFICAND_WIDTH = 53;
    String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
    return s.substring(0, 0 + SIGN_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
            + s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}

public static void test() {
    // Use a fixed seed so the generated numbers are reproducible.
    java.util.Random r = new java.util.Random(0);

    // Generate a floating point number that makes use of its full 52 bits of significand precision.
    double a = r.nextDouble() * 100;
    System.out.println(decompose(a) + " " + a);
    Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));

    // Cast the double to a float to produce a "ground truth" of precision loss to compare against.
    double b = (float) a;
    System.out.println(decompose(b) + " " + b);
    Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
    // 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
    double c = reducePrecision(a, 23);
    System.out.println(decompose(c) + " " + c);
    Assert.assertTrue(b == c);

    // 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
    // Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
    double d = reducePrecision(c, 22);
    System.out.println(decompose(d) + " " + d);
    Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
    Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
    // 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
    // Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
    double e = reducePrecision(c, 20);
    System.out.println(decompose(e) + " " + e);
    Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
    Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
    Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');

    // Reduce the precision of a number close to the largest normal number.
    double f = reducePrecision(a * 0x1p+1017, 23);
    System.out.println(decompose(f) + " " + f);
    // Reduce the precision of a number close to the smallest normal number.
    double g = reducePrecision(a * 0x1p-1028, 23);
    System.out.println(decompose(g) + " " + g);
    // Reduce the precision of a number close to the smallest subnormal number.
    double h = reducePrecision(a * 0x1p-1051, 23);
    System.out.println(decompose(h) + " " + h);
}

及其输出：

0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315

【讨论】：