找到满足浮点不等式的最小整数答案

【问题标题】：Find smallest integer that satisfies floating point inequality equation找到满足浮点不等式的最小整数
【发布时间】：2020-09-03 00:01:29
【问题描述】：

我正在寻找一种快速算法，它可以找到满足以下不等式的最小整数 N，其中 s、q、u 和 p 是 float 数字（使用 IEEE-754 binary32 格式）：

s > q + u * p / (N - 1)

其中 N 可以是由带符号的 32 位整数表示的任何正整数。在(N - 1) 转换为float 之后，所有算术都在float 中进行计算。

其他约束是：

0p
-1 ≤ q ≤ 1。
q s.
0 u.

我无法弄清楚如何以稳健的方式执行此操作，以正确处理浮点舍入错误和比较。这是我对一个不快且甚至不可靠的解决方案的糟糕尝试，因为我无法确定最小值SOME_AMOUNT：

int n = std::max(1.0f, floorf((u * p / (s - q)) - 1.0f));

// Floating point math might require to round up by some amount...
for (int i = 0; i < SOME_AMOUNT; ++i)
    if (!(q + (u * p / (n + 1)) < second))
        ++n;

您可以在上面看到我使用基本代数计算n 的公式。 for 循环是我试图解释浮点舍入错误的粗略方法。我正在用这样的蛮力检查它：

int nExact = 0;
bool found = false;
for (; nExact < SOME_BIG_NUMBER; ++nExact) {
    if (q + (u * p / (nExact + 1)) < second) {
        found = true;
        break;
    }
}
assert(found);
assert(n == nExact);

任何浮点专家在 C++ 中都有相当快的答案？

坦率地说，如果有人能给出一个理论上合理的证明，证明上面“SOME_AMOUNT”的上限，我会相当高兴...

【问题讨论】：

在释放手指编写代码之前，在纸上做一些基本的代数操作，将s > q + u * p / (N - 1) 变成不等式，一方面是N，另一方面是其他所有内容。您必须考虑一些情况（例如，如果代数运算涉及除以某物，请注意某物为零的情况），但您最终会得到一些简单的封闭式公式来计算 N 给定p、q、u 和 s 的值。最多几个if()和else，绝对不用循环。
您想要使用浮点算术计算时s > q + u * p / (N - 1) 为真的解决方案，还是使用计算时s > q + u * p / (N - 1) 为真的解决方案实数算术？ N 的域是以浮点格式表示的整数集还是整数集？ p 和 q 有相同的符号吗？是 s > q 吗？你对 s、q、u 和 p 了解多少？你知道他们的价值观有什么界限吗？他们的域名有什么限制吗？他们来自哪里？
只是为了切掉部分问题，给定s > q，如果u和q有不同的符号，那么解决方案是2，假设1由于被零除而被排除, 因为u * q / (2-1) 是负数或零，并且s > q + u * q / (2-1) 是真的。所以我们可以将问题简化为u 和p 具有相同的符号。并且u * q 可以替换为x，因为它们不参与表达式。所以我们有s > q + x / (N-1)，其中 x 是正数。
基本算术运算在浮点数中是弱单调的，对应的实数运算是单调或弱单调的。这可能有助于为检查N 的候选人建立界限。（显然，在实数算术中可以很容易地找到 N，但鉴于我们被要求在浮点算术中找到解决方案，舍入问题可能会导致 N 的浮动解决方案与 N 的实际解决方案不同。建立界限可以给我们一个有效的经验解决方案。）
要考虑的一点是，由于 N 是一个 32 位整数，并且使用 float 计算表达式，因此必须将 N 转换为 float，这会引入舍入误差。考虑 q 至少为 ½s 的情况。那么在float 中计算的s-q 是精确的（没有舍入误差），满足s > q + x/n 的最小float n 是(s-q)/x 或高或低1 ULP，具体取决于除法中的舍入。例如，我们可能会发现 n 是 2147483392。在这种情况下，N 将是 2147483266，因为 N-1 是 2147483265，这是四舍五入到 2147483392 的最小整数。

标签： c++ floating-point floating-accuracy floating-point-conversion inequality

【解决方案1】：

为了安全起见，我们可以先得到一个更大的可能值（上限）和一个更小的可能值（下限），然后将其减少到我们的实际答案，这样它会比仅仅迭代更准确和更快超过数字。

通过解决我们得到的不等式，

N > u * p / (s - q) + 1

获取上限

因此，您将首先通过使用整数找到最大猜测答案。我们将增加分子和整数转换分母

int UP = (int)(u * p + 1);    // Increase by one
int D = (int)(s - q);         // we don't increase this because it  would cause g to decrease, which we don't want

float g = UP / (float)D + 1;  // we again float cast D to avoid integer division
int R = (int)(g + 1);         // Now again increase g

/******** Or a more straight forward approach ********/
int R = (int)(((int)(u*p+1))/(s-q) + 1 + 1)

// Add rounding-off error here
if(R + 128 < 0) R = 2147483647;    // The case of overflow
else R += 128;

这是你的最大答案（上限）。

获取下限

和之前一样，但这次我们将增加分母和整数转换分子

int UP = (int)(u * p);         // will automatically decrease
int D = (int)(s - q + 1);      // we increase this because it would cause g to decrease, which we want

float g = UP / (float)D + 1;   // we again float cast D to avoid integer division
int L = (int)g;                // Integer cast, will automatically decrease
/******** Or a more straight forward approach ********/
int L = (int)(((int)(u*p))/(s-q+1) + 1)

// Subtract rounding-off error
if(L - 128 <= 1 ) L = 2;        // N cannot be below 2
else L -= 128;

这是您的最低答案（下限）。

注意：整数转换的原因是为了减少我们的样本空间。如果你觉得可以省略它。

消除可能的数字并得到正确的数字

for (int i = L; i <= R; ++i){
    if ((s > q + u*p/(i-1))) break;   // answer would be i
}
N = i;    // least number which satisfies the condition

如果边界之间的差距 (R-L) 很大，您可以使用二分搜索更快地完成此操作。至于差为 2^n 的数字范围，只需 n 步即可减少。

// we know that
// lower limit = L;
// upper limit = R;
// Declare u, p, q, s in global space or pass as parameters to biranySearch

int binarySearch(int l, int r)
{
    if(l==r) return l;

    if (r > l) {
        int mid = l + (r - l) / 2;

        bool b = (s > q + (p*u)/(mid-1));

        if (b==true){
            // we know that numbers >= mid will all satisfy
            // so our scope reduced to [l, mid]
            return binarySearch(l, mid);
        }
        // If mid doesn't satisfy
        // we know that our element is greater than mid
        return binarySearch(mid+1, r); 
    } 
} 

int main(void) 
{
    // calculate lower bound L and upper bound R here using above methods
    int N = binarySearch(L, R);
    // N might have rounding-off errors, so check for them
    // There might be fluctuation of 128 [-63 to 64] so we will manually check.
    // To be on safe side I will assume fluctuation of 256
    L = N-128 > 2 ? N-128 : 2;
    R = N+128 < 0 ? 2147483647 : N+128;
    for(int i=L; i<=R; ++i){
        if( s > q + u * p / ((float)i - 1)) {
            break;
        }
    }
    cout << i << endl;
}

这主要是一个概念，但它既快速又安全。唯一的问题是我没有测试它，但它应该可以工作！

【讨论】：

我想我会试一试，但你的 cmets 令人困惑......你说，“//我们不会四舍五入，因为增加它会导致 g 减少，我们不会'不想要"，但你确实通过强制转换为整数来四舍五入...
@YesheTenley 四舍五入我的意思是最接近的整数，例如 5.7 变为 6，而转换为整数将使其变为 5。是的，我的一些 cmets 令人困惑，我现在正在更改它们！
@YesheTenley 感谢您指出这个四舍五入的东西，我发现了一个巨大的错误。四舍五入 4.3 会变成 4，但我有意识地希望它变成 5，所以我删除了四舍五入，而是添加了 1。现在很好！之前的错误是由于两次复制粘贴相同的代码，我忘记编辑cmets。
对于 s = 1, q = 0, u = 2^30 = 1073741824, p = 1，这段代码给出的下限为 536870912，上限为 1073741824，但正确答案是 1073741890
@EricPostpischil 我的代码给出的这些约束范围是[2^29+1, 2^30+3] => [536870913, 1073741827]。正确答案是 2^30+2 => 1073741826 小于你的答案，在范围内并且满足不等式。请再次检查！

【解决方案2】：

这是解决方案的开始。一些警告：

它是 C 语言，而不是 C++。
它假定 IEEE-754 算术四舍五入到最接近。
它不处理不等式要求 N 超出从 2 到 INT_MAX 的范围的情况。
我没有测试太多。

代码首先使用浮点算法来估计不等式变化的边界在哪里，忽略舍入误差。它测试不等式以查看是否需要增加或减少候选值。然后它遍历连续整数float 值以找到边界。我的感觉是这需要几次迭代，但我还没有完全分析它。

这会产生最小的float，其整数值在用于代替分母N-1 时满足不等式。然后代码找到最小的int N，使得N-1 舍入到float，这应该是满足不等式的最小int 的N。

#include <math.h>
#include <stdio.h>
#include <stdlib.h>


//  Test the inequality.
static int Test(float s, float q, float u, float p, int N)
{
    return s > q + (float) (((float) (u * p)) / (N-1));
}


int main(void)
{
    float s = 1;
    float q = 0;
    float u = 0x1p30, p = 1;

    /*  Approximate the desired denominator (N-1) -- would be exact with real
        arithmetic but is subject to rounding errors.
    */
    float D = floorf(u*p/(s-q));

    //  Test which side of the boundary where the inequality changes we are on.
    if (Test(s, q, u, p, (int) D + 1))
    {
        //  We are above the boundary, decrement find the boundary.
        float NextD = D;
        do
        {
            D = NextD;
            //  Decrement D by the greater of 1 or 1 ULP.
            NextD = fminf(D-1, nexttowardf(D, 0));
        }
        while (Test(s, q, u, p, (int) NextD + 1));
    }
    else
        //  We are below the boundary, increment to find the boundary.
        do
            //  Increment D by the greater of 1 or 1 ULP.
            D = fmaxf(D+1, nexttowardf(D, INFINITY));
        while (!Test(s, q, u, p, (int) D + 1));

    //  Find the distance to the next lower float, as an integer.
    int distance = D - nexttowardf(D, 0);

    /*  Find the least integer that rounds to D.  If the distance to the next
        lower float is less than 1, then D is that integer.  Otherwise, we want
        either the midpoint between the D and the next lower float or one more
        than that, depending on whether the low bit of D in the float
        significand is even (midpoint will round to it, so use midpoint) or odd
        (midpoint will not round to it, so use one higher).

        (int) D - distance/2 is the midpoint.

        ((int) D / distance) & 1 scales D to bring the low bit of its
        significand to the one’s position and tests it, producing 0 if it is
        even and 1 if it is odd.
    */
    int I = distance == 0 ? (int) D
        : (int) D - distance/2 + (((int) D / distance) & 1);

    //  Set N to one more than that integer.
    int N = I+1;

    printf("N = %d.\n", N);

    if (Test(s, q, u, p, N-1) || !Test(s, q, u, p, N))
    {
        fprintf(stderr, "Error, solution is wrong.\n");
        exit(EXIT_FAILURE);
    }
}

【讨论】：