如何在GPU上找到除法的魔法乘数？答案

【问题标题】：How to find magic multipliers for divisions by constant on a GPU?如何在GPU上找到除法的魔法乘数？
【发布时间】：2021-05-09 07:26:33
【问题描述】：

我正在考虑实现以下计算，其中 divisor 是非零且不是 2 的幂

unsigned multiplier(unsigned divisor)
{
    unsigned shift = 31 - clz(divisor);
    uint64_t t = 1ull << (32 + shift);

    return t / div;
}

对于缺少 64 位整数和浮点指令但可能具有 32 位融合乘加（例如 GPU，它也缺少除法）的处理器而言，这种方式是有效的。

此计算对于查找优化除法所涉及的“魔术乘法器”很有用，当除数提前知道时，乘高指令后跟按位移位。与编译器和reference code in libdivide 中使用的代码不同，它会找到最大的乘数。

另一个转折是，在我正在查看的应用程序中，我预计 divisor 几乎总是可以在 float 类型中表示。因此，有一个有效的“快速路径”来处理这些除数，以及一个尺寸优化的“慢速路径”来处理其余部分是有意义的。

【问题讨论】：

标签： math floating-point gpu division fma

【解决方案1】：

我想出的解决方案在“快速路径”上的 6 或 8 个 FMA 操作中执行专门针对此特定场景（股息是 2 的幂）的长除法，然后使用 8 执行二进制搜索“慢路径”上的迭代。

以下程序对建议的解决方案执行详尽的测试（在支持 FMA 的 CPU 上大约需要 1-2 分钟）。

#include <math.h>
#include <stdint.h>
#include <stdio.h>

struct quomod {
    unsigned long quo;
    unsigned long mod;
};

// Divide 1 << (32 + SHIFT) by DIV, return quotient and modulus
struct quomod
quomod_ref(unsigned div, unsigned shift)
{
    uint64_t t = 1ull << (32 + shift);

    return (struct quomod){t / div, t % div};
}

// Reinterpret given bits as float
static inline float int_as_float(uint32_t bits)
{
    return (union{ unsigned b; float f; }){bits}.f;
}

// F contains integral value in range [-2**32 .. 2**32]. Convert it to integer,
// with wrap-around on overflow. If the GPU implements saturating conversion,
// it also may be used
static inline uint32_t cvt_f32_u32_wrap(float f)
{
    return (uint32_t)(long long)f;
}

struct quomod
quomod_alt(unsigned div, unsigned shift)
{
    // t = float(1ull << (32 + shift))
    float t = int_as_float(0x4f800000 + (shift << 23));

    // mask with max(0, shift - 23) low bits zero
    uint32_t mask = (int)(~0u << shift) >> 23;

    // No roundoff in conversion
    float div_f = div & mask;

    // Caution: on the CPU this is correctly rounded, but on the GPU
    // native reciprocal may be off by a few ULP, in which case a
    // refinement step may be necessary:
    // recip = fmaf(fmaf(recip, -div_f, 1), recip, recip)
    float recip = 1.f / div_f;

    // Higher part of the quotient, integer in range 2^31 .. 2^32
    float quo_hi = t * recip;

    // No roundoff
    float res = fmaf(quo_hi, -div_f, t);

    float quo_lo_approx = res * recip;

    float res2 = fmaf(quo_lo_approx, -div_f, res);

    // Lower part of the quotient, may be negative
    float quo_lo = floorf(fmaf(res2, recip, quo_lo_approx));

    // Remaining part of the dividend
    float mod_f = fmaf(quo_lo, -div_f, res);

    // Quotient as sum of parts
    unsigned quo = cvt_f32_u32_wrap(quo_hi) + (int)quo_lo;

    // Adjust quotient down if remainder is negative
    if (mod_f < 0) {
        quo--;
    }

    if (div & ~mask) {
        // The quotient was computed for a truncated divisor, so
        // it matches or exceeds the true result

        // High part of the dividend
        uint32_t ref_hi = 1u << shift;

        // Unless quotient is zero after wraparound, increment it so
        // it's higher than true quotient (its high bit must be 1)
        quo -= (int)quo >> 31;

        // Binary search for the true quotient; search invariant:
        // quo is higher than true quotient, quo-2*bit is lower
        for (unsigned bit = 256; bit; bit >>= 1) {
            unsigned try = quo - bit;
            // One multiply-high instruction
            uint32_t prod_hi = 1ull * try * div >> 32;
            if (prod_hi >= ref_hi)
                quo = try;
        }
        // quo is zero or exceeds the true quotient, so quo-1 must be it
        quo--;
    }

    // Use the "left-pointing short magic wand" operator
    // to recover the remainder
    return (struct quomod){quo, quo *- div};
}

int main()
{
    fprintf(stderr, "%66c\r[", ']');
    unsigned step = 1;
    for (unsigned div = 3; div; div += step) {
        // Progress bar
        if (!(div & 0x03ffffff)) fprintf(stderr, "=");
        // Skip powers of two
        if (!(div & (div-1))) continue;
        unsigned shift = 31 - __builtin_clz(div);

        struct quomod ref = quomod_ref(div, shift);
        struct quomod alt = quomod_alt(div, shift);

        if (ref.quo != alt.quo || ref.mod != alt.mod) {
            printf("\nerror at %u\n", div);
            return 1;
        }
    }
    fprintf(stderr, "=\nAll ok\n");
    return 0;
}

【讨论】：

这里使用了很多整数类型（不必要的）（6+）。当unsigned 是 16 位时，代码很容易失败，从而降低了可移植性。
@chux-ReinstateMonica (uint32_t)f 不起作用：当 f 为负数或大于 UINT_MAX 时，它会调用未定义的行为。函数前面的注释提到了 f 的界限。
阿莫纳科夫，同意。