vrecpeq_f32 内在的参考实现？答案

【问题标题】：Reference implementation of vrecpeq_f32 intrinsic?vrecpeq_f32 内在的参考实现？
【发布时间】：2021-12-01 23:48:17
【问题描述】：

有vrecpeq_f32ARM NEON Intrinsic。

vrecpeq_f32的官方解释：https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrecpeq_f32。

浮点倒数估计。该指令为源 SIMD&FP 寄存器中的每个向量元素找到一个近似倒数估计，将结果放入一个向量中，并将该向量写入目标 SIMD&FP 寄存器。

但是，它对我来说仍然不准确。只是想知道我们是否可以用 C/C++ 编写一个参考实现来保持与vrecpeq_f32 完全相同的结果？

我尝试调用vrecpeq_f32 并得到结果：

float32x4_t v1 = {1, 2, 3, 4};
float32x4_t v_out = vrecpeq_f32(v1);//0.99805, 0.49902, 0.33301, 0.24951

很好奇为什么 1 的倒数是 0.99805 而不是 1.0。

附：我对如何使用 NEON 内在函数和一些技巧来获得更好的精确倒数结果不感兴趣，例如一次或多次 Newton-Raphson 迭代。

【问题讨论】：

它记录在 FPRecipEstimate 下的here
@Frank 哦，在你提到之前我没有点击那个链接。不过好像伪代码太长了，我以为会短一点。
> 很好奇为什么 1 的倒数是 0.99805 而不是 1.0 。 ——我怀疑，这条指令的结果是从一个具有一组有限 bin 的 ROM 中读取的，每个 bin 用于一系列浮点数。 IOW，必须生成 0.99805 的值并且不仅对 1.0 的输入有效，而且对其他相邻值也有效。因此，结果是近似值，而不是精确值。
因为它只是一个估计，正如指令助记符明确指出的那样。

标签： c++ simd intrinsics neon

【解决方案1】：

ARM documention 提供伪代码，详细说明正在执行的确切算法。查找使用定点RecipEstimate 的FPRecipEstimate。

这可能看起来有很多代码，但其中很大一部分用于处理各种边缘情况、操作模式和元素大小。

只是想知道我们是否可以用 C/C++ 编写一个参考实现来保持与 vrecpeq_f32 完全相同的结果？

当然！毕竟这归结为位操作，所以没有理由不可行。将其转换为 C++，同时删除大多数极端情况处理以及扩展精度模式，如下所示：（参见 godbolt）

免责声明：这不是函数的完整实现，仅足以探索精度行为，假设有限归一化输入，没有特殊情况。不要将它放在期望它与一般指令匹配的代码库中。

#include <iostream>
#include <cstring>
#include <iomanip>

// Convenience struct to deal with encoding and decoding ieee754 floats
struct float_parts {
    explicit float_parts(float v);
    explicit operator float() const;

    std::uint32_t sign;
    std::uint32_t fraction;
    std::uint32_t exp;
};

// Adapted from:
// https://developer.arm.com/documentation/ddi0596/2021-03/Shared-Pseudocode/Shared-Functions?lang=en#impl-shared.FPRecipEstimate.2

// RecipEstimate()
// ===============
// Compute estimate of reciprocal of 9-bit fixed-point number.
//
// a is in range 256 .. 511 representing a number in
// the range 0.5 <= x < 1.0.
// result is in the range 256 .. 511 representing a
// number in the range 1.0 to 511/256
std::uint32_t RecipEstimate(std::uint32_t a) {
    a = a*2+1;
    std::uint32_t b = (1 << 19) / a;
    return ( b + 1) / 2;
}

// FPRecipEstimate()
// =================
float FPRecipEstimate(float operand) {
    // ([...],sign,[...]) = FPUnpack(operand, [...], [...]);
    // fraction = operand<22:0> : Zeros(29);
    // exp = UInt(operand<30:23>);
    float_parts parts{operand};    

    // scaled = UInt('1':fraction<51:44>);
    std::uint32_t scaled = 0x100 | ((parts.fraction >> 15) & 0xFF) ;

    // when 32 result_exp =  253 - exp; // In range 253-254 = -1 to 253+1 = 254
    parts.exp = 253 - parts.exp;

    // // Scaled is in range 256 .. 511 representing a
    // // fixed-point number in range [0.5 .. 1.0].
    // estimate = RecipEstimate(scaled, increasedprecision);
    std::uint32_t estimate = RecipEstimate(scaled);

    // fraction = estimate<11:0> : Zeros(40);
    parts.fraction = (estimate & 0xff ) << 15;

    return float(parts);
}

int main() {
    std::cout << std::setprecision(5) 
              << FPRecipEstimate(1.0f) << "\n"
              << FPRecipEstimate(2.0f) << "\n"
              << FPRecipEstimate(3.0f) << "\n"
              << FPRecipEstimate(4.0f);
}

float_parts::float_parts(float v) {
    std::uint32_t v_bits;
    std::memcpy(&v_bits, &v, sizeof(float));

    sign = (v_bits >> 31) & 0x1;
    fraction = v_bits & ((1 << 23) - 1);
    exp = (v_bits >> 23) & 0xff;
}

float_parts::operator float() const {
    std::uint32_t v_bits = 
        ((sign & 0x1) << 31) |
        (fraction & ((1 << 23) - 1)) |
        ((exp & 0xff) << 23);

    float result;
    std::memcpy(&result, &v_bits, sizeof(float));
    return result;
}

产生预期值：

【讨论】：