使用 SSE2 优化 RGB565 到 RGB888 的转换答案

【问题标题】：Optimizing RGB565 to RGB888 conversions with SSE2使用 SSE2 优化 RGB565 到 RGB888 的转换
【发布时间】：2015-02-15 12:27:45
【问题描述】：

我正在尝试使用带有基本公式的 SSE2 优化从 565 到 888 的像素深度转换：

col8 = col5 << 3 | col5 >> 2
col8 = col6 << 2 | col6 >> 4

我采用两个 2x565 128 位向量，输出 3x888 128 位向量。

经过一些掩码、移位和 OR'ing，我得到了两个向量，其中 ((blue

    BR: BR7-BR6-...-BR1-BR0
    0G: 0G7-0G7-...-0G1-0G0
                 |
                 v
  OUT1: R5-BGR4-...-BGR1-BGR0

在 SSSE3 中有一个 _mm_shuffle_epi8() 可以解决我的需求，但由于我需要支持的硬件范围，我想将自己限制为 SSE2。

是小端序

【问题讨论】：

第二张图对我来说不是很清楚，那里到底发生了什么？
@harold，它的目的是描述将这两个部分向量打包成带有 RGB 值的最终结果（BGR，因为小端序）。

标签： c++ rgb sse2

【解决方案1】：

可以参考谷歌的libyuv项目，里面有SSE2的转换：

https://chromium.googlesource.com/libyuv/libyuv/+/master/source/row_win.cc

 // pmul method to replicate bits.
// Math to replicate bits:
// (v << 8) | (v << 3)
// v * 256 + v * 8
// v * (256 + 8)
// G shift of 5 is incorporated, so shift is 5 + 8 and 5 + 3
// 20 instructions.
__declspec(naked)
void RGB565ToARGBRow_SSE2(const uint8* src_rgb565, uint8* dst_argb,
                          int width) {
  __asm {
    mov       eax, 0x01080108  // generate multiplier to repeat 5 bits
    movd      xmm5, eax
    pshufd    xmm5, xmm5, 0
    mov       eax, 0x20802080  // multiplier shift by 5 and then repeat 6 bits
    movd      xmm6, eax
    pshufd    xmm6, xmm6, 0
    pcmpeqb   xmm3, xmm3       // generate mask 0xf800f800 for Red
    psllw     xmm3, 11
    pcmpeqb   xmm4, xmm4       // generate mask 0x07e007e0 for Green
    psllw     xmm4, 10
    psrlw     xmm4, 5
    pcmpeqb   xmm7, xmm7       // generate mask 0xff00ff00 for Alpha
    psllw     xmm7, 8
    mov       eax, [esp + 4]   // src_rgb565
    mov       edx, [esp + 8]   // dst_argb
    mov       ecx, [esp + 12]  // width
    sub       edx, eax
    sub       edx, eax
 convertloop:
    movdqu    xmm0, [eax]   // fetch 8 pixels of bgr565
    movdqa    xmm1, xmm0
    movdqa    xmm2, xmm0
    pand      xmm1, xmm3    // R in upper 5 bits
    psllw     xmm2, 11      // B in upper 5 bits
    pmulhuw   xmm1, xmm5    // * (256 + 8)
    pmulhuw   xmm2, xmm5    // * (256 + 8)
    psllw     xmm1, 8
    por       xmm1, xmm2    // RB
    pand      xmm0, xmm4    // G in middle 6 bits
    pmulhuw   xmm0, xmm6    // << 5 * (256 + 4)
    por       xmm0, xmm7    // AG
    movdqa    xmm2, xmm1
    punpcklbw xmm1, xmm0
    punpckhbw xmm2, xmm0
    movdqu    [eax * 2 + edx], xmm1  // store 4 pixels of ARGB
    movdqu    [eax * 2 + edx + 16], xmm2  // store next 4 pixels of ARGB
    lea       eax, [eax + 16]
    sub       ecx, 8
    jg        convertloop
    ret
  }
}

【讨论】：

我假设在某处有一个带有内在函数的版本。虽然这看起来很容易变成可以移植到 64 位或与其他编译器一起使用的独立 asm。它很容易阅读，并且有 cmets，所以我赞成。
我们正在将其转换为 64 位 asm，准备就绪后，也会在这里发布
注意这个函数的输入必须至少是8个像素，所以对于565来说至少是16字节的数组。数组必须是16字节对齐的