极慢的双线性插值（与 OpenCV 相比）答案

【问题标题】：Extremely slow bilinear interpolation (compared to OpenCV)极慢的双线性插值（与 OpenCV 相比）
【发布时间】：2012-12-03 00:24:39
【问题描述】：

template<typename T>
cv::Mat_<T> const bilinear_interpolation(cv::Mat_<T> const &src, cv::Size dsize,
                                     float dx, float dy)
{
    cv::Mat_<T> dst = dsize.area() == 0 ? cv::Mat_<T>(src.rows * dy, src.cols * dx) :
                                        cv::Mat_<T>(dsize);
  
    float const x_ratio = static_cast<float>((src.cols - 1)) / dst.cols;
    float const y_ratio = static_cast<float>((src.rows - 1)) / dst.rows;
    for(int row = 0; row != dst.rows; ++row)
    {
        int y = static_cast<int>(row * y_ratio);
        float const y_diff = (row * y_ratio) - y; //distance of the nearest pixel(y axis)
        float const y_diff_2 = 1 - y_diff;
        auto *dst_ptr = &dst(row, 0)[0];
        for(int col = 0; col != dst.cols; ++col)
        {
            int x = static_cast<int>(col * x_ratio);
            float const x_diff = (col * x_ratio) - x; //distance of the nearest pixel(x axis)
            float const x_diff_2 = 1 - x_diff;
            float const y2_cross_x2 = y_diff_2 * x_diff_2;
            float const y2_cross_x = y_diff_2 * x_diff;
            float const y_cross_x2 = y_diff * x_diff_2;
            float const y_cross_x = y_diff * x_diff;
            for(int channel = 0; channel != cv::DataType<T>::channels; ++channel)
            {
                *dst_ptr++ = y2_cross_x2 * src(y, x)[channel] +
                             y2_cross_x * src(y, x + 1)[channel] +
                             y_cross_x2 * src(y + 1, x)[channel] +
                             y_cross_x * src(y + 1, x + 1)[channel];
            }
        }
    }
    
    return dst;
}

这是一个双线性插值的实现，我用它来将 512 * 512 的图像（“lena.png”）放大到 2048 * 2048。完成这项工作需要 0.195 秒，但是 cv::resize（不是OpenCV 的 GPU 版本）只需要 0.026 秒。我不知道是什么让我的程序这么慢（OpenCV 比我快了将近 750%），我想看看 OpenCV 调整大小的源代码，但我找不到它的实现。

你知道为什么 OpenCV 的大小调整会这么快或者我的双线性太慢吗？

    {
        timeEstimate<> time;
        cv::Mat_<cv::Vec3b> const src = input;
        bilinear_interpolation(src, cv::Size(), dx, dy);
        std::cout << "bilinear" << std::endl;
    }

    {
        timeEstimate<> time;
        cv::Mat output = input.clone();
        cv::resize(input, output, cv::Size(), dx, dy, cv::INTER_LINEAR);
        std::cout << "bilinear cv" << std::endl;
    }

编译器：mingw4.6.2 操作系统：win7 64位处理器：英特尔® i3-2330M (2.2G)

【问题讨论】：

cv::resize 可能正在利用处理器指令集扩展，例如 SSE，它允许它并行运行多个乘法运算。要自己执行此操作，您需要智能编译器优化或手动编写 x86 程序集。 编辑： According to sources, GCC enables vectorisation with the -O3 flag.。（但是，-O3 可能会导致非常奇怪的错误，因此一般不建议使用它。）
谢谢，最近去看看opencl，希望gpgpu开发的代码可以更便携。
您可以通过考虑针对特定情况的特定实现来开始加速它，假设每个通道具有 8 位深度的典型 RGB 整数图像。然后可以使用更少的乘法和更多的按位运算来执行此插值。虽然我没有适合 cmets 的实现，但您可以在 cboard.cprogramming.com/game-programming/… 找到一个我没有检查正确性的实现。
我研究了hqx github.com/Arcnor/hqx-java/tree/master/src/hqx的类似代码，我需要将cv::Vec3b（或4b）转换为int，然后将int转换回cv::Vec3b来保存或显示图像，不是一个非常通用的解决方案，但值得试一试。

标签： c++ image algorithm opencv

【解决方案1】：

主要有两点让 OpenCV 的版本更快：

OpenCV 将调整大小实现为“可分离操作”。 IE。它分两步完成：水平拉伸图像，然后垂直拉伸。这种技术允许使用更少的算术运算来调整大小。
手工编码的 SSE 优化。

【讨论】：

我检查了第一个解决方案，但无法弄清楚如何将操作分为两步。它可以用于某些矩阵乘法和 dct，但我不知道如何将它应用于双线性.

【解决方案2】：

可能有点晚了，但还要检查一下您是否在调试模式下运行应用程序。 OpenCV 是一个库，很可能会被编译以发布 - 带有编译器优化。

【讨论】：

【解决方案3】：

我最近在一些基于 CPU 的图形代码中添加双线性升级时遇到了同样的问题。

首先，我使用以下配置运行您的代码：

操作系统：虚拟机中的 Xubuntu 20 编译器：gcc 9.3.0 OpenCV 版本：4.2.0 CPU：i3-6100u (2.3 GHz) 源位图大小：512x512 目标位图大小：2048x2048

我发现你的代码用了 92 毫秒，而 OpenCV 用了 4.2 毫秒。所以现在的差异甚至比你在 2012 年问这个问题时更大。我猜 OpenCV 从那时起优化得更多。

（此时我切换到在 Windows 中使用 Visual Studio 2013，为 x64 目标构建）。

将代码转换为使用定点算术将时间减少到 30 毫秒。定点算术很有帮助，因为将数据保留为整数。输入和输出数据是整数。必须将它们转换为浮动并再次返回是昂贵的。如果我坚持使用 GCC 9.3，我预计速度会更快，因为我通常发现它生成的代码比 VS 2013 更快。无论如何，这是代码：

typedef union {
    unsigned c;
    struct { unsigned char b, g, r, a; };
} DfColour;

typedef struct _DfBitmap {
    int width, height;
    DfColour *pixels;
} DfBitmap;

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    // For every output pixel...
    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < dstW; x++, dstPixel++) {
            // Perform bilinear interpolation on 2x2 src pixels.

            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned r = 0, g = 0, b = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[srcX];
            unsigned w = (weightX * weightY) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 1,0
            srcPixel++;
            w = (weightX2 * weightY) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 1,1
            srcPixel += src->width;
            w = (weightX2 * weightY2) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 0,1
            srcPixel--;
            w = (weightX * weightY2) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }
}

切换到更好的算法将时间减少到 19.5 毫秒。 正如 Andrey Kamaev 的回答所说，更好的算法通过将垂直和水平调整大小分成两个单独的通道来工作。目标位图用作第一遍输出的临时存储空间。第二遍中的 X 遍历是向后的，以避免覆盖它即将需要的数据。代码如下：

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++, dstPixel++) {
            unsigned r = 0, g = 0, b = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            r += srcPixel->r * weightY;
            g += srcPixel->g * weightY;
            b += srcPixel->b * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            r += srcPixel->r * weightY2;
            g += srcPixel->g * weightY2;
            b += srcPixel->b * weightY2;

            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }

    for (int y = 0; y < dstH; y++) {
        DfColour *dstRow = &dst->pixels[y * dst->width];

        for (int x = dstW - 1; x; x--) {
            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned r = 0, g = 0, b = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            r += srcPixel->r * weightX;
            g += srcPixel->g * weightX;
            b += srcPixel->b * weightX;

            // Pixel 0,1
            srcPixel++;
            r += srcPixel->r * weightX2;
            g += srcPixel->g * weightX2;
            b += srcPixel->b * weightX2;

            DfColour *dstPixel = &dstRow[x];
            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }
}

使用简单的便携式 SIMD 方案将时间缩短到 16.5 毫秒。 SIMD 方案不使用 SSE/AVX 等专有指令集扩展。相反，它使用 hack 允许以 32 位整数存储和操作红色和蓝色通道。它不如 AVX 实现快，但它具有简单的优点。代码如下：

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++, dstPixel++) {
            unsigned rb = 0, g = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            rb += (srcPixel->c & 0xff00ff) * weightY;
            g += srcPixel->g * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            rb += (srcPixel->c & 0xff00ff) * weightY2;
            g += srcPixel->g * weightY2;

            dstPixel->c = rb >> 8;
            dstPixel->g = g >> 8;
        }
    }

    for (int y = 0; y < dstH; y++) {
        DfColour *dstRow = &dst->pixels[y * dst->width];

        for (int x = dstW - 1; x; x--) {
            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned rb = 0, g = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            rb += (srcPixel->c & 0xff00ff) * weightX;
            g += srcPixel->g * weightX;

            // Pixel 0,1
            srcPixel++;
            rb += (srcPixel->c & 0xff00ff) * weightX2;
            g += srcPixel->g * weightX2;

            DfColour *dstPixel = &dstRow[x];
            dstPixel->c = rb >> 8;
            dstPixel->g = g >> 8;
        }
    }
}

可以单独保留 X 轴通道，但合并 Y 轴通道。这提高了缓存的一致性并使代码更简单一些。 重新组合这两个通道将时间减少到 14.6 毫秒。这是代码：

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstRow = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++) {
            unsigned rb = 0, g = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            rb += (srcPixel->c & 0xff00ff) * weightY;
            g += srcPixel->g * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            rb += (srcPixel->c & 0xff00ff) * weightY2;
            g += srcPixel->g * weightY2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }

        for (int x = dstW - 1; x; x--) {
            unsigned rb = 0, g = 0;

            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            rb += (srcPixel->c & 0xff00ff) * weightX;
            g += srcPixel->g * weightX;

            // Pixel 0,1
            srcPixel++;
            rb += (srcPixel->c & 0xff00ff) * weightX2;
            g += srcPixel->g * weightX2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }
    }
}

此时代码仍然是单线程的。我的 CPU 总共有两个物理内核和 4 个线程。 OpenCV 在我的机器上使用 2 个线程。 我希望将代码转换为使用 2 个线程将时间减少到大约 8 毫秒。

我不知道还需要什么其他技巧才能将时间缩短到 4 毫秒，尽管可能需要转换为真正的 AVX SIMD 实现。

【讨论】：