最快的高斯模糊实现答案

【问题标题】：Fastest Gaussian blur implementation最快的高斯模糊实现
【发布时间】：2010-09-11 00:47:19
【问题描述】：

如何实现最快的Gaussian blur 算法？

我打算用Java实现它，所以排除了GPU的解决方案。我的应用程序planetGenesis 是跨平台的，所以我不想要JNI。

【问题讨论】：

值得一看（或直接使用！）JH Labs 的 GaussianFilter - jhlabs.com/ip/filters/index.html - 我用过它，速度非常快。
看看这里：github.com/RoyiAvital/FastGuassianBlur

标签： java image-processing filtering gaussian

【解决方案1】：

您应该使用高斯核是可分离的这一事实，即。 e.您可以将 2D 卷积表示为两个 1D 卷积的组合。

如果滤波器很大，那么使用空间域中的卷积等价于频域（傅立叶）域中的乘法这一事实也可能有意义。这意味着您可以对图像和滤波器进行傅里叶变换，将（复数）结果相乘，然后进行傅里叶逆变换。 FFT（快速傅里叶变换）的复杂度为 O(n log n)，而卷积的复杂度为 O(n^2)。此外，如果您需要使用相同的滤镜对多张图像进行模糊处理，您只需对滤镜进行一次 FFT。

如果您决定使用 FFT，FFTW library 是一个不错的选择。

【讨论】：

另请注意，高斯函数集在傅里叶变换下是封闭的——对一个高斯进行傅里叶变换只会给你一个不同的高斯。

【解决方案2】：

数学高手可能知道这一点，但对其他人来说......

由于高斯具有良好的数学特性，您可以通过首先对图像的每一行运行 1D 高斯模糊，然后对每列运行 1D 模糊来快速模糊 2D 图像。

【讨论】：

感谢您翻译“您应该使用高斯核是可分离的这一事实，即您可以将 2D 卷积表示为两个 1D 卷积的组合。” (Dima)

【解决方案3】：

终极解决方案

我对如此多的信息和实现感到非常困惑，我不知道我应该相信哪一个。想通之后，我决定写自己的文章。我希望它可以为您节省数小时的时间。

Fastest Gaussian Blur (in linear time)

它包含源代码，（我希望）它简短、干净且易于重写为任何其他语言。请投票，让其他人看到。

【讨论】：

我为您的代码制作了一个 RGBA 版本，以便与 StackBlur 比较速度和质量。这是代码：pastebin.com/mS0fNYFF - 但我必须说 StackBlur 仍然更快，它以更好的方式处理边界条件（不确定是否缺少某些东西，但我看到你的代码中有一些溢出）
StackBlur 是什么意思？如果您的意思是“累加器”算法，我在算法 4 中使用它。
StackBlur 是一种准高斯模糊算法，至少据我所知，它是最快的非框模糊算法之一。一次通过的结果介于框模糊和高斯之间，如果您需要它而不是视觉效果而不是科学图像分析，结果应该足够好。
@IvanKuckir 我无法让它工作。您能否提供一个有关如何在 HTML 页面中调用您的方法的示例？（非常需要）
@IvanKuckir 你能看一下这个吗：Fastest Gaussian Blur Not Working？

【解决方案4】：

我找到了Quasimondo : Incubator : Processing : Fast Gaussian Blur。此方法包含许多近似值，例如使用整数和查找表而不是浮点数和浮点除法。我不知道现代 Java 代码有多少加速。
Fast Shadows on Rectangles 有一个使用B-splines 的近似算法。
Fast Gaussian Blur Algorithm in C# 声称有一些很酷的优化。
另外，David Everly 的 Fast Gaussian Blur (PDF) 提供了一种快速的高斯模糊处理方法。

我会尝试各种方法，对它们进行基准测试并在此处发布结果。

出于我的目的，我从 Internet 复制并实现了基本（独立处理 X-Y 轴）方法和 David Everly 的 Fast Gaussian Blur 方法。它们的参数不同，所以我无法直接比较它们。然而，对于大的模糊半径，后者经历的迭代次数要少得多。另外，后者是一种近似算法。

【讨论】：

【解决方案5】：

您可能希望框模糊，这要快得多。请参阅 this link 以获得很棒的教程和一些 copy & paste C code。

【讨论】：

如何将高斯核的 STD 与 Box Blur 的长度联系起来？

【解决方案6】：

对于较大的模糊半径，请尝试应用 box blur 三次。这将非常接近高斯模糊，并且比真正的高斯模糊要快得多。

【讨论】：

基本上。如果您想要 20 的“模糊直径”，请应用直径为 7、7 和 6 的框模糊。这将产生类似于直径为 20 的单个框模糊的模糊效果，但更好看。
AFAIK Photoshop 这样做而不是真正的高斯模糊。
顺便说一下，BoxBlurRadius = 0.39 * GaussBlurRadius。
@IvanKuckir：我不明白你从哪里得到这个数字 0.39。
@Jaan 我也没有，我通过尝试想出了它，它有效！它必须类似于将 2D 高斯（体积）“转换”为框的积分，同时保持相同的预期值。在下面检查我的答案！

【解决方案7】：

我已将 Ivan Kuckir 的快速高斯模糊实现转换为 java。结果过程是 O(n)，正如他所说的 at his own blog。如果您想了解更多关于为什么 3 时间框模糊接近于高斯模糊（3%）的信息，我的朋友您可以查看 box blur 和 Gaussian blur。

这里是java实现。

@Override
public BufferedImage ProcessImage(BufferedImage image) {
    int width = image.getWidth();
    int height = image.getHeight();

    int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
    int[] changedPixels = new int[pixels.length];

    FastGaussianBlur(pixels, changedPixels, width, height, 12);

    BufferedImage newImage = new BufferedImage(width, height, image.getType());
    newImage.setRGB(0, 0, width, height, changedPixels, 0, width);

    return newImage;
}

private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
    ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
    BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
    BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}

private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
    double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);

    int filterWidth = (int) Math.floor(idealFilterWidth);

    if (filterWidth % 2 == 0) {
        filterWidth--;
    }

    int filterWidthU = filterWidth + 2;

    double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
    double m = Math.round(mIdeal);

    ArrayList<Integer> result = new ArrayList<>();

    for (int i = 0; i < n; i++) {
        result.add(i < m ? filterWidth : filterWidthU);
    }

    return result;
}

private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
    System.arraycopy(source, 0, output, 0, source.length);
    BoxBlurHorizantal(output, source, width, height, radius);
    BoxBlurVertical(source, output, width, height, radius);
}

private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius);
    for (int i = 0; i < height; i++) {
        int outputIndex = i * width;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
        float val = (radius) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
        }

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (radius + 1); j < (width - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }

        for (int j = (width - radius); j < width; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
        }
    }
}

private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
    int resultingColorPixel;
    float iarr = 1f / (radius + radius + 1);
    for (int i = 0; i < width; i++) {
        int outputIndex = i;
        int li = outputIndex;
        int sourceIndex = outputIndex + radius * width;

        int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
        int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
        float val = (radius + 1) * fv;

        for (int j = 0; j < radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
        }
        for (int j = 0; j <= radius; j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = radius + 1; j < (height - radius); j++) {
            val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            sourceIndex += width;
            outputIndex += width;
        }
        for (int j = (height - radius); j < height; j++) {
            val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
            resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
            outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
            li += width;
            outputIndex += width;
        }
    }
}

【讨论】：

效果很好，但结果图像是黑白的，请问我怎样才能让它彩色？

【解决方案8】：

我会考虑为此使用 CUDA 或其他一些 GPU 编程工具包，特别是如果您想使用更大的内核。如果做不到这一点，总要在汇编中手动调整你的循环。

【讨论】：

【解决方案9】：

第 1 步：SIMD 1 维高斯模糊
第 2 步：转置
第 3 步：重复第 1 步

最好在小块上完成，因为全图像转置很慢，而使用PUNPCKs (PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD) 链可以非常快速地完成小块转置。

【讨论】：

【解决方案10】：

在一维中：

重复使用几乎任何内核进行模糊处理将趋向于高斯内核。这就是高斯分布的奇妙之处，也是统计学家喜欢它的原因。所以选择容易模糊的东西并应用几次。

例如，使用盒形内核很容易进行模糊处理。先计算一个累计和：

y(i) = y(i-1) + x(i)

然后：

blurred(i) = y(i+radius) - y(i-radius)

重复几次。

或者您可能会使用各种 IIR 过滤器来回切换几次，它们的速度同样快。

在 2D 或更高版本中：

正如DarenW所说，一个接一个地模糊每个维度。

【讨论】：

【解决方案11】：

我在研究中一直在努力解决这个问题，并尝试了一种有趣的快速高斯模糊方法。首先，如前所述，最好将模糊分成两个 1D 模糊，但根据您的硬件实际计算像素值，您实际上可以预先计算所有可能的值并将它们存储在查找表中。

换句话说，预先计算Gaussian coefficient * input pixel value 的每个组合。当然，您需要离散化您的系数，但我只是想添加此解决方案。如果您订阅了IEEE，您可以在Fast image blurring using Lookup Table for real time feature extraction 中阅读更多内容。

最终，我最终还是使用了CUDA :)

【讨论】：

【解决方案12】：

我在不同的地方看到了几个答案，我在这里收集它们，以便我可以尝试围绕它们并记住它们以供以后使用：

无论您使用哪种方法，filter horizontal and vertical dimensions separately 使用一维滤镜而不是使用单个方形滤镜。

标准的“慢”方法：卷积滤波器
SIFT 中分辨率降低的分层图像金字塔
由中心极限定理引起的重复框模糊。 Box Blur 是 Viola 和 Jones 人脸检测的核心，如果我没记错的话，他们称其为整体图像。我认为类似 Haar 的功能也使用了类似的东西。
Stack Blur：基于队列的替代方案，介于卷积和框模糊方法之间
IIR filters
- Derich filter (Wikipedia) 二阶 IIR 滤波器
- van Vliet filter这个我一无所知
- Bessel filters 虽然关于这些有一些争论

在回顾了所有这些之后，我想起了简单的、糟糕的近似值通常在实践中效果很好。在另一个领域，Alex Krizhevsky 发现 ReLU 比他开创性的 AlexNet 中的经典 sigmoid 函数更快，尽管乍一看它们似乎是 Sigmoid 的可怕近似。

【讨论】：

【解决方案13】：

二维数据的高斯模糊有几种快速方法。你应该知道什么。

这是可分离滤波器，因此只需要两个一维卷积。
对于大内核，您可以处理缩小的图像副本，而不是放大后的副本。
可以通过多个框过滤器（也是可分离的）实现良好的近似，（可以调整迭代次数和内核大小）
现有 O(n) 复杂度算法（适用于任何内核大小），可通过 IIR 滤波器进行精确的高斯近似。

您的选择取决于所需的速度、精度和实施复杂性。

【讨论】：

【解决方案14】：

尝试像我在这里所做的那样使用 Box Blur： Approximating Gaussian Blur Using Extended Box Blur

这是最好的近似值。

使用 Integral Images 可以使其更快。
如果你这样做，请分享你的解决方案。

【讨论】：

【解决方案15】：

用现在（截至 2016 年）实施的新库来回答这个老问题，因为 Java 的 GPU 技术有了许多新的进步。

正如其他几个答案所建议的那样，CUDA 是一种替代方案。但是java 现在支持 CUDA。

IBM CUDA4J 库：提供用于管理和访问 GPU 设备、库、内核和内存的 Java API。使用这些新的 API，可以编写 Java 程序来管理 GPU 设备特性并将工作卸载到 GPU，并使用 Java 内存模型、异常和自动资源管理的便利。

Jcuda：NVIDIA CUDA 和相关库的 Java 绑定。使用 JCuda，可以通过 Java 程序与 CUDA 运行时和驱动程序 API 进行交互。

Aparapi：允许 Java 开发人员通过在 GPU 上执行数据并行代码片段来利用 GPU 和 APU 设备的计算能力，而不是局限于本地 CPU。

一些 Java OpenCL 绑定 库

https://github.com/ochafik/JavaCL：OpenCL 的 Java 绑定：一个面向对象的 OpenCL 库，基于自动生成的低级绑定

http://jogamp.org/jocl/www/：OpenCL 的 Java 绑定：一个面向对象的 OpenCL 库，基于自动生成的低级绑定

http://www.lwjgl.org/：OpenCL 的 Java 绑定：自动生成的低级绑定和面向对象的便利类

http://jocl.org/：OpenCL 的 Java 绑定：原始 OpenCL API 的 1:1 映射的低级绑定

以上所有这些库都将有助于在 CPU 上比任何 Java 实现更快地实现高斯模糊。

【讨论】：

【解决方案16】：

来自 CWP 的 Dave Hale 有一个 minejtk 包，其中包括递归高斯滤波器（Deriche 方法和 Van Vliet 方法）。 java子程序可以在https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java找到

Deriche 的方法对于高斯模糊（以及高斯的导数）似乎是一种非常好的方法。

【讨论】：

之所以推荐 Deriche 的高斯模糊方法是因为它非常准确。请参阅以下论文进行调查和比较：dev.ipol.im/~getreuer/code/doc/gaussian_20131215_doc/…