将 SQDIFF 与“滑动窗口视图”匹配的 NumPy 模板答案

【问题标题】：NumPy template matching SQDIFF with `sliding window_view`将 SQDIFF 与“滑动窗口视图”匹配的 NumPy 模板
【发布时间】：2021-09-08 08:25:15
【问题描述】：

SQDIFF 定义为openCV definition。（我相信他们省略了频道）

在初级numpy Python中应该是什么

A = np.arange(27, dtype=np.float32)
A = A.reshape(3,3,3) # The "image"
B = np.ones([2, 2, 3], dtype=np.float32) # window
rw, rh = A.shape[0] - B.shape[0] + 1, A.shape[1] - B.shape[1] + 1 # End result size
result = np.zeros([rw, rh])
for i in range(rw):
    for j in range(rh):
        w = A[i:i + B.shape[0], j:j + B.shape[1]]
        res =  B - w
        result[i, j] = np.sum(
            res ** 2
        )
cv_result = cv.matchTemplate(A, B, cv.TM_SQDIFF) # this result is the same as the simple for loops
assert np.allclose(cv_result, result)

这是相对较慢的解决方案。我已经阅读了有关 sliding_window_view 的信息，但无法正确理解。

# This will fail with these large arrays but is ok for smaller ones
A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)
sqdiff = np.sum((B - locations) ** 2, axis=(-1,-2, -3, -4)) # This will fail with normal sized images

即使结果很容易符合记忆，也会以MemoryError 失败。如何以这种更快的方式产生与cv2.matchTemplate 函数相似的结果？

【问题讨论】：

你能发布你的“正常大小的图像”和正常大小的窗口吗？您发布的示例无法重现MemoryError...
我已经用np.random.rand生成了一个源代码和一个模板。
现在可以重现了。我收到一个错误：numpy.core._exceptions.MemoryError: Unable to allocate 530. GiB for an array with shape (781, 984, 1, 248, 249, 3) and data type float32。 NumPy 尝试分配内存以一次存储(B - locations) 的所有结果。

标签： python numpy opencv stride

【解决方案1】：

相当于

其中 'star' 操作是互相关，1_[m, n] 是模板大小的窗口，1_[k, l] 是图像大小的窗口。

您可以使用“scipy.signal.correlate”计算互相关项，并通过在平方差图中查找局部最小值来找到匹配项。
您可能也想做一些非最小抑制。此解决方案将需要更少数量级的内存来存储。

如需更多帮助，请发布一个可重现的示例，其中包含对算法有效的图像和模板。使用噪声会导致无意义的输出。

【讨论】：

感谢您的关注！我正在尝试找出如何使用sliding_window_view 来实现这一点，因为我觉得我不是唯一一个努力删除NumPy 中的双循环以接近此类计算的人。如果这要用于实际应用，我只会使用openCV 或自己构建一个内核。输入只是为了演示。
我很确定你不能使用 'sliding_window_view' 来解决这个问题，因为它会在图像中创建具有模板大小的所有可能视图。这就是大内存需求的原因。换句话说，这是不可能的。但这也是不必要的，因为您可以使用我提出的方法更有效地执行相同的操作，该方法不包含显式循环。

【解决方案2】：

作为最后的手段，您可以在瓦片中执行计算，而不是“一次”计算。

np.lib.stride_tricks.sliding_window_view 返回数据的视图，因此不会消耗大量 RAM。

表达式 B - locations 不能使用视图，并且需要 RAM 来存储形状为 (781, 984, 1, 248, 249, 3) 的浮点元素的数组。

用于存储 B - locations 的总 RAM 为 781*984*1*248*249*3*4 = 569,479,908,096 字节。

为了避免一次将B - locations 存储在RAM 中的需要，当“tile”计算需要更少的RAM 时，我们可以在tile 中计算sqdiff。

一个简单的瓦片划分是使用每一行作为一个瓦片-循环sqdiff的行，并逐行计算输出。

例子：

sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32)  # Allocate an array for storing the result.

# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
    sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))

可执行代码示例：

import numpy as np
import cv2

A = np.random.rand(1028, 1232, 3).astype(np.float32)
B = np.random.rand(248, 249, 3).astype(np.float32)
locations = np.lib.stride_tricks.sliding_window_view(A, B.shape)

cv_result = cv2.matchTemplate(A, B, cv2.TM_SQDIFF)  # this result is the same as the simple for loops

#sqdiff = np.sum((B - locations) ** 2, axis=(-1, -2, -3, -4))  # This will fail with normal sized images

sqdiff = np.zeros((locations.shape[0], locations.shape[1]), np.float32)  # Allocate an array for storing the result.

# Compute sqdiff row by row instead of computing all at once.
for i in range(sqdiff.shape[0]):
    sqdiff[i, :] = np.sum((B - locations[i, :, :, :, :, :]) ** 2, axis=(-1, -2, -3, -4))

assert np.allclose(cv_result, sqdiff)

我知道解决方案有点令人失望……但这是我能找到的唯一通用解决方案。

【讨论】：

我尝试了这种方法，性能甚至比双循环还要差。真可惜，尤其是openCV 是 0.22 秒，而这些是 ~200 秒。
是的，它很慢……我猜 OpenCV 使用了一个技巧来减少计算次数。