C++中绝对差计算的可变块大小和答案

【问题标题】：Variable block size sum of absolute difference calculation in C++C++中绝对差计算的可变块大小和
【发布时间】：2011-11-11 20:32:28
【问题描述】：

我想在 C++ 程序中使用 16 位整数的二维数组尽可能高效地执行可变块大小 sum of absolute difference 计算。我对实时块匹配代码感兴趣。我想知道是否有任何可用的软件库来执行此操作？该代码在 Windows XP 上运行，我被困在使用 Visual Studio 2010 进行编译。 CPU 是 2 核 AMD Athlon 64 x2 4850e。

通过绝对差值的可变块大小和 (SAD) 计算，我的意思如下。

我有一个较小的二维数组，我称之为template_grid，还有一个较大的二维数组，我称之为image。我想在图像中找到使模板中的像素与图像中区域中的像素之间的绝对差之和最小的区域。

在 C++ 中计算 SAD 的最简单方法如下：

for(int shiftY = 0; shiftY < rangeY; shiftY++) {
    for(int shiftX = 0; shiftX < rangeX; shiftX++) {
        for(int x = 0; x < lenTemplateX; x++) {
            for(int y = 0; y < lenTemplateY; y++) {
                SAD[shiftY][shiftX]=abs(template_grid[x][y] - image[y + shiftY][x + shiftX]);
            }
        }
    }
}

针对特定阵列大小的 SAD 计算已在英特尔性能原语库中进行了优化。但是，我正在使用的数组不适合这些库中的大小。

我使用了两个搜索范围，

大范围：rangeY = 45，rangeX = 10

一个小范围：rangeY = 4, rangeX = 2

只有一种模板尺寸，它是： lenTemplateY = 61, lenTemplateX = 7

【问题讨论】：

可变块大小的绝对差之和。我认为标题就足够了。
lenTemplateX、lenTemplateY、rangeX、rangeY的典型值有多大？
你的意思是SAD[shiftY][shiftX] +=abs(template[x][y] - image[x + shiftX][y + shiftY])? （除了+=，还有tem[x][y]-img[**y**][**x**]之间的区别）

标签： c++ algorithm computer-vision

【解决方案1】：

小优化：

for(int shiftY = 0; shiftY < rangeY; shiftY++) {
  for(int shiftX = 0; shiftX < rangeX; shiftX++) {
    // if you can assume SAD is already filled with 0-es, 
    // you don't need the next line
    SAD[shiftX][shiftY]=0;
    for(int tx = 0, imx=shiftX; x < lenTemplateX; tx++,imx++) {
      for(int ty = 0, imy=shiftY; y < lenTemplateY; ty++,imy++) {
        // two increments of imx/imy may be cheaper than 
        // two addition with offsets
        SAD[shiftY][shiftX]+=abs(template_grid[tx][ty] - image[imx][imy]);
      }
    }
  }
}

使用 C++ 模板展开循环

对于您的配置来说可能是一个疯狂的想法（C++ 编译器让我担心），但它可能工作。我不提供任何保证，但请尝试一下。

这个想法可能会奏效，因为您的 template_grid 大小和范围是恒定的 - 因此在编译时已知。
此外，要使其工作，您的 image 和 template_grid 必须使用相同的布局进行组织（列第一或行第一） - 问题中描述的“示例代码”的方式将SAD x/y 与template_grid y/x 混合在一起。
在下文中，我将假设一个“列优先”的组织，因此SAD[ix] 表示您的SAD** 矩阵的ix^th 列。代码与“行优先”相同，只是变量的名称与值数组的含义不匹配。

那么，让我们开始吧：

template <
  typename sad_type, typename val_type,
  size_t template_len
> struct sad1D_simple {
  void operator()(
    const val_type* img, const val_type* templ,
    sad_type& result
  ) {
    // template specialization recursion, with one less element to add
    sad1D_simple<sad_type, val_type, template_len-1> one_shorter;
    // call it incrementing the img and template offsets
    one_shorter(img+1, templ+1, result);
    // the add the contribution of the first diff we skipped over above
    result+=abs(*(img+template_len-1)-*(templ+template_len-1));
  }
};

// at len of 0, the result is zero. We need it to stop the
template <
  typename sad_type, typename val_type
>
struct sad1D_simple<sad_type, val_type, 0> {
  void operator()(
    const val_type* img, const val_type* templ,
    sad_type& result
  ) {
    result=0;
  }
};

为什么是函子结构 - 带有运算符的结构？ C++ 不允许函数模板的部分特化。
sad1D_simple 的作用：展开一个for 循环，该循环计算输入中两个数组的SAD，没有任何偏移，基于template_grid 数组的长度是编译时已知的常量这一事实。这与“使用 C++ 模板计算编译时间的阶乘”相同

这有什么帮助？
下面代码中的使用示例：

typedef ulong SAD_t;
typedef int16_t pixel_val_t;

const size_t lenTemplateX = 7; // number of cols in the template_grid
const size_t lenTemplateY = 61;
const size_t rangeX=10, rangeY=45;

pixel_val_t **image, **template_grid;
SAD_t** SAD;
// assume those are initialized somehow


for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
  pixel_val_t* template_col=template_grid[tgrid_col];
  // the X axis - horizontal - is the column axis, right?
  for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
    pixel_val_t* img_col=image[shiftX];
    for(size_t shiftY = 0; shiftY < rangeY; shiftY++) {
      // the Y axis - vertical - is the "offset in a column"=row, isn't it?
      pixel_val_t* img_col_offsetted=img_col+shiftY;

      // this functor is made by recursive specialization
      // there's no cycle inside it, it was unrolled into
      // lenTemplateY individual subtractions, abs-es and additions 
      sad1D_simple<SAD_t, pixel_val_t, lenTemplateY> calc;
      calc(img_col_offsetted, template_col, SAD[shiftX][shiftY]);
    }
  }
}

嗯……我们能做得更好吗？不，它不会是 X 轴展开，我们仍然希望停留在 1D 区域，但是......好吧，也许如果我们创建一个范围 sad1D 并在同一轴上展开另一个循环？
它会起作用如果frangeX 也是恒定的。

template <
  typename sad_type, typename val_type,
  size_t range, size_t template_len
> struct sad1D_ranged {
  void operator()(
    const val_type* img, const val_type* templ,
    // result is assumed to have at least `range` slots
    sad_type* result
  ) {
    // we'll compute here the first slot of the result
    sad1D_simple<sad_type, val_type, template_len>
      calculator_for_first_sad;
    calculator_for_first_sad(img, templ, *(result));

    // now, ask for a recursive specialization for 
    // the next (range-1) sad-s
    sad1D_ranged<sad_type, val_type, range-1, template_len>
       one_less_in_range;
    // when calling, pass the shifted img and result
    one_less_in_range(img+1, templ, result+1);
  }
};

// for a range of 0, there's nothing to do, but we need it
// to stop the template specialization recursion
template <
  typename sad_type, typename val_type,
  size_t template_len
> struct sad1D_ranged<sad_type, val_type, 0, template_len> {
  void operator()(
    const val_type* img, const val_type* templ,
    // result is assumed to have at least `range` slots
    sad_type* result
  ) {
  }
};

这是你如何使用它的：

for(size_t tgrid_col=0; tgrid_col<lenTemplateX; tgrid_col++) {
  pixel_val_t* template_col=template_grid[tgrid_col];
  for(size_t shiftX=0; shiftX < rangeX; shiftX++) {
    pixel_val_t* img_col=image[shiftX];
    SAD_t* sad_col=SAD[shiftX];

    sad1D_ranged<SAD_t, pixel_val_t, rangeY, lenTemplateY> calc;
    calc(img_col, template_col, sad_col);
  }
}

是的...但问题是：这会提高性能吗？
见鬼，如果我知道。对于循环内的少量循环和强数据局部性（值彼此接近，以便它们位于 CPU 缓存中），循环展开应该会提高性能。对于更多的循环，您可能会对 CPU 分支预测和其他 mumbo-jumbo-I-know-may-impact-performance-but-I-don't-know-how 产生负面影响。

胆量：即使相同的展开技术可能适用于其他两个循环，使用它很可能会导致性能下降：我们需要从一个连续向量（image 列）跳转到另一个 - 整个图像可能不适合 CPU 缓存。

注意：如果您的template_grid 数据也是常量（或者您有一组有限的常量模板网格），则可以更进一步，创建带有专用掩码的结构函子。但是我今天没有动力了。

【讨论】：

【解决方案2】：

您可以尝试使用 OpenCV 模板与平方差参数匹配，请参阅教程here。 OpenCV 使用 OpenCL 进行了优化，但我不知道这个特定的功能。我觉得你应该试一试。

【讨论】：

【解决方案3】：

我不确定您对使用 SAD 的限制有多少，或者您是否通常有兴趣在图像中找到与模板最匹配的区域。在最后一种情况下，您可以使用卷积而不是 SAD。这可以在 O(N log N) 的傅立叶域中解决，包括傅立叶变换 (FFT)。

简而言之，您可以使用 FFT（例如使用http://www.fftw.org/）将模板和图像都转换到频域，然后将它们相乘，然后再转换回时域。

当然，如果你一定要使用 SAD，这一切都无关紧要。

【讨论】：

这是错误的。他不是在用模板过滤图像之后。
SAD 大约是 L1 标准。您建议的是 L2 Norm 中最接近的补丁。有区别（如果有完美匹配，它们会发生碰撞，否则它们可能不会）。