【问题标题】：OpenMP on a 2-socket system2 插槽系统上的 OpenMP
【发布时间】：2014-08-04 20:04:09
【问题描述】：

我在 C++ 中进行了一些科学计算，并尝试利用 OpenMP 来并行化一些循环。到目前为止，这运作良好，例如在具有 8 个线程的 Intel i7-4770 上。

设置

我们有一个小型工作站，它在一个主板上包含两个 Intel CPU (E5-2680v2)。该代码只要在 1 个 CPU 上运行，线程数不限，就可以工作。但是一旦我使用第二个 CPU，我就会不时地观察到不正确的结果（大约每 50 到 100 次我运行代码）。即使我只使用 2 个线程并将它们分配给两个不同的 CPU，也会发生这种情况。由于我们有 5 个这样的工作站（都是相同的），我在每个工作站上运行了代码，都显示了这个问题。

工作站在 OpenSuse 13.1、内核 3.11.10-7 上运行。 g++ 4.8.1 和 4.9.0 以及 Intel 的 icc 13.1.3.192 存在此问题（尽管 icc 不经常出现此问题，但仍然存在）。

症状

症状可描述如下：

我有一大堆 std::complex:std::complex<double>* mFourierValues;
在循环中，我访问并设置每个元素。每次迭代访问一个不同的元素，所以我没有并发访问（我检查了这个）：mFourierValues[idx] = newValue;
如果我之后将设置的数组值与输入值进行比较，大致为mFourierValues[idx] == newValue，此检查有时会失败（尽管并非每次结果最终都不正确）。

所以症状看起来像我在没有任何同步的情况下同时访问元素。但是，当我将索引存储在 std::vector 中时（使用适当的 #pragma omp critical），所有指标都是唯一的并且在正确的范围内。

问题

经过几天的调试，我越来越怀疑发生了其他事情，并且我的代码是正确的。在我看来，当 CPU 将缓存与主存同步时，似乎发生了一些奇怪的事情。

因此，我的问题是：

OpenMP 甚至可以用于这样的系统吗？（我还没有找到拒绝的消息来源。）
是否存在针对这种情况的已知错误（我在错误跟踪器中没有发现任何错误）？
您认为问题可能出在哪里？
- 我的代码（在 1 个多核 CPU 上似乎运行良好！），
- 编译器（gcc，icc 两者！），
- 操作系统，
- 硬件（所有 5 个工作站都有缺陷？）

代码

[编辑：旧代码已删除，见下文]

用最小的例子编辑

好的，我终于能够生成一个更短（且自洽）的代码示例。

关于代码

保留一些内存空间。对于堆栈上的数组，可以这样访问：complex<double> mAllElements[tensorIdx][kappa1][kappa2][kappa3]。 IE。我有 3 个 rank-3-tensor (tensorIdx)。每个张量代表一个 3 维数组，由 kappa1、kappa2 和 kappa3 索引。
我有 4 个嵌套循环（在所有 4 个索引中），而 kappa1 循环是并行化的循环（并且是最外层的循环）。它们位于DoComputation()。
在main()中，我调用DoComputation()一次得到一些参考值，然后我调用了几次，比较结果。它们应该完全匹配，但有时它们不匹配。

不幸的是，代码仍然有 190 行左右。我试图进一步简化它（只有 1 个等级为 1 的张量，等等），但后来我再也无法重现这个问题。我猜它的出现是因为内存访问是非对齐的（tensorIdx 上的循环是最里面的循环）（我知道，这远非最佳。）

此外，需要在适当的地方进行一些延迟，以重现该错误。这就是nops() 电话的原因。没有它们，代码运行速度会快很多，但到目前为止还没有显示出问题。

请注意，我再次检查了关键部分CalcElementIdx()，并认为它是正确的（每个元素都被访问一次）。我还运行了 valgrind 的 memcheck、helgrind 和 drd（使用正确的重新编译的 libgomp），所有三个都没有出错。

输出

每隔一到第三次启动程序，我就会遇到一两个不匹配的问题。示例输出：

41      Is exactly 0
42      Is exactly 0
43      Is exactly 0
44      Is exactly 0
45      348496
46      Is exactly 0
47      Is exactly 0
48      Is exactly 0
49      Is exactly 0

gcc 和 icc 也是如此。

我的问题

我的问题是：下面的代码对您来说是否正确？（除了明显的设计缺陷。）（如果太长，我会尝试进一步减少它，但如上所述我到目前为止失败了。）

代码

代码是用

编译的

g++ main.cc -O3 -Wall -Wextra -fopenmp

或

icc main.cc -O3 -Wall -Wextra -openmp

当在 2 个 CPU 上运行时，两个版本都显示了所描述的问题，总共 40 个线程。我无法在 1 个 CPU（以及我喜欢的任意数量的线程）上观察到错误。

// File: main.cc
#include <cmath>
#include <iostream>
#include <fstream>
#include <complex>
#include <cassert>
#include <iomanip>
#include <omp.h>

using namespace std;


// If defined: We add some nops in certain places, to get the timing right.
// Without them, I haven't observed the bug.
#define ENABLE_NOPS

// The size of each of the 3 tensors is: GRID_SIZE x GRID_SIZE x GRID_SIZE
static const int GRID_SIZE = 60;

//=============================================
// Produces several nops. Used to get correct "timings".

//----
template<int N> __attribute__((always_inline)) inline void nop()
{
    nop<N-1>();
    asm("nop;");
}

//----
template<> inline void nop<0>() { }

//----
__attribute__((always_inline)) inline void nops()
{
    nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>(); nop<500>();
}




//=============================================
/*
Memory layout: We have 3 rank-3-tensors, i.e. 3 arrays of dimension 3.
The layout looks like this: complex<double> allElements[tensorIdx][kappa1][kappa2][kappa3];
The kappas represent the indices into a certain tensor, and are all in the interval [0; GRID_SIZE-1].
*/
class MemoryManagerFFTW
{
public:
  //---------- Constructor ----------
  MemoryManagerFFTW()
  {
    mAllElements = new complex<double>[GetTotalNumElements()];
  }

  //---------- Destructor ----------
  ~MemoryManagerFFTW() 
  { 
    delete[] mAllElements; 
  }

  //---------- SetElement ----------
  void SetElement(int tensorIdx, int kappa1, int kappa2, int kappa3, const complex<double>& newVal)
  {
    // Out-of-bounds error checks are done in this function.
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);

    // These nops here are important to reproduce the bug.
#if defined(ENABLE_NOPS)
    nops();
    nops();
#endif

    // A flush makes the bug appear more often.
    // #pragma omp flush
    mAllElements[idx] = newVal;

    // This was never false, although the same check is false in DoComputation() from time to time.
    assert(newVal == mAllElements[idx]);
  }

  //---------- GetElement ----------
  const complex<double>& GetElement(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {  
    const size_t idx = CalcElementIdx(tensorIdx, kappa1, kappa2, kappa3);
    return mAllElements[idx];
  }


  //---------- CalcElementIdx ----------
  size_t CalcElementIdx(int tensorIdx, int kappa1, int kappa2, int kappa3)const
  {
    // We have 3 tensors (index by "tensorIdx"). Each tensor is of rank 3. In memory, they are placed behind each other.
    // tensorStartIdx is the index of the first element in the tensor.
    const size_t tensorStartIdx = GetNumElementsPerTensor() * tensorIdx;

    // Index of the element relative to the beginning of the tensor. A tensor is a 3dim. array of size GRID_SIZE x GRID_SIZE x GRID_SIZE
    const size_t idxInTensor = kappa3 + GRID_SIZE * (kappa2 + GRID_SIZE * kappa1);

    const size_t finalIdx = tensorStartIdx + idxInTensor;
    assert(finalIdx < GetTotalNumElements());

    return finalIdx;
  }


  //---------- GetNumElementsPerTensor & GetTotalNumElements ----------
  size_t GetNumElementsPerTensor()const { return GRID_SIZE * GRID_SIZE * GRID_SIZE; }
  size_t GetTotalNumElements()const { return NUM_TENSORS * GetNumElementsPerTensor(); }



public:
  static const int NUM_TENSORS = 3; // The number of tensors.
  complex<double>* mAllElements; // All tensors. An array [tensorIdx][kappa1][kappa2][kappa3]
};




//=============================================
void DoComputation(MemoryManagerFFTW& mSingleLayerManager)
{
  // Parallize outer loop.
  #pragma omp parallel for
  for (int kappa1 = 0; kappa1 < GRID_SIZE; ++kappa1)
  {
    for (int kappa2 = 0; kappa2 < GRID_SIZE; ++kappa2)
    {
      for (int kappa3 = 0; kappa3 < GRID_SIZE; ++kappa3)
      {    
#ifdef ENABLE_NOPS
        nop<50>();
#endif
        const double k2 = kappa1*kappa1 + kappa2*kappa2 + kappa3*kappa3;
        for (int j = 0; j < 3; ++j)
        {
          // Compute and set new result.
          const complex<double> curElement = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          const complex<double> newElement = exp(-k2) * k2 * curElement;

          mSingleLayerManager.SetElement(j, kappa1, kappa2, kappa3, newElement);

          // Check if the results has been set correctly. This is sometimes false, but _not_ always when the result is incorrect.
          const complex<double> test = mSingleLayerManager.GetElement(j, kappa1, kappa2, kappa3);
          if (test != newElement)
            printf("Failure: (%g, %g) != (%g, %g)\n", test.real(), test.imag(), newElement.real(), newElement.imag());
        }
      }
    }
  }
}



//=============================================
int main()
{
  cout << "Max num. threads: " << omp_get_max_threads() << endl;

  // Call DoComputation() once to get a reference-array.
  MemoryManagerFFTW reference;
  for (size_t i = 0; i < reference.GetTotalNumElements(); ++i)
    reference.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
  DoComputation(reference);

  // Call DoComputation() several times, and each time compare the result to the reference.
  const size_t NUM = 1000;
  for (size_t curTry = 0; curTry < NUM; ++curTry)
  {
    MemoryManagerFFTW mSingleLayerManager;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
      mSingleLayerManager.mAllElements[i] = complex<double>((double)i, (double)i+0.5);
    DoComputation(mSingleLayerManager);

    // Get the max. difference. This *should* be 0, but isn't from time to time.
    double maxDiff = -1;
    for (size_t i = 0; i < mSingleLayerManager.GetTotalNumElements(); ++i)
    {
      const complex<double> curDiff = mSingleLayerManager.mAllElements[i] - reference.mAllElements[i];
      maxDiff = max(maxDiff, max(curDiff.real(), curDiff.imag()));
    }

    if (maxDiff != 0)
      cout << curTry << "\t" << maxDiff << endl;
    else
      cout << curTry << "\t" << "Is exactly 0" << endl;
  }

  return 0;
}

编辑

从下面的 cmets 和 Zboson 的回答中可以看出，内核 3.11.10-7 中存在错误。更新到 3.15.0-1 后，我的问题消失了，代码可以正常工作了。

【问题讨论】：

您所拥有的称为 NUMA 系统，OpenMP 可用于 NUMA 系统。我不知道你为什么还有问题。几乎听起来您应该使用#pragma omp flush，但没有人向我展示何时应该使用#pragam omp flush 的好例子。顺便说一句，您可以将特殊情况 kappa1,2,3,=0 拉出循环，但我认为它不会帮助解决您的问题。
我同意@Zboson - 如果没有复制器，我们无法回答您的问题，但 OpenMP 经常跨套接字使用，这将是编译器的 OpenMP 实现（或主板内存管理硬件）弄错了。因此，即使您仔细查看，我仍然怀疑您的代码中存在细微的内存问题。继续尝试获取复制器，和/或仔细查看描述 OpenMP 内存模型的文档。
我建议您使用 Intel 的 Inspector XE（以前称为 Intel Thread Checker）之类的工具运行您的代码，并让它检查（请原谅双关语）程序的数据竞争行为。也可以使用 Valgrind 包中的 Helgrind tool，但它不能完全理解 OpenMP 运行时，并且可能会产生一些误报。
@Gugi，也许内核 3.11.10-7 中存在错误？您可以尝试更新 Linux 内核或在 USB 上安装具有不同内核的 Linux，然后从 USB 启动并测试您的代码。
这听起来很像处理 TLB 缓存失效的错误。搜索“Linux 3.11 TLB 缓存”我找到了this bug in 3.11。我强烈建议您按照 Z boson 的建议去做，并尝试使用不同的内核版本。

标签： c++ multithreading synchronization openmp

【解决方案1】：

问题是由于 Linux 内核 3.11.10-7 中的一个错误造成的。 The bug may be due to how the kernel handles invalidating the TLB cache 正如 Hristo Iliev 所指出的。我猜测内核可能是问题所在，因为我读到Linux Kernel 3.15 for NUMA systems 会有一些改进，所以我认为内核版本对于 NUMA 系统很重要。

当 OP 将他的 NUMA 系统的 Linux 内核更新到 3.15.0-1 时，问题就消失了。

【讨论】：