【问题标题】:What is the issue for my performance difference with openmp between an array of pointers and a pointer to an array?我在指针数组和指向数组的指针之间与 openmp 的性能差异有什么问题?
【发布时间】:2019-11-21 11:07:56
【问题描述】:

我用 C 语言编写了两个程序,它们使用 openmp 进行高瘦矩阵乘法。 该算法是我机器的内存限制。 对于我使用的代码之一和用于存储矩阵的指针数组(aop)。 对于我只在数组上使用的其他代码,矩阵的行一个接一个地存储,从现在开始称为 pta。 现在我观察到 pta 总是优于 aop 版本。 尤其是在使用 12 核而不是 6 核时,aop 的性能会略微下降,而 pta 的性能会翻倍。 我无法真正解释这种行为,我只是假设核心在计算过程中以某种方式干扰。 有人可以解释这种行为吗?

指向数组版本的指针:

int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
  {
    printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
  }

  //define dimensions
  int dim_n=atoi(*(argv+1));
  int dim_nb=2;
  printf("n = %d, nb = %d\n",dim_n,dim_nb);

  //allocate space for matrix M, V and W
  //each element of **M is a pointer for the first element of an array
  //size of double and double* is depending on compiler and machine

  double *M = malloc((dim_nb*dim_nb) * sizeof(double));

  //Initialize Matrix M
  for(int i=0; i<dim_nb; i++)
  {
    for(int j=0; j<dim_nb; j++)
    {
      M[i*dim_nb+j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
    }
  }

  double *V = malloc((dim_n*dim_nb) * sizeof(double));
  double *W = malloc((dim_n*dim_nb) * sizeof(double));


// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
  for (int i=0; i<dim_n; i++)
  {
    for (int j=0; j<dim_nb; j++)
    {
      V[i*dim_nb+j]=j+1;
    }
  }

  int max_iter=100;
  double time = omp_get_wtime();

  // calculate the matrix-matrix product VM product max_iter times
  for(int iter=0; iter<max_iter; iter++)
  {
  // calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
    // i < #rows of V
    for(int i=0; i<dim_n; i++)
    {
      // j < #columns of M
      for(int j=0; j<dim_nb; j++)
      {
        // Initialize W_ij with zero, everytime W_ij is calculated
        W[i*dim_nb+j]=0;
        // k < #colums of V = rows of M
        for(int k=0; k<dim_nb; k++)
        {
          W[i*dim_nb+j] += V[i*dim_nb+k]*M[k*dim_nb+j];
        }
      }
    }
  }
  time=omp_get_wtime()-time;
'''

指针数组版本:

int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
  {
    printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
  }

  //define dimensions
  int dim_n=atoi(*(argv+1));
  int dim_nb=2;
  printf("n = %d, nb = %d\n",dim_n,dim_nb);

  //allocate space for matrix M, V and W
  // each element of **M is a pointer for the first element of an array
  //size of double and double* is depending on compiler and machine
  double **M = malloc(dim_nb * sizeof(double *));
  for(int i = 0; i < dim_nb; i++)
  {
    M[i] = malloc(dim_nb * sizeof(double));
  }


  //Initialize Matrix 
  for(int i=0; i<dim_nb; i++)
  {
    for(int j=0; j<dim_nb; j++)
    {
      M[i][j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
    }
  }

    double **V = malloc(dim_n * sizeof(double *));
    for(int i=0; i<dim_n; i++)
  {
    V[i] = malloc(dim_nb * sizeof(double));
  }

  double **W = malloc(dim_n * sizeof(double *));
    for(int i=0; i<dim_n; i++)
  {
    W[i] = malloc(dim_nb * sizeof(double));
  }


// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
  for (int i=0; i<dim_n; i++)
  {
    for (int j=0; j<dim_nb; j++)
    {
      V[i][j]=j+1;
    }
  }

  int max_iter=100;
  double time = omp_get_wtime();

  // calculate the matrix-matrix product VM product max_iter times
  for(int iter=0; iter<max_iter; iter++)
  {
  // calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
    // i < #rows of V
    for(int i=0; i<dim_n; i++)
    {
      // j < #columns of M
      for(int j=0; j<dim_nb; j++)
      {
        // Initialize W_ij with zero, everytime W_ij is calculated
        W[i][j]=0;
        // k < #colums of V = rows of M
        for(int k=0; k<dim_nb; k++)
        {
          W[i][j] += V[i][k]*M[k][j];
        }
      }
    }
  }
  time=omp_get_wtime()-time;

【问题讨论】:

标签: c performance pointers malloc openmp


【解决方案1】:

这很容易解释,因为指针版本必须先访问指针,然后取消引用该指针。这些内存位置可能彼此相距很远,并且缓存也更有可能被刷新。数组中的数据存储在一个内存块中,因此需要更少的内存访问,并且 CPU 更有可能不会错过缓存。

https://godbolt.org/z/c_8c7c

【讨论】:

  • 为什么现金更容易被冲走?
  • @Robin 由于缓存驱逐,我会假设。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-12-16
  • 1970-01-01
  • 2017-09-03
  • 1970-01-01
相关资源
最近更新 更多