DTRMM 和 DTRSM 挂在某些矩阵大小上答案

【问题标题】：DTRMM & DTRSM hangs on certain matrix sizesDTRMM 和 DTRSM 挂在某些矩阵大小上
【发布时间】：2013-02-05 17:07:48
【问题描述】：

我正在使用 MKL 在新的英特尔至强融核协处理器上的自动卸载功能测试 ?GEMM、?TRMM、?TRSM 的性能，并且在使用 DTRMM 和 DTRSM 时遇到了一些问题。我有代码以 1024 到 10240 的步长测试矩阵大小的性能，并且在 N=M=K=8192 之后的某个地方性能似乎显着下降。当我尝试使用 2 的步长来准确测试位置时，我的脚本挂了。然后我检查了 512 个步长，它们工作得很好，256 个也工作得很好，但是 256 下的任何东西都会停止。我找不到与此问题有关的任何已知问题。所有单精度版本以及 ?GEMM 上的单精度和双精度版本都有效。这是我的代码：

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include <time.h>
#include "mkl.h"

#define DBG 0

int main(int argc, char **argv)
{
   char transa = 'N', side = 'L', uplo = 'L', diag = 'U';
   MKL_INT N, NP; // N = M, N, K, lda, ldb, ldc
   double alpha = 1.0; // Scaling factors 
   double *A, *B; // Matrices 
   int matrix_bytes; // Matrix size in bytes 
   int matrix_elements; // Matrix size in elements
   int i, j; // Counters
   int msec;
   clock_t start, diff;

   N = atoi(argv[1]);

   start = clock();

   matrix_elements = N * N;
   matrix_bytes = sizeof(double) * matrix_elements;

   // Allocate the matrices
   A = malloc(matrix_bytes);
   if (A == NULL)
   {
      printf("Could not allocate matrix A\n");
      return -1;
   }

   B = malloc(matrix_bytes);
   if (B == NULL)
   {
      printf("Could not allocate matrix B\n");
      return -1;
   }

   for (i = 0; i < matrix_elements; i++)
   {
      A[i] = 0.0;
      B[i] = 0.0;
   }

   // Initialize the matrices
   for (i = 0; i < N; i++)
      for (j = 0; j <= i; j++)
      {
         A[i+N*j] = 1.0;
         B[i+N*j] = 2.0;
      }

   // DTRMM call
   dtrmm(&side, &uplo, &transa, &diag, &N, &N, &alpha, A, &N, B, &N);

   diff = clock() - start;
   msec = diff * 1000 / CLOCKS_PER_SEC;
   printf("%f\n", (float)msec * 10e-4);

   if (DBG == 1)
   {
      printf("\nMatrix dimension is set to %d \n\n", (int)N);

      // Display the result
      printf("\nResulting matrix B:\n");
      if (N > 10)
      {
         printf("NOTE: B is too large, print only upper-left 10x10 block...\n");
         NP = 10;
      }
      else
         NP = N;

      printf("\n");
      for (i = 0; i < NP; i++)
      {
         for (j = 0; j < NP; j++)
            printf("%7.3f ", B[i + j * N]);
         printf("\n");
      }
   }

   // Free the matrix memory
   free(A);
   free(B);

   return 0;
}

任何帮助或见解将不胜感激。

【问题讨论】：

标签： c matrix-multiplication blas

【解决方案1】：

这种现象已经在其他问题中得到了广泛的讨论，在英特尔的软件优化手册和 Agner Fog 的笔记中也得到了广泛的讨论。

通常情况下，您会在内存层次结构中经历一场完美的驱逐风暴，以至于突然（几乎）每次访问都会错过缓存和/或 TLB（可以通过以下方式准确确定丢失的资源查看特定的数据访问模式或使用 PMC；我可以稍后在我靠近白板时进行计算，除非你先找到神秘的东西）。

您还可以搜索我或 Mystical 的一些答案以查找以前的答案。

【讨论】：

我会开始搜索你的答案。感谢您的回复！
实际上，如果您能指出我的问题主题，那将不胜感激。 26 页的答案可供浏览！
这里有一些关于朴素矩阵乘法的讨论：(stackoverflow.com/questions/7905760/…)。您的案例机制有些不同，因为 MKL 确实缓存阻塞，但您遇到的现象基本相同。今天晚些时候我会添加更多细节。
mystical 在这里也谈到了这个问题：stackoverflow.com/questions/9515482/…
我们可能在不同的页面上。我不太担心性能下降（链接的讨论非常有意义）。困扰我的是 N=8192 需要大约 10 秒才能将 100% 的工作卸载到 MIC。 N=8292 根本不运行。它只是挂起。没有错误或任何东西，但它只是坐在那里。这不是由于缓存大小而导致性能下降，而是完全停止。

【解决方案2】：

问题是英特尔 icc 编译器的旧版本（我相信是 beta 10 更新……也许吧）。黄金更新就像一个魅力。

【讨论】：