在 C 中使用 MPI_Type_Vector 和 MPI_Gather答案

【问题标题】：Using MPI_Type_Vector and MPI_Gather, in C在 C 中使用 MPI_Type_Vector 和 MPI_Gather
【发布时间】：2011-02-28 04:18:27
【问题描述】：

我正在尝试将平方矩阵与 MPI 并行相乘。

我使用 MPI_Type_vector 将方形子矩阵（浮点数组）发送到进程，以便它们可以计算子积。然后，对于下一次迭代，这些子矩阵作为 MPI_Type_contiguous 发送到相邻进程（发送整个子矩阵）。这部分按预期工作，本地结果是正确的。

然后，我使用 MPI_Gather 和连续类型将所有本地结果发送回根进程。问题是，最终的矩阵是逐行构建的（显然是通过这种方法），而不是逐个子矩阵。

我写了一个丑陋的程序来重新排列最终的矩阵，但我想知道是否有一种直接的方式来执行发送 MPI_Type_vectors 的“逆”操作（即发送一个值数组并直接将其排列在一个子数组中接收数组中的表格）。

一个例子，试图澄清我的长文本：

A[16] 和 B[16]

这些确实是二维数组，A[4][4] 和 B[4][4]。

是要相乘的 4x4 矩阵； C[4][4] 将包含结果；使用了 4 个进程（Pi 和 i 从 0 到 3）：

Pi 得到两个 2x2 子矩阵：subAi[4] 和 subBi[4]；他们的产品本地存储在 subCi[4] 中。

例如，P0 得到：

subA0[4] 包含 A[0]、A[1]、A[4] 和 A[5] ;
subB0[4] 包含 B[0]、B[1]、B[4] 和 B[5]。

一切计算完毕后，根进程收集所有subCi[4]。

那么 C[4][4] 包含：

[
subC0[0], subC0[1], subC0[2], subC0[3 ],
subC1[0], subC1[1], subC1[2], subC1[3],
subC2[0], subC2[1], subC2[2], subC2[3],
subC3[0], subC3[1], subC3[2], subC3[3]]

我希望它是：

[
subC0[0], subC0[1], subC1[0], subC1[1],
subC0[2], subC0[3], subC1[2], subC1[3],
subC2[0], subC2[1], subC3[0], subC3[1],
subC2[2], subC2[3], subC3[2], subC3[3]]

无需进一步操作。有人知道方法吗？

感谢您的建议。

添加信息以回答“高性能标记”：

1 好吧，我的初始矩阵是二维数组（形状为 A[4][4]）。我想在写我的问题时简短一点，我现在发现这是个坏主意...

我确实定义了 MPI_Type_vector 如下，例如：

MPI_Type_vector(2, 2, 4, MPI_FLOAT, &subMatrix);

（顺便说一下，我看不出扁平数组有什么不同）。

2 我不是 MPI 方面的专家，远非 MPI，所以我可能会做一些奇怪的事情。这是我的一些代码，应用于示例（仅处理 A，B 非常相似）：

从根向从属进程发送子矩阵：

Master {
    for (i = 0 ; i < 2 ; i++)
        for (j = 0 ; j < 2 ; j++)
            MPI_Send(&A[j * 2][(i + j) % 2 * 2], 1, subMatrix, i + j * 2, 42, MPI_COMM_WORLD);
}

奴隶接收：

MPI_Recv(subA, 4, MPI_FLOAT, 0, 42, MPI_COMM_WORLD, &status);

那么，进程间的交换是通过subMatrixLocal的MPI_Send和MPI_Recv完成的，即：

MPI_Type_contiguous(4, MPI_FLOAT, &subMatrixLocal);

在所有本地操作完成后，我将所有 subC 矩阵收集到 C 中：

MPI_Gather(subC, 1, subMatrixLocal, C, 1, subMatrixLocal, 0, MPI_COMM_WORLD);

并且我得到了之前声明的结果，我必须重新排序......

关于您提出的算法：下一步将使用 GPU 进行矩阵乘法，其中方阵乘积是有效的。 MPI 将仅用于将矩阵从 CPU 传输到 CPU。当然，届时将考验全局效率。

0 你说“相同的类型定义应该适用于反向操作”。但是，我的 MPI_Vector_type 在“大”矩阵上运行良好，但无法直接在子矩阵上使用它（在 2x2 矩阵上应用 MPI_Vector_type(2, 2, 4) 会产生错误的结果，因为它需要最后两个值“在”定义的数组“之外”......）。你的意思是我应该创建另一个 MPI_Vector_type 并发送/接收它？

【问题讨论】：

标签： c mpi

【解决方案1】：

您的问题的答案是否有直接的方式来执行发送 MPI_Type_vectors 的“逆”操作是的。如果您已经定义了一个类型向量以将子矩阵从一个进程发送到另一个进程，那么相同的类型定义应该适用于反向操作。

但是，我对您的解释有些困惑，还有一些问题要问您。如果你回答他们，我可能会提供更好的建议。

您将矩阵写为 A[16]、B[16] 并说它们是 4x4。你已经把它们弄平了吗？我预计它们会是 A[4][4] 等。如果你已经扁平化了矩阵，你为什么要这样做？您当然可以定义一个 mpi_type_vector 来定义二维矩阵的子矩阵。
在我看来有点奇怪，不一定是错误的，但奇怪的是，将发送与聚集相匹配。我通常希望看到聚集与分散匹配并通过接收发送。也许您可以发布足够多的代码来阐明您正在使用哪些操作。

最后，我认为通过乘以子矩阵来乘矩阵可能不是 MPI 的有效方法。如果您将此作为练习，请继续。但是一种更好的算法，可能是更容易实现的算法，应该是

mpi_broadcast 矩阵 B 到所有进程；
director 进程循环发送 A 行到工作进程；
worker进程计算一行C并将其发送回director进程；
director 进程接收 C 行并将它们放在正确的位置。

【讨论】：

我已编辑我的帖子以尝试回答您的问题。感谢您的关注。

【解决方案2】：

我知道这个问题很久以前就被问过了，但我认为它还没有得到最佳答案，而且我最近偶然发现了同样的问题。

您需要做两件事。首先，使用两种 MPI_Datatype，一种用于发送，一种用于接收。发送的类型（我的代码示例中的 stype）与一行中的本地元素数（我的代码中的 nloc）具有相同的步幅，这意味着您可以根据需要使用 MPI_Type_contiguous 构造它。然而，gather 中的接收过程必须将其放置在一个长度为 n=nloc*nproc 行的数组中，因此您必须使用 MPI_Type_vector 创建它。

这是我花了一些时间才弄清楚的关键部分（我最终从 OpenMPI 邮件列表上的 Gilles Gouaillardet 那里得到了答案：https://www.mail-archive.com/users@lists.open-mpi.org//msg34678.html

为了将传入矩阵放置在正确的偏移量，您必须将接收数据类型的“范围”设置为 nloc（在我的代码中），因为这是下一个块的第一个元素的偏移量。为此，您可以使用 MPI_Type_create_resized 从 rtype 获取新的数据类型（在我的代码中为 rtype_resized）。

MWE：

#include <mpi.h>
#include <iostream>
#include <sstream>
#include <string>

void print(std::string label, int rank, int nloc, int m, int* array)
{
  std::ostringstream oss;
  oss << label << " on P"<<rank<<": "<< m << "x" << nloc << std::endl;

  for (int i=0; i<m; i++)
  {
    for (int j=0; j<nloc; j++)
    {
      oss << array[i*nloc+j] << " ";
    }
    oss << std::endl;
  }
  std::cout << oss.str()<<std::flush<<std::endl;
}

int main(int argc, char** argv)
{

   MPI_Init(&argc,&argv);
   int rank, nproc;
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&nproc);

   int nloc=2;
   int n=nloc*nproc;
   int m=2;

   int *Cj = new int[nloc*m+1000];
   int *Cglob = new int[n*m+1000];

   for (int j=0; j<m; j++)
     for (int i=0; i<nloc; i++)
       Cj[j*nloc+i]=j*n + rank*nloc + i;

   for (int r=0; r<nproc; r++)
   {
      if (rank==r) print("Cj", rank, nloc, m, Cj);
      std::cout << std::flush;
      MPI_Barrier(MPI_COMM_WORLD);
   }

   MPI_Datatype stype, rtype, rtype_resized;

   // this data typre represents the local nloc x m matrix,
   // which is column-major and has stride m.
   MPI_Type_vector(m,nloc,nloc,MPI_INT,&stype);
   MPI_Type_commit(&stype);

   // this represents a block of size nloc x m within a col-major
   // matrix of size n x m, hence the stride is n.
   MPI_Type_vector(m,nloc,n,MPI_INT,&rtype);
   MPI_Type_commit(&rtype);

  // we need to manually define the extent of the receive type in order to
  // get the displacements in the MPI_Gather right:
  MPI_Type_create_resized(rtype, 0, nloc*sizeof(int), &rtype_resized);
  MPI_Type_commit(&rtype_resized);

   // these two result in the same thing:
   //MPI_Allgather(Cj,nloc*m,MPI_INT,Cglob,1,rtype,MPI_COMM_WORLD);
   MPI_Gather(Cj,1,stype,Cglob,1,rtype_resized,0,MPI_COMM_WORLD);

   if (rank==0)
     print("Cglob", rank, n, m, Cglob);

   MPI_Type_free(&stype);
   MPI_Type_free(&rtype);
   MPI_Type_free(&rtype_resized);

   delete [] Cj;
   delete [] Cglob;

   MPI_Finalize();
}

输出：


> mpicxx -o matrix_gather matrix_gather.cpp
> mpirun -np 2 ./matrix_gather

Cj on P0: 2x2
0 1 
4 5 

Cglob on P0: 2x4
0 1 2 3 
4 5 6 7 

Cj on P1: 2x2
2 3 
6 7

【讨论】：