如何让 MPI 中的所有等级发送一个值到等级 0，然后阻塞接收所有这些等级？答案

【问题标题】：How to get all ranks in MPI to do a send a value to rank 0 which then does a blocking receive on all of them?如何让 MPI 中的所有等级发送一个值到等级 0，然后阻塞接收所有这些等级？
【发布时间】：2017-09-10 19:48:36
【问题描述】：

假设我有 n 个进程：

他们进行计算，然后将结果发送到排名 0。这就是我想要发生的事情：

排名 0 将等待，直到它从所有排名中获得结果，然后将它们相加。

我该怎么做？另外，我想避免以下情况：

例如。 4个进程P0、P1、P2、P3，

P1 -> P0
P2 -> P0
P3 -> P0

此时P1已经完成计算，所以P1->P0再次发生。

我希望 P0 在一个周期内只对 3 个进程进行加法运算，然后再为下一个周期加法。

有人可以建议一个 MPI 函数来执行此操作吗？我知道 MPI_Gather，但我不确定它是否阻塞

我想到了这个：

#include <mpi.h>
#include <stdio.h>

int main()
{
int pross, rank,p_count = 0;
int count = 10;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&pross);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);

int * num = malloc((pross-1)*sizeof(int));

        if(rank !=0)
        {
            MPI_Send(&count,1,MPI_INT,0,1,MPI_COMM_WORLD);
        }
        else
        {            
            MPI_Gather(&count, 1,MPI_INT,num, 1, MPI_INT, 0,MPI_COMM_WORLD);
            for(ii = 0; ii < pross-1;ii++ ){printf("\n NUM %d \n",num[ii]); p_count += num[ii]; }
}
MPI_Finalize();
}

我收到错误：

  *** Process received signal ***
  Signal: Segmentation fault (11)
  Signal code: Address not mapped (1)
  Failing at address: (nil)
  [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11630)[0x7fb3e3bc3630]
  [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x90925)[0x7fb3e387b925]
  [ 2] /usr/lib/libopen-pal.so.13(+0x30177)[0x7fb3e3302177]
  [ 3] /usr/lib/libmpi.so.12(ompi_datatype_sndrcv+0x54c)[0x7fb3e3e1e3ec]
  [ 4] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_gather_intra_basic_linear+0x143)[0x7fb3d53d9063]
  [ 5] /usr/lib/libmpi.so.12(PMPI_Gather+0x1ba)[0x7fb3e3e29a3a]
  [ 6]  sosuks(+0xe83)[0x55ee72119e83]
  [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fb3e380b3f1]
  [ 8]  sosuks(+0xb5a)[0x55ee72119b5a]
  *** End of error message ***

另外，我试过了：

#include <mpi.h>
#include <stdio.h>

int main()
{
int pross, rank,p_count = 0;
int count = 10;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&pross);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);

int * num = malloc((pross-1)*sizeof(int));

        if(rank !=0)
        {
            MPI_Send(&count,1,MPI_INT,0,1,MPI_COMM_WORLD);
        }
        else
        {            
            MPI_Gather(&count, 1,MPI_INT,num, 1, MPI_INT, 0,MPI_COMM_WORLD);
            for(ii = 0; ii < pross-1;ii++ ){printf("\n NUM %d \n",num[ii]); p_count += num[ii]; }
}
MPI_Finalize();
}

我在这里遇到错误：

  *** Process received signal ***
  Signal: Segmentation fault (11)
  Signal code: Address not mapped (1)
  Failing at address: 0x560600000002
  [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11630)[0x7fefc8c11630]
  [ 1] mdscisuks(+0xeac)[0x5606c1263eac]
  [ 2] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fefc88593f1]
  [ 3] mdscisuks(+0xb4a)[0x5606c1263b4a]
  *** End of error message ***

对于第二次尝试，这里需要注意的是 send 和 recv 都成功了，但是 root 由于某种原因只能收到 2 条来自 rank 的消息。看到的分段错误是由于 num 中只有两个元素，我不明白为什么 num 只接收两次。

我将代码称为

mpiexec -n 6 ./sosuks

有人可以告诉我更好/正确的方法来实现我的想法吗？

更新：

除了下面的答案，我在上面的实现中发现了我想分享的错误：

#include <mpi.h>
#include <stdio.h>

int main()
{
int pross, rank,p_count = 0;
int count = 10;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&pross);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Status status;
int * num = malloc((pross-1)*sizeof(int));

        if(rank !=0)
        {
            MPI_Send(&count,1,MPI_INT,0,1,MPI_COMM_WORLD);
        }
        else
        {
        int var,lick = 0;
        for(lick = 1; lick < pross; u++)
        {
        int fetihs;
        MPI_Recv(&fetihs,1,MPI_INT,lick,1,MPI_COMM_WORLD,&status);          

        var += fetihs;
        }
     // do things with var
}
MPI_Finalize();
}

【问题讨论】：

如果要将所有结果相加，那么您可能需要MPI_Reduce，而不是MPI_Gather。
但它会遵循我在上面问题中概述的阻止程序吗？我试图在所有过程都达到某个点之后才进行添加。在某种程度上，我试图在该步骤中“同步”所有进程的结果。
您的描述不是很清楚，但听起来您希望每一轮都有一个障碍。是的，会有障碍（没有障碍就没有合乎逻辑的方法来减少）。
我是新手。什么是障碍？
屏障是执行中的一个点，所有进程在允许它们继续越过屏障之前必须到达该点。

标签： c multithreading mpi openmpi message-passing

【解决方案1】：

就您而言，正如 Sneftel 指出的那样，您需要 MPI_Reduce。此外，您不需要在循环完成之前进行显式同步。

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
    int pross, rank, p_count, count = 10;

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD, &pross);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int* num = malloc((pross-1)*sizeof(int));

    // master does not send data to itself.
    // only workers send data to master.

    for (int i=0; i<3; ++i)
    {
        // to prove that no further sync is needed.
        // you will get the same answer in each cycle.
        p_count = 0; 

        if (rank == 0)
        {
            // this has not effect since master uses p_count for both 
            // send and receive buffers due to MPI_IN_PLACE.
            count = 500;

            MPI_Reduce(MPI_IN_PLACE, &p_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
        }
        else
        {
            // for slave p_count is irrelevant.
            MPI_Reduce(&count, NULL, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
        }

        if (rank == 0)
        {
            printf("p_count = %i\n", p_count);
        }

        // slaves send their data to master before the cycle completes.
        // no need for explicit sync such as MPI_Barrier.
        // MPI_Barrier(MPI_COMM_WORLD); // no need.
    }

    MPI_Finalize();
}

在上面的代码中，slave 中的 count 被简化为 master 中的 p_count。注意MPI_IN_PLACE 和两个MPI_Reduce 调用。您可以通过简单地设置count = 0 并在没有MPI_IN_PLACE 的所有等级中调用MPI_Reduce 来获得相同的功能。

for (int i=0; i<3; ++i)
{
    p_count = 0;    
    if (rank == 0) count = 0;

    MPI_Reduce(&count, &p_count, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);           
}

【讨论】：