【问题标题】:open MPI - ring_c on multiple hosts fails打开 MPI - 多个主机上的 ring_c 失败
【发布时间】:2016-07-23 08:24:11
【问题描述】:

我在两台 Ubuntu 14.04 主机上打开了 recently installed MPI,现在我正在使用提供的两个测试函数 hello_c 和 ring_c 测试它的功能。主机被称为“hermes”和“zeus”,它们都有用户“mpiuser”以非交互方式登录(通过 ssh-agent)。

mpirun hello_cmpirun --host hermes,zeus hello_c 函数都可以正常工作。

在本地调用函数mpirun --host zeus ring_c 也可以。 hermes 和 zeus 的输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting

但调用函数mpirun --host zeus,hermes ring_c 失败并给出以下输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

我没有找到任何关于如何解决此类问题的文档,并且我不知道根据错误输出在哪里查找错误。 我该如何解决这个问题?

【问题讨论】:

    标签: testing installation openmpi


    【解决方案1】:

    您在第一次和第二次运行之间更改了两件事 - 您将进程数从 1 个增加到 2 个,并在多个主机上而不是单个主机上运行。

    我建议您首先检查您是否可以在同一主机上的 2 个进程上运行:

    mpirun -n 2 ring_c
    

    看看你会得到什么。

    在集群上进行调试时,了解每个进程的运行位置通常很有用。您还应该始终打印出进程总数。尝试在 ring_c.c 的顶部使用以下代码:

    char nodename[MPI_MAX_PROCESSOR_NAME];
    int namelen;
    
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    
    MPI_Get_processor_name(nodename, &namelen);
    printf("Rank %d out of %d running on node %s\n", rank, size, nodename);
    

    你得到的错误是说传入的消息对于接收缓冲区来说太大了,这很奇怪,因为代码总是发送和接收一个整数。

    【讨论】:

    • mpirun -n 2 ring_c 在同一主机上工作。但我想我发现了错误。环境变量不对,ssh user@IP env 没有显示正确的 $PATH 和 $LD_LIBRARY_PATH。所以我用mpirun --prefix /opt/openmpi -host hermes,zeus ring_c 进行了尝试,并且成功了。所以我必须弄清楚导出变量的正确方法是什么。
    猜你喜欢
    • 2013-03-24
    • 1970-01-01
    • 2013-08-12
    • 2011-12-08
    • 2020-04-03
    • 2012-12-02
    • 2012-03-26
    • 2020-05-25
    • 1970-01-01
    相关资源
    最近更新 更多