【发布时间】:2013-12-24 14:58:31
【问题描述】:
我正在尝试对我的 OpenMPI 安装运行一个简单的 helloworld 测试。我已经在 Amazon AWS 上建立了一个两节点集群,我正在使用 SUSE SLES11 SP3、OpenMPI 1.4.4(有点旧,但没有可用于我的 Linux 发行版的新二进制文件)。我已经到了最后一步,但在正确设置 btl 标志时遇到了一些问题。
他是我能做到的:
我可以在节点之间进行双向 scp,因此无密码 SSH 已启动并正常运行
如果我运行 iptables -L 表示没有防火墙启动,所以我认为节点之间的通信应该可以工作。
-
我可以使用 mpicc 编译我的 helloworld.c 程序,并且我已确认该脚本在另一个工作集群上正确运行,因此我认为本地路径设置正确,并且该脚本确实有效。
-
如果我从我的主节点执行 mpirun,并且只使用主节点,helloworld 会正确执行:
ip-xxx-xxx-xxx-133: # mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi ip-xxx-xxx-xxx-133: hello world from process 0 of 1 -
如果我从我的主节点执行 mpirun,只使用工作节点,helloworld 会正确执行:
ip-xxx-xxx-xxx-133: # mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi ip-xxx-xxx-xxx-210: hello world from process 0 of 1
现在,我的问题是,如果我尝试在两个节点上运行 helloworld,就会出现错误:
ip-xxx-xxx-xxx-133: # mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[5228,1],0]) is on host: ip-xxx-xxx-xxx-133
Process 2 ([[5228,1],1]) is on host: node001
BTLs attempted: self
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-133:7037] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7037 on
node ip-xxx-xxx-xxx-133 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-210:5838] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[ip-xxx-xxx-xxx-133:07032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure
最后,如果我省略了 -mca btl sm,openib,self 标志,那么根本就没有任何效果。我承认我对这些标志的理解几乎为零。然而,网络上关于它们的使用的信息很少。我查看了我的 data.conf 文件,我不确定列出的所有设备是否都实际存在,但 -mca 标志似乎解决了大部分问题,因为我至少可以在每个节点上执行单独在集群中。任何关于我可能做错了什么或我可能在哪里寻找的指针将不胜感激。
【问题讨论】:
标签: amazon-web-services mpi cluster-computing openmpi