【问题标题】:Setting BTL flags in OpenMPI在 OpenMPI 中设置 BTL 标志
【发布时间】:2013-12-24 14:58:31
【问题描述】:

我正在尝试对我的 OpenMPI 安装运行一个简单的 helloworld 测试。我已经在 Amazon AWS 上建立了一个两节点集群,我正在使用 SUSE SLES11 SP3、OpenMPI 1.4.4(有点旧,但没有可用于我的 Linux 发行版的新二进制文件)。我已经到了最后一步,但在正确设置 btl 标志时遇到了一些问题。

他是我能做到的:

  • 我可以在节点之间进行双向 scp,因此无密码 SSH 已启动并正常运行

  • 如果我运行 iptables -L 表示没有防火墙启动,所以我认为节点之间的通信应该可以工作。

  • 我可以使用 mpicc 编译我的 helloworld.c 程序,并且我已确认该脚本在另一个工作集群上正确运行,因此我认为本地路径设置正确,并且该脚本确实有效。

  • 如果我从我的主节点执行 mpirun,并且只使用主节点,helloworld 会正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
    ip-xxx-xxx-xxx-133: hello world from process 0 of 1
    
  • 如果我从我的主节点执行 mpirun,只使用工作节点,helloworld 会正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
    ip-xxx-xxx-xxx-210: hello world from process 0 of 1
    

现在,我的问题是,如果我尝试在两个节点上运行 helloworld,就会出现错误:

ip-xxx-xxx-xxx-133: # mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[5228,1],0]) is on host: ip-xxx-xxx-xxx-133
  Process 2 ([[5228,1],1]) is on host: node001
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-133:7037] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7037 on
node ip-xxx-xxx-xxx-133 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-210:5838] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[ip-xxx-xxx-xxx-133:07032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure

最后,如果我省略了 -mca btl sm,openib,self 标志,那么根本就没有任何效果。我承认我对这些标志的理解几乎为零。然而,网络上关于它们的使用的信息很少。我查看了我的 data.conf 文件,我不确定列出的所有设备是否都实际存在,但 -mca 标志似乎解决了大部分问题,因为我至少可以在每个节点上执行单独在集群中。任何关于我可能做错了什么或我可能在哪里寻找的指针将不胜感激。

【问题讨论】:

    标签: amazon-web-services mpi cluster-computing openmpi


    【解决方案1】:

    "--mca btl openib,sm,self" 告诉 Open MPI 哪个传输用于 MPI 流量。您指定:

    • openib:InfiniBand 或 iWARP
    • sm:共享内存
    • 自我:环回

    据我所知(尽管我没有密切关注 AWS),AWS 没有 InifniBand 或 iWARP。所以在这里指定 openib 是没有用的。如果您将“tcp”添加到逗号分隔列表中,它应该使用 TCP,这应该是您想要的。具体来说,“--mca btl tcp,sm,self”(以逗号分隔的列表中的顺序无关紧要)。

    话虽如此,Open MPI 默认应该有效地挑选 sm、tcp 和 self —— 所以你根本不需要指定“--mca btl tcp,sm,self”。这对你不起作用,这对我来说有点奇怪。

    【讨论】:

    • 谢谢。这很好地解释了我昨天工作几个小时后得出的结论。亚马逊不使用 Infiniband。关于您的最后评论,我也不确定,但我认为这可能是由我的 data.conf 文件引起的。我认为该文件列出了一些实际上不存在的硬件(我从亚马逊上的另一个 Linux AMI 借用了该文件)。从 mpi 的角度来看,使用 btl 标志以某种方式从 data.conf 文件中过滤出违规行。如果我不使用 btl 标志,mpirun 会抱怨缺少 cma-blah-blah(我现在无法提取确切的错误)并抛出错误。
    【解决方案2】:

    为了记录,我只需将 tcp 添加到 -mca btl 标志,它现在可以正常工作了。

    【讨论】:

      猜你喜欢
      • 2019-08-19
      • 2016-11-11
      • 2015-11-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-01-12
      • 2019-04-10
      • 1970-01-01
      相关资源
      最近更新 更多