【发布时间】:2019-03-20 07:40:48
【问题描述】:
我无法通过Slurm-script 在Slurm 下运行Open MPI。
一般来说,我可以获取主机名并在我的机器上运行Open MPI。
$ mpirun hostname
myHost
$ cd NPB3.3-SER/ && make ua CLASS=B && mpirun -n 1 bin/ua.B.x inputua.data # Works
但如果我通过 slurm-script mpirun hostname 执行相同的操作,则返回空字符串,因此我无法运行 mpirun -n 1 bin/ua.B.x inputua.data。
slurm-script.sh:
#!/bin/bash
#SBATCH -o slurm.out # STDOUT
#SBATCH -e slurm.err # STDERR
#SBATCH --mail-type=ALL
export LD_LIBRARY_PATH="/usr/lib/openmpi/lib"
mpirun hostname > output.txt # Returns empty
cd NPB3.3-SER/
make ua CLASS=B
mpirun --host myHost -n 1 bin/ua.B.x inputua.data
$ sbatch -N1 slurm-script.sh
Submitted batch job 1
我收到的错误:
There are no allocated resources for the application
bin/ua.B.x
that match the requested mapping:
------------------------------------------------------------------
Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
------------------------------------------------------------------
【问题讨论】:
-
完全删除
--host myHost选项。如果 SLURM 是使用 OpenMPI 集成编译的,它将能够将分配的节点列表隐式传递给 mpirun。 -
我已经删除了
--host myHost,但我仍然遇到同样的错误。我将 SLURM 编译如下./configure --enable-debug --enable-front-end && make && make install。如何使用 OpenMPI 集成编译 SLURM? @DmitriChubarov -
你能提供 slurm 和 openmpi 版本吗?
-
如果通过
slurm运行hostname(不是mpirun hostname)会得到什么?这可以区分openmpi是否涉及该问题。我的猜测很可能 openmpi 与output.txt为空无关(我不知道这是您遇到的唯一问题,还是第一个出现的问题)。 -
如果我通过
slurm运行hostname,它会返回ebloc,这实际上也是slurm.conf上的NodeHostName。 @sancho.s