【问题标题】:MPI result is different under Slurm and by using commandMPI 结果在 Slurm 下和使用命令时不同
【发布时间】:2020-03-05 07:33:11
【问题描述】:

我在运行 Slurm 的 MPI 项目时遇到了一个问题。

a1 是我的可执行文件。 当我运行mpiexec -np 4 ./a1时效果很好

但是当我在Slurm下运行它就不能正常工作了,而且看起来像是停在了中间:

这是使用mpiexec -np 4 ./a1的输出,这是正确的。

Processor1 will send and receive with processor0
Processor3 will send and receive with processor0
Processor0 will send and receive with processor1
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor2 will send and receive with processor0
Processor1 will send and receive with processor2
Processor2 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor0 will send and receive with processor3
Processor0 finished send and receive with processor3
Processor3 finished send and receive with processor0
Processor1 finished send and receive with processor2
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor0: I am very good, I save the hash in range 0 to 65
p: 4
Tp: 8.61754
Processor1 will send and receive with processor3
Processor3 will send and receive with processor1
Processor3 finished send and receive with processor1
Processor1 finished send and receive with processor3
Processor2 will send and receive with processor3
Processor1: I am very good, I save the hash in range 65 to 130
Processor2 finished send and receive with processor3
Processor3 will send and receive with processor2
Processor3 finished send and receive with processor2
Processor3: I am very good, I save the hash in range 195 to 260
Processor2: I am very good, I save the hash in range 130 to 195

这是 Slurm 下的输出,它不像使用命令那样返回整个结果。

Processor0 will send and receive with processor1
Processor2 will send and receive with processor0
Processor3 will send and receive with processor0
Processor1 will send and receive with processor0
Processor0 finished send and receive with processor1
Processor1 finished send and receive with processor0
Processor0 will send and receive with processor2
Processor0 finished send and receive with processor2
Processor2 finished send and receive with processor0
Processor1 will send and receive with processor2
Processor0 will send and receive with processor3
Processor2 will send and receive with processor1
Processor2 finished send and receive with processor1
Processor2 will send and receive with processor3
Processor1 finished send and receive with processor2

这是我的 Slurm.sh 文件:我认为我犯了一些错误,结果与命令不同,但我不确定...

#!/bin/bash

####### select partition (check CCR documentation)
#SBATCH --partition=general-compute --qos=general-compute

####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
#SBATCH --mem=64000

####### make sure no other jobs are assigned to your nodes
#SBATCH --exclusive

####### further customizations
#SBATCH --job-name="a1"
#SBATCH --output=%j.stdout
#SBATCH --error=%j.stderr
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=12:00:00

mpiexec -np 4 ./a1

【问题讨论】:

    标签: c++ mpi slurm


    【解决方案1】:

    再次,回来解决我的问题。 我犯了一个愚蠢的错误,我为我的 mpi 代码使用了错误的 slurm.sh。 正确的 slurm.sh 是:

    #!/bin/bash
    
    ####### select partition (check CCR documentation)
    #SBATCH --partition=general-compute --qos=general-compute
    
    ####### set memory that nodes provide (check CCR documentation, e.g., 32GB)
    #SBATCH --mem=32000
    
    ####### make sure no other jobs are assigned to your nodes
    #SBATCH --exclusive
    
    ####### further customizations
    #SBATCH --job-name="a1"
    #SBATCH --output=%j.stdout
    #SBATCH --error=%j.stderr
    #SBATCH --nodes=4
    #SBATCH --ntasks-per-node=12
    #SBATCH --time=01:00:00
    
    ####### check modules to see which version of MPI is available
    ####### and use appropriate module if needed
    module load intel-mpi/2018.3
    export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
    
    srun /.a1
    
    

    我太傻了,所以我用小南作为昵称......希望我能变得聪明。

    【讨论】:

      猜你喜欢
      • 2016-01-01
      • 2021-12-29
      • 2011-10-06
      • 1970-01-01
      • 2016-11-17
      • 1970-01-01
      • 2015-07-20
      • 1970-01-01
      • 2017-06-24
      相关资源
      最近更新 更多