如何用一个 GPU 启动 n 个任务？答案

【问题标题】：How to start n tasks with one GPU each?如何用一个 GPU 启动 n 个任务？
【发布时间】：2021-02-09 22:23:25
【问题描述】：

我有一个大型计算节点集群，每个节点有 6 个 GPU。我想开始，比如说，100 名工作人员在这件事上，每个人都只能访问一个 GPU。

我现在的做法是这样的：

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh

在main.sh里面：

srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh

这样，我启动了 100 个工作人员（完全使用了 17 个节点）。但我有一个问题：CUDA_VISIBLE_DEVICES 设置不正确。

sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
# CUDA_VISIBLE_DEVICES in main.sh: 0,1,2,3,4,5 (that's fine)
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
# CUDA_VISIBLE_DEVICES in worker.sh: 0,1,2,3,4,5 (this is my problem: how to assign exactly 1 GPU to each worker and to that worker alone?)

我对 Slurm 的实际工作方式可能存在误解，因为我是在此类 HPC 系统上进行编程的新手。但是任何线索如何实现我想要实现的目标？（每个工人都有 1 个 GPU 分配给它，而且只有它）

我们使用 SLURM 20.02.2。

【问题讨论】：

标签： gpu slurm

【解决方案1】：

我认为你在这里缺少的是你应该明确定义节点的数量：例如，您可以拥有 NODES=17 x 6 GPUs per NODE = 102 个任务（如果您只需要 100 个任务，这意味着最后一个节点可能只有 4 个任务未得到充分利用）

#SBATCH --ntasks=100
#SBATCH --gres=gpu:6
#SBATCH --nodes=17
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1   (depends on the available cores per node)

srun -l main.py

【讨论】：