【发布时间】:2022-01-23 06:44:25
【问题描述】:
我尝试在由 srun 管理的集群上运行 pytorch,我遵循此处的 ddp 示例 (https://github.com/pytorch/examples/tree/master/distributed/ddp)。当我设置一个节点并使用多个进程(每个进程访问一个 gpu)时,它对我有用。结果如下:
$ srun -C gpu -N 1 -c 8 -n 1 --gpus-per-task=4 python -m torch.distributed.launch --nnode=1 --nproc_per_node=4 example.py --local_world_size=4
srun: job 2520346 queued and waiting for resources
srun: job 2520346 has been allocated resources
[7288] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '4'}
[7289] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '4'}
[7290] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '4'}
[7291] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '4'}
[7288]: world_size = 4, rank = 0, backend=nccl
[7288] rank = 0, world_size = 4, n = 1, device_ids = [0]
[7290]: world_size = 4, rank = 2, backend=nccl
[7290] rank = 2, world_size = 4, n = 1, device_ids = [2]
[7289]: world_size = 4, rank = 1, backend=nccl
[7289] rank = 1, world_size = 4, n = 1, device_ids = [1]
[7291]: world_size = 4, rank = 3, backend=nccl
[7291] rank = 3, world_size = 4, n = 1, device_ids = [3]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
但是,当我尝试 2 个节点并且每个节点可以访问 4 个 gpu 时,程序就会挂在那里
srun -C gpu -N 2 -c 8 -n 2 --gpus-per-task=4 python -m torch.distributed.launch --nnode=2 --nproc_per_node=4 example.py --local_world_size=4
srun: job 2520347 queued and waiting for resources
srun: job 2520347 has been allocated resources
[62582] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[62583] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[62585] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[62586] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
[48801] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[48829] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[48849] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[48850] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
我不确定 pytorch 如何在此处详细初始化 env,但我猜主地址不应该是第二种情况的 127.0.0.1,因为有两个不同的节点。你知道如何让这个例子在这种情况下工作吗?谢谢!
【问题讨论】:
标签: pytorch distributed hpc