【问题标题】:Distributed training initialisation of pytorch based on srun基于srun的pytorch分布式训练初始化
【发布时间】:2022-01-23 06:44:25
【问题描述】:

我尝试在由 srun 管理的集群上运行 pytorch,我遵循此处的 ddp 示例 (https://github.com/pytorch/examples/tree/master/distributed/ddp)。当我设置一个节点并使用多个进程(每个进程访问一个 gpu)时,它对我有用。结果如下:

$ srun -C gpu -N 1 -c 8 -n 1 --gpus-per-task=4 python -m torch.distributed.launch --nnode=1 --nproc_per_node=4 example.py --local_world_size=4 
srun: job 2520346 queued and waiting for resources
srun: job 2520346 has been allocated resources
[7288] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '4'}
[7289] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '4'}
[7290] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '4'}
[7291] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '4'}
[7288]: world_size = 4, rank = 0, backend=nccl 
[7288] rank = 0, world_size = 4, n = 1, device_ids = [0] 
[7290]: world_size = 4, rank = 2, backend=nccl 
[7290] rank = 2, world_size = 4, n = 1, device_ids = [2] 
[7289]: world_size = 4, rank = 1, backend=nccl 
[7289] rank = 1, world_size = 4, n = 1, device_ids = [1] 
[7291]: world_size = 4, rank = 3, backend=nccl 
[7291] rank = 3, world_size = 4, n = 1, device_ids = [3] 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

但是,当我尝试 2 个节点并且每个节点可以访问 4 个 gpu 时,程序就会挂在那里

srun -C gpu -N 2 -c 8 -n 2 --gpus-per-task=4 python -m torch.distributed.launch --nnode=2 --nproc_per_node=4 example.py --local_world_size=4 
srun: job 2520347 queued and waiting for resources
srun: job 2520347 has been allocated resources
[62582] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[62583] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[62585] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[62586] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
[48801] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[48829] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[48849] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[48850] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}

我不确定 pytorch 如何在此处详细初始化 env,但我猜主地址不应该是第二种情况的 127.0.0.1,因为有两个不同的节点。你知道如何让这个例子在这种情况下工作吗?谢谢!

【问题讨论】:

    标签: pytorch distributed hpc


    【解决方案1】:

    经过一番探索,我找到了一个解决方案,我把它放在这里。也许有更好的解决方案,但这个解决方案目前对我来说似乎有效。我写了一个 MPI 程序,它可以检测 eth 的地址(在我的例子中,它是 eth3),然后 mpi 程序将 leader addr 广播给所有工作人员,然后 mpi 程序使用系统调用来启动 python 脚本。

    这是mpi程序

    #include <iostream>
    #include <mpi.h>
    #include <cstdlib>
    #include <cstdio>
    #include <iostream>
    #include <memory>
    #include <stdexcept>
    #include <string>
    #include <array>
    #include <cstring>
    
    
    std::string exec(const char* cmd) {
        std::array<char, 128> buffer;
        std::string result;
        std::unique_ptr<FILE, decltype(&pclose)> pipe(popen(cmd, "r"), pclose);
        if (!pipe) {
            throw std::runtime_error("popen() failed!");
        }
        while (fgets(buffer.data(), buffer.size(), pipe.get()) != nullptr) {
            result += buffer.data();
        }
        return result;
    }
    
    int main(int argc, char *argv[]){
        MPI_Init(&argc, &argv);
        int rank, procs;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &procs);
    
        std::string ipcommand="ifconfig eth3 | egrep -o 'inet [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'  | cut -d' ' -f2";
    
        std::string ipaddr = exec(ipcommand.c_str());
    
        std::cout << "rank is " << rank << " ip addr is " << ipaddr << std::endl;
        
        //get master ip
        char masterAddr[128];
        if(rank==0){
            strcpy(masterAddr,ipaddr.c_str());
        }
    
        MPI_Bcast(masterAddr,128,MPI_CHAR,0,MPI_COMM_WORLD);
    
        //std::cout << "rank is " << rank << " ip addr is " << ipaddr << " master ip is " << std::string(masterAddr)<< std::endl;
    
        std::string pytorchcommand = "/bin/bash ../rundistributed.sh " + std::to_string(rank) + " " + std::string(masterAddr);
    
        std::cout << "pytorchcommand: " << pytorchcommand << std::endl;
    
        system(pytorchcommand.c_str());
    
        return 0;
    }
    

    这是rundistributed.sh中的内容

    #!/bin/bash
    #$1 is the rank id
    #$2 is the master addr
    
    python -m torch.distributed.launch \
        --nnode=2 --nproc_per_node=4 --node_rank=$1\
        --master_addr="$2" ../distributed4.py --local_world_size=4
    

    结果如下:

    $ srun -C gpu -N 2 -c 8 -n 2 --gpus-per-task=4 ./initrank 
    srun: job 2520882 queued and waiting for resources
    srun: job 2520882 has been allocated resources
    rank is 1 ip addr is 192.168.174.14
    
    rank is 0 ip addr is 192.168.174.13
    
    pytorchcommand: /bin/bash ../rundistributed.sh 1 192.168.174.13
    
    pytorchcommand: /bin/bash ../rundistributed.sh 0 192.168.174.13
    
    [37240] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '6', 'WORLD_SIZE': '8'}
    [37238] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '4', 'WORLD_SIZE': '8'}
    [78961] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
    [37239] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '5', 'WORLD_SIZE': '8'}
    [78963] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
    [78962] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
    [37241] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '7', 'WORLD_SIZE': '8'}
    [78964] Initializing process group with: {'MASTER_ADDR': '192.168.174.13', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
    [37241]: world_size = 8, rank = 7, backend=gloo 
    [78962]: world_size = 8, rank = 1, backend=gloo 
    [78963]: world_size = 8, rank = 2, backend=gloo 
    [78961]: world_size = 8, rank = 0, backend=gloo 
    [78964]: world_size = 8, rank = 3, backend=gloo 
    [37238]: world_size = 8, rank = 4, backend=gloo 
    [37240]: world_size = 8, rank = 6, backend=gloo 
    [37239]: world_size = 8, rank = 5, backend=gloo 
    [37241] rank = 7, world_size = 8, attachedDevice = 1, device_ids = [3] 
    [78961] rank = 0, world_size = 8, attachedDevice = 1, device_ids = [0] 
    [78964] rank = 3, world_size = 8, attachedDevice = 1, device_ids = [3] 
    [78963] rank = 2, world_size = 8, attachedDevice = 1, device_ids = [2] 
    [78962] rank = 1, world_size = 8, attachedDevice = 1, device_ids = [1] 
    [37239] rank = 5, world_size = 8, attachedDevice = 1, device_ids = [1] 
    [37240] rank = 6, world_size = 8, attachedDevice = 1, device_ids = [2] 
    [37238] rank = 4, world_size = 8, attachedDevice = 1, device_ids = [0] 
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    *****************************************
    *****************************************
    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    *****************************************
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-08-06
      • 1970-01-01
      • 1970-01-01
      • 2020-09-17
      • 2019-01-24
      • 1970-01-01
      • 1970-01-01
      • 2018-01-06
      相关资源
      最近更新 更多