【问题标题】:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when resuming trainingRuntimeError: Expected all tensors to be on the same device,但发现至少有两个设备,cuda:0 和 cpu!恢复训练时
【发布时间】:2021-05-11 10:41:00
【问题描述】:

我在 gpu 上训练时保存了一个检查点。 重新加载检查点并继续训练后,我收到以下错误。

Traceback (most recent call last):
  File "main.py", line 140, in <module>
    train(model,optimizer,train_loader,val_loader,criteria=args.criterion,epoch=epoch,batch=batch)
  File "main.py", line 71, in train
    optimizer.step()
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/optim/sgd.py", line 106, in step
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

我的训练代码是:

def train(model,optimizer,train_loader,val_loader,criteria,epoch=0,batch=0):
    batch_count = batch
    if criteria == 'l1':
        criterion = L1_imp_Loss()
    elif criteria == 'l2':
        criterion = L2_imp_Loss()
    if args.gpu and torch.cuda.is_available():
        model.cuda()
        criterion = criterion.cuda()

    print(f'{datetime.datetime.now().time().replace(microsecond=0)} Starting to train..')
    
    while epoch <= args.epochs-1:
        print(f'********{datetime.datetime.now().time().replace(microsecond=0)} Epoch#: {epoch+1} / {args.epochs}')
        model.train()
        interval_loss, total_loss= 0,0
        for i , (input,target) in enumerate(train_loader):
            batch_count += 1
            if args.gpu and torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()
            input, target = input.float(), target.float()
            pred = model(input)
            loss = criterion(pred,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            ....

保存过程发生在每个 epoch 结束后。

torch.save({'epoch': epoch,'batch':batch_count,'model_state_dict': model.state_dict(),'optimizer_state_dict':
                    optimizer.state_dict(),'loss': total_loss/len(train_loader),'train_set':args.train_set,'val_set':args.val_set,'args':args}, f'{args.weights_dir}/FastDepth_Final.pth')

我不知道为什么会出现此错误。 args.gpu == True ,我将模型、所有数据和损失函数传递给 cuda,不知何故 cpu 上还有一个张量,有人能找出问题所在吗?

谢谢。

【问题讨论】:

  • 似乎问题来自criterion(pred, target)。你能检查pred.is_cudatarget.is_cuda吗?
  • 看起来您在模型上调用.cuda 为时已晚:需要在初始化优化器之前调用它。来自文档:If you need to move a model to GPU via .cuda(), please do so before constructing optimizers for it. Parameters of a model after .cuda() will be different objects with those before the call. In general, you should make sure that optimized parameters live in consistent locations when optimizers are constructed and used。请参阅文档here

标签: python deep-learning pytorch runtime-error


【解决方案1】:

可能有一个issue,设备参数已开启:

如果您需要通过 .cuda() 将模型移动到 GPU,请在为其构建优化器之前执行此操作。 .cuda() 之后的模型参数将与调用之前的对象不同。
一般来说,在构建和使用优化器时,您应该确保优化的参数位于一致的位置。

【讨论】:

    猜你喜欢
    • 2022-01-03
    • 1970-01-01
    • 2021-04-27
    • 2021-03-03
    • 2021-12-24
    • 2021-07-18
    • 2022-10-13
    • 2020-03-24
    • 2021-03-26
    相关资源
    最近更新 更多