损失函数中带有 torch.autograd.grad 的 torch.nn.DataParallel 失败答案

【问题标题】：torch.nn.DataParallel with torch.autograd.grad in loss function fails损失函数中带有 torch.autograd.grad 的 torch.nn.DataParallel 失败
【发布时间】：2021-10-09 08:37:52
【问题描述】：

我有一个代表物体表面的神经网络模型。为此，梯度是在损失函数中计算的（因为例如，梯度总是单位长度是有符号距离场 (sdfs) 的一个属性）。损失函数是来自SIREN 的 sdfs 函数，定义为

def sdf(model_output, gt):
    gt_sdf = gt['sdf']
    gt_normals = gt['normals']

    coords = model_output['model_in']
    pred_sdf = model_output['model_out'].to(torch.float32)

    gradient = diff_operators.gradient(pred_sdf, coords)

    # Wherever boundary_values is not equal to zero, we interpret it as a boundary constraint.
    sdf_constraint = torch.where(gt_sdf != -1, pred_sdf, torch.zeros_like(pred_sdf))
    inter_constraint = torch.where(gt_sdf != -1, torch.zeros_like(pred_sdf), torch.exp(-1e2 * torch.abs(pred_sdf)))
    normal_constraint = torch.where(gt_sdf != -1, 1 - F.cosine_similarity(gradient, gt_normals, dim=-1)[..., None],
                                    torch.zeros_like(gradient[..., :1]))
    grad_constraint = torch.abs(gradient.norm(dim=-1) - 1)

    return {'sdf': torch.abs(sdf_constraint).mean() * 3e3,
            'inter': inter_constraint.mean() * 1e2,
            'normal_constraint': normal_constraint.mean() * 1e2,
            'grad_constraint': grad_constraint.mean() * 5e1}

而梯度计算使用torch.autograd.grad:

def gradient(y, x, grad_outputs=None):
    if grad_outputs is None:
        grad_outputs = torch.ones_like(y)
    grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
    return grad

现在我想通过实现torch.nn.DataParallel 来并行化训练。我收到以下错误：

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

是否可以在损失函数中使用torch.nn.DataParallel 进行梯度计算，我需要进行哪些更改才能使其正常工作？

【问题讨论】：

标签： deep-learning neural-network pytorch autograd

【解决方案1】：

看nn.parallel.DistributedDataParallel的文档：

此模块不适用于torch.autograd.grad()（即只有在参数的.grad 属性中累积梯度时才会起作用）。

还在torch.distributed 的文档中建议使用gloo 后端：

请注意，目前唯一保证所有功能都能正常工作的后端是gloo。

【讨论】：

感谢您的回答！但我并没有完全出于这个原因使用nn.parallel.DistributedDataParallel，而是nn.parallel.DataParallel。