【问题标题】:PyTorch error in trying to backward through the graph a second timePyTorch 在尝试第二次向后遍历图形时出错
【发布时间】:2021-03-11 00:44:02
【问题描述】:

我正在尝试运行此代码:https://github.com/aitorzip/PyTorch-CycleGAN
我只修改了数据加载器和转换以与我的数据兼容。 尝试运行它时出现此错误:

回溯(最近一次通话最后一次):
文件“模型/CycleGANs/train”, 第 150 行,在 loss_D_A.backward()
文件“/opt/conda/lib/python3.8/site-packages/torch/tensor.py”,第 221 行,在 落后 torch.autograd.backward(自我,渐变,retain_graph,create_graph)
文件 "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", 第 130 行,在后面 变量._execution_engine.run_backward(
RuntimeError:试图第二次向后遍历图形,但保存的中间 结果已经被释放。指定retain_graph=True 时 第一次向后调用。

这是直到错误点的火车循环:

for epoch in range(opt.epoch, opt.n_epochs):
for i, batch in enumerate(dataloader):
    # Set model input
    real_A = Variable(input_A.copy_(batch['A']))
    real_B = Variable(input_B.copy_(batch['B']))

    ##### Generators A2B and B2A #####
    optimizer_G.zero_grad()

    # Identity loss
    # G_A2B(B) should equal B if real B is fed
    same_B = netG_A2B(real_B)
    loss_identity_B = criterion_identity(same_B, real_B)*5.0
    # G_B2A(A) should equal A if real A is fed
    same_A = netG_B2A(real_A)
    loss_identity_A = criterion_identity(same_A, real_A)*5.0

    # GAN loss
    fake_B = netG_A2B(real_A)
    pred_fake = netD_B(fake_B)
    loss_GAN_A2B = criterion_GAN(pred_fake, target_real)

    fake_A = netG_B2A(real_B)
    pred_fake = netD_A(fake_A)
    loss_GAN_B2A = criterion_GAN(pred_fake, target_real)

    # Cycle loss
    # TODO: cycle loss doesn't allow for multimodality. I leave it for now but needs to be thrown out later
    recovered_A = netG_B2A(fake_B)
    loss_cycle_ABA = criterion_cycle(recovered_A, real_A)*10.0

    recovered_B = netG_A2B(fake_A)
    loss_cycle_BAB = criterion_cycle(recovered_B, real_B)*10.0

    # Total loss
    loss_G = loss_identity_A + loss_identity_B + loss_GAN_A2B + loss_GAN_B2A + loss_cycle_ABA + loss_cycle_BAB
    loss_G.backward()

    optimizer_G.step()

    ##### Discriminator A #####
    optimizer_D_A.zero_grad()

    # Real loss
    pred_real = netD_A(real_A)
    loss_D_real = criterion_GAN(pred_real, target_real)

    # Fake loss
    fake_A = fake_A_buffer.push_and_pop(fake_A)
    pred_fale = netD_A(fake_A.detach())
    loss_D_fake = criterion_GAN(pred_fake, target_fake)

    # Total loss
    loss_D_A = (loss_D_real + loss_D_fake)*0.5
    loss_D_A.backward()

我完全不熟悉它的含义。我的猜测是这与fake_A_buffer 有关。这只是一个fake_A_buffer = ReplayBuffer()

class ReplayBuffer():
def __init__(self, max_size=50):
    assert (max_size > 0), 'Empty buffer or trying to create a black hole. Be careful.'
    self.max_size = max_size
    self.data = []

def push_and_pop(self, data):
    to_return = []
    for element in data.data:
        element = torch.unsqueeze(element, 0)
        if len(self.data) < self.max_size:
            self.data.append(element)
            to_return.append(element)
        else:
            if random.uniform(0,1) > 0.5:
                i = random.randint(0, self.max_size-1)
                to_return.append(self.data[i].clone())
                self.data[i] = element
            else:
                to_return.append(element)
    return Variable(torch.cat(to_return))

设置 `loss_G.backward(retain_graph=True) 后出错

Traceback(最近一次调用最后一次):文件“models/CycleGANs/train”, 第 150 行,在 loss_D_A.backward() 文件“/opt/conda/lib/python3.8/site-packages/torch/tensor.py”,第 221 行,在 落后 torch.autograd.backward(self, gradient, retain_graph, create_graph) 文件 "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", 第 130 行,在后面 Variable._execution_engine.run_backward(RuntimeError: 梯度计算所需的变量之一已被 就地操作:[torch.FloatTensor [3, 64, 7, 7]] 是版本 2; 而是预期的版本 1。提示:启用异常检测以找到 无法计算其梯度的操作,其中 torch.autograd.set_detect_anomaly(True)。

设置torch.autograd.set_detect_anomaly(True)之后

/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py:130: UserWarning:在 MkldnnConvolutionBackward 中检测到错误。追溯 导致错误的前向呼叫:
文件“模型/CycleGANs/train”, 第 115 行,在 fake_B = netG_A2B(real_A)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, 第 727 行,在 _call_impl 中 结果 = self.forward(*input, **kwargs)
文件“/home/Histology-Style-Transfer-Research/models/CycleGANs/models.py”, 第 67 行,向前 返回 self.model(x)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, 第 727 行,在 _call_impl 中 结果 = self.forward(*input, **kwargs)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py”, 第 117 行,向前 输入 = 模块(输入)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, 第 727 行,在 _call_impl 中 结果 = self.forward(*input, **kwargs)
文件“/home/Histology-Style-Transfer-Research/models/CycleGANs/models.py”, 第 19 行,向前 返回 x + self.conv_block(x)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, 第 727 行,在 _call_impl 中 结果 = self.forward(*input, **kwargs)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py”, 第 117 行,向前 输入 = 模块(输入)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, 第 727 行,在 _call_impl 中 结果 = self.forward(*input, **kwargs)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py”, 第 423 行,向前 return self._conv_forward(input, self.weight)
文件“/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py”, 第 419 行,在 _conv_forward return F.conv2d(input, weight, self.bias, self.stride, (在内部触发 /opt/conda/conda-bld/pytorch_1603729096996/work/torch/csrc/autograd/python_anomaly_mode.cpp:104。) 变量._execution_engine.run_backward(
Traceback(最近一次调用 最后):文件“models/CycleGANs/train”,第 133 行,在 loss_G.backward(retain_graph=True)
文件“/opt/conda/lib/python3.8/site-packages/torch/tensor.py”,第 221 行,在 落后 torch.autograd.backward(自我,渐变,retain_graph,create_graph)
文件 "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", 第 130 行,在后面 Variable._execution_engine.run_backward( RuntimeError: Function 'MkldnnConvolutionBackward' 在其第二个输出中返回了 nan 值。

【问题讨论】:

标签: python deep-learning pytorch backpropagation autograd


【解决方案1】:

loss_G.backward() 应该是 loss_G.backward(retain_graph=True) 这是因为当您正常使用后向传递时,它不会记录它在后向传递中执行的操作,retain_graph=True 告诉这样做。

【讨论】:

  • 我试过了,但不幸的是它不起作用。它在同一个地方显示完全相同的错误。
  • 标准循环是否会因任何原因向后调用?你能显示完整的堆栈跟踪吗? plz
  • 我之前一定做错了什么,现在它显示不同的错误但在同一个地方。我用它更新了原始问题。 criterion 只是标准的 nn.MSELossnn.L1Loss
  • 尝试在optimizer_D_A.zero_grad() 之后设置realA.grad = NonerealB.grad = None。做二阶反向传播可能会导致一些奇怪的事情发生,并且过去将标签/输入 grad 设置为 None 对我有用
猜你喜欢
  • 2021-11-18
  • 2023-02-26
  • 2020-11-07
  • 2021-11-19
  • 1970-01-01
  • 2020-10-06
  • 1970-01-01
  • 2020-07-15
  • 2015-11-21
相关资源
最近更新 更多