自动编码器模型在 MNIST 数据集上振荡或不收敛答案

【问题标题】：Autoencoder model either oscillates or doesn't converge on MNIST dataset自动编码器模型在 MNIST 数据集上振荡或不收敛
【发布时间】：2019-05-31 23:39:55
【问题描述】：

3 个月前已运行代码并获得预期结果。什么都没改变。尝试使用（几个）早期版本的代码进行故障排除，包括最早的（肯定有效）。问题依然存在。

# 4 - Constructing the undercomplete architecture
class autoenc(nn.Module):
    def __init__(self, nodes = 100):
        super(autoenc, self).__init__() # inheritence
        self.full_connection0 = nn.Linear(784, nodes) # encoding weights
        self.full_connection1 = nn.Linear(nodes, 784) # decoding weights
        self.activation = nn.Sigmoid()

    def forward(self, x):
        x = self.activation(self.full_connection0(x)) # input encoding
        x = self.full_connection1(x) # output decoding
        return x



# 5 - Initializing autoencoder, squared L2 norm, and optimization algorithm
model = autoenc().cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(),
                          lr = 1e-3, weight_decay = 1/2)



# 6 - Training the undercomplete autoencoder model
num_epochs = 500
batch_size = 32
length = int(len(trn_data) / batch_size)

loss_epoch1 = []

for epoch in range(num_epochs):
    train_loss = 0
    score = 0. 


    for num_data in range(length - 2):
        batch_ind = (batch_size * num_data)
        input = Variable(trn_data[batch_ind : batch_ind + batch_size]).cuda()

        # === forward propagation ===
        output = model(input)
        loss = criterion(output, trn_data[batch_ind : batch_ind + batch_size])

        # === backward propagation ===
        loss.backward()

        # === calculating epoch loss ===
        train_loss += np.sqrt(loss.item())
        score += 1. #<- add for average loss error instead of total
        optimizer.step()

    loss_calculated = train_loss/score
    print('epoch: ' + str(epoch + 1) + '   loss: ' + str(loss_calculated))
    loss_epoch1.append(loss_calculated)

现在绘制损失时，它会剧烈振荡（在 lr = 1e-3 处）。而在 3 个月前，它正在稳步收敛（在 lr = 1e-3）。

由于最近创建的帐户，无法上传图片。

How it looks like now.

虽然这是我将学习率降低到 1e-5 的时候。当它在 1e-3 时，到处都是。

How it should look like, and used to look like at lr = 1e-3.

【问题讨论】：

您应该在执行loss.backward() 之前执行optimizer.zero_grad()，因为渐变会累积。这很可能是导致问题的原因。
这就是问题所在。谢谢！！！我如何给你一个支持或声誉？虽然只是出于好奇；这是最近在 pytorch 中改变的吗？
我将其添加为答案，您可以接受。不，我认为它最近没有改变，这已经是很长一段时间的标准了。
没关系。添加了 optimizer.zero_grad() 但发生的事情是损失似乎几乎没有变化，最终达到 0.35（远非过去的 0.14 左右）。

标签： python deep-learning pytorch autoencoder

【解决方案1】：

您应该在执行loss.backward() 之前执行optimizer.zero_grad()，因为渐变会累积。这很可能是导致问题的原因。

训练阶段要遵循的一般顺序：

optimizer.zero_grad()
output = model(input)
loss = criterion(output, label)
loss.backward()
optimizer.step()

此外，使用的权重衰减值 (1 / 2) 也引起了问题。

【讨论】：

设置 optimizer.zero_grad() 绝对是解决方案的一部分。对于将来可能会与似乎已经停止收敛的部分作斗争的任何人；对我来说解决问题的是摆脱体重衰减。现在可以了！
好的，我也会在答案中添加。您可以尝试使用较低的权重衰减值，例如 1e-4 或 1e-5，而不是之前使用的 0.5。