为什么我的训练损失在使用预训练权重训练 AlexNet 的最后一层时会出现波动？答案

【问题标题】：Why does my training loss oscillate while training the final layer of AlexNet with pre-trained weights?为什么我的训练损失在使用预训练权重训练 AlexNet 的最后一层时会出现波动？
【发布时间】：2018-02-15 09:46:47
【问题描述】：

我正在研究纹理分类，并且基于以前的工作，我正在尝试将 AlexNET 的最后一层修改为具有 20 个类，并且仅针对我的多类分类问题训练该层。我在 NVIDIA GTX 1080 上使用 Tensorflow-GPU，在 Ubuntu 16.04 上使用 Python3.6。我正在使用梯度下降优化器和类 Estimator 来构建它。我还使用两个 dropout 层进行正则化。因此，我的超参数是学习率、batch_size 和 weight_decay。我尝试使用 50,100,200 的 batch_size，0.005 和 0.0005 的 weight_decays，以及 1e-3,1e-4 和 1e-5 的学习率。上述值的所有训练损失曲线都遵循相似的趋势。

我的训练损失曲线不是单调递减，而是似乎在振荡。我为学习率=1e-5、权重衰减=0.0005 和batch_size=200 提供了张量板可视化。

请协助了解出了什么问题以及我可以如何纠正它。 The Tensorboard Visualization for the case I specified

  # Create the Estimator
  classifier = tf.estimator.Estimator(model_fn=cnn_model)
  # Set up logging for predictions
  tensors_to_log = {"probabilities": "softmax_tensor"}
  logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=10)
 # Train the model
  train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": train_data},y=train_labels,batch_size=batch_size,num_epochs=None,shuffle=True)
  classifier.train(input_fn=train_input_fn, steps=200000, hooks=[logging_hook])
  # Evaluate the model and print results
  eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
  y=eval_labels,
  num_epochs=1,
  shuffle=False)
  eval_results = classifier.evaluate(input_fn=eval_input_fn)
  print(eval_results)

#Sections of the cnn_model
 #Output Config
 predictions = { "classes": tf.argmax(input=logits, axis=1),# Generate predictions (for PREDICT and EVAL mode)
"probabilities": tf.nn.softmax(logits, name="softmax_tensor")}  # Add `softmax_tensor` to the graph. It is used for PREDICT and by the `logging_hook`.
if mode == tf.estimator.ModeKeys.PREDICT:
  return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

 # Calculate Loss (for both TRAIN and EVAL modes)
  onehot_labels = tf.one_hot(indices=tf.cast(labels,tf.int32),depth=20)
  loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits)


#Training Config
  if mode == tf.estimator.ModeKeys.TRAIN:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    tf.summary.scalar('training_loss',loss)
    summary_hook = tf.train.SummarySaverHook(save_steps=10,output_dir='outputs',summary_op=tf.summary.merge_all())
    train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op,training_hooks=[summary_hook])



# Evaluation Metric- Accuracy
eval_metric_ops = {"accuracy": tf.metrics.accuracy(labels=labels, predictions=predictions["classes"])}
print(time.time()-t)
tf.summary.scalar('eval_loss',loss)
ac=tf.metrics.accuracy(labels=labels,predictions=predictions["classes"])
tf.summary.scalar('eval_accuracy',ac)
evaluation_hook= tf.train.SummarySaverHook(save_steps=10,output_dir='outputseval',summary_op=tf.summary.merge_all())
return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops,evaluation_hooks=[evaluation_hook])

【问题讨论】：

这个问题应该去：datascience.stackexchange.com
@RSSharma：您找到解决问题的方法了吗？我面临着类似的情况，我的训练损失也看起来像一个周期函数。谢谢

标签： python tensorflow machine-learning

【解决方案1】：

您是在随机选择小批量吗？看起来你的小批量有很大的差异，这导致在不同的迭代中损失的差异很大。我假设图中的 x 轴是迭代而不是 epoch，并且每 160 次迭代提供的训练数据更难预测，这会导致损失曲线的周期性下降。您的验证损失表现如何？

可能的解决方案/想法：

尝试以更好的方式随机化您的训练数据选择
检查您的训练数据中是否存在错误标记的示例

【讨论】：

感谢您的建议。已检查训练数据是否存在错误标记，这不是问题所在。我的批次是随机选择的，其中我在训练输入函数中使用了 Shuffle=True。我还能如何尝试随机化训练数据选择？代码：train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": train_data},y=train_labels,batch_size=batch_size,num_epochs=None,shuffle=True)
似乎您将数据洗牌一次，然后在每个时期使用相同的洗牌。如果这是正确的，您可以尝试在每个时期之前重新洗牌数据。那么你不应该看到你的损失周期性下降。但是，如果整个训练集的方差很大，那么您仍然不会获得平滑的曲线。一般来说，batch_size 越高，损失曲线越平滑。（这并不意味着训练更快）
谢谢。将尝试改组部分，并增加批量大小。并且由于使用不同优化器获得的损失曲线遵循不同的路径到达全局最小值，是否存在允许损失增加然后下降的情况？还是训练损失必须总是单调减少？
由于这样的网络通常有很多参数，因此您正在处理高维目标函数。这个函数通常非常复杂，有很多局部最小值和鞍点。因此，很难预测训练将如何进行以及损失曲线的样子。如果您摆脱了局部最小值，则损失确实会增加，然后进一步减少。通常在训练的某个时间点，损失会饱和。