XOR 的 Tensorflow 在 500 个 epoch 后无法正确预测答案

【问题标题】：Tensorflow for XOR is not predicting correctly after 500 epochsXOR 的 Tensorflow 在 500 个 epoch 后无法正确预测
【发布时间】：2020-11-04 14:22:33
【问题描述】：

我正在尝试使用 TensorFlow 实现神经网络来解决 XOR 问题。我选择了sigmoid作为激活函数，形状(2, 2, 1)和optimizer=SGD()。我选择batch_size=1，因为问题的宇宙是4，所以真的很小。问题是预测甚至没有接近正确的答案。我做错了什么？

我在 Google Colab 上做这个，Tensorflow 版本是 2.3.0。

import tensorflow as tf
import numpy as np



x = np.array([[0, 0],
              [1, 1],
              [1, 0],
              [0, 1]],  dtype=np.float32)

y = np.array([[0], 
              [0], 
              [1], 
              [1]],     dtype=np.float32)



model =  tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(2,)))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid))

model.compile(optimizer=tf.keras.optimizers.SGD(), 
              loss=tf.keras.losses.MeanSquaredError(), 
              metrics=['binary_accuracy'])

history = model.fit(x, y, batch_size=1, epochs=500, verbose=False)

print("Tensorflow version: ", tf.__version__)
predictions = model.predict_on_batch(x)
print(predictions)

输出：

Tensorflow version:  2.3.0
WARNING:tensorflow:10 out of the last 10 calls to <function Model.make_predict_function.<locals>.predict_function at 0x7f69f7a83a60> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
[[0.5090364 ]
[0.4890102 ]
[0.50011414]
[0.49678832]]

【问题讨论】：

标签： python tensorflow machine-learning keras neural-network

【解决方案1】：

问题在于您的学习率和优化权重的方式

训练时要记住的另一个因素是我们在梯度方向上的步长。如果这一步太大，我们最终可能会处于错误的位置，跳出我们的局部最小值。如果太小，我们永远无法达到最小值。

默认情况下，keras 中的随机梯度下降 (SGD) 的学习率为 0.01。并且这个学习率在训练期间是固定的。如果你检查你的训练，损失会向全局最小值移动太慢，或者有时会跳到更高的值。对于您的具体问题，使用固定的学习率很难达到最小值，因为您没有考虑损失函数的情况。

例如，使用 Adam 作为优化器算法和 learning_rate = 0.02，我能够达到 1 的准确度

import tensorflow as tf
import numpy as np

x = np.array([[0, 0],
              [1, 1],
              [1, 0],
              [0, 1]],  dtype=np.float32)

y = np.array([[0], 
              [0], 
              [1], 
              [1]],     dtype=np.float32)

model =  tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(2,)))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid))
model.add(tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid))

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.02), # learning rate was 0.001 prior to this change
              loss=tf.keras.losses.MeanSquaredError(), 
              metrics=['mse', 'binary_accuracy'])
model.summary()
print("Tensorflow version: ", tf.__version__)
predictions = model.predict_on_batch(x)
print(predictions)history = model.fit(x, y, batch_size=1, epochs=500)

[[0.05162644]
[0.06670767]
[0.9240402 ]
[0.923379  ]]

我使用 Adam 是因为它具有自适应学习率，在训练期间会根据火车的行驶方式进行调整。

如果您使用更大的学习率 (0.1)，但使用 SGD，则在历史训练损失中，您可以看到准确度在某一时刻达到 1，但之后立即跳到较低的值。那是因为你有一个固定的学习率。另一种策略是在使用 SGD 达到该值时停止训练，也许使用 keras callback。

不要忘记调整学习率并选择正确的优化器。获得快速培训和良好的最低要求是至关重要的。

同时考虑改变网络架构（添加节点，并为隐藏层使用其他激活函数，如 Relu）

Here some useful details on how to handle the learning rate

【讨论】：

【解决方案2】：

你是对的。我将学习率稍微更改为 0.5，但仍然使用 SGD 和 epochs=10000。我得到了以下输出：

Tensorflow version:  2.3.0
[[0.00407344]
[0.00893608]
[0.9912169 ]
[0.99120843]]

其他培训：

[[0.0097596 ]
[0.00862199]
[0.99391216]
[0.98597133]]

当然 epoch=10000 不是我所说的快速训练，但我正在学习这些东西，我得到了我想要的。

【讨论】：

【解决方案3】：

我冒昧地使用您的一些代码制作了一个 Google Colab 笔记本，发现您的模型受到所谓的Vanishing Gradient Problem 的影响。我通过使用0.5 常量初始化隐藏层内核值来解决此问题，如下所示：

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(2,)))
model.add(tf.keras.layers.Dense(2, activation=tf.keras.activations.sigmoid, kernel_initializer=tf.initializers.Constant(0.5)))
model.add(tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid))

这最大限度地减少了梯度消失问题发生的可能性，0.5 的值与 sigmoid 激活函数的梯度消失值 0.0 和 1.0 等距。

我还发现 Adam 优化器的性能优于 SGD 优化器，正如在接受的答案中指出的那样。

您可以在此处访问我的 Colab 笔记本：

https://colab.research.google.com/drive/1gv-z-C9TpKAsnAyBYLmwRvk6NaAtlr_C?usp=sharing

【讨论】：