为什么这个 TensorFlow 实现远不如 Matlab 的 NN 成功？答案

【问题标题】：Why is this TensorFlow implementation vastly less successful than Matlab's NN?为什么这个 TensorFlow 实现远不如 Matlab 的 NN 成功？
【发布时间】：2016-02-16 16:21:37
【问题描述】：

作为一个玩具示例，我试图从 100 个无噪声数据点拟合函数 f(x) = 1/x。 matlab 默认实现非常成功，均方差约为 10^-10，并且插值完美。

我实现了一个包含 10 个 sigmoid 神经元的隐藏层的神经网络。我是神经网络的初学者，所以要提防愚蠢的代码。

import tensorflow as tf
import numpy as np

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

#Can't make tensorflow consume ordinary lists unless they're parsed to ndarray
def toNd(lst):
    lgt = len(lst)
    x = np.zeros((1, lgt), dtype='float32')
    for i in range(0, lgt):
        x[0,i] = lst[i]
    return x

xBasic = np.linspace(0.2, 0.8, 101)
xTrain = toNd(xBasic)
yTrain = toNd(map(lambda x: 1/x, xBasic))

x = tf.placeholder("float", [1,None])
hiddenDim = 10

b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])

b2 = bias_variable([1])
W2 = weight_variable([1, hiddenDim])

hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.matmul(W2, hidden) + b2

# Minimize the squared errors.
loss = tf.reduce_mean(tf.square(y - yTrain))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# For initializing the variables.
init = tf.initialize_all_variables()

# Launch the graph
sess = tf.Session()
sess.run(init)

for step in xrange(0, 4001):
    train.run({x: xTrain}, sess)
    if step % 500 == 0:
        print loss.eval({x: xTrain}, sess)

均方差以 ~2*10^-3 结束，因此比 matlab 差大约 7 个数量级。可视化

xTest = np.linspace(0.2, 0.8, 1001)
yTest = y.eval({x:toNd(xTest)}, sess)  
import matplotlib.pyplot as plt
plt.plot(xTest,yTest.transpose().tolist())
plt.plot(xTest,map(lambda x: 1/x, xTest))
plt.show()

我们可以看到拟合在系统上是不完美的：而 matlab 用肉眼看起来很完美，差异均匀我试图用 TensorFlow 复制 Matlab 网络的图表：

顺便说一句，该图似乎暗示了一个 tanh 而不是 sigmoid 激活函数。可以确定的是，我在文档中的任何地方都找不到它。但是，当我尝试在 TensorFlow 中使用 tanh 神经元时，拟合很快就会失败，变量为 nan。我不知道为什么。

Matlab 使用 Levenberg–Marquardt 训练算法。贝叶斯正则化在均方为 10^-12 的情况下更加成功（我们可能处于浮点算术的领域）。

为什么 TensorFlow 实现如此糟糕，我能做些什么来让它变得更好？

【问题讨论】：

我还没有研究过张量流，对此很抱歉，但是你正在用 toNd 函数做一些奇怪的事情。 np.linspace已经返回一个ndarray，而不是一个列表，如果你想将一个列表转换为一个ndarray，你需要做的就是np.array(my_list)，如果你只需要额外的轴，你可以做new_array = my_array[np.newaxis, :]。它可能只是没有达到零错误，因为它应该这样做。大多数数据都有噪音，你不一定希望它的训练误差为零。从“reduce_mean”判断，它可能使用了交叉验证。
@AdamAcosta toNd 绝对是我缺乏经验的权宜之计。我之前试过np.array，问题似乎是np.array([5,7]).shape是(2,)而不是(2,1)。 my_array[np.newaxis, :] 似乎纠正了这一点，谢谢！我不使用 python，而是每天使用 F#。
@AdamAcostaI 我不认为reduce_mean 进行交叉验证。来自文档：Computes the mean of elements across dimensions of a tensor。 Matlab 进行交叉验证，在我看来，与没有交叉验证相比，这应该会降低训练样本的拟合度，对吗？
是的，交叉验证通常会阻止完美匹配。很抱歉没有真正的答案。张量流的知识仍然很少。我最近看到很多关于它的问题，但没有太多答案。 Udacity 正在开发一门关于它的课程，作为他们新的机器学习工程师纳米学位的一部分。我发誓我不为 Udacity 工作，但它可能值得研究！

标签： python matlab neural-network tensorflow

【解决方案1】：

我尝试训练 50000 次迭代，结果出现 0.00012 错误。 Tesla K40 大约需要 180 秒。

对于这类问题，一阶梯度下降似乎不太合适（双关语），你需要 Levenberg-Marquardt 或 l-BFGS。我认为还没有人在 TensorFlow 中实现它们。

编辑使用tf.train.AdamOptimizer(0.1) 解决这个问题。经过 4000 次迭代后，它达到 3.13729e-05。此外，使用默认策略的 GPU 对于这个问题似乎也是一个坏主意。有许多小操作，开销导致 GPU 版本在我的机器上运行速度比 CPU 慢 3 倍。

【讨论】：

感谢您查看此内容。你的意思是我的 5000 个循环，所以 20M 基本训练运行？能否确认将隐藏层更改为tanh神经元时会失败，如果是，您知道为什么会发生吗？
我刚刚将您的 xrange(4001) 更改为 xrange(5000)。对于 tanh，看起来训练以 0.5 的学习率发散。一般来说，对于梯度下降，您需要针对每个问题调整学习率，如果我执行 tf.train.GradientDescentOptimizer(0.1)，它似乎可以工作
我了解梯度参数。很奇怪 xrange(0, 5000) 比 4k 范围提供了一个数量级的精度，并且在 GPU 上需要 180 秒。我在 CPU 上运行相同的范围，精度不变，不到 10 秒。
哎呀，错字，50000，不是 5000
另外 - 将您的数据类型从 float32 更改为 float64，调整 adamoptimizer 以使用指数衰减的学习率从 0.2 逐步下降，exp 衰减 0.9999 在 4000 个训练步骤后得到 1.44e-05。 step = tf.Variable(0, trainable=False) rate = tf.train.exponential_decay(0.2, step, 1, 0.9999) optimizer = tf.train.AdamOptimizer(rate) train = optimizer.minimize(loss, global_step=step)

【解决方案2】：

顺便说一句，这里是上面的一个稍微清理过的版本，它清理了一些形状问题以及 tf 和 np. 40k 步后达到 3e-08，4000 步后达到 1.5e-5 左右：

import tensorflow as tf
import numpy as np

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

xTrain = np.linspace(0.2, 0.8, 101).reshape([1, -1])
yTrain = (1/xTrain)

x = tf.placeholder(tf.float32, [1,None])
hiddenDim = 10

b = bias_variable([hiddenDim,1])
W = weight_variable([hiddenDim, 1])

b2 = bias_variable([1])
W2 = weight_variable([1, hiddenDim])

hidden = tf.nn.sigmoid(tf.matmul(W, x) + b)
y = tf.matmul(W2, hidden) + b2

# Minimize the squared errors.                                                                
loss = tf.reduce_mean(tf.square(y - yTrain))
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.15, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate)
train = optimizer.minimize(loss, global_step=step)
init = tf.initialize_all_variables()

# Launch the graph                                                                            
sess = tf.Session()
sess.run(init)

for step in xrange(0, 40001):
    train.run({x: xTrain}, sess)
    if step % 500 == 0:
        print loss.eval({x: xTrain}, sess)

综上所述，LMA 在拟合 2D 曲线方面比更通用的 DNN 样式优化器做得更好，这可能并不令人惊讶。 Adam 和其他人针对的是非常高维的问题，LMA starts to get glacially slow for very large networks（见 12-15）。

【讨论】：