用有限差分方法检查神经网络梯度不起作用答案

【问题标题】：Checking Neural Network Gradient with Finite Difference Methods Doesn't Work用有限差分方法检查神经网络梯度不起作用
【发布时间】：2021-02-22 17:11:18
【问题描述】：

在整整一周打印语句、维度分析、重构和大声讨论代码之后，我可以说我完全陷入困境。

我的成本函数产生的梯度与有限差分产生的梯度相差太远。

我已经确认我的成本函数会为正则化输入产生正确的成本，而不是。这是成本函数：

def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
  # reshape parameter/weight vectors to suit network size
  Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
  Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))

  if lambda_ is None:
    lambda_ = 0

  # grab number of observations
  m = X.shape[0]
  
  # init variables we must return
  cost = 0
  Theta1_grad = np.zeros(Theta1.shape)
  Theta2_grad = np.zeros(Theta2.shape)

  # one-hot encode the vector y
  y_mtx = pd.get_dummies(y.ravel()).to_numpy() 

  ones = np.ones((m, 1))
  X = np.hstack((ones, X))
  
  # layer 1
  a1 = X
  z2 = Theta1@a1.T
  # layer 2
  ones_l2 = np.ones((y.shape[0], 1))
  a2 = np.hstack((ones_l2, sigmoid(z2.T)))
  z3 = Theta2@a2.T
  # layer 3
  a3 = sigmoid(z3)

  reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T@Theta1[:,0]),(Theta2[:,0].T@Theta2[:,0])))
  cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
  
  # BACKPROPAGATION
  # δ3 equals the difference between a3 and the y_matrix
  d3 = a3 - y_mtx.T
  # δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
  d2 = Theta2[:,1:].T@d3 * sigmoidGradient(z2)
  # Δ1 equals the product of δ2 and a1.
  Delta1 = d2@a1
  Delta1 /= m
  # Δ2 equals the product of δ3 and a2.
  Delta2 = d3@a2
  Delta2 /= m
  
  reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
  reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
  
  Theta1_grad = Delta1 + reg_term1
  Theta2_grad = Delta2 + reg_term2
  
  grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
  
  return cost, grad

这是检查渐变的代码。我已经完成了每一行，在这里我想不出任何可以改变的东西。它似乎处于正常工作状态。

def checkNNGradients(lambda_):
  """
  Creates a small neural network to check the backpropagation gradients. 
  Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
  
  Input: Regularization parameter, lambda, as int or float.
  
  Output: Analytical gradients produced by backprop code and the numerical gradients (computed
  using computeNumericalGradient). These two gradient computations should result in 
  very similar values. 
  """

  input_layer_size = 3
  hidden_layer_size = 5
  num_labels = 3
  m = 5

  # generate 'random' test data
  Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
  Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)

  # reusing debugInitializeWeights to generate X
  X  = debugInitializeWeights(m, input_layer_size - 1)
  y  = np.ones(m) + np.remainder(np.range(m), num_labels)


  # unroll parameters
  nn_params = np.append(Theta1.ravel(), Theta2.ravel())
  costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
    
  cost, grad = costFunc(nn_params)
    
  numgrad = computeNumericalGradient(costFunc, nn_params)

  # examine the two gradient computations; two columns should be very similar. 
  print('The columns below should be very similar.\n')
   
  # Credit: http://stackoverflow.com/a/27663954/583834
  print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
  for numerical, analytical in zip(numgrad, grad):
    print('{:<25}{}'.format(numerical, analytical))


  # If you have a correct implementation, and assuming you used EPSILON = 0.0001 
  # in computeNumericalGradient.m, then diff below should be less than 1e-9
  diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
  print(diff)
  print("\n")
  print('If your backpropagation implementation is correct, then \n' \
          'the relative difference will be small (less than 1e-9). \n' \
          '\nRelative Difference: {:.10f}'.format(diff))

check 函数使用debugInitializeWeights 函数生成自己的数据（因此有可重现的示例；只需运行它，它将调用其他函数），然后调用使用有限差分计算梯度的函数。两者都在下面。

def debugInitializeWeights(fan_out, fan_in):
  """
  Initializes the weights of a layer with fan_in
  incoming connections and fan_out outgoing connections using a fixed
  strategy.

  Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
  of incoming connections for the same layer as int. 
  
  Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
  """
  W = np.zeros((fan_out, 1 + fan_in))
  # Initialize W using "sin", this ensures that the values in W are of similar scale;
  # this will be useful for debugging
  W = np.sin(range(1, np.size(W)+1)) / 10 
  return W.reshape(fan_out, fan_in+1)

def computeNumericalGradient(J, nn_params):
  """
  Computes the gradient using "finite differences"
  and provides a numerical estimate of the gradient (i.e.,
  gradient of the function J around theta).
  Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ. 

  Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.

  Output: Gradient vector using finite differences. Per Dr. Ng, 
  'Sets numgrad(i) to (a numerical approximation of) the partial derivative of 
  J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should 
  be the (approximately) the partial derivative of J with respect
  to theta(i).)'          
  """
  numgrad = np.zeros(nn_params.shape)
  perturb = np.zeros(nn_params.shape)
  e = .0001
  for i in range(np.size(nn_params)):
      # Set perturbation (i.e., noise) vector
      perturb[i] = e
      # run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
      cost1, grad1 = J((nn_params - perturb))
      cost2, grad2 = J((nn_params + perturb))
      # record the difference in cost function ouputs; this is the numerical gradient
      numgrad[i] = (cost2 - cost1) / (2*e)
      perturb[i] = 0

  return numgrad

代码不适用于课程。那个 MOOC 在 MATLAB 中，它已经结束了。这是给我的。网络上存在其他解决方案；事实证明，看着他们是徒劳的。每个人都有不同的（不可思议的）方法。所以，我非常需要帮助或奇迹。

编辑/更新：散布矢量影响结果时的 Fortran 排序，但我无法让渐变一起移动以更改该选项。

【问题讨论】：

标签： python machine-learning neural-network gradient

【解决方案1】：

一个想法：我认为你的扰动有点大，是1e-4。对于双精度浮点数，它应该更像1e-8，即机器精度的根（或者您是否使用单精度？！）。

话虽如此，有限差分对于真正的导数可能是非常糟糕的近似。具体来说，numpy 中的浮点计算不是确定性的，正如您似乎已经发现的那样。在某些情况下，评估中的噪音会抵消许多有效数字。你看到了什么价值观，你期待什么？

【讨论】：

首先，感谢您的回复。你会因为伸出援手并说不要对有限差异方法有太多期望而获得回报。其次，有一些事情是错误的，包括。何时使用 order=F、ravel() 与 flatten() 以及实际上更适合比较分析梯度和数值梯度的替代公式。我稍后会发布所有这些。

【解决方案2】：

以下所有内容都可以解决我的问题。对于那些试图将 MATLAB 代码翻译成 Python 的人来说，无论是否来自 Andrew NG 的 Coursera 机器学习课程，这些都是每个人都应该知道的。

MATLAB 以 FORTRAN 顺序执行所有操作； Python 以 C 顺序执行所有操作。这会影响向量的填充方式，从而影响您的结果。如果您希望您的答案与您在 MATLAB 中所做的相匹配，您应该始终按 FORTRAN 顺序排列。见docs
按照 FORTRAN 顺序获取向量就像将 order='F' 作为参数传递给 .reshape()、.ravel() 或 .flatten() 一样简单。但是，如果您使用.ravel()，则可以通过转置向量然后像X.T.ravel() 那样应用.ravel() 函数来实现相同的目的。
说到.ravel()，.ravel() 和.flatten() 函数做的事情不同，可能有不同的用例。例如，.flatten() 是 SciPy 优化方法的首选。因此，如果您的 fminunc 等效项不起作用，可能是因为您忘记了 .flatten() 您的响应向量 y。请参阅此 Q&A StackOverflow 和 .ravel() 上的文档，链接到 .flatten().More Docs
如果您要将 MATLAB 实时脚本中的代码翻译成 Jupyter notebook 或 Google COLAB，则必须监管您的命名空间。有一次，我发现我认为正在传递的变量实际上并不是正在传递的变量。为什么？ Jupyter 和 Colab 笔记本有很多通常不会写的全局变量。
有一个更好的函数来评估数值梯度和解析梯度之间的差异：相对误差比较np.abs(numerical-analyitical)/(numerical+analytical)。在这里阅读CS231 了解它另外，请考虑上面接受的帖子。

【讨论】：