Lecture-7 Training Neural Networks

Optimization

SGD

  • Cons
    1. Very slow progress along shallow dimension, jitter along steep direction.
      cs231n-notes-Lecture-7:各种优化方法介绍与比较 2. local minima or saddle point. Saddle points are much more common in high dimension.
      cs231n-notes-Lecture-7:各种优化方法介绍与比较
    2. Gradients come from minibatches, so they can be noisy!

SGD + Momentum

cs231n-notes-Lecture-7:各种优化方法介绍与比较

Nesterov Momentum

cs231n-notes-Lecture-7:各种优化方法介绍与比较

AdaGrad

cs231n-notes-Lecture-7:各种优化方法介绍与比较

  • step size becomes smaller and smaller because grad_squared is always increasing.
  • the gradient becomes smaller in the waggling dimension
  • not common: Slow, get stuck easily

RMSProp

cs231n-notes-Lecture-7:各种优化方法介绍与比较

  • decay_rate: commonly 0.9 or 0.99
  • Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.

Adam

cs231n-notes-Lecture-7:各种优化方法介绍与比较

  • Sort of like RMSProp with momentum
  • Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

  • common in SGD but not in Adam.
  • draw the loss curve and think if it’s needed.

Second-order Optimization

  • Quasi-Newton methods (BGFS most popular):
  • instead of inverting the Hessian (O(n^3)), approximate
  • inverse Hessian with rank 1 updates over time (O(n^2) each).
  • L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

  • Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
  • Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

  1. Train multiple independent models
  2. At test time average their results

Enjoy 2% extra performance

Tips and Tricks

  • Instead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization

  • Add term to loss
    cs231n-notes-Lecture-7:各种优化方法介绍与比较

  • Dropout (two explanations)

    • Forces the network to have a redundant representation; Prevents co-adaptation of features
    • Dropout is training a large ensemble of models (that share parameters).
  • Data augmentation

    • Horizontal Flips
    • Random crops and scales
    • Color Jitter
    • translation
    • rotation
    • stretching
    • shearing
    • lens distortions
  • DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

  • Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

  • Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

相关文章: