cs231n-notes-Lecture-7：各种优化方法介绍与比较

Lecture-7 Training Neural Networks

Optimization

SGD

Cons
1. Very slow progress along shallow dimension, jitter along steep direction.
  2. local minima or saddle point. Saddle points are much more common in high dimension.
2. Gradients come from minibatches, so they can be noisy!

SGD + Momentum

cs231n-notes-Lecture-7：各种优化方法介绍与比较

Nesterov Momentum

cs231n-notes-Lecture-7：各种优化方法介绍与比较

AdaGrad

cs231n-notes-Lecture-7：各种优化方法介绍与比较

step size becomes smaller and smaller because grad_squared is always increasing.
the gradient becomes smaller in the waggling dimension
not common: Slow, get stuck easily

RMSProp

cs231n-notes-Lecture-7：各种优化方法介绍与比较

decay_rate: commonly 0.9 or 0.99
Slove the problem that Adagrad always slow down the movement in all dimensions : only work in some dimesions whose gradients are large.

Adam

cs231n-notes-Lecture-7：各种优化方法介绍与比较

Sort of like RMSProp with momentum
Bias correction is used to avoid it moves large distance at the very first step.

Learning rate decay

common in SGD but not in Adam.
draw the loss curve and think if it’s needed.

Second-order Optimization

Quasi-Newton methods (BGFS most popular):

instead of inverting the Hessian (O(n^3)), approximate

inverse Hessian with rank 1 updates over time (O(n^2) each).

L-BFGS (Limited memory BFGS):

Does not form/store the full inverse Hessian.

Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
Does not transfer very well to mini-batch setting. Give bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.

Model Ensembles

Train multiple independent models
At test time average their results

Enjoy 2% extra performance

Tips and Tricks

Instead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016 Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017

Improve single-model performance

Regularization

Add term to loss
Dropout (two explanations)
- Forces the network to have a redundant representation; Prevents co-adaptation of features
- Dropout is training a large ensemble of models (that share parameters).
Data augmentation
- Horizontal Flips
- Random crops and scales
- Color Jitter
- translation
- rotation
- stretching
- shearing
- lens distortions
DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Stochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016