CS231n Winter 2016: Lecture 5: Neural Networks Part 2 

CS231n Winter 2016: Lecture 6: Neural Networks Part 3

Andrej Karpathy

本章节主要讲解激活函数,参数初始化以及周边的知识体系。

Ref: 《深度学习》第八章 - 深度模型中的优化

 

Overview

1. One time setup

    • activation functions,
    • preprocessing,
    • weight initialization,
    • regularization,
    • gradient checking

2. Training dynamics

    • babysitting the learning process,
    • parameter updates,
    • hyperparameter optimization

3. Evaluation

    • model ensembles 

 

 

激活函数

一、若干常用 Activation functions

 [Converge] Training Neural Networks

 

二、讲解

Sigmoid's three problems:

1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zerocentered
3. exp() is a bit compute expensive

 

Tanh 跟sigmoid一样具有毛病1。 

Why, “kill” the gradients? 

[Converge] Training Neural Networks

 

ReLU

Computes f(x) = max(0,x)

- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice (e.g. 6x)

 

- Not zero-centered output
- An annoyance: what is the gradient when x < 0? “kill” the gradients. When x = 0, no gradient.

 

Leaky ReLU

Computes f(x) = max(0.01x, x)

[Converge] Training Neural Networks

- Does not saturate
- Computationally efficient
- Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
- will not “die”.

 

Parametric Rectifier (PReLU)

Computes f(x) = max(alpha*x, x)

backprop into \alpha
(parameter)

 

Exponential Linear Units (ELU)

 [Converge] Training Neural Networks

- All benefits of ReLU
- Does not die
- Closer to zero mean outputs
- Computation requires exp()

 

Maxout “Neuron” 

[Converge] Training Neural Networks

- Does not have the basic form of dot product ->nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron

  

三、总结:TLDR: In practice

- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

 

 

预处理

 [Converge] Training Neural Networks

In practice, you may also see PCA and Whitening of the data.

[Converge] Training Neural Networks

可见,白化的威力!

 

TLDR: In practice for Images: center only

We don't have separate different features that could be at different units.

Everything is just pixels and they're all bounded between 0 and 255.

So, it is not common to normalize the data, but it's very common to zero center your data.

【以下有点“减小亮度影响”的意思】

- Subtract the mean image (e.g. AlexNet)
  (mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
  (mean along each channel = 3 numbers)

Not common to normalize variance, to do PCA or whitening.

 

 

权重初始化

First idea: Small random numbers
(gaussian with zero mean and 1e-2 standard deviation)

[Converge] Training Neural Networks

 

一个十层网络的实验:权重趋于0。

[Converge] Training Neural Networks

首先,这个实验有必要自己亲自做一下。下图所示,初始值扩大100倍后,几乎所有的neuron趋于饱和,这不利于收敛。

[Converge] Training Neural Networks

深度网络的调参是个集技术和经验一身的skill,可参见如下的理论。

[Converge] Training Neural Networks

[Converge] Training Neural Networks

采用不同的激活函数,得到不同的结果,实践并体会背后的原理。

[Converge] Training Neural Networks

 

Proper initialization is an active area of research…

  • Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010
  • Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
  • Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
  • Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
  • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
  • All you need is a good init, Mishkin and Matas, 2015

 

 

Batch Normalization

From: http://blog.csdn.net/hjimce/article/details/50866313

本篇博文主要讲解2015年深度学习领域,非常值得学习的一篇文献:《Batch Normalization: Accelerating Deep Network Training by  Reducing Internal Covariate Shift》

附上另一个参考:http://blog.csdn.net/elaine_bao/article/details/50890491 (图示较多)

 

先拿小数据试一试策略,最好能达到overfit (acc = 100%)

以下这个loss不怎么变,可能learning rata太小;

但为什么acc最后突然好了许多?思考。

[Converge] Training Neural Networks

 

 

Hyperparameter Optimization

First stage: only a few epochs to get rough idea of what params work
Second stage: longer running time, finer search

[Converge] Training Neural Networks

[Converge] Training Neural Networks

 

 

参数更新

可见,SGD总是最慢。

注意:加了momentum后的效果;以及改进后的momentum的紫色线(nag)的效果。

[Converge] Training Neural Networks

唯一的区别在于:

  • 传统的:

[Converge] Training Neural Networks

  • 改进的:

[Converge] Training Neural Networks

然后,通过“变量转换”带来计算上的便利。(nag

[Converge] Training Neural Networks

AdaGrad and RMSProp

[Converge] Training Neural Networks

Adam: 貌似为上策首选

[Converge] Training Neural Networks

 

学习率的选择

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

先用快的,再减小,常用策略如下:

[Converge] Training Neural Networks

 

Second order optimization methods 

- 牛顿法,不需要考虑学习率!

- Quasi-Newton methods (BGFS most popular):
instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2)
each).

- L-BFGS (Limited memory BFGS):
Does not form/store the full inverse Hessian. Usually works very well in full batch, deterministic mode; not for mini-batch setting.

 

In practice:

- Adam is a good default choice in most cases
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

[Converge] Training Neural Networks

 

 

Regularization (dropout)

Ref: http://www.jianshu.com/p/ba9ca3b07922

总结

  1. Dropout 方法存在两种形式:直接的和 Inverted。
  2. 在单个神经元上面,Dropout 方法可以使用伯努利随机变量。
  3. 在一层神经元上面,Dropout 方法可以使用伯努利随机变量。
  4. 我们精确的丢弃 np 个神经元是不太可能的,但是在一个拥有 n 个神经元的网络层上面,平均丢弃的神经元就是 np 个。
  5. Inverted Dropout 方法可以产生有效学习率。
  6. Inverted Dropout 方法应该和别的规范化参数的技术一起使用,从而帮助简化学习率的选择过程。
  7. Dropout 方法有助于防止深度神经网路的过拟合。

 

相关文章: