He K, Zhang X, Ren S, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification[C]. international conference on computer vision, 2015: 1026-1034.
@article{he2015delving,
title={Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
pages={1026–1034},
year={2015}}
概
本文介绍了一种PReLU的**函数和Kaiming的参数初始化方法.
主要内容
PReLU
![[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpJek1TOW1NRGd4WmpNNU1qY3paamxqT0RBeE5EazJPVGN4WWpGaU56YzBaamcyTnk1d2JtYz0=)
f(yi)={yi,aiyi,yi>0,yi≤0.
其中ai是作为网络的参数进行训练的.
等价于
f(yi)=max(0,yi)+aimin(0,yi).
特别的, 可以一层的节点都用同一个a.
Kaiming 初始化
Forward case
yl=Wlxl+bl,
在卷积层中时, xl是k×k×c的展开, 故xl∈Rk2c, 而yl∈Rd, Wl∈Rd×k2c(每一行都可以视作一个kernel), 并记n=k2c.
xl=f(yl−1),
则
cl=dl−1.
![[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](/default/index/img?u=L2RlZmF1bHQvaW5kZXgvaW1nP3U9YUhSMGNITTZMeTl3YVdGdWMyaGxiaTVqYjIwdmFXMWhaMlZ6THpVeU5pOWlaVE14T0RneE5qZ3pZak14WTJZelpEUmpPV0UzTVRrellqRmpORFUyTmk1d2JtYz0=)
假设wl与xl(注意没粗体, 表示wl,xl中的某个元素)相互独立, 且wl采样自一个均值为0的对称分布之中.
则
Var[yl]=nlVar[wlxl]=nlVar[wl]E[xl2],
除非E[xl]=0, Var[yl]=nlVar[wl]Var[xl], 但对于ReLu, 或者 PReLU来说这个性质是不成立的.
如果我们令bl−1=0, 易证
E[xl2]=21Var[yl−1],
其中f是ReLU, 若f是PReLU,
E[xl2]=21+a2Var[yl−1].
下面用ReLU分析, PReLU是类似的.
故
Var[yl]=21nlar[wl]Var[yl−1],
自然我们希望
Var[yi]=Var[yj]⇒21nlVar[wl]=1,∀l.
Backward case
Δxl=W^lΔyl,(13)
Δxl表示损失函数观念与xl的导数, 这里的yl与之前提到的yl有出入, 这里需要用到卷积的梯度回传, 三言两语讲不清, W^l是Wl的一个重排.
因为xl=f(yl−1), 所以
Δyl=f′(yl)Δxl+1.
假设f′(yl)与Δxl+1相互独立, 所以
E[Δyl]=E[f′(yl)]E[Δxl+1]=0,
若f为ReLU:
E[(Δyl)2]=Var[Δyl]=21Var[Δxl+1].
若f为PReLU:
E[(Δyl)2]=Var[Δyl]=21+a2Var[Δxl+1].
下面以f为ReLU为例, PReLU类似
Var[Δxl]=n^lVar[wl]Var[Δyl]=21n^lVar[wl]Var[Δxl+1],
这里n^l=k2d为yl的长度.
和前向的一样, 我们希望Var[Δxl]一样, 需要
21n^lVar[wl]=1,∀l.
是实际中,我们前向后向可以任选一个(因为误差不会累积).