He K, Zhang X, Ren S, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification[C]. international conference on computer vision, 2015: 1026-1034.

@article{he2015delving,
title={Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
pages={1026–1034},
year={2015}}

本文介绍了一种PReLU的**函数和Kaiming的参数初始化方法.

主要内容

PReLU

[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

f(yi)={yi,yi>0,aiyi,yi0. f(y_i) = \left \{ \begin{array}{ll} y_i, & y_i >0, \\ a_i y_i, & y_i \le 0. \end{array} \right.
其中aia_i是作为网络的参数进行训练的.
等价于
f(yi)=max(0,yi)+aimin(0,yi). f(y_i)=\max(0, y_i) + a_i \min (0,y_i).
特别的, 可以一层的节点都用同一个aa.

Kaiming 初始化

Forward case

yl=Wlxl+bl, \mathbf{y}_l=W_l\mathbf{x}_l+\mathbf{b}_l,
在卷积层中时, xl\mathbf{x}_lk×k×ck\times k \times c的展开, 故xlRk2c\mathrm{x}_l\in \mathbb{R}^{k^2c}, 而ylRd\mathbf{y}_l \in \mathbb{R}^{d}, WlRd×k2cW_l \in \mathbb{R^{d \times k^2c}}(每一行都可以视作一个kernel), 并记n=k2cn=k^2c.

xl=f(yl1), \mathbf{x}_l=f(\mathbf{y}_{l-1}),

cl=dl1. c_l = d_{l-1}.
[Kaiming]Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

假设wlw_lxlx_l(注意没粗体, 表示wl,xl\mathbf{w}_l, \mathbf{x}_l中的某个元素)相互独立, 且wlw_l采样自一个均值为0的对称分布之中.


Var[yl]=nlVar[wlxl]=nlVar[wl]E[xl2], Var[y_l] = n_l Var [w_lx_l] = n_lVar[w_l]E[x_l^2],
除非E[xl]=0E[x_l]=0, Var[yl]=nlVar[wl]Var[xl]Var[y_l] = n_lVar[w_l]Var[x_l], 但对于ReLu, 或者 PReLU来说这个性质是不成立的.

如果我们令bl1=0b_{l-1}=0, 易证
E[xl2]=12Var[yl1], E[x_l^2] = \frac{1}{2} Var[y_{l-1}],
其中ff是ReLU, 若ff是PReLU,
E[xl2]=1+a22Var[yl1]. E[x_l^2] = \frac{1+a^2}{2} Var[y_{l-1}].
下面用ReLU分析, PReLU是类似的.


Var[yl]=12nlar[wl]Var[yl1], Var[y_l] = \frac{1}{2} n_l ar[w_l]Var[y_{l-1}],
自然我们希望
Var[yi]=Var[yj]12nlVar[wl]=1,l. Var[y_i]=Var[y_j] \Rightarrow \frac{1}{2}n_l Var[w_l]=1, \forall l.

Backward case

Δxl=W^lΔyl,(13) \tag{13} \Delta \mathbf{x}_l = \hat{W}_l \Delta \mathbf{y}_l,
Δxl\Delta \mathbf{x}_l表示损失函数观念与xl\mathbf{x}_l的导数, 这里的yl\mathbf{y}_l与之前提到的yl\mathbf{y}_l有出入, 这里需要用到卷积的梯度回传, 三言两语讲不清, W^l\hat{W}_lWlW_l的一个重排.

因为xl=f(yl1)\mathbf{x}_l=f(\mathbf{y}_{l-1}), 所以
Δyl=f(yl)Δxl+1. \Delta y_l = f'(y_l) \Delta x_{l+1}.

假设f(yl)f'(y_l)Δxl+1\Delta x_{l+1}相互独立, 所以
E[Δyl]=E[f(yl)]E[Δxl+1]=0, E[\Delta y_l]=E[f'(y_l)] E[\Delta x_{l+1}] = 0,
ff为ReLU:
E[(Δyl)2]=Var[Δyl]=12Var[Δxl+1]. E[(\Delta y_l)^2] = Var[\Delta y_l] = \frac{1}{2}Var[\Delta x_{l+1}].
ff为PReLU:
E[(Δyl)2]=Var[Δyl]=1+a22Var[Δxl+1]. E[(\Delta y_l)^2] = Var[\Delta y_l] = \frac{1+a^2}{2}Var[\Delta x_{l+1}].

下面以ff为ReLU为例, PReLU类似

Var[Δxl]=n^lVar[wl]Var[Δyl]=12n^lVar[wl]Var[Δxl+1], Var[\Delta x_l] = \hat{n}_l Var[w_l] Var[\Delta y_l] = \frac{1}{2} \hat{n}_l Var[w_l] Var[\Delta x_{l+1}],
这里n^l=k2d\hat{n}_l=k^2dyl\mathbf{y}_l的长度.

和前向的一样, 我们希望Var[Δxl]Var[\Delta x_l]一样, 需要
12n^lVar[wl]=1,l. \frac{1}{2}\hat{n}_l Var[w_l]=1, \forall l.

是实际中,我们前向后向可以任选一个(因为误差不会累积).

相关文章: