矩阵求导方法
- 维度相容原则:假设每个中间变量量的维度都不不⼀一样,看怎么摆能把雅克比矩阵的维度摆成矩阵乘法规则允许的形式。只要把矩阵维度倒腾顺了了,公式也就对了了。
- 设有f(Y):Rm×p→R,Y=AX+B:Rn×p→Rm×p,则∇Xf(AX+B)=AT∇Yf,即∂X∂f=AT∂Y∂f
- 设有f(Y):Rm×p→R,Y=XA+B:Rm×n→Rm×p,则∇Xf(XA+B)=∇YfAT,即∂X∂f=∂Y∂fAT
证明
在前向传播过程中,X的shape(N,D),W的shape(D,C),Y=XW。现在,我们假设N = 2, D = 2, C = 3。那么
X=(x1,1x2,1x1,2x2,2)W=(w1,1w2,1w1,2w2,2w1,3w2,3)Y=XW=(x1,1w1,1+x1,2w2,1x2,1w1,1+x2,2w2,1x1,1w1,2+x1,2w2,2x2,1w1,2+x2,2w2,2x1,1w1,3+x1,2w2,3x2,1w1,3+x2,2w2,3)在前向传播结束后,我们通过输出Y计算得到损失函数L,然后求得∂Y∂L∂Y∂L=(∂y1,1∂L∂y2,1∂L∂y1,2∂L∂y2,2∂L∂y1,3∂L∂y2,3∂L)
然后计算∂X∂L=∂Y∂L∂X∂Y,∂W∂L=∂Y∂L∂W∂Y。∂X∂L与X的维度是一致的,所以首先来展开∂X∂L。
X=(x1,1x2,1x1,2x2,2)⟹∂X∂L=(∂x1,1∂L∂x2,1∂L∂x1,2∂L∂x2,2∂L)先对其中一个元素∂x1,1∂L进行计算:∂x1,1∂L=i=1∑Nj=1∑C∂yi,j∂L∂x1,1∂yi,j=∂Y∂L⋅∂x1,1∂Y 其中L、x1,1、∂x1,1∂L都为一个实数,∂x1,1∂Y是一个与Y相同维度的矩阵。∂x1,1∂Y=(w1,10w1,20w1,30)综上,就可以求得∂x1,1∂L=∂Y∂L∂x1,1∂Y=(∂y2,1∂L∂y2,1∂L∂y2,2∂L∂y2,2∂L∂y2,3∂L∂y2,3∂L)⋅(w1,10w1,20w1,30)=∂y1,1∂Lw1,1+∂y1,2∂Lw1,2+∂y1,3∂Lw1,3同理,求得∂x1,2∂L、∂x2,1∂L、∂x2,2∂L
∂X∂L=(∂y1,1∂Lw1,1+∂y1,2∂Lw1,2+∂y1,3∂Lw1,3∂y2,1∂Lw1,1+∂y2,2∂Lw1,2+∂y2,3∂Lw1,3∂y11∂Lw2,1+∂y1,2∂Lw2,2+∂y1,3∂Lw2,3∂y2,1∂Lw2,1+∂y2,2∂Lw2,2+∂y2,3∂Lw2,3)=(∂y1,1∂L∂y2,1∂L∂y1,2∂L∂y2,2∂L∂y1,2∂L∂y2,3∂L)⎝⎛w1,1w1,2w1,3w2,1w2,2w2,3⎠⎞=∂Y∂LWT用同样的分批求导方法,可以得到∂W∂L=XT∂Y∂L
两层神经网络的梯度求导
两层神经网络的结构如下图所示,outputlayer最终输出的是样本属于每一类的概率,最后的损失函数Loss可以由SVM或者Softmax求得。

将神经网络得计算过程转化为计算图如下,其中hiddenlayer中有一个**函数relu,依据计算图,使用链式求导法则,求四个参数的倒数。

其中S1=XW1+b1,S1relu=relu(S1),S2=S1reluW2+b2。求∂W1∂L、∂W2∂L、∂b1∂L、∂b2∂L
-
∂W2∂L
∂W2∂L=S1reluT∂S2∂L
-
∂b2∂L
∂b2∂L=∂S2∂L∂b2∂S2=i∑∂S2ij∂L
-
∂W1∂L
∂W1∂L=XT∂S1∂L
其中,∂S1∂L=∂S1relu∂L∂S1relu∂S1=∂S2∂LW2T(S1>0)
∂W1∂L=XT∂S2∂LW2T(S1>0)
-
∂b1∂L
∂b1∂L=∂S1∂L∂b1∂S1=i∑∂S1ij∂L
综上,求的四个参数的梯度之后,就可以进行梯度下降的训练过程。
classifiers/neural_net.py
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.
Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
scores = None
z1 = np.dot(X,W1)+b1
a1 = np.maximum(0,z1)
scores = np.dot(a1,W2)+b2
if y is None:
return scores
loss = None
scores -=np.max(scores,axis=1).reshape(N,1)
correct_scores = np.exp(scores[np.arange(N),y])
sum_line = np.sum(np.exp(scores),axis=1)
loss = np.sum(-np.log(correct_scores/sum_line))
loss /= N
loss += reg*(np.sum(W1**2)+np.sum(W2**2))
grads = {}
dS2 = np.exp(scores)/sum_line.reshape(N,1)
dS2[np.arange(N),y] -=1
grads['W2'] = np.dot(a1.T,dS2)/N + 2 * reg * W2
grads['b2'] = np.sum(dS2, axis=0)/N
dz1 = dS2.dot(W2.T)*(z1>0)
grads['W1'] = np.dot(X.T,dz1)/N + 2 * reg * W1
grads['b1'] = np.sum(dz1, axis=0) / N
return loss, grads
参考
https://github.com/soloice/Matrix_Derivatives
http://cs231n.stanford.edu/handouts/linear-backprop.pdf
http://cs231n.stanford.edu/handouts/derivatives.pdf