矩阵求导方法

  • 维度相容原则:假设每个中间变量量的维度都不不⼀一样,看怎么摆能把雅克比矩阵的维度摆成矩阵乘法规则允许的形式。只要把矩阵维度倒腾顺了了,公式也就对了了。
  • 设有f(Y):Rm×pRf ( Y ) : \mathbb { R } ^ { m \times p } \rightarrow \mathbb { R }Y=AX+B:Rn×pRm×pY = A X + B : \mathbb { R } ^ { n \times p } \rightarrow \mathbb { R } ^ { m \times p },则Xf(AX+B)=ATYf\nabla _ { X } f ( A X + B ) = A ^ { T } \nabla _ { Y } f,即fX=ATfY\frac { \partial f } { \partial X} = A ^ { T } \frac { \partial f} { \partial Y }
  • 设有f(Y):Rm×pRf ( Y ) : \mathbb { R } ^ { m \times p } \rightarrow \mathbb { R }Y=XA+B:Rm×nRm×pY = X A+ B : \mathbb { R } ^ { m \times n } \rightarrow \mathbb { R } ^ { m \times p },则Xf(XA+B)=YfAT\nabla _ { X } f (XA + B ) = { \nabla _ { Y } f}A ^ { T },即fX=fYAT\frac { \partial f } { \partial X} = \frac { \partial f} { \partial Y }{A ^ { T } }

证明

在前向传播过程中,X的shape(N,D),W的shape(D,C),Y=XW。现在,我们假设N = 2, D = 2, C = 3。那么
X=(x1,1x1,2x2,1x2,2)W=(w1,1w1,2w1,3w2,1w2,2w2,3)X = \left( \begin{array} { l l } { x _ { 1,1 } } & { x _ { 1,2 } } \\ { x _ { 2,1 } } & { x _ { 2,2 } } \end{array} \right) \qquad W = \left( \begin{array} { l l l } { w _ { 1,1 } } & { w _ { 1,2 } } & { w _ { 1,3 } } \\ { w _ { 2,1 } } & { w _ { 2,2 } } & { w _ { 2,3 } } \end{array} \right)Y=XW=(x1,1w1,1+x1,2w2,1x1,1w1,2+x1,2w2,2x1,1w1,3+x1,2w2,3x2,1w1,1+x2,2w2,1x2,1w1,2+x2,2w2,2x2,1w1,3+x2,2w2,3)Y = X W = \left( \begin{array} { l l } { x _ { 1,1 } w _ { 1,1 } + x _ { 1,2 } w _ { 2,1 } } & { x _ { 1,1 } w _ { 1,2 } + x _ { 1,2 } w _ { 2,2 } } & { x _ { 1,1 } w _ { 1,3 } + x _ { 1,2 } w _ { 2,3 } } \\ { x _ { 2,1 } w _ { 1,1 } + x _ { 2,2 } w _ { 2,1 } } & { x _ { 2,1 } w _ { 1,2 } + x _ { 2,2 } w _ { 2,2 } } & { x _ { 2,1 } w _ { 1,3 } + x _ { 2,2 } w _ { 2,3 } } \end{array} \right) 在前向传播结束后,我们通过输出Y计算得到损失函数L,然后求得LY\frac { \partial L } { \partial Y }LY=(Ly1,1Ly1,2Ly1,3Ly2,1Ly2,2Ly2,3)\frac { \partial L } { \partial Y } = \left( \begin{array} { l l l } { \frac { \partial L } { \partial y _ { 1,1 } } } & { \frac { \partial L } { \partial y _ { 1,2 } } } & { \frac { \partial L } { \partial y _ { 1,3 } } } \\ { \frac { \partial L } { \partial y _ { 2,1 } } } & { \frac { \partial L } { \partial y _ { 2,2 } } } & { \frac { \partial L } { \partial y _ { 2,3 } } } \end{array} \right)
然后计算LX=LYYX\frac { \partial L } { \partial X } = \frac { \partial L } { \partial Y } \frac { \partial Y } { \partial X }LW=LYYW\frac { \partial L } { \partial W } = \frac { \partial L } { \partial Y } \frac { \partial Y } { \partial W }LX\frac { \partial L } { \partial X }与X的维度是一致的,所以首先来展开LX\frac { \partial L } { \partial X }
X=(x1,1x1,2x2,1x2,2)LX=(Lx1,1Lx1,2Lx2,1Lx2,2)X = \left( \begin{array} { l l } { x _ { 1,1 } } & { x _ { 1,2 } } \\ { x _ { 2,1 } } & { x _ { 2,2 } } \end{array} \right) \Longrightarrow \frac { \partial L } { \partial X } = \left( \begin{array} { l l } { \frac { \partial L } { \partial x _ { 1,1 } } } & { \frac { \partial L } { \partial x _ { 1,2 } } } \\ { \frac { \partial L } { \partial x _ { 2,1 } } } & { \frac { \partial L } { \partial x _ { 2,2 } } } \end{array} \right)先对其中一个元素Lx1,1\frac { \partial L } { \partial x _ { 1,1 }}进行计算:Lx1,1=i=1Nj=1CLyi,jyi,jx1,1=LYYx1,1\frac { \partial L } { \partial x _ { 1,1 } } = \sum _ { i = 1 } ^ { N } \sum _ { j = 1 } ^ { C} \frac { \partial L } { \partial y _ { i , j } } \frac { \partial y _ { i , j } } { \partial x _ { 1,1 } } = \frac { \partial L } { \partial Y } \cdot \frac { \partial Y } { \partial x _ { 1,1 } } 其中LLx1,1x_{1,1}Lx1,1\frac { \partial L } { \partial x _ { 1,1 } }都为一个实数,Yx1,1\frac { \partial Y } { \partial x _ { 1,1 } }是一个与YY相同维度的矩阵。Yx1,1=(w1,1w1,2w1,3000)\frac { \partial Y } { \partial x _ { 1,1 } } = \left( \begin{array} { c c c } { w _ { 1,1 } } & { w _ { 1,2 } } & { w _ { 1,3 } } \\ { 0 } & { 0 } & { 0 } \end{array} \right)综上,就可以求得Lx1,1=LYYx1,1=(Ly2,1Ly2,2Ly2,3Ly2,1Ly2,2Ly2,3)(w1,1w1,2w1,3000)=Ly1,1w1,1+Ly1,2w1,2+Ly1,3w1,3\begin{aligned} \frac { \partial L } { \partial x _ { 1,1 } } & = \frac { \partial L } { \partial Y } \frac { \partial Y } { \partial x _ { 1,1 } } \\ & = \left( \begin{array} { c } { \frac { \partial L } { \partial y _ { 2,1 } } } & { \frac { \partial L } { \partial y _ { 2,2 } } } & { \frac { \partial L } { \partial y _ { 2,3 } } } \\ { \frac { \partial L } { \partial y _ { 2,1 } } } & { \frac { \partial L } { \partial y _ { 2,2 } } } & { \frac { \partial L } { \partial y _ { 2,3 } } } \end{array} \right) \cdot \left( \begin{array} { c c c } { w _ { 1,1 } } & { w _ { 1,2 } } & { w _ { 1,3 } } \\ { 0 } & { 0 } & { 0 } \end{array} \right) \\ & = \frac { \partial L } { \partial y _ { 1,1 } } w _ { 1,1 } + \frac { \partial L } { \partial y _ { 1,2 } } w _ { 1,2 } + \frac { \partial L } { \partial y _ { 1,3 } } w _ { 1,3 } \end{aligned}同理,求得Lx1,2\frac { \partial L } { \partial x _ { 1,2 } }Lx2,1\frac { \partial L } { \partial x _ { 2,1 } }Lx2,2\frac { \partial L } { \partial x _ { 2,2 } }

LX=(Ly1,1w1,1+Ly1,2w1,2+Ly1,3w1,3Ly11w2,1+Ly1,2w2,2+Ly1,3w2,3Ly2,1w1,1+Ly2,2w1,2+Ly2,3w1,3Ly2,1w2,1+Ly2,2w2,2+Ly2,3w2,3)=(Ly1,1Ly1,2Ly1,2Ly2,1Ly2,2Ly2,3)(w1,1w2,1w1,2w2,2w1,3w2,3)=LYWT\begin{aligned}\frac { \partial L } { \partial X } & = \left( \begin{array} { c c } { \frac { \partial L } { \partial y _ { 1,1 } } w _ { 1,1 } + \frac { \partial L } { \partial y _ { 1,2 } } w _ { 1,2 } + \frac { \partial L } { \partial y _ { 1,3 } } w _ { 1,3 } } & { \frac { \partial L } { \partial y _ { 1 } 1 } w _ { 2,1 } + \frac { \partial L } { \partial y _ { 1,2 } } w _ { 2,2 } + \frac { \partial L } { \partial y _ { 1,3 } } w _ { 2,3 } } \\ { \frac { \partial L } { \partial y _ { 2,1 } } w _ { 1,1 } + \frac { \partial L } { \partial y _ { 2,2 } } w _ { 1,2 } + \frac { \partial L } { \partial y _ { 2,3 } } w _ { 1,3 } } & { \frac { \partial L } { \partial y _ { 2,1 } } w _ { 2,1 } + \frac { \partial L } { \partial y _ { 2,2 } } w _ { 2,2 } + \frac { \partial L } { \partial y _ { 2,3 } } w _ { 2,3 } } \end{array} \right)\\ & = \left( \begin{array} { l l l } { \frac { \partial L } { \partial y _ { 1,1 } } } & { \frac { \partial L } { \partial y _ { 1,2 } } } & { \frac { \partial L } { \partial y _ { 1,2 } } } \\ { \frac { \partial L } { \partial y _ { 2,1 } } } & { \frac { \partial L } { \partial y _ { 2,2 } } } & { \frac { \partial L } { \partial y _ { 2,3 } } } \end{array} \right) \left( \begin{array} { c c } { w _ { 1,1 } } & { w _ { 2,1 } } \\ { w _ { 1,2 } } & { w _ { 2,2 } } \\ { w _ { 1,3 } } & { w _ { 2,3 } } \end{array} \right) \\ & =\frac { \partial L } { \partial Y } W ^ { T } \end{aligned}用同样的分批求导方法,可以得到LW=XTLY\frac { \partial L } { \partial W } = X ^ { T } \frac { \partial L } { \partial Y }

两层神经网络的梯度求导

两层神经网络的结构如下图所示,outputlayer最终输出的是样本属于每一类的概率,最后的损失函数Loss可以由SVM或者Softmax求得。
【cs231n】两层神经网络的反向传播
将神经网络得计算过程转化为计算图如下,其中hiddenlayer中有一个**函数relu,依据计算图,使用链式求导法则,求四个参数的倒数。
【cs231n】两层神经网络的反向传播
其中S1=XW1+b1S_1=XW_1+b1S1relu=relu(S1)S_{1relu} = relu(S_1)S2=S1reluW2+b2S_2 = S_{1relu}W_2+b2。求LW1LW2Lb1Lb2\frac { \partial L} { \partial W_1}、\frac { \partial L} { \partial W_2}、\frac { \partial L} { \partial b_1}、\frac { \partial L} { \partial b_2}

  • LW2\frac { \partial L} { \partial W_2}
    LW2=S1reluTLS2\frac { \partial L} { \partial W_2} = S_{1relu}^T{\frac { \partial L} { \partial S_2} }
  • Lb2\frac { \partial L} { \partial b_2}
    Lb2=LS2S2b2=iLS2ij\frac { \partial L} { \partial b_2} = {\frac { \partial L} { \partial S_2} }{\frac { \partial S_2} { \partial b_2} } =\sum_i{\frac { \partial L} { \partial S_{2ij}} }
  • LW1\frac { \partial L} { \partial W_1}
    LW1=XTLS1\frac { \partial L} { \partial W_1} = X^T{\frac { \partial L} { \partial S_1} }
    其中,LS1=LS1reluS1S1relu=LS2W2T(S1>0)\frac { \partial L} { \partial S_1} = {\frac { \partial L} { \partial S_{1relu}}} {\frac { \partial S_1} { \partial S_{1relu}}} = {\frac { \partial L} { \partial S_2}}W_2^T(S_1>0)
    LW1=XTLS2W2T(S1>0)\frac { \partial L} { \partial W_1} = X^T{\frac { \partial L} { \partial S_2}}W_2^T(S_1>0)
  • Lb1\frac { \partial L} { \partial b_1}
    Lb1=LS1S1b1=iLS1ij\frac { \partial L} { \partial b_1} = {\frac { \partial L} { \partial S_1} }{\frac { \partial S_1} { \partial b_1} } = \sum_i{\frac { \partial L} { \partial S_{1ij}} }
    综上,求的四个参数的梯度之后,就可以进行梯度下降的训练过程。

classifiers/neural_net.py

def loss(self, X, y=None, reg=0.0):
    """
    Compute the loss and gradients for a two layer fully connected neural
    network.
    Inputs:
    - X: Input data of shape (N, D). Each X[i] is a training sample.
    - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
      an integer in the range 0 <= y[i] < C. This parameter is optional; if it
      is not passed then we only return scores, and if it is passed then we
      instead return the loss and gradients.
    - reg: Regularization strength.
    Returns:
    If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
    the score for class c on input X[i].
    If y is not None, instead return a tuple of:
    - loss: Loss (data loss and regularization loss) for this batch of training
      samples.
    - grads: Dictionary mapping parameter names to gradients of those parameters
      with respect to the loss function; has the same keys as self.params.
    """
    # Unpack variables from the params dictionary
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
    N, D = X.shape

    # Compute the forward pass
    scores = None
    z1 = np.dot(X,W1)+b1
    a1 = np.maximum(0,z1)
    scores = np.dot(a1,W2)+b2
    
    # If the targets are not given then jump out, we're done
    if y is None:
      return scores

    # Compute the loss
    loss = None
    scores -=np.max(scores,axis=1).reshape(N,1)
    correct_scores = np.exp(scores[np.arange(N),y])
    sum_line = np.sum(np.exp(scores),axis=1)
    loss = np.sum(-np.log(correct_scores/sum_line))
    loss /= N
    loss += reg*(np.sum(W1**2)+np.sum(W2**2))


    # Backward pass: compute gradients
    grads = {}
    dS2 = np.exp(scores)/sum_line.reshape(N,1)
    dS2[np.arange(N),y] -=1
    grads['W2'] = np.dot(a1.T,dS2)/N + 2 * reg * W2
    grads['b2'] = np.sum(dS2, axis=0)/N
    dz1 = dS2.dot(W2.T)*(z1>0)
    grads['W1'] = np.dot(X.T,dz1)/N + 2 * reg * W1
    grads['b1'] = np.sum(dz1, axis=0) / N

    return loss, grads

参考
https://github.com/soloice/Matrix_Derivatives
http://cs231n.stanford.edu/handouts/linear-backprop.pdf
http://cs231n.stanford.edu/handouts/derivatives.pdf

相关文章: