书籍简介

《The Matrix Calculus You Need For Deep Learning》是旧金山大学的Terence Parr教授(ANTLR之父,fast.ai创始人)和Jeremy Howard共同推出的一篇免费教程,可以帮助你快速入门深度学习中的矩阵微积分相关知识。该教程简洁明了,通俗易懂,只需要一点微积分和神经网络的基础知识就可以直接开始学习啦!

本教程涵盖的内容

教程先快速回顾了标量求导法则、向量微积分和偏导数的概念,然后从Jacobian矩阵的推广开始介绍如何计算矩阵的导数,最后推导单个神经元输出的梯度以及神经网络损失函数的梯度。
《The Matrix Calculus You Need For Deep Learning》读书笔记

内容总结

1.引言

导数是机器学习中的一个重要组成部分,特别是深度学习,它通过优化损失函数来对神经网络进行训练。不过它们需要的不是之前所学的标量微积分,而是所谓的矩阵微积分——线性代数和多变量微积分的“联姻”。
标量求导我们已经很熟悉的,常用的有指数法则、乘积法则和链式法则。需要注意的是这里我们已经可以引入算子的概念,即可认为 d d x \frac{d}{dx} dxd是将一个函数映射到它的导数的微分算子,这就意味着 d d x f ( x ) \frac{d}{dx}f(x) dxdf(x) d f ( x ) d x \frac{df(x)}{dx} dxdf(x)表示相同的概念。
进一步考虑多变量的情况。多变量函数对单个变量求导得到的是偏导数(用 ∂ ∂ x \frac{\partial}{\partial x} x表示)。将所有的偏导数放在一个行向量内,这个向量即称为函数 f ( x , y ) f(x,y) f(x,y)的梯度,即
∇ f ( x , y ) = [ ∂ f ( x , y ) ∂ x , ∂ f ( x , y ) ∂ y ] \nabla f(x,y)=[\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}] f(x,y)=[xf(x,y),yf(x,y)].
再进一步,考虑多函数多变量的情况。除了 f ( x , y ) f(x,y) f(x,y)之外,再加上一个函数 g ( x , y ) g(x,y) g(x,y)。对于这两个函数,我们可以将它们的梯度组合成一个矩阵,称为Jacobian矩阵,矩阵的每一行对应一个函数的梯度:
J = [ ∇ f ( x , y ) ∇ g ( x , y ) ] = [ ∂ f ( x , y ) ∂ x ∂ f ( x , y ) ∂ y ∂ g ( x , y ) ∂ x ∂ g ( x , y ) ∂ y ] J=\begin{bmatrix}\nabla f(x,y)\\ \nabla g(x,y) \end{bmatrix}=\begin{bmatrix}\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\ \frac{\partial g(x,y)}{\partial x}&\frac{\partial g(x,y)}{\partial y} \end{bmatrix} J=[f(x,y)g(x,y)]=[xf(x,y)xg(x,y)yf(x,y)yg(x,y)].
这样我们就得到了本教程的核心内容:矩阵微积分!

2.Jacobian的推广

将参数用向量表示,即 x = [ x 1 x 2 . . . x n ] T \bold{x}=[ x_1 x_2 ... x_n]^T x=[x1x2...xn]T.
将函数同样用向量表示,即 y = f ( x ) = [ f 1 ( x ) f 2 ( x ) . . . f m ( x ) ] T \bold{y}=\bold{f(x)}=[f_1(\bold x) f_2(\bold x) ... f_m(\bold x)]^T y=f(x)=[f1(x)f2(x)...fm(x)]T,表示由m个标量函数构成的函数向量。
通常Jacobian矩阵即是 m ∗ n m*n mn个偏导数的集合,也就是相对于 x \bold x x的m个梯度的堆积:
∂ y ∂ x = [ ∇ f 1 ( x ) ∇ f 2 ( x ) . . . ∇ f m ( x ) ] = [ ∂ x f 1 ( x ) ∂ x f 2 ( x ) . . . ∂ x f m ( x ) ] = [ ∂ x 1 f 1 ( x ) ∂ x 2 f 1 ( x ) . . . ∂ x n f 1 ( x ) ∂ x 1 f 2 ( x ) ∂ x 2 f 2 ( x ) . . . ∂ x n f 2 ( x ) . . . ∂ x 1 f m ( x ) ∂ x 2 f m ( x ) . . . ∂ x n f m ( x ) ] \frac{\partial {\bold y}}{\partial {\bold x}}=\begin{bmatrix} \nabla f_1(\bold {x}) \\ \nabla f_2(\bold {x}) \\ ... \\ \nabla f_m(\bold {x}) \end{bmatrix}=\begin{bmatrix} \frac{\partial}{\bold {x}}f_1(\bold x) \\ \frac{\partial}{\bold {x}}f_2(\bold x) \\ ... \\ \frac{\partial}{\bold {x}}f_m(\bold x)\end{bmatrix}\\=\begin{bmatrix} \frac{\partial}{x_1}f_1(\bold x) & \frac{\partial}{x_2}f_1(\bold x) & ... & \frac{\partial}{x_n}f_1(\bold x)\\ \frac{\partial}{x_1}f_2(\bold x) & \frac{\partial}{x_2}f_2(\bold x) & ... & \frac{\partial}{x_n}f_2(\bold x)\\ &...& \\ \frac{\partial}{x_1}f_m(\bold x) & \frac{\partial}{x_2}f_m(\bold x) & ... & \frac{\partial}{x_n}f_m(\bold x)\end{bmatrix} xy=f1(x)f2(x)...fm(x)=xf1(x)xf2(x)...xfm(x)=x1f1(x)x1f2(x)x1fm(x)x2f1(x)x2f2(x)...x2fm(x).........xnf1(x)xnf2(x)xnfm(x).
作为一个特例,假定 f ( x ) = x \bold f(\bold x)=\bold x f(x)=x f i ( x ) = x i f_i(\bold x)=x_i fi(x)=xi。这里有n个函数,每个函数都有n个参数,因此Jacobian矩阵是一个方阵,而且很容易得到它是一个单位阵 I I I.

3.向量的逐元素二元运算求导

向量的逐元素二元运算(element-wise binary operations)有很多,如向量的加法、减法、点乘,标量乘以向量等运算。一般情况下,该种运算可以写成 y = f ( w ) ∘ g ( x ) \bold y = \bold f(\bold w) \circ \bold g(\bold x) y=f(w)g(x)的形式,其中 ∘ \circ 即表示任意一种逐元素运算操作。
对于加减乘除等简单的逐元素运算操作,函数 y \bold y y w \bold w w x \bold x x的偏导数如下:
《The Matrix Calculus You Need For Deep Learning》读书笔记

4.涉及标量运算的向量求导

简单来说,即向量加上标量或者向量乘以标量,然后再求导。这里涉及的运算仍然是逐元素的,因此可以将标量展开成包含相同值的向量。求导结果比较简单:
∂ y ∂ x = ∂ ( x + z ) ∂ x = I \frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold x+z)}}{\partial {\bold x}}=I xy=x(x+z)=I
∂ y ∂ z = ∂ ( x + z ) ∂ z = 1 \frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold x+z)}}{\partial {z}}=\bold 1 zy=z(x+z)=1
∂ y ∂ x = ∂ ( x z ) ∂ x = I z \frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold xz)}}{\partial {\bold x}}=Iz xy=x(xz)=Iz
∂ y ∂ z = ∂ ( x z ) ∂ z = x \frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold xz)}}{\partial {z}}=\bold x zy=z(xz)=x

5.向量和求导

向量元素求和是深度学习中的一个重要运算,如计算网络的损失函数。同样的,也可用于向量点积和其它将向量转化成标量的运算的求导操作。
y = s u m ( f ( x ) ) = ∑ i = 1 n f i ( x ) y=sum(\bold f(\bold x))=\sum_{i=1}^{n}f_i(\bold x) y=sum(f(x))=i=1nfi(x)
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 . . . ∂ y ∂ x n ] \frac{\partial{y}}{\partial \bold x}=\begin{bmatrix} \frac{\partial{y}}{\partial x_1} \frac{\partial{y}}{\partial x_2} ... \frac{\partial{y}}{\partial x_n} \end{bmatrix} xy=[x1yx2y...xny]
一些简明的求导结果如下:
y = s u m ( x ) , ∇ y = 1 y=sum(\bold x),\nabla y=\bold 1 y=sum(x),y=1
y = s u m ( x z ) , ∂ y ∂ x = 1 z , ∂ y ∂ z = s u m ( x ) y=sum(\bold xz), \frac {\partial y}{\partial \bold x}=\bold 1z,\frac {\partial y}{\partial \bold z}=sum(\bold x) y=sum(xz),xy=1z,zy=sum(x)

6.链式法则

使用基本的矩阵微分法则无法计算复杂函数的偏导数,比如嵌套函数,此时必须将基本的向量求导法则和向量链式法则结合起来使用。不幸的是,有很多求导法则都属于链式法则,所以我们要小心使用哪个链式法则。

6.1 单变量链式法则

所谓单变量链式法则,即之前所学的链式法则,此时只有一个变量。

6.2 单变量全导数链式法则

单变量链式法则的适用范围是有限的,即所有的中间变量必须是单个变量的函数。所谓全导数,即是要计算 d y d x \frac {dy}{dx} dxdy,必须把x变化量对y变化量的所有可能贡献加起来。相对 x x x的全导数假设所有的变量都是 x x x的函数,并且可能随着 x x x变化而变化。如 f ( x ) = u 2 ( x , u 1 ) f(x)=u_2(x,u_1) f(x)=u2(x,u1)通过中间变量 u 1 ( x ) u_1(x) u1(x)直接和间接地依赖于x,则全导数:
d y d x = ∂ f ( x ) ∂ x = ∂ u 2 ( x , u 1 ) ∂ x = ∂ u 2 ∂ x ∂ x ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = ∂ u 2 ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x \frac{dy}{dx}=\frac{\partial f(x)}{\partial x}=\frac{\partial u_2(x,u_1)}{\partial x}=\frac{\partial u_2}{\partial x}\frac{\partial x}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}=\frac{\partial u_2}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} dxdy=xf(x)=xu2(x,u1)=xu2xx+u1u2xu1=xu2+u1u2xu1
其一般形式为:
∂ f ( u 1 , . . . , u n + 1 ) ∂ x = ∑ i = 1 n + 1 ∂ f ∂ u i ∂ u i ∂ x \frac {\partial f(u_1,...,u_n+1)}{\partial x}=\sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} xf(u1,...,un+1)=i=1n+1uifxui

6.3 向量链式法则

我们先从计算一个向量函数 y = f ( g ( x ) ) \bold y=\bold f(\bold g(x)) y=f(g(x))对标量的导数开始,看看我们能否抽象出一个一般的公式。显然,有:
∂ y ∂ x = [ ∂ f 1 ( g ) ∂ x ∂ f 2 ( g ) ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ f 1 ∂ g 2 ∂ f 2 ∂ g 1 ∂ f 2 ∂ g 2 ] [ ∂ g 1 ∂ x ∂ g 2 ∂ x ] = ∂ f ∂ g ∂ g ∂ x \frac {\partial \bold y}{\partial x}=\begin{bmatrix} \frac{\partial f_1(\bold g)}{\partial x} \\ \frac{\partial f_2(\bold g)}{\partial x}\end{bmatrix}=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x } \\ \frac {\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x } \end{bmatrix} \\=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}& \frac {\partial f_1}{\partial g_2} \\ \frac {\partial f_2}{\partial g_1}& \frac {\partial f_2}{\partial g_2} \end{bmatrix}\begin{bmatrix} \frac{\partial g_1}{\partial x}\\\frac{\partial g_2}{\partial x } \end{bmatrix}=\frac{\partial \bold f}{\partial \bold g}\frac{\partial \bold g}{\partial x} xy=[xf1(g)xf2(g)]=[g1f1xg1+g2f1xg2g1f2xg1+g2f2xg2]=[g1f1g1f2g2f1g2f2][xg1xg2]=gfxg
当将变量 x x x扩展成向量 x \bold x x,再次求Jacobian矩阵,此时完整的向量链式法则为:
∂ ∂ x f ( g ( x ) ) = ∂ f ∂ g ∂ g ∂ x \frac{\partial}{\partial \bold x} \bold f(\bold g(\bold x))=\frac{\partial \bold f}{\partial \bold g} \frac {\partial \bold g}{\partial \bold x} xf(g(x))=gfxg

《The Matrix Calculus You Need For Deep Learning》读书笔记
这个等式可以进一步简化,因为在许多情况下,Jacobian矩阵是方阵( m = n m=n m=n),而且非对角元素均为0.这是神经网络的天然性质,它涉及的是向量的函数,而非函数的向量。如神经元仿射函数是 s u m ( w ⨂ x ) sum(\bold w \bigotimes \bold x) sum(wx),**函数是 m a x ( 0 , x ) max(0,\bold x) max(0,x).
正如之前所讲,对向量 w \bold w w x \bold x x进行逐元素运算,其偏导数矩阵为 ∂ w i ∂ x i \frac {\partial w_i}{\partial x_i} xiwi构成的对角阵,因为 w i w_i wi x i x_i xi的函数,而非 x j x_j xj的函数( j ≠ i j\ne i j=i).
∂ f ∂ g = d i a g ( ∂ f i ∂ g i ) \frac{\partial \bold f}{\partial \bold g}=diag(\frac{\partial \bold f_i}{\partial \bold g_i} ) gf=diag(gifi)
∂ g ∂ x = d i a g ( ∂ g i ∂ x i ) \frac{\partial \bold g}{\partial \bold x}=diag(\frac{\partial \bold g_i}{\partial \bold x_i} ) xg=diag(xigi)
此时,链式法则可以简化为:
∂ ∂ x f ( g ( x ) ) = d i a g ( ∂ f i ∂ g i ) d i a g ( ∂ g i ∂ x i ) = d i a g ( ∂ f i ∂ g i ∂ g i ∂ x i ) \frac{\partial}{\partial \bold x} f(\bold g( x))=diag(\frac{\partial f_i}{\partial g_i} )diag(\frac{\partial g_i}{\partial x_i} )=diag(\frac{\partial f_i}{ \partial g_i}\frac{ \partial g_i}{\partial x_i}) xf(g(x))=diag(gifi)diag(xigi)=diag(gifixigi)

6.4 总结

综上所述,下表总结了链式法则的相应乘积部分以计算Jacobian矩阵。
《The Matrix Calculus You Need For Deep Learning》读书笔记

7.**函数的梯度

**函数:
a c t i v a t i o n ( x ) = m a x ( 0 , w ⋅ x + b ) activation(\bold x)=max(0,\bold w \cdot \bold x+b) activation(x)=max(0,wx+b)
中间变量:
u = w ⋅ x + b \bold u = \bold w \cdot \bold x+b u=wx+b
y = s u m ( u ) y=sum(\bold u) y=sum(u)
偏导数:
∂ y ∂ w = ∂ ∂ w w ⋅ x + ∂ ∂ w b = x T + 0 T = x T \frac{\partial y}{\partial \bold w}=\frac{\partial}{\partial \bold w}\bold w\cdot \bold x+\frac{\partial}{\partial \bold w}b=\bold x^T+\bold 0^T=\bold x^T wy=wwx+wb=xT+0T=xT
∂ y ∂ b = ∂ ∂ b w ⋅ x + ∂ ∂ b b = 0 + 1 = 1 \frac{\partial y}{\partial b}=\frac{\partial}{\partial b}\bold w\cdot \bold x+\frac{\partial}{\partial \bold b}b=0+1=1 by=bwx+bb=0+1=1
**函数的偏导数:
∂ a c t i v a t i o n ∂ w = { 0 T , w ⋅ x + b ≤ 0 x T , w ⋅ x + b > 0 \frac{\partial activation}{\partial \bold w}=\begin{cases}\bold 0^T, &\bold w\cdot \bold x+b \le0 \\ \bold x^T, & \bold w\cdot \bold x+b>0 \end{cases} wactivation={0T,xT,wx+b0wx+b>0
∂ a c t i v a t i o n ∂ w = { 0 , w ⋅ x + b ≤ 0 1 , w ⋅ x + b > 0 \frac{\partial activation}{\partial \bold w}=\begin{cases}0, &\bold w\cdot \bold x+b \le0 \\ 1, & \bold w\cdot \bold x+b>0 \end{cases} wactivation={0,1,wx+b0wx+b>0

8.损失函数的梯度

训练一个神经网络需要计算损失函数(或代价函数)对模型参数 w \bold w w b b b的导数。由于我们是用多个输入矢量(也即多个图片)和标量目标值(如每个图片一个类别)进行训练。令
X = [ x 1 , x 2 , . . . , x N ] T X=[\bold x_1,\bold x_2,...,\bold x_N]^T X=[x1,x2,...,xN]T
y = [ t a r g e t ( x 1 ) , t a r g e t ( x 2 ) , . . . , t a r g e t ( x N ) ] T \bold y=[target(\bold x_1),target(\bold x_2),...,target(\bold x_N)]^T y=[target(x1),target(x2),...,target(xN)]T.
代价函数则为:
C ( w , b , X , y ) = 1 N ∑ i = 1 N ( y i − a c t i v a t i o n ( x i ) ) 2 = 1 N ∑ i = 1 N ( y i − m a x ( 0 , w ⋅ x i + b ) ) 2 C(\bold w,b,X,\bold y)=\frac {1}{N}\sum_{i=1}^{N}(y_i-activation(\bold x_i))^2\\=\frac{1}{N}\sum_{i=1}^{N}(y_i-max(0,\bold w\cdot \bold x_i+b))^2 C(w,b,X,y)=N1i=1N(yiactivation(xi))2=N1i=1N(yimax(0,wxi+b))2
此时,引入中间变量:
u ( w , b , x ) = m a x ( 0 , w ⋅ x + b ) u(\bold w,b,\bold x)=max(0,\bold w\cdot x+b) u(w,b,x)=max(0,wx+b)
v ( y , u ) = y − u v(y,u)=y-u v(y,u)=yu
C ( v ) = 1 N ∑ i = 1 N v 2 C(v)=\frac{1}{N}\sum_{i=1}^{N}v^2 C(v)=N1i=1Nv2
因此,对权重的偏导数为:
∂ C ( v ) w = ∂ ∂ w 1 N ∑ i = 1 N v 2 = 1 N ∑ i = 1 N 2 v ∂ v ∂ w = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 v x T , w ⋅ x i + b > 0 = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 ( y i − u ) x i T , w ⋅ x i + b > 0 = { 0 , w ⋅ x i + b ≤ 0 2 N ∑ i = 1 N ( w ⋅ x i + b − y i ) x i T , w ⋅ x i + b > 0 \frac{\partial C(v)}{\bold w}=\frac{\partial}{\partial \bold w} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N}\sum_{i=1}^{N}2v\frac{\partial v}{\partial \bold w} \\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v\bold x^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2(y_i-u)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases} wC(v)=wN1i=1Nv2=N1i=1N2vwv=N1i=1N{0,2vxT,wxi+b0wxi+b>0=N1i=1N{0,2(yiu)xiT,wxi+b0wxi+b>0={0,N2i=1N(wxi+byi)xiT,wxi+b0wxi+b>0
定义误差项 e i = w ⋅ x i + b − y i e_i=\bold w \cdot\bold x_i+b-y_i ei=wxi+byi,则偏导数
∂ C ∂ w = 2 N ∑ i = 1 N e i x i T \frac{\partial C}{\partial \bold w}=\frac{2}{N}\sum_{i=1}^{N}e_i\bold x_i^T wC=N2i=1NeixiT,非零情况
可见,这个结果是 X X X中所有 x i \bold x_i xi的加权平均值,权重是误差项.
对偏差的偏导数为:
∂ C ( v ) b = ∂ ∂ b 1 N ∑ i = 1 N v 2 = 1 N ∑ i = 1 N 2 v ∂ v ∂ b = 1 N ∑ i = 1 N { 0 , w ⋅ x i + b ≤ 0 − 2 v , w ⋅ x i + b > 0 = { 0 , w ⋅ x i + b ≤ 0 2 N ∑ i = 1 N ( w ⋅ x i + b − y i ) , w ⋅ x i + b > 0 = 2 N ∑ i = 1 N e i \frac{\partial C(v)}{b}=\frac{\partial}{\partial b} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N} \sum_{i=1}^{N}2v\frac{\partial v}{\partial b}\\ =\frac{1}{N} \sum_{i=1}^{N}\begin{cases} 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i), & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{2}{N}\sum_{i=1}^{N}e_i bC(v)=bN1i=1Nv2=N1i=1N2vbv=N1i=1N{0,2v,wxi+b0wxi+b>0={0,N2i=1N(wxi+byi),wxi+b0wxi+b>0=N2i=1Nei
在实际应用中,将向量 w \bold w w b b b合成单个向量更为方便: w ^ = [ w T , b ] T \hat \bold w=[\bold w^T,b]^T w^=[wT,b]T.输入矢量 x \bold x x扩展成 x ^ = [ x T , 1 ] T \hat \bold x=[\bold x^T,1]^T x^=[xT,1]T,则有 w ⋅ x + b = w ^ ⋅ x ^ \bold w \cdot\bold x+b=\hat \bold w\cdot\hat\bold x wx+b=w^x^.

相关文章: