用于深度学习的矩阵微积分读书笔记
书籍简介
《The Matrix Calculus You Need For Deep Learning》是旧金山大学的Terence Parr教授(ANTLR之父,fast.ai创始人)和Jeremy Howard共同推出的一篇免费教程,可以帮助你快速入门深度学习中的矩阵微积分相关知识。该教程简洁明了,通俗易懂,只需要一点微积分和神经网络的基础知识就可以直接开始学习啦!
本教程涵盖的内容
教程先快速回顾了标量求导法则、向量微积分和偏导数的概念,然后从Jacobian矩阵的推广开始介绍如何计算矩阵的导数,最后推导单个神经元输出的梯度以及神经网络损失函数的梯度。
内容总结
1.引言
导数是机器学习中的一个重要组成部分,特别是深度学习,它通过优化损失函数来对神经网络进行训练。不过它们需要的不是之前所学的标量微积分,而是所谓的矩阵微积分——线性代数和多变量微积分的“联姻”。
标量求导我们已经很熟悉的,常用的有指数法则、乘积法则和链式法则。需要注意的是这里我们已经可以引入算子的概念,即可认为
d
d
x
\frac{d}{dx}
dxd是将一个函数映射到它的导数的微分算子,这就意味着
d
d
x
f
(
x
)
\frac{d}{dx}f(x)
dxdf(x)和
d
f
(
x
)
d
x
\frac{df(x)}{dx}
dxdf(x)表示相同的概念。
进一步考虑多变量的情况。多变量函数对单个变量求导得到的是偏导数(用
∂
∂
x
\frac{\partial}{\partial x}
∂x∂表示)。将所有的偏导数放在一个行向量内,这个向量即称为函数
f
(
x
,
y
)
f(x,y)
f(x,y)的梯度,即
∇
f
(
x
,
y
)
=
[
∂
f
(
x
,
y
)
∂
x
,
∂
f
(
x
,
y
)
∂
y
]
\nabla f(x,y)=[\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}]
∇f(x,y)=[∂x∂f(x,y),∂y∂f(x,y)].
再进一步,考虑多函数多变量的情况。除了
f
(
x
,
y
)
f(x,y)
f(x,y)之外,再加上一个函数
g
(
x
,
y
)
g(x,y)
g(x,y)。对于这两个函数,我们可以将它们的梯度组合成一个矩阵,称为Jacobian矩阵,矩阵的每一行对应一个函数的梯度:
J
=
[
∇
f
(
x
,
y
)
∇
g
(
x
,
y
)
]
=
[
∂
f
(
x
,
y
)
∂
x
∂
f
(
x
,
y
)
∂
y
∂
g
(
x
,
y
)
∂
x
∂
g
(
x
,
y
)
∂
y
]
J=\begin{bmatrix}\nabla f(x,y)\\ \nabla g(x,y) \end{bmatrix}=\begin{bmatrix}\frac{\partial f(x,y)}{\partial x} & \frac{\partial f(x,y)}{\partial y} \\ \frac{\partial g(x,y)}{\partial x}&\frac{\partial g(x,y)}{\partial y} \end{bmatrix}
J=[∇f(x,y)∇g(x,y)]=[∂x∂f(x,y)∂x∂g(x,y)∂y∂f(x,y)∂y∂g(x,y)].
这样我们就得到了本教程的核心内容:矩阵微积分!
2.Jacobian的推广
将参数用向量表示,即
x
=
[
x
1
x
2
.
.
.
x
n
]
T
\bold{x}=[ x_1 x_2 ... x_n]^T
x=[x1x2...xn]T.
将函数同样用向量表示,即
y
=
f
(
x
)
=
[
f
1
(
x
)
f
2
(
x
)
.
.
.
f
m
(
x
)
]
T
\bold{y}=\bold{f(x)}=[f_1(\bold x) f_2(\bold x) ... f_m(\bold x)]^T
y=f(x)=[f1(x)f2(x)...fm(x)]T,表示由m个标量函数构成的函数向量。
通常Jacobian矩阵即是
m
∗
n
m*n
m∗n个偏导数的集合,也就是相对于
x
\bold x
x的m个梯度的堆积:
∂
y
∂
x
=
[
∇
f
1
(
x
)
∇
f
2
(
x
)
.
.
.
∇
f
m
(
x
)
]
=
[
∂
x
f
1
(
x
)
∂
x
f
2
(
x
)
.
.
.
∂
x
f
m
(
x
)
]
=
[
∂
x
1
f
1
(
x
)
∂
x
2
f
1
(
x
)
.
.
.
∂
x
n
f
1
(
x
)
∂
x
1
f
2
(
x
)
∂
x
2
f
2
(
x
)
.
.
.
∂
x
n
f
2
(
x
)
.
.
.
∂
x
1
f
m
(
x
)
∂
x
2
f
m
(
x
)
.
.
.
∂
x
n
f
m
(
x
)
]
\frac{\partial {\bold y}}{\partial {\bold x}}=\begin{bmatrix} \nabla f_1(\bold {x}) \\ \nabla f_2(\bold {x}) \\ ... \\ \nabla f_m(\bold {x}) \end{bmatrix}=\begin{bmatrix} \frac{\partial}{\bold {x}}f_1(\bold x) \\ \frac{\partial}{\bold {x}}f_2(\bold x) \\ ... \\ \frac{\partial}{\bold {x}}f_m(\bold x)\end{bmatrix}\\=\begin{bmatrix} \frac{\partial}{x_1}f_1(\bold x) & \frac{\partial}{x_2}f_1(\bold x) & ... & \frac{\partial}{x_n}f_1(\bold x)\\ \frac{\partial}{x_1}f_2(\bold x) & \frac{\partial}{x_2}f_2(\bold x) & ... & \frac{\partial}{x_n}f_2(\bold x)\\ &...& \\ \frac{\partial}{x_1}f_m(\bold x) & \frac{\partial}{x_2}f_m(\bold x) & ... & \frac{\partial}{x_n}f_m(\bold x)\end{bmatrix}
∂x∂y=⎣⎢⎢⎡∇f1(x)∇f2(x)...∇fm(x)⎦⎥⎥⎤=⎣⎢⎢⎡x∂f1(x)x∂f2(x)...x∂fm(x)⎦⎥⎥⎤=⎣⎢⎢⎡x1∂f1(x)x1∂f2(x)x1∂fm(x)x2∂f1(x)x2∂f2(x)...x2∂fm(x).........xn∂f1(x)xn∂f2(x)xn∂fm(x)⎦⎥⎥⎤.
作为一个特例,假定
f
(
x
)
=
x
\bold f(\bold x)=\bold x
f(x)=x,
f
i
(
x
)
=
x
i
f_i(\bold x)=x_i
fi(x)=xi。这里有n个函数,每个函数都有n个参数,因此Jacobian矩阵是一个方阵,而且很容易得到它是一个单位阵
I
I
I.
3.向量的逐元素二元运算求导
向量的逐元素二元运算(element-wise binary operations)有很多,如向量的加法、减法、点乘,标量乘以向量等运算。一般情况下,该种运算可以写成
y
=
f
(
w
)
∘
g
(
x
)
\bold y = \bold f(\bold w) \circ \bold g(\bold x)
y=f(w)∘g(x)的形式,其中
∘
\circ
∘即表示任意一种逐元素运算操作。
对于加减乘除等简单的逐元素运算操作,函数
y
\bold y
y对
w
\bold w
w和
x
\bold x
x的偏导数如下:
4.涉及标量运算的向量求导
简单来说,即向量加上标量或者向量乘以标量,然后再求导。这里涉及的运算仍然是逐元素的,因此可以将标量展开成包含相同值的向量。求导结果比较简单:
∂
y
∂
x
=
∂
(
x
+
z
)
∂
x
=
I
\frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold x+z)}}{\partial {\bold x}}=I
∂x∂y=∂x∂(x+z)=I
∂
y
∂
z
=
∂
(
x
+
z
)
∂
z
=
1
\frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold x+z)}}{\partial {z}}=\bold 1
∂z∂y=∂z∂(x+z)=1
∂
y
∂
x
=
∂
(
x
z
)
∂
x
=
I
z
\frac{\partial {\bold y}}{\partial {\bold x}}=\frac{\partial {(\bold xz)}}{\partial {\bold x}}=Iz
∂x∂y=∂x∂(xz)=Iz
∂
y
∂
z
=
∂
(
x
z
)
∂
z
=
x
\frac{\partial {\bold y}}{\partial {z}}=\frac{\partial {(\bold xz)}}{\partial {z}}=\bold x
∂z∂y=∂z∂(xz)=x
5.向量和求导
向量元素求和是深度学习中的一个重要运算,如计算网络的损失函数。同样的,也可用于向量点积和其它将向量转化成标量的运算的求导操作。
y
=
s
u
m
(
f
(
x
)
)
=
∑
i
=
1
n
f
i
(
x
)
y=sum(\bold f(\bold x))=\sum_{i=1}^{n}f_i(\bold x)
y=sum(f(x))=∑i=1nfi(x)
∂
y
∂
x
=
[
∂
y
∂
x
1
∂
y
∂
x
2
.
.
.
∂
y
∂
x
n
]
\frac{\partial{y}}{\partial \bold x}=\begin{bmatrix} \frac{\partial{y}}{\partial x_1} \frac{\partial{y}}{\partial x_2} ... \frac{\partial{y}}{\partial x_n} \end{bmatrix}
∂x∂y=[∂x1∂y∂x2∂y...∂xn∂y]
一些简明的求导结果如下:
y
=
s
u
m
(
x
)
,
∇
y
=
1
y=sum(\bold x),\nabla y=\bold 1
y=sum(x),∇y=1
y
=
s
u
m
(
x
z
)
,
∂
y
∂
x
=
1
z
,
∂
y
∂
z
=
s
u
m
(
x
)
y=sum(\bold xz), \frac {\partial y}{\partial \bold x}=\bold 1z,\frac {\partial y}{\partial \bold z}=sum(\bold x)
y=sum(xz),∂x∂y=1z,∂z∂y=sum(x)
6.链式法则
使用基本的矩阵微分法则无法计算复杂函数的偏导数,比如嵌套函数,此时必须将基本的向量求导法则和向量链式法则结合起来使用。不幸的是,有很多求导法则都属于链式法则,所以我们要小心使用哪个链式法则。
6.1 单变量链式法则
所谓单变量链式法则,即之前所学的链式法则,此时只有一个变量。
6.2 单变量全导数链式法则
单变量链式法则的适用范围是有限的,即所有的中间变量必须是单个变量的函数。所谓全导数,即是要计算
d
y
d
x
\frac {dy}{dx}
dxdy,必须把x变化量对y变化量的所有可能贡献加起来。相对
x
x
x的全导数假设所有的变量都是
x
x
x的函数,并且可能随着
x
x
x变化而变化。如
f
(
x
)
=
u
2
(
x
,
u
1
)
f(x)=u_2(x,u_1)
f(x)=u2(x,u1)通过中间变量
u
1
(
x
)
u_1(x)
u1(x)直接和间接地依赖于x,则全导数:
d
y
d
x
=
∂
f
(
x
)
∂
x
=
∂
u
2
(
x
,
u
1
)
∂
x
=
∂
u
2
∂
x
∂
x
∂
x
+
∂
u
2
∂
u
1
∂
u
1
∂
x
=
∂
u
2
∂
x
+
∂
u
2
∂
u
1
∂
u
1
∂
x
\frac{dy}{dx}=\frac{\partial f(x)}{\partial x}=\frac{\partial u_2(x,u_1)}{\partial x}=\frac{\partial u_2}{\partial x}\frac{\partial x}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}=\frac{\partial u_2}{\partial x}+\frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x}
dxdy=∂x∂f(x)=∂x∂u2(x,u1)=∂x∂u2∂x∂x+∂u1∂u2∂x∂u1=∂x∂u2+∂u1∂u2∂x∂u1
其一般形式为:
∂
f
(
u
1
,
.
.
.
,
u
n
+
1
)
∂
x
=
∑
i
=
1
n
+
1
∂
f
∂
u
i
∂
u
i
∂
x
\frac {\partial f(u_1,...,u_n+1)}{\partial x}=\sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x}
∂x∂f(u1,...,un+1)=∑i=1n+1∂ui∂f∂x∂ui
6.3 向量链式法则
我们先从计算一个向量函数
y
=
f
(
g
(
x
)
)
\bold y=\bold f(\bold g(x))
y=f(g(x))对标量的导数开始,看看我们能否抽象出一个一般的公式。显然,有:
∂
y
∂
x
=
[
∂
f
1
(
g
)
∂
x
∂
f
2
(
g
)
∂
x
]
=
[
∂
f
1
∂
g
1
∂
g
1
∂
x
+
∂
f
1
∂
g
2
∂
g
2
∂
x
∂
f
2
∂
g
1
∂
g
1
∂
x
+
∂
f
2
∂
g
2
∂
g
2
∂
x
]
=
[
∂
f
1
∂
g
1
∂
f
1
∂
g
2
∂
f
2
∂
g
1
∂
f
2
∂
g
2
]
[
∂
g
1
∂
x
∂
g
2
∂
x
]
=
∂
f
∂
g
∂
g
∂
x
\frac {\partial \bold y}{\partial x}=\begin{bmatrix} \frac{\partial f_1(\bold g)}{\partial x} \\ \frac{\partial f_2(\bold g)}{\partial x}\end{bmatrix}=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x } \\ \frac {\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x}+ \frac {\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x } \end{bmatrix} \\=\begin{bmatrix} \frac {\partial f_1}{\partial g_1}& \frac {\partial f_1}{\partial g_2} \\ \frac {\partial f_2}{\partial g_1}& \frac {\partial f_2}{\partial g_2} \end{bmatrix}\begin{bmatrix} \frac{\partial g_1}{\partial x}\\\frac{\partial g_2}{\partial x } \end{bmatrix}=\frac{\partial \bold f}{\partial \bold g}\frac{\partial \bold g}{\partial x}
∂x∂y=[∂x∂f1(g)∂x∂f2(g)]=[∂g1∂f1∂x∂g1+∂g2∂f1∂x∂g2∂g1∂f2∂x∂g1+∂g2∂f2∂x∂g2]=[∂g1∂f1∂g1∂f2∂g2∂f1∂g2∂f2][∂x∂g1∂x∂g2]=∂g∂f∂x∂g
当将变量
x
x
x扩展成向量
x
\bold x
x,再次求Jacobian矩阵,此时完整的向量链式法则为:
∂
∂
x
f
(
g
(
x
)
)
=
∂
f
∂
g
∂
g
∂
x
\frac{\partial}{\partial \bold x} \bold f(\bold g(\bold x))=\frac{\partial \bold f}{\partial \bold g} \frac {\partial \bold g}{\partial \bold x}
∂x∂f(g(x))=∂g∂f∂x∂g
这个等式可以进一步简化,因为在许多情况下,Jacobian矩阵是方阵(
m
=
n
m=n
m=n),而且非对角元素均为0.这是神经网络的天然性质,它涉及的是向量的函数,而非函数的向量。如神经元仿射函数是
s
u
m
(
w
⨂
x
)
sum(\bold w \bigotimes \bold x)
sum(w⨂x),**函数是
m
a
x
(
0
,
x
)
max(0,\bold x)
max(0,x).
正如之前所讲,对向量
w
\bold w
w和
x
\bold x
x进行逐元素运算,其偏导数矩阵为
∂
w
i
∂
x
i
\frac {\partial w_i}{\partial x_i}
∂xi∂wi构成的对角阵,因为
w
i
w_i
wi是
x
i
x_i
xi的函数,而非
x
j
x_j
xj的函数(
j
≠
i
j\ne i
j=i).
∂
f
∂
g
=
d
i
a
g
(
∂
f
i
∂
g
i
)
\frac{\partial \bold f}{\partial \bold g}=diag(\frac{\partial \bold f_i}{\partial \bold g_i} )
∂g∂f=diag(∂gi∂fi)
∂
g
∂
x
=
d
i
a
g
(
∂
g
i
∂
x
i
)
\frac{\partial \bold g}{\partial \bold x}=diag(\frac{\partial \bold g_i}{\partial \bold x_i} )
∂x∂g=diag(∂xi∂gi)
此时,链式法则可以简化为:
∂
∂
x
f
(
g
(
x
)
)
=
d
i
a
g
(
∂
f
i
∂
g
i
)
d
i
a
g
(
∂
g
i
∂
x
i
)
=
d
i
a
g
(
∂
f
i
∂
g
i
∂
g
i
∂
x
i
)
\frac{\partial}{\partial \bold x} f(\bold g( x))=diag(\frac{\partial f_i}{\partial g_i} )diag(\frac{\partial g_i}{\partial x_i} )=diag(\frac{\partial f_i}{ \partial g_i}\frac{ \partial g_i}{\partial x_i})
∂x∂f(g(x))=diag(∂gi∂fi)diag(∂xi∂gi)=diag(∂gi∂fi∂xi∂gi)
6.4 总结
综上所述,下表总结了链式法则的相应乘积部分以计算Jacobian矩阵。
7.**函数的梯度
**函数:
a
c
t
i
v
a
t
i
o
n
(
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
activation(\bold x)=max(0,\bold w \cdot \bold x+b)
activation(x)=max(0,w⋅x+b)
中间变量:
u
=
w
⋅
x
+
b
\bold u = \bold w \cdot \bold x+b
u=w⋅x+b
y
=
s
u
m
(
u
)
y=sum(\bold u)
y=sum(u)
偏导数:
∂
y
∂
w
=
∂
∂
w
w
⋅
x
+
∂
∂
w
b
=
x
T
+
0
T
=
x
T
\frac{\partial y}{\partial \bold w}=\frac{\partial}{\partial \bold w}\bold w\cdot \bold x+\frac{\partial}{\partial \bold w}b=\bold x^T+\bold 0^T=\bold x^T
∂w∂y=∂w∂w⋅x+∂w∂b=xT+0T=xT
∂
y
∂
b
=
∂
∂
b
w
⋅
x
+
∂
∂
b
b
=
0
+
1
=
1
\frac{\partial y}{\partial b}=\frac{\partial}{\partial b}\bold w\cdot \bold x+\frac{\partial}{\partial \bold b}b=0+1=1
∂b∂y=∂b∂w⋅x+∂b∂b=0+1=1
**函数的偏导数:
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
{
0
T
,
w
⋅
x
+
b
≤
0
x
T
,
w
⋅
x
+
b
>
0
\frac{\partial activation}{\partial \bold w}=\begin{cases}\bold 0^T, &\bold w\cdot \bold x+b \le0 \\ \bold x^T, & \bold w\cdot \bold x+b>0 \end{cases}
∂w∂activation={0T,xT,w⋅x+b≤0w⋅x+b>0
∂
a
c
t
i
v
a
t
i
o
n
∂
w
=
{
0
,
w
⋅
x
+
b
≤
0
1
,
w
⋅
x
+
b
>
0
\frac{\partial activation}{\partial \bold w}=\begin{cases}0, &\bold w\cdot \bold x+b \le0 \\ 1, & \bold w\cdot \bold x+b>0 \end{cases}
∂w∂activation={0,1,w⋅x+b≤0w⋅x+b>0
8.损失函数的梯度
训练一个神经网络需要计算损失函数(或代价函数)对模型参数
w
\bold w
w和
b
b
b的导数。由于我们是用多个输入矢量(也即多个图片)和标量目标值(如每个图片一个类别)进行训练。令
X
=
[
x
1
,
x
2
,
.
.
.
,
x
N
]
T
X=[\bold x_1,\bold x_2,...,\bold x_N]^T
X=[x1,x2,...,xN]T
y
=
[
t
a
r
g
e
t
(
x
1
)
,
t
a
r
g
e
t
(
x
2
)
,
.
.
.
,
t
a
r
g
e
t
(
x
N
)
]
T
\bold y=[target(\bold x_1),target(\bold x_2),...,target(\bold x_N)]^T
y=[target(x1),target(x2),...,target(xN)]T.
代价函数则为:
C
(
w
,
b
,
X
,
y
)
=
1
N
∑
i
=
1
N
(
y
i
−
a
c
t
i
v
a
t
i
o
n
(
x
i
)
)
2
=
1
N
∑
i
=
1
N
(
y
i
−
m
a
x
(
0
,
w
⋅
x
i
+
b
)
)
2
C(\bold w,b,X,\bold y)=\frac {1}{N}\sum_{i=1}^{N}(y_i-activation(\bold x_i))^2\\=\frac{1}{N}\sum_{i=1}^{N}(y_i-max(0,\bold w\cdot \bold x_i+b))^2
C(w,b,X,y)=N1∑i=1N(yi−activation(xi))2=N1∑i=1N(yi−max(0,w⋅xi+b))2
此时,引入中间变量:
u
(
w
,
b
,
x
)
=
m
a
x
(
0
,
w
⋅
x
+
b
)
u(\bold w,b,\bold x)=max(0,\bold w\cdot x+b)
u(w,b,x)=max(0,w⋅x+b)
v
(
y
,
u
)
=
y
−
u
v(y,u)=y-u
v(y,u)=y−u
C
(
v
)
=
1
N
∑
i
=
1
N
v
2
C(v)=\frac{1}{N}\sum_{i=1}^{N}v^2
C(v)=N1∑i=1Nv2
因此,对权重的偏导数为:
∂
C
(
v
)
w
=
∂
∂
w
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
w
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
v
x
T
,
w
⋅
x
i
+
b
>
0
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
(
y
i
−
u
)
x
i
T
,
w
⋅
x
i
+
b
>
0
=
{
0
,
w
⋅
x
i
+
b
≤
0
2
N
∑
i
=
1
N
(
w
⋅
x
i
+
b
−
y
i
)
x
i
T
,
w
⋅
x
i
+
b
>
0
\frac{\partial C(v)}{\bold w}=\frac{\partial}{\partial \bold w} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N}\sum_{i=1}^{N}2v\frac{\partial v}{\partial \bold w} \\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v\bold x^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{1}{N}\sum_{i=1}^{N}\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ -2(y_i-u)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}\bold 0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i)\bold x_i^T, & \bold w\cdot \bold x_i+b>0 \end{cases}
w∂C(v)=∂w∂N1∑i=1Nv2=N1∑i=1N2v∂w∂v=N1∑i=1N{0,−2vxT,w⋅xi+b≤0w⋅xi+b>0=N1∑i=1N{0,−2(yi−u)xiT,w⋅xi+b≤0w⋅xi+b>0={0,N2∑i=1N(w⋅xi+b−yi)xiT,w⋅xi+b≤0w⋅xi+b>0
定义误差项
e
i
=
w
⋅
x
i
+
b
−
y
i
e_i=\bold w \cdot\bold x_i+b-y_i
ei=w⋅xi+b−yi,则偏导数
∂
C
∂
w
=
2
N
∑
i
=
1
N
e
i
x
i
T
\frac{\partial C}{\partial \bold w}=\frac{2}{N}\sum_{i=1}^{N}e_i\bold x_i^T
∂w∂C=N2∑i=1NeixiT,非零情况
可见,这个结果是
X
X
X中所有
x
i
\bold x_i
xi的加权平均值,权重是误差项.
对偏差的偏导数为:
∂
C
(
v
)
b
=
∂
∂
b
1
N
∑
i
=
1
N
v
2
=
1
N
∑
i
=
1
N
2
v
∂
v
∂
b
=
1
N
∑
i
=
1
N
{
0
,
w
⋅
x
i
+
b
≤
0
−
2
v
,
w
⋅
x
i
+
b
>
0
=
{
0
,
w
⋅
x
i
+
b
≤
0
2
N
∑
i
=
1
N
(
w
⋅
x
i
+
b
−
y
i
)
,
w
⋅
x
i
+
b
>
0
=
2
N
∑
i
=
1
N
e
i
\frac{\partial C(v)}{b}=\frac{\partial}{\partial b} \frac{1}{N}\sum_{i=1}^{N}v^2=\frac{1}{N} \sum_{i=1}^{N}2v\frac{\partial v}{\partial b}\\ =\frac{1}{N} \sum_{i=1}^{N}\begin{cases} 0, &\bold w\cdot \bold x_i+b \le0 \\ -2v, & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\begin{cases}0, &\bold w\cdot \bold x_i+b \le0 \\ \frac{2}{N}\sum_{i=1}^{N}(\bold w \cdot\bold x_i+b-y_i), & \bold w\cdot \bold x_i+b>0 \end{cases}\\ =\frac{2}{N}\sum_{i=1}^{N}e_i
b∂C(v)=∂b∂N1∑i=1Nv2=N1∑i=1N2v∂b∂v=N1∑i=1N{0,−2v,w⋅xi+b≤0w⋅xi+b>0={0,N2∑i=1N(w⋅xi+b−yi),w⋅xi+b≤0w⋅xi+b>0=N2∑i=1Nei
在实际应用中,将向量
w
\bold w
w和
b
b
b合成单个向量更为方便:
w
^
=
[
w
T
,
b
]
T
\hat \bold w=[\bold w^T,b]^T
w^=[wT,b]T.输入矢量
x
\bold x
x扩展成
x
^
=
[
x
T
,
1
]
T
\hat \bold x=[\bold x^T,1]^T
x^=[xT,1]T,则有
w
⋅
x
+
b
=
w
^
⋅
x
^
\bold w \cdot\bold x+b=\hat \bold w\cdot\hat\bold x
w⋅x+b=w^⋅x^.