Chapter 2 - Neural Network and Deep Learning

Chapter 2: How the backpropagation algorithm works

反向传播：back propagation or BP:
计算图：computation graph
链式法则： chain rule ∂y∂x=∂y∂z⋅∂z∂x

Computational Graph

In GD algorithms, we modify weights/biases by −η multiplying ∇C (C’s partial derivatives to them).
An example of using computational graphs to solve partial derivatives:
Chapter 2 - Neural Network and Deep Learning

Notations

elementwise application of functions: f(v)
elementwise product of two vectors of the same shape: s⊙v
from the (l−1)th to lth layer: vl
weight from neuron k in layer l−1 to neuron j in layer l: wljk
zlj≡wlj⋅al−1+blj
the activation of the jth neuron in layer l: alj=σ(zlj)
δlj≡∂C∂zlj

Back Propagation

If we go through every neuron forwards, we may revisit some neurons for many times. Back propagation uses dynamic programming to save time.
Try to build an intuition with the following equations.

Calculate δl

1) For the output layer

Apply the chain rule:

\partial C \partial z L j = \partial C \partial a L j σ' (z L j) (1)

in shorthand: δL=∇aC⊙σ′(zL)

2) For layer l before the output layer

According to the chain rule: δlk=δl+1j∂zl+1j∂zlk=δl+1jwl+1jkσ′(zlk), therefore:

δ l = ((w l + 1) T \cdot δ l + 1) ⊙ σ' (z l) (2)

Calculate biases and weights

1) Biases

\partial C \partial b l j = δ l j (3)

in shorthand: ∂C∂b=δ

2) Weights

\partial C \partial w l j k = a l - 1 k δ l j (4)

Dipicted:

Chapter 2 - Neural Network and Deep Learning