In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn't discuss how to compute the gradient of the cost function. That's quite a gap! In this chapter I'll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David RumelhartGeoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.

This chapter is more mathematically involved than the rest of the book. If you're not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you're willing to ignore. Why take the time to study those details?

The reason, of course, is understanding. At the heart of backpropagation is an expression for the partial derivative b) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. And while the expression is somewhat complex, it also has a beauty to it, with each element having a natural, intuitive interpretation. And so backpropagation isn't just a fast algorithm for learning. It actually gives us detailed insights into how changing the weights and biases changes the overall behaviour of the network. That's well worth studying in detail.

With that said, if you want to skim the chapter, or jump straight to the next chapter, that's fine. I've written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don't follow all the reasoning.

 

Warm up: a fast matrix-based approach to computing the output from a neural network

 

Before discussing backpropagation, let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it's worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.

Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use lth layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

BP反向传播算法的工作原理How the backpropagation algorithm works

This notation is cumbersome at first, and it does take some work to master. But with a little effort you'll find the notation becomes easy and natural. One quirk of the notation is the ordering of the k to the output neuron, not vice versa, as is actually done. I'll explain the reason for this quirk below.

 

We use a similar notation for the network's biases and activations. Explicitly, we use lth layer. The following diagram shows examples of these notations in use:

BP反向传播算法的工作原理How the backpropagation algorithm works

With these notations, the activation 

(23)ajl=σ(∑kwjklakl−1+bjl),

where the sum is over all neurons ajl.

 

The last ingredient we need to rewrite 

(24)f([23])=[f(2)f(3)]=[49],

that is, the vectorized f just squares every element of the vector.

 

With these notations in mind, Equation 

(25)al=σ(wlal−1+bl).

This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

 

When using Equation l.

 

The two assumptions we need about the cost function

 

The goal of backpropagation is to compute the partial derivatives 

(26)C=12n∑x∥y(x)−aL(x)∥2,

where: x is input.

 

Okay, so what assumptions do we need to make about our cost function, Cx=12∥y−aL∥2. This assumption will also hold true for all the other cost functions we'll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives x back in, but for now it's a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

BP反向传播算法的工作原理How the backpropagation algorithm works

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example 

(27)C=12∥y−aL∥2=12∑j(yj−ajL)2,

and thus is a function of the output activations. Of course, this cost function also depends on the desired output ymerely a parameter that helps define that function.

 

 

 

 

 

The Hadamard product, s⊙t

 

The backpropagation algorithm is based on common linear algebraic operations - things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose 

(28)[12]⊙[34]=[1∗32∗4]=[38].

This kind of elementwise multiplication is sometimes called theHadamard product or Schur product. We'll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.

 

 

The four fundamental equations behind backpropagation

 

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives ∂C/∂bjl.

To understand how the error is defined, imagine there is a demon in our neural network:

BP反向传播算法的工作原理How the backpropagation algorithm works

The demon sits at the ∂C∂zjlΔzjl.

 

Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they're trying to find a ∂C∂zjl is a measure of the error in the neuron.

Motivated by this story, we define the error 

(29)δjl≡∂C∂zjl.

As per our usual conventions, we use ∂C/∂bjl.

 

You might wonder why the demon is changing the weighted input δ vectors. In practice, you shouldn't have trouble telling which meaning is intended in any given usage..

Plan of attack: Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

Here's a preview of the ways we'll delve more deeply into the equations later in the chapter: I'll give a short proof of the equations, which helps explain why they are true; we'll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we'll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.

An equation for the error in the output layer, 

(BP1)δjL=∂C∂ajLσ′(zjL).

This is a very natural expression. The first term on the right, zjL.

 

Notice that everything in ∂C/∂ajL=(aj−yj), which obviously is easily computable.

Equation 

(BP1a)δL=∇aC⊙σ′(zL).

Here, 

(30)δL=(aL−y)⊙σ′(zL).

As you can see, everything in this expression has a nice vector form, and is easily computed using a library such as Numpy.

 

An equation for the error 

(BP2)δl=((wl+1)Tδl+1)⊙σ′(zl),

where l.

 

By combining δL−2, and so on, all the way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

(BP3)∂C∂bjl=δjl.

That is, the error 

相关文章:

  • 2021-07-25
  • 2021-10-12
  • 2022-01-23
猜你喜欢
  • 2021-11-26
  • 2021-08-06
  • 2021-06-04
  • 2021-07-25
  • 2021-07-19
  • 2021-11-18
  • 2021-08-16
相关资源
相似解决方案