COURSE 1 Neural Networks and Deep Learning

Week1

What is neural network?

It is a powerful learning algorithm inspired by how the brain works.

Example 1 - single neural network

Given data about the size of houses on the real estate market and you want to fit a function that will
predict their price. It is a linear regression problem because the price as a function of size is a continuous
output.

We know the prices can never be negative so we are creating a function called Rectified Linear Unit (ReLU)
which starts at zero.

COURSE 1 Neural Networks and Deep Learning

The input is the size of the house (x)

The output is the price (y)

The “neuron” implements the function ReLU (blue line)

COURSE 1 Neural Networks and Deep Learning

Example 2 – Multiple neural network

The price of a house can be affected by other features such as size, number of bedrooms, zip code and
wealth. The role of the neural network is to predicted the price and it will automatically generate the
hidden units. We only need to give the inputs x and the output y.

COURSE 1 Neural Networks and Deep Learning

Supervised learning for Neural Network

In supervised learning, we are given a data set and already know what our correct output should look like,
having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into “regression” and “classification” problems. In a
regression problem, we are trying to predict results within a continuous output, meaning that we are
trying to map input variables to some continuous function. In a classification problem, we are instead
trying to predict results in a discrete output. In other words, we are trying to map input variables into
discrete categories.

There are different types of neural network, for example Convolution Neural Network (CNN) used often
for image application and Recurrent Neural Network (RNN) used for one-dimensional sequence data
such as translating English to Chinses or a temporal component such as text transcript. As for the
autonomous driving, it is a hybrid neural network architecture.

Neural Network examples

COURSE 1 Neural Networks and Deep Learning

Structured vs unstructured data

Structured data refers to things that has a defined meaning such as price, age whereas unstructured
data refers to thing like pixel, raw audio, text.

COURSE 1 Neural Networks and Deep Learning

Why is deep learning taking off?

Deep learning is taking off due to a large amount of data available through the digitization of the society, faster computation and innovation in the development of neural network algorithm.

COURSE 1 Neural Networks and Deep Learning

Two things have to be considered to get to the high level of performance:

Being able to train a big enough neural network
Huge amount of labeled data

The process of training a neural network is iterative.

COURSE 1 Neural Networks and Deep Learning

It could take a good amount of time to train a neural network, which affects your productivity. Faster computation helps to iterate and improve new algorithm.

Week2

Binary Classification

In a binary classification problem, the result is a discrete value output

Notation

a training example:
$(x, y), x \in ℝ n x, y \in {0, 1}$
m training examples:
${(x (1), y (1)), (x (2), y (2)), . . ., (x (m), y (m))} m = m train = # of train examples$
matrix:
$X = [x (1), x (2), . . ., x (m)] \in ℝ n x \times m Y = [y (1), y (2), . . ., y (m)] \in ℝ 1 \times m$
goal:
$Given x, y ̂ = P (y = 1 | x), where 0 \leq y ̂$

Logistic Regression

parameters

The input features vector:

$x \in ℝ n x, where n x is the number of features$
The training label:
$y \in {0, 1}$
The weights:
$w \in ℝ n X, where n x is the number of features$
The threshold:
$b \in ℝ$
The output:
$y ̂ = σ (w T x + b)$
Sigmoid function:
$s = σ (w t x + b) = σ (z) = 1 1 + e - z$

Loss (error) function:

ℓ (y ̂, y) = - (y log (y ̂) + (1 - y) log (1 - y ̂))

Cost function:

J (w, b) = 1 m \sum i = 1 m ℓ (y ̂ (i), y (i)) = - 1 m \sum i = 1 m (y (i) log (y ̂ (i)) + (1 - y (i)) log (1 - y ̂ (i)))

Gradient Descent

Want to find w and b that minimize J(w, b)

Process

Repeat

w : = w - α \partial J (w, b) \partial w b : = b - α \partial J (b, w) \partial b

Logistic Regression Gradient Descent

Recap

z = w T x + b y ̂ = a = σ (z) ℓ (a, y) = - (y log (a) + (1 - y) log (1 - a))

Gradient Descent

d z = \partial ℓ \partial z = a - y = a (1 - a) d w 1 = \partial ℓ \partial w 1 = x 1 \cdot d z d w 2 = \partial ℓ \partial w 2 = x 2 \cdot d z . . . d b = \partial ℓ \partial b = d z

Process

w 1 : = w 1 - α d w 1 w 2 : = w 2 - α d w 2 . . . b : = b - α d b

Gradient Descent on m examples

Recap

J (w, b) = 1 m \sum i = 1 m ℓ (a (i), y (i)) = - 1 m \sum i = 1 m (y (i) log (a (i)) + (1 - y (i)) log (1 - a (i))) a (i) = y (i) = σ (z (i)) = σ (w T x + b)

Descent

d z (i) = \partial ℓ \partial z (i) = a (i) - y (i) d w 1 = 1 m \sum i = 1 m \partial ℓ \partial w 1 = 1 m \sum i = 1 m x 1 \cdot d z (i) d w 2 = 1 m \sum i = 1 m \partial ℓ \partial w 2 = 1 m \sum i = 1 m x 2 \cdot d z (i) . . . d b = 1 m \sum i = 1 m \partial ℓ \partial b = 1 m \sum i = 1 m d z (i)

Pseudocode

COURSE 1 Neural Networks and Deep Learning

Vectorization

Logistic Regression Derivatives

COURSE 1 Neural Networks and Deep Learning

Vectorizing Logistic Regression

X = [x (1), x (2), . . ., x (m)] Y = [y (1), y (2), . . ., y (m)] Z = [z (1), z (2), . . ., z (m)] A = [a (1), a (2), . . ., a (m)] = σ (Z)

Implementing Logistic Regression

COURSE 1 Neural Networks and Deep Learning

Broadcasting in Python

General Principle

(m, n) [+ - * /] (1, n) \to (m, n) [+ - * /] (m, n) (m, n) [+ - * /] (m, 1) \to (m, n) [+ - * /] (m, n)

Week3

Neural Networks Overview

COURSE 1 Neural Networks and Deep Learning

Neural Network Representation

COURSE 1 Neural Networks and Deep Learning

Computing a Neural Network’s Output

z [1] = W [1] T x + b [1] = W [1] T a [0] + b [1] a [1] = σ (z [1]) z [2] = W [2] T a [1] + b [2] a [2] = σ (z [2]) . . .

Vectorizing across multiple examples

a [2] (i) : example i, layer 2

COURSE 1 Neural Networks and Deep Learning

Activation functions

sigmoid a = 1 1 + e - z, a' = a (1 - a) tanh a = e z - e - z e z + e - z, a' = 1 - a 2 ReLU a = m a x (0, z), a' = {01 if z < 0 if z \geq 0 leaky ReLU a = m a x (0.01 z, z) . a' = {0.01 1 if z < 0 if z \geq 0

COURSE 1 Neural Networks and Deep Learning

Why do you need non-linear activation functions

Suppose

z [1] = W [1] x + b [1] a [1] = g [1] (z [1]) = z [1] z [2] = W [2] a [1] + b [2] a [2] = g [2] (z [2]) = z [2]

Then

a [1] = z [1] = W [1] x + b [1] a [2] = z [2] = W [2] a [1] + b [2] \to a [2] = W [2] (W [1] x + b [1]) + b [2] = (W [2] W [1]) x + (W [2] b [1] + b [2])

It is similar to

a [2] = W' x + b'

If you were to use linear activation functions or we go to call them identity activation functions, then the new network is just outputting a linear function of the input and we’ll talk about deep networks later new networks with many many layers, many many hidden layers and it turns out that if you use a linear activation function or alternatively if you don’t have an activation function. Then no matter how many layers, your neural network has always doing is just computing a linear activation function.

Gradient Descent for Neural Networks

Backpropogation

d Z [2] = g [2]' (Z [2]) d W [2] = 1 m d Z [2] A [1] T d b [2] = 1 m n p . s u m (d Z [2], a x i s = 1, k e e p d i m s = T r u e) d z [1] = W [2] T d Z [2] \circ g [1]' (Z [1]) d W [1] = 1 m d Z [1] X T d b [1] = 1 m n p . s u m (d Z [1], a x i s = 1, k e e p d i m s = T r u e)

Random Initialization

If initializing weights to zeros, then all weights will update symmetricly. Then no matter how many nodes in one layer, your neural network has always doing is just using one node in one layer.

Week4

Building Blocks of Deep Neural Networks

COURSE 1 Neural Networks and Deep Learning

Propagation

Forward Propagation for Layer l

Input

a [l - 1]

Cache

z [l] = W [l] a [l - 1] + b [l]

Output

a [l] = g [l] (z [l])

Vectorized

Input

A [l - 1]

Cache

Z [l] = W [l] A [l - 1] + b [l]

Output

A [l] = g [l] (Z [l])

Backward Propagation for Layer l

Input

d a [l]

Local

d z [l] = d a [l] \circ g [l]' (z [l])

Output

d W [l] = d z [l] a [l - 1] d b [l] = d z [l] d a [l - 1] = W [l] T d z [l]

Vectorized

Input

d A [l]

Local

d Z [l] = d A [l] \circ g [l]' (Z [l])

Output

d W [l] = 1 m d Z [l] A [l - 1] d b [l] = 1 m n p . s u m (d Z [l], a x i s = 1, k e e p d i m s = T r u e) d A [l - 1] = W [l] T d Z [l]

Parameters vs Hyperparameters

Parameters

W [1], b [1] W [2], b [2] . . .

Hyperparameters

Hyperparameters can control W and b

learning rate α # of iterations # of hidden layers L # of hidden units n [1], n [2], . . . choice of activation function momentum term mini batch size various forms of regularization parameters

Week1

What is neural network?

Example 1 - single neural network

Example 2 – Multiple neural network

Supervised learning for Neural Network

Neural Network examples

Structured vs unstructured data

Why is deep learning taking off?

Week2

Binary Classification

Notation

Logistic Regression

parameters

Loss (error)​ function:

Cost function:

Gradient Descent

Process

Logistic Regression Gradient Descent

Gradient Descent

Process

Gradient Descent on m examples

Pseudocode

Vectorization

Logistic Regression Derivatives

Vectorizing Logistic Regression

Implementing Logistic Regression

Broadcasting in Python

General Principle

Week3

Neural Networks Overview

Neural Network Representation

Computing a Neural Network’s Output

Vectorizing across multiple examples

Activation functions

Why do you need non-linear activation functions

Gradient Descent for Neural Networks

Backpropogation

Random Initialization

Week4

Building Blocks of Deep Neural Networks

Propagation

Forward Propagation for Layer l

Vectorized

Backward Propagation for Layer l

Vectorized

Parameters vs Hyperparameters

Parameters

Hyperparameters

Loss (error) function: