Introduction to Neural Network
Neural networks are made out of logistic units and depending on the way you arrange those units, you might get different neural network architectures. Figure: 1 depicts the schematic diagram of a logistic unit. You can consider it as a function which takes a input vector (W) which is used to control the relative importance of elements of the input vector.
The logistic unit itself works as a binary classifier and we called it as the logistic regression. For instance, with appropriate weight vectors W, it can be used for classifying spam/non-spam emails or to check whether a given credit card transaction as fraudulent or not.
Logistic and tanh are two popular non-linear functions which can be used in logistic units. Figure: 2 shows graphs of these two functions for R.
Single Logistic unit works well with datasets which can be linearly separable. However, it doesn't perform so well with linearly non-separable datasets. This can be easily demonstrated with two synthetically generated datasets shown in Figure: 3.
So it is clear that logistic regression is capable of classifying linearly separable datasets. Unfortunately, most of the datasets we come across in practical machine learning problems do not show linearly separable property. Hence, we need better classifiers than logistic regression.
Building classifiers with more logistic units would be an obvious and natural approach to overcoming the limitations of a single logistic unit. The classifier which addresses limitations of a single logistic unit is known as the Neural Network. Usually, neural networks arrange logistic units into layers and depending on the orientation of these layers, we have different neural network architectures. Figure: 4 shows a typical neural network with 4 layers. The first layer (denoted by 4) is known as output layer. The layers in between input and output layers are know as hidden layers and usually, neural networks with more than one hidden layer are known as deep neural networks.
In Figure: 4, y. Finding proper values for those parameters is crucial for the predictive performance of neural networks. We use training dataset for estimating suitable values for those weight matrices and vectors.
Training Procedure
In this section, we introduce key steps of neural network training process. First, we formulate neural network training as an optimization problem. Then, the gradient descent will be introduced as a technique for solving this optimization. Finally, we discuss automatic differentiation as an efficient method for calculating error derivatives of a function w.r.t. its parameters.
Empirical Risk minimization
As we pointed out above, neural network training can be considered as an optimization probable. The framework we use to formulate this optimization problem is known as empirical risk minimization. It is a generic principle and useful in many areas of machine learning. For more details about empirical risk minimization, please read Chapter 4 of [1].
Let me explain empirical risk minimization with an example. Suppose you have developed a neural network for classifying handwritten digits. During training time you input training images (actually, raw pixel intensities) and network predicts most probable digit associated with each image. Suppose you have a function (say L as much as possible. Mathematically this can be written as follows.
In Equation: θ that maps input features to output labels.
In practice, we add an extra term to (2).
In Equation: λ is a hyper-parameter and its value is estimated using a cross validation dataset.
So now we have formulated neural network training as an optimization problem. Therefore, next step would be to find a suitable algorithm which can be used to optimize the empirical loss function given in Equation: (2).
Gradient Descent
In this section, we discuss a simple yet powerful optimization algorithm called gradient descent. Actually, we will be using one of its improved versions for training neural networks. However, having a good understanding on vanilla gradient descent is essential to understand those algorithms. Therefore, in this tutorial we are going to use vanilla gradient descent. But, in upcoming tutorials we will be using few improved versions for training neural networks.
Let's start our discussion with a simple example: ) and Figure: 5 shows tangent lines at these two points.
From Figure: 5 it is clear that if you follow the negative direction of the derivative and small step at a time, you will be moving towards the minimum of the (3).
We iteratively apply Equation: (3) is known as the gradient descent.
Now let's move to the implementation of gradient descent algorithm in Python.
1 def get_grad(x):
2 """ This method returns the derivative of f(x)=x^2 function"""
3 return 2*x
4
5 #initial guess
6 x = 10
7 #learning rate
8 eta = 0.01
9
10 num_iterations = 500
11 for i in range(num_iterations):
12 x = x - eta*get_grad(x)
13 if i % 50 == 0:
14 print('Iteration: {:3d} x: {:.3e} f(x): {:.3e}'.format(i, x, x**2))
15 print('Iteration: {:3d} x: {:.3e} f(x): {:.3e}'.format(i, x, x**2))
Program 1: Finding minimum value of https://github.com/upul/GNN/blob/master/chapter2/simple_gradient_descent.py
By just looking at the Figure: 5, it is obvious that the minimum value of the function ). It represents a surface in 3D space as shown in Figure: 7.
Now let's say we are going to find the minimum point of the function ]. Since, this function has several minima, depending on the initial starting point, you will be reaching different local minima. For instance, in Figure: 7 we have shown two such possibilities. So it should be noted that in gradient descent starting point of the optimization process plays an important role.
Automatic Differentiation (AD)
In previous sections, we formulated neural network training as an optimization problem. Also, the gradient descent was introduced for obtaining minimum points of the loss function given in Equation: θ). In this section, we will discuss an efficient technique for Calculating derivatives of the loss function (also known as error derivatives) w.r.t. model parameters.
Discovered independently by several different search groups in the 1970s and 1980s, thebackpropagation algorithm has been using as the main tool for calculating error derivatives of the lost function w.r.t. model parameters. The key idea of the backpropagation algorithm is that error derivatives can be calculated by starting at the output layer of the network and moving towards the input layer. So error derivaties of the h layer with the help of the chain rule.
However, in this tutorial instead of backpropagation we will be using a more general technique called reverse-mode automatic differentiation for calculating derivatives of the loss w.r.t. model parameters. Actually, backpropagation is a specialized version of the everse-mode automatic differentiation. Unlike backpropagation, reverse-mode automatic differentiation can be used for calculating derivatives of any computational graphs. Though, it is heavily underused in machine learning, automatic differentiation is a well-established technique used in some other scientific disciplines such as fluid dynamics and nuclear engineering.
Let's consider simple function 3) and move forward by applying elementary operations at each node.
In Figure: 8, we have introduced three intermediate variables (y into three elementary operations.
Second phase, commonly known as backward pass starts at the bottom of the graph and movies towards inputs. During the backward pass, we calculate derivatives of the output w.r.t. intermediate variables and finally, w.r.t. input variables. Figure: 9 shows the backward pass of3.
We start backward pass at the bottom of Figure: 9 and first calculate 2).
Next two step (i.e. 2.
Calculating 10.39.
Now we have a good understanding of the mechanics of reverse-mode automatic differentiation and we are ready to use it for calculating error derivatives of neural networks. Also, it is worth mentioning that, automatic differentiation is neither fully analytical nor numerical algorithm. At the elementary operations level (such as )), we use analytical differentiation and keep intermediate result numerically. Also, we use the chain rule for propagating derivatives towards inputs from a given output.
Training Shallow Networks
In previous sections, we have discussed a lot of necessary tools for training neural networks. Now it's time to put those tools into practice. But, before moving to full–fledged neural networks, we would like to start with a simple linear network called Softmax Classifier.
Since, we are in the classification setting, it would be very nice to interpret output values of the network as probabilities. However, the pre-activations of the output layer (denoted by (4)) for calculating probabilities from the pre-activations.
Where p will be selected as the predicted category.
Next, in order to use the gradient descent for training our network, we have to devise a suitable cost function which quantifies the discrepancy between predicted and actual classes. In the remaining part of this section, we derive a loss function called cross-entropy loss that will be using in our softmax classifier.
Technically speaking, our output layer calculates conditional probabilities. For instance, if we consider (5).
Where 2 regularization loss in above loss function.
So we have discussed a loss function, an efficient technique for calculating error derivatives of the loss function and a optimization algorithm. Hence, now we are ready to train our softmax classifier. Actually, we will be building a handwritten digit recognizer using the softmax classifier.
MNIST Dataset
For building our handwritten digit recognizer, we are going to use MNIST dataset [2]. It is one of the most well-known datasets in the field of machine learning. MNIST dataset consists of 60,000 training and 10,000 testing black and white images of 28x28 pixels. Figure: 11 shows few sample images extracted by MNIST dataset.
Implementing Softmax Classifier in Python/Numpy
Program: 2 shows our SoftmaxLayer implementation. It consists of three methods:forward_pass, backward_pass and update_parameters. forward_pass is easy to understand and it first calculates pre-activation using b. Next, pre-activations are converted to probabilities using Equation: 4 and finally, the empirical risk is estimated using Equation: 5.
However, backward_pass is a little bit completed. Therefore, we use the computational flow graph shown in Figure: 12 to understand backward_pass. In order to use gradient descent, we would like to calculate b.
Consider a training example a:
Considering complete pre-activation vector b as given below.
Since, the regulation loss doesn't have bias term, b). Since, now we have those equations in our hands backward_pass is just converting Equation: 7 and Equation: 8 to Python/Numpy codes.
1 import numpy as np
2
3 class SoftmaxLayer:
4 """
5 SoftmaxLayer class represents teh Softmax layer.
6 Parameters
7 ----------
8 W : matrix W represents the input to output connection weight
9 b : bias vector
10 reg_parameter : regularization parameter of the L2 regularizer
11 """
12 def __init__(self, W, b, reg_parameter, num_unique_categories):
13 self.W = W
14 self.b = b
15 self.reg_parameter = reg_parameter
16 self.num_unique_categories = num_unique_categories
17
18 def forward_pass(self, x_input, y_input):
19 """
20 Performs forward pass and returns x_out_prob and total_loss
21 """
22 # calculates pre-activation using XW + b
23 x_hid = np.dot(x_input, self.W) + self.b
24
25 # subtract np.max(x_hid) from each element of the x_hid
26 # for numerical stability
27 # detials: http://www.iro.umontreal.ca/~bengioy/dlbook/numerical.html
28 x_hid = x_hid - np.max(x_hid)
29 # calculate output probabilities using Equation 4
30 x_out_prob = np.exp(x_hid) / np.sum(np.exp(x_hid), axis=1, keepdims=True)
31
32 # calculate data loss using -log_e(p_k)
33 num_examples = x_input.shape[0]
34 prob_target = x_out_prob[range(num_examples), y_input]
35 data_loss_vector = -np.log(prob_target)
36 data_loss = np.sum(data_loss_vector) / num_examples
37
38 reg_loss = self.reg_parameter * np.sum(self.W * self.W) * 0.5
39 total_loss = data_loss + reg_loss
40
41 return x_out_prob, total_loss
42
43 def backward_pass(self, x_out_prob, x_input, y_input):
44 """
45 Performs backward pass and calculates error derivatives.