# An overview of artificial neural network

## Abstract

This paper aim to get acquainted with the Artificial Neural Network and its two methods. First it defines what an artificial neural network is and then describes the key features behind it like weights, biases, activation functions, back propagation etc. We first look at the mathematical view of feed forward neural network and how learning process occur. And then we will have a brief introduction to Convolutional neural network. Finally in order to build intuition, a case study of effectiveness in the MNIST dataset of handwritten digits is carried out, examining how parameters such as learning rate, width, depth and mini-batch size used in the network affects its accuracy.

## Abbreviations

Given below is the list of abbreviations used in the paper.

FNN | FeedForward Neural Network |

MLP | Multi Layer Perceptron |

CNN | Convolution Neural Network |

LR | Learning Rate |

MNIST | Modified National Institute of Standards and Technology |

## AN OVERVIEW OF THE WORK

This 2 month internship at IIST Trivandrum aims in getting an insight to artificial neural networks. We are in some way or the other, surrounded by the elements of Artificial Intelligence. A major part of which is Machine Learning in which we try to make machines see the world as we human do. A big success in area is achieved due to the introduction of Deep Learning and Artificial Neural Networks (ANN's). A deep study has been made in order to gain the knowledge about ANN's. Some basic ideas like weights, biases, backpropagation and gradient descent method were learned. The objective was to learn some technique for pattern recognition, image processing and classification, and I came to learn about FeedForward Neural Networks (FNN's) and Convolutional Neural Networks (CNN's). During this internship, I also have improved my technical and coding skill particularly in Python about which I had no idea before coming here.

Throughout the duration of the course, theoretical study as well as hand to hand practical implementation is motivated. After doing lots of tutorials and readings, I learned about various activation functions and mathematics behind them, different optimization techniques in neural networks. I also came to know about Jupyter Notebook, keras function library and several other technical specialities. In order to gain a sound knowledge of what is happening, a case study of MNIST dataset of handwritten digits is carried out. Data collection and analysis has been made and eventually I came to know about deep learning.

Different types of Artificial Neural Network like Feed Forward, Convolutional, Radial Basis Function (RBFN) and Bayesian Neural Network were introduced. I successfully implemented FNN and CNN, and had an introductory lessons on RBFN and Bayesian Neural Network without implementation. I also tried to learn how to enhance the capability of neural network but I didn't arrived at any conclusion. In future, A detailed study on RBFN and Bayesian neural network with their implementation is intended. I will also try to learn the mathematical aspects of enhancing the neural network capability.

In all, this internship proved fruitful to me in achieving good technical and coding skills and having an insight to the world of Artificial Intelligence.

## INTRODUCTION

Artificial Intelligence has been a popular topic in science fiction and news articles for at least a few decades, and is the futuristic area for research and innovation. It is the simulation of human intelligence processes like learning, reasoning and self correction by machines, especially computer systems. Many algorithms of this category are indeed incorporated in the everyday life of people with self-driving cars, automatically generated image captions on search engines, recommendation algorithms, ridesharing apps like uber, location prediction on Google Map, social networking, online shopping and many more. A precise definition of this is "a program that does something smart", with the meaning of "smart" changing throughout the years. More recently this term has overwhelmingly been used to refer to *machine learning* algorithms which can also be called as *data driven algorithms* since they are nothing more than algorithms which adjust some of their parameters in response to a data set. In machine learning, system learns things without being programmed to do so. Out of these algorithms, one type is currently especially popular: *Deep Learning* in which machines think like human brains using *Artificial Neural Networks*.

Algorithms that falls under this term are more often than not used to mimic some sort of action which could be executed by human beings, but for which we lack the mathematical tool to explicitly model in a fully applicable way. They are brain inspired system which are intended to replicate the way we human learn and they are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize. In conventional approach to the programming we tell the computer what to do, breaking big problems into many small precisely defined tasks that the computer can easily perform. By contrast, in artificial neural network we don't tell the computer how to solve the problem. Instead it learns from the observational data, figuring out its own solution to the problem. One of the classic example of this is *image recognition and classification*. For example, recognizing a handwritten digit instead of its translation, rotation or specific configuration, is not a big task for human as we carry a supercomputer in our head. The difficulty of visual pattern recognition becomes apparent if we attempt to write a program explicitly to recognize digits. Simple intuitions about how we recognize shapes- 'a 9 has a loop at the top and a vertical stroke in the bottom right' - turns out to be not so simple to express algorithmically. When we try to make such rules we quickly get lost in a morass of exceptions and caveats and special cases.

MNIST dataset is collection of 70,000 scanned images of handwritten digits, together with their correct classification. Some of the digits are shown in fig 2.

The data comes in two part. The first part contains 60,000 images to be used as training data. These images are scanned handwriting sample from 250 people half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixel in size. The second part of MNIST data set contains 10,000 images to be used as test data which was taken from the different set of 250 to give us confidence that our system can recognize digits from people whose writing it didn't see.

There are many types of artificial neural networks such as Recurrent (often used for speech and audio recognition) and Convolutional (often used for image classification), Radial Basis Function Network (RBFN) and Bayesian Network. This paper focuses on feedforward network and Convolutional Network.

## FeedForward Neural Network

## Structure

## Nodes and layers

A Feedforward Neural Network (FNN) is an algorithm which works in layers or waves. The models are called "feed-forward" because information flows right through the model, there are no feedback connections in which outputs of the model are fed back into itself. A FNN consists of an input layer, output layer and some layers between input and output layers known as *hidden layer*. The term 'hidden' has no deep philosophical or mathematical significance, it only means 'not an input or an output'. In each layer there are a given number of *Nodes*, which are sometimes called as *perceptrons *or even neurons (in analogy with the biological brain). A visual representation of a FNN with two hidden layer is presented in fig 3.

Nodes are simply the representation of numbers stored in the program. Starting with the input layer, all the way to the output layer, the numerical values stored in the nodes are called *activation* of the node. These activations in the n^{th} layers will be used to compute the activation in the (n+1)^{th} layers. Mathematically, the activation stored in a layer with *k* nodes can be represented by a vector of dimension *k*.

In order to refer to the structure of a specific network, the term *Depth* is commonly used to refer to the number of hidden layers in the network. Hence when we say 'Deep Network' or 'Deep Learning', it simply means that a specific network has a large number of layers. In analogous to this term, a less commonly used term *width* is used to refer to how many nodes are there in each layer of the network. The term is well defined for networks in which all hidden layers have the same number of nodes (the width is then this number), but not so much in networks in which the number of nodes varies from layer to layer.

## Weight, Bias and Activation Functions

Given the value stored in the n^{th} layer of the network (with, say, *k* nodes), how are the values of the next layer (with, say, *j *nodes) calculated? Each neuron in the n*th* layer has its own activation and it is given a specific connection strength to each node in the next layer. This connection strength is called* Weight*, which is a real number. For notation $W_{a,b}^n\;\in \mathbb{R}$ is the weight between a^{th} node in the (n-1)^{th} layer to the b^{th} node in the n^{th} layer. In order to represent all the weights connecting these two layers, we may use a matrix

$W_n:=\;\begin{bmatrix}W_{1,1}^n&\cdots&W_{1,k}^n\\\vdots&\ddots&\vdots\\W_{j,1}^n&\cdots&W_{j,k}^n\end{bmatrix}$

This system is analogous to biological system of neurons in which one neuron firing up (transmitting an electrical signal) will fire up (propagate) nearby neurons. A significant difference between these two system is that in biological networks, a neuron is either activated or not, while in Artificial Neural Network it may have a continuum of activations, such as a real number 0 and 1 (like in Sigmoid neuron), a non-negative number (like in ReLU networks) or even an imaginary number. Now, we may also wish to add what is called a 'bias' or 'threshold' which is a constant represented by *b** $\in \mathbb{R}$* and an 'activation function' which is a function $\sigma :\mathrm{\mathbb{R}}\to \mathrm{\mathbb{R}}$. A list of some commonly used activation function with their derivatives is shown in fig 4 which can be easily accessed when working with keras. The activation function converts the value calculated by weights and biases into something which can be stored in the network. Bias is a measure of how easy it is to get the perceptron to fire. In order to calculate the activations ${A}_{n}$ in the nodes of n^{th} layer, we use the formula

$A_n = \sigma( W_n A_{n-1} + b_n) = \sigma \bigg( \begin{bmatrix}W_{1,1}^n&\cdots&W_{1,k}^n\\\vdots&\ddots&\vdots\\W_{j,1}^n&\cdots&W_{j,k}^n\end{bmatrix}\begin{bmatrix}a_1^{n-1}\\\vdots\\a_k^{n-1}\end{bmatrix} + \begin{bmatrix}b_1^n\\\vdots\\b_j^k \end{bmatrix} \bigg)$

In the above equation, one can see that the activation of a node depends on the weights and biases which is of great importance. It means that a slight change in weight or bias causes slight change in the activation of a node depending on the choice of activation function. This is very crucial for our network for learning.

Suppose we are trying to determine whether a handwritten image depicts a '9' or not. Let the image is 28x28 greyscale pixel images, and our network takes the greyscale values stored in the image pixels as input, then we'd have 784 = 28x28 input neurons. Each neuron in the input layer will contain some activation value which is a number between 0 and 1 inclusive depending upon the brightness of each pixel. A 0 means a white pixel and 1 means a black pixel. The activation in this layer will determine the activation in further layers. Suppose for a given set of values of weights and biases, our network recognizes the digit as '9' but this does not mean that this set of values will be helpful in recognizing other digits like '5'. Hence we need to slightly adjust weights and biases so as to recognize 9, 5 and other digits also. The smoothness of activation function $\sigma$ means that small changes $\Delta W_j$ in the weights and $\Delta b$ in the bias will produce a small change $\Delta output$ from the neurons. In fact, calculus tells us that $\Delta output$ is well approximated by

$\Delta output\approx\sum\frac{\partial Output}{\partial W_j} \Delta W_j + \frac{\partial Output}{\partial b} \Delta b$

## Learning Process

Till now we have seen the basic ideas like weight, bias and activation functions. Now we want an algorithm which lets us find weights and biases so that the output from the network approximates for all the training input images. Let $x$ denotes a training input. It is convenient to regard each training input $x$ as a 28x28 = 784 dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We will denote the corresponding output by $y = y(x)$ where $y$ is a 10 dimensional vector. For instance, if a particular training image $x$ depicts a '6', then $y(x)$= $(0,0,0,0,0,0,1,0,0,0)^T$ is the desired output from the network. This is done by *one-hot-encoding method*. Note that $T$ here is transpose operation turning a row vector into a column vector.

## Cost function

In order to determine how well a certain prediction given by the algorithm is, we may establish a *cost function* also called as *loss function*, which will quantify how well we are achieving this goal or measure the discrepancy in our result. Consider this function, known as 'Mean Square Error' or 'Quadratic Cost Function':

$C(w,b)\equiv\frac{1}{2n}\sum_{x}\| a - y(x)\|^2$

Here, $w$ denotes the collection of all weights in the network, $b$ all the biases, $n$ is the total number of training inputs, $a$ is the vector of outputs from the network when $x$ is the input and sum is over all the training input $x$. Of course $a$ depends on *x, w and b*. The notation |v| is usual length (Euclidian distance) function for a vector v.

It is noticeable that this function becomes large when there is a large difference in input and desired output and small when input and desired output are close to each other. In other words, this function becomes large when our network approximates badly and small when the approximation is accurate. We have to find a set of weights and biases so as to minimize the cost function.

Given below are the list of some loss function available in keras documentation which can used in computing the networks loss:

- Mean Absolute Error
- Mean Absolute Percentage Error
- Square Hinge
- logcosh
- categorical crossentropy
- binary crossentropy
- poisson
- cosine proxomity

## Gradient descent

It is the standard method used in the *optimization* problem [1]. It relies on the concept from the multi variable calculus. In our neural network we want to minimise the cost function so as to minimize the error and to predict accurately. For this purpose we need to learn how to minimize a multi variable function and Gradient Descent is one of the profound example.

**Definition 1.** The Gradient of a differentiable function $f:\mathbb{R}^n\rightarrow\mathbb{R}$ at a point $x = (x_1,x_2,\ldots,x_n) \in \mathbb{R}^n$ is a vector of the form

$\nabla f(x) = \bigg(\frac{\partial f}{\partial x_1}(x),\frac{\partial f}{\partial x_2}(x),\ldots,\frac{\partial f}{\partial x_n}(x)\bigg)$

$x\in \mathbb{R}^n$It is well known that given a point, the gradient at that point indicates the direction of steepest ascent. Given that $f$ is differentiable at $x$, it can be approximated linearly and therefore the vector $-\nabla f(x)$ indicates the direction of steepest descent of the function $f$ at the point $x$.

In order to obtain the minimum value of the function, the gradient descent strategy tells us to start at a given $x_0\in\mathbb{R}^n$ , calculate the value of and then proceed to calculate a new point $x_1:= x_0- \eta \nabla f(x_0)$, where $\eta > 0$ is called *learning rate*. We then repeat this process, creating a sequence $\{x_i\}$ defined by our initial choice of $x_0$, the learning rate $\eta$, and the rule $x_{i+1} = x_i -\eta\nabla f(x_i)$ for any $i\in\mathbb{N}$. This sequence continues till we approach a region close to our desired minimum.

To get the better intuition of what is written above, suppose $f = f(v_1,v_2,\ldots,v_n)$is a multi variable function. Small changes in each variable causes change in $f$ which can be approximated as:

$\Delta f \approx \frac{\partial f}{\partial v_1}\Delta v_1 + \frac{\partial f}{\partial v_2}\Delta v_2 + \ldots + \frac{\partial f}{ \partial v_n}\Delta v_n$

let $\nabla f$ defines the gradient of $f$ and $\Delta v = (\Delta v_1,\Delta v_2,\ldots,\Delta v_n)^T$ where $T$ is again a transpose function. Then

$\nabla f = \bigg( \frac{\partial f}{\partial v_1},\ldots,\frac{\partial f}{\partial v_n} \bigg)^T$

so we can write

$\Delta f \approx \nabla f\Delta v$

Now, we want to make $\Delta f$ negative, so lets choose

$\Delta v = -\eta\nabla f,\; \eta >0$

then $\Delta f =-\eta \|\nabla f\|^2$ which will always be negative.

To make gradient descent work correctly, we need to choose the learning rate $\eta$ to be small enough so that $\Delta f \approx \nabla f\Delta v$ remains a good approximation. If we don't, we might end up with $\Delta f >0$, which obviously would not be good. At the same time, we don't want $\eta$ to be too small, since that will make the changes $\Delta v$ too small, and thus gradient descent works slowly.

Given that our cost function indicates how poorly our neural network approximates a given function, by calculating the gradient of the cost function with respect to the weights and biases of the network and adjusting these parameters in the direction opposite to the gradient, we will decrease our error and therefore lead us to an adequate network.

## Stochastic gradient descent

Let's look back at our quadratic cost function in eq 1

$C(w,b)\equiv\frac{1}{2n}\sum_{x}\| a - y(x)\|^2$

Notice that this cost function has the form

$C = \frac{1}{n}\sum_{x}C_x$

that is, it is an average over costs $C_x \equiv \frac{\|a - y(x)\|^2}{2}$ for individual training examples.

In practice, to compute the gradient $\nabla C$ we need to compute the gradient $\nabla C_x$ separately for each training input $x$ and then average them $\nabla C = \frac{1}{n}\sum_{x}\nabla C_x$. Unfortunately when the number of training input is very large this can take a long time thus occur slowly.

Here comes the idea of *Stochastic Gradient Descent* to speed up the learning. The idea is to estimate the gradient $\nabla C$ by computing $\nabla C_x$ for a small sample of randomly chosen training inputs. By averaging over this small sample, it turns out that we can quickly get a good estimate of the true gradient $\nabla C$. Let us pick $m$ training inputs randomly and label them $X_1,X_2,X_3,\ldots, X_m$ and refer to them as* mini batch*. Provided that the sample size is large enough we can expect that the average value of $\nabla C_{X_j}$ will be roughly equal to the average over all $\nabla C_x$, that is

$\frac{\sum_{j=1}^{m}\nabla C_{X_j}}{m} \approx \frac{\sum_{x} \nabla C_x}{n} = \nabla C$

where the second term is over the entire set of training data. So

$\nabla C \approx \frac{\sum_{j=1}^{m}\nabla C_{X_j}}{m}$

confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an *Epoch* of training. Given below are the list of Optimizer available in keras documentation which can be used to optimize the learning process:

- SGD (stands for Stochastic Gradient Descent).
- RMSprop.
- Adagrad.
- Adadelta.
- Adam.
- Adamax
- Nadam

## Backpropagation

With the advancement in the technology and the data science literature, artificial neural network becomes a major part of Artificial Intelligence after the arrival of "Backpropagation" technique. This technique allows the network to adjust some of their neuron weights and biases, in order to obtain the derivative for each of the parameters with respect to the loss function and then gradient descent can be used to update these parameters in an informed manner to improve the predictive power of the network.

The explanation in this section is a simple introduction to the method of backpropagation [1] in feedforward neural networks. For a more comprehensive understanding of the backpropagation algorithm, the second chapter of the online book Neural Network and Deep Learning is a great reference.

The big challenge of applying gradient descent to neural networks is calculating the partial derivatives of the cost function with respect to each individual weight and bias (namely $\frac{\partial C}{\partial W_{i,j}^l} and \frac{\partial C}{\partial b_{j}^l}$ ). This is where backpropagation comes in. This algorithm first tells us how to calculate these values for the last layer of connections, and with these results then inductively goes "backwards" through the network, calculating the partial derivatives of each layer until it reaches the first layer of the network. Hence the name "backpropagation".

In a previous section, we defined $A_n$ to be the vector representing the activations of the nodes in the n^{th} layer of the network. However, for the purposes of this section, it will be useful for us to consider what the values sent by the previous layer were before the activation function was applied. Consider

$z_j^l := \sum_{k}W_{j,k}^l a_{k}^{l-1} + b_{j}^l \;$ and $a_j^l = \sigma(z_j^l) \; and \; A^l = \sigma(Z^l)$

In the previous equation, $Z^l := \sum_{k} z_{k}^l e_k$ is a vector with entries corresponding to the values $z_{j}^l$( $e_i$ are the standard basis vectors). Additionally, consider the quantity

$\delta _{j}^l := \frac{\partial C}{\partial z_j^l} \;and\; \Delta ^l := \sum_{k} \delta _{k}^l e_k$

These values will be useful for propagating the algorithm backwards through the network and are directly related to $\frac{\partial C}{\partial W_{i,j}^l} and \frac{\partial C}{\partial b_{j}^l}$, by the chain rule, since

$\frac{\partial C}{\partial w_{i,j}^l} = \frac{\partial C}{\partial z_{j}^l} \frac{\partial z_{j}^l}{\partial w_{i,j}^l} = \delta _{j}^l a_{i}^{l-1} and \frac{\partial C}{\partial b_{j}^l} = \frac{\partial C}{z_{j}^l} \frac{\partial z_{j}^l}{\partial b_{j}^l} = \delta _{j}^l$

Since $a_j^{l-1}$ is readily available for any node of the network, if we are able to calculate the value of the $\delta _{j}^l$’s we will have been successful in obtaining our gradient! Our first step is calculating this value for the last layer of the network, that is, $\delta _{j}^L$ for a network with L layers. It is not hard to notice that, since $a_{j}^L = \sigma(z_{j}^L)$ again by the chain rule

$\delta _{j}^L = \frac{\partial C}{\partial a_{j}^L} \frac{\partial a_{j}^k}{\partial z_{j}^l} = \frac{\partial C}{\partial a_{j}^L} \sigma '(z_{j}^L)$

$F(x) = A^L = (a_1^L, \cdots, a_k^L)$Using the Mean Square Error function, noticing that and letting $f(x) = (y_1, \cdots, y_k)$ we get that

$\delta _{j}^L = (a_{j}^L-y_j)\sigma'(z_j^L)$

which can easily be calculated by a computer if we know how to calculate $\sigma'$ (which should be true for any practical activation function). In order to obtain this result in vector form, it is common to write

$\Delta^L = \nabla_{A^L} \odot \sigma'(z^L)$

where $\Delta_{A^L} := \bigg( \frac{\partial C}{\partial a_1^L}, \cdots, \frac{\partial C}{\partial a_k^L} \bigg)$ is the gradient of C taken with respect to the elements of $A^L and \odot$ is the Hadamard Product, which multiplies two matrices (or vectors) elementwise.

Now we will only need to "propagate" this backwards in the network in order to obtain $\delta_j^{L-1}$. In order to do so, apply the chain rule once again

$\delta_j^{L-1} = \frac{\partial C}{\partial z_{j}^{L-1}} \\ = \nabla_{Z^L} C \frac{\partial Z^L}{\partial z_j^{L-1}}\\ = \sum_{i}^{k} \frac{\partial C}{\partial z_i^L}.\frac{\partial z_i^L}{\partial z_j^{L-1}}\\ = \sum_{i}^{k} \delta_i^L \frac{\partial z_i^L}{\partial z_j^{L-1}}$

focus on the term $\frac{\partial z_i^L}{\partial z_j^{L-1}}$ we find that If we

$\frac{\partial z_i^L}{\partial z_j^{L-1}} = \frac{\partial (\sum_{k} w_{i,k}^L a_k^{L-1} + b_i^L)}{\partial z_j^{L-1}}\\ = \frac{\partial(\sum_{k} w_{i,k}^L \sigma(z_k^{L-1}) + b_i^L)}{\partial z_j^{L-1}}\\ = \frac{\partial(w_{i,j}^L \sigma(z_j^{L-1}))}{\partial z_j^{L-1}}\\ = w_{i,j}^L . \sigma'(z_j^{L-1})$

which, again, can be easily calculated by a computer given the network. Therefore

$\delta_j^{L-1} = \sum_{i}^{k} \delta_i^L w_{i,j}^L \sigma'(z_j^{L-1})$

This formula tells us how to calculate any $\delta_j^l$ in the network, assuming we know $\Delta^{l+1}$. Since we know how to jump start $\Delta^L$ in the last layer of the network, we are done, and the algorithm is successful.

Summarizing what was done in this section, we first defined $\delta_i^l := \frac{\partial C}{\partial z_j^{l}}$ which has direct relationships with our desired values of $\frac{\partial C}{\partial w_{i,j}^l} and \frac{\partial C}{\partial b_j^l}$. We then wrote down what the $\delta_i^L$’s should be for the last layer of the network, and finally developed a way to calculate all the $\delta_i^l$’s, given that we know what the values of $\delta_i^{l+1}$ are. Thus, by propagating this method backwards through the layers of the network we are able to find all our desired partial derivatives, and can therefore calculate the value of $\nabla C$ as a function of the weights and biases of the network and execute the method of gradient descent.

## CONVOLUTIONAL NEURAL NETWORK

Computer vision is a field of study in which we try to enable machines to view the world as we human do, perceive it in similar manner and use the knowledge for a multitude of tasks such as image and video recognition, image analysis and classification, media recreation and natural language processing etc. One of the major advancement in this field is due to the introduction of* Convolutional Neural Network*.

Convolutional Neural Network (CNN) is a deep, feed forward artificial neural network which is commonly refer to as CNN or ConvNet. It is specifically inspired by the biological visual cortex which has small regions of cells that are sensitive to the specific area of visual field like edges and corners. The fig 5 shows a complete flow of CNN to process an input image and classifies the image based on the values. The figure 5 shows that image is fed as an input to the network, which goes through the multiple convolutions, subsampling, a fully connected layer and finally outputs something.

Why CNN over FNN? An image is a matrix of pixel values. In FNN we just flatten the image and feed it to MLP for classification. FNN shows good precision score for extremely basic binary images but not accurate for complex images having pixel dependencies throughout. A CNN is able to successfully capture the spatial and temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights.

## Structure

CNN provides an abstract representation of the data at each stage of the network which are designed to detect specific features of the network. When we look at hidden layers closer to the output of a deep network, the hidden layers have highly interpretable representations, such as faces, clothing, etc. However, when we look at the first layers of the network, they are detecting very basic features such as corners, curves, and so on.

## Convolutional layer

The convolution layer computes the output of neurons that are connected to local regions or receptive fields in the input, each computing a dot product between their weights and a small receptive field to which they are connected to in the input volume. The element involved in carrying out the convolution operation in the first part of a convolutional layer is called the *Kernel/Filter*. Each computation leads to extraction of a feature map from the input image. In other words, suppose we have an image represented as a 5x5 matrix of values, and we take a 3x3 matrix and slide that 3x3 kernel around the image. *Stride* is the number of pixel shift over the input matrix. When stride = 1 then we move the filter to 1 pixel at a time and so on. At each position of that matrix, we multiply the values of 3x3 window by the values in the image that are currently being covered by the window. As a result, we obtain a single number that represents all the values in that window of the images. We use this layer to filtering: as the window moves over the image, we check for patterns in that section of the image. This works because of filters, which are multiplied by the values outputted by the convolution. fig 6 demonstrate the convolution process.

There are two types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same depending on how the filter fits the image. The first part is done by applying* Valid Padding *in which we drop the part of picture in which filter does not fits. The second part is done by* Same Padding *in which we pad the picture with zeroes.

## Subsampling

The objective of subsampling is to get an input representation by reducing its dimensions, which helps in reducing overfitting. One of the techniques of subsampling is *max pooling*. Fig 7 shows the working of different Pooling techniques.

With this technique, you select the highest pixel value from a region depending on its size. In other words, max pooling takes the largest value from the window of the image currently covered by the kernel. For example, a max-pooling layer of size 2x2 will select the maximum pixel intensity value from 2x2 region. The other two techniques are (i) *sum pooling* and (ii) *average pooling*.

We have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular neural network for classification purposes.

## Fully connected layer

Now that we have converted our input image into a suitable form for our Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.

## IMPLEMENTATION AND CASE STUDY OF EFFICIENCY

Choosing architectures for neural networks is not an easy task. We want to select a network architecture that is large enough to approximate the function of interest, but not too large that it takes an excessive amount of time to train. Another issue with large networks is that they require large amounts of data to train — you cannot train a neural network on a hundred data samples and expect it to get 99% accuracy on an unseen data set. In general, it is good practice to use multiple hidden layers as well as multiple nodes within the hidden layers, as these seem to result in the best performance.

In this section, we include a case study of the most classic example of neural network implementation: MNIST database of handwritten digits. The MNIST database is composed of 70,000 pictures, each of which is made up of a 28x28 grid of black and white pixels. Each pixel stores a value between 0 and 1, corresponding to the shade of the region, with 0 indicating a completely white square and 1 indicating a completely black square. Our objective is to create a neural network which is able to classify what type of digit is represented in the picture. This is a great application of neural network as it would be most likely to write down some explicit function to recognize the digits.

## Procedure

In order to experiment around some parameters such as depth, width, learning rate and mini-batch size used in the network, we use a python 3.7 implementation of the algorithm presented at the end of the first chapter of the book Neural Networks and Deep Learning written by Michael A. Nielsen. In this implementation, Stochastic Gradient Descent have been employed which aims to ensure that the training method is carried out in a randomly uniform way across the training data, thus making it less likely for network to specialize in specific image and perform bad on others.

In summary we divide the 70,000 pictures into a group of 10,000, called the test set. We also divide the training process into epochs as described in previous section. In each epoch, we randomly selects a given number (called mini-batch size) of pictures from training set. We then go through the process of backpropagation for these pictures and finally update the network by adding up all the contributions of the individual pictures to the gradient of the cost function, thus ending this epoch. We repeat this process for a given number of times until our training is done. After this step, we test the network on all the 10,000 pictures we separated in the test set without updating the network for these pictures. This separation is made to ensure that our network is not memorizing the desired output of the training set but rather be able to generalize to results outside of this set. By keeping the two sets separated, we make sure that our test is independent of the training data used to shape the network.

For all networks in this section, the measure of success is chosen to be average success after the specified number of epochs.

## Comparison Result

## Learning rate

In order to test how the learning rate of a network affects its accuracy, we used two structures:

- A shallow network structure of 2 hidden layers with 30 nodes each.
- A deep network structure with (i) 6 hidden layers having width 5, (ii) 10 hidden layers having width 5. (iii) hidden layers having width 10.

Each point in the fig 8 represents a network trained for 15 epochs with a mini batch size of 10.

For shallow network, we can notice an increase in accuracy as the learning rate goes up but accuracy seems to have stagnated. For deep network the case is not the same. The network with 6 hidden layer and width 5 seems to gain good accuracy for lower learning rates and accuracy fades for increasing learning rates. For network with 10 hidden layer and width 5, there seems no effect of learning rate. But for a network with 10 hidden layer of width 10 accuracy seems higher for lower learning rates. More will be said about the influence of Learning Rates to the accuracy of the network when we discuss depth.

## Width

In an analogous fashion to the previous section, fix the depth and learning rate, and vary the width of the hidden layer in order to examine what relationship this parameter has with the final networks accuracy. In fig 9, results are shown for two network structure-

- shallow network with 2 hidden layers of same number of (varied) widths with LR = 3.
- deep network with 10 hidden layers of same number of (varied) widths for different LR.

These results show a very clear relationship between accuracy and width of the network, where increasing the number of nodes in each layer increases the number of successes. In the context of 'thinner' networks, adding nodes to the layer makes a very big difference, but this gain diminished as the network grows bigger.

At first sight, this lead the reader to conclude that in any context having shallow but wider networks means having better results. However, this is not the case. Results are less encouraging when data from Figure 7(a) is extended to include considerably wider networks and accuracy gets low with the increase in number of width in case of shallow network.

The accuracy of the networks does increase as a function of width, up to roughly about 100 hidden layers. After this threshold, network accuracy varies significantly, in most cases for the worse. This is because a wider network has many more weights and biases that need to be adjusted, so the learning process happens more slowly. In this specific case of networks with two hidden layers of width n, 784 inputs and 10 outputs, the number of weights is 784n+ $n^2$+10n = $n^2$+794n and the number of biases is n+n+10 = 2n+10 which means that the total number of parameters that need to be adjusted is $n^2$+796n+10. Because of this squared dependency on n, larger networks will require many more epochs of training in order to achieve their full potential.

Not only that, but each epoch requires much more computational power in larger networks. However, the large number of parameters also means that the network is capable of expressing a larger class of functions, which is indeed shown when considering networks trained over more epochs. This is shown in Figure 10, which compares networks with 50, 100 and 200 nodes in each hidden layer across different epochs. As expected, larger networks require more training epochs in order to achieve their full potential, but when this is obtained, they indeed perform better than smaller networks. In this specific case, however, the gain is noticeable, but not overwhelming, while the required time to train larger networks is, indeed much larger.

In case of deep network, the accuracy seems to increase at some threshold value and then change in behaviour is observed. Accuracy increases for lower learning rates and the gain in diminished for higher learning rates.

## Mini batch size

As discussed in the previous section, mini-batch size is related to the number of randomly selected image samples to calculate the overall gradient of the cost function. fig 11 shows dependency of networks success with the mini-batch size. For shallow networks it seems as if mini-batch size does not affect much the accuracy and with increasing mini-batch size, accuracy improves. In case of the deep network, network predicts correctly to a threshold value and then fluctuations are observed.

## Depth

"Deep Learning" is a term which became popular in machine learning over the past few years, with many news articles praising successful applications of the method. This technique simply means that many layers are being used in a neural network (although what qualifies as "many" is not always well defined).

Although this method has indeed been incredibly successful in many important applications, the reader should not be mislead into believing that more layers always means better accuracy. For instance, consider the results exhibited in Fig 12, where networks with different depth were compared.

After a certain point, having more layers in the network dramatically decreases its accuracy, to a point where it converges to 10%, which is equivalent to just randomly guessing what digit is in the picture! A reasonable thought to have after reading the previous section is to conclude that this is because the network is larger, and therefore requires more epochs of training in order to achieve its true potential. However, further examination shows that even after 40 epochs, the network does not show any sign of improvement. However, observe in Fig 13, what happens when we lower the learning rate of the network.

Each curve presented in Fig 13 relates to a different region of influence of the learning rate. In the first one, represented by 0.3, the learning rate is appropriate, and, given enough training epochs, results in a successful network. In the second one, represented by 3.0, the learning rate is so big that the steps taken by the gradient descent technique completely skip over any minimum of the cost function, turning the network’s output into random guesses. The final one, represented by 1.0, is an intermediary region, where steps are small enough that the outcome of the network is definitely better than random guesses, but still too big, as to make the network oscillate intensely across one or more minimums of the cost function, resulting in unreliability. Why does the deeper network require a smaller learning rate in order to function properly when compared to shallower networks? The answer relies on the different types of operations carried out when shaping the network’s function: in shallow but wide networks, the primary operations influencing the outcome are affine combinations, which are determined by the weights and biases across layers; in deep but thin networks function composition is the key, which means that each individual weight (specially closer to the input layer) has a very large influence on the outcome of the network. As a result, in deeper networks we need smaller increments to the gradient descent in order to finely tune it to the desired dataset.

## REFERENCES

- Leonardo Ferreira Guilhoto.
*An Overview of Artificial Neural Networks for Mathematicians*. 2018 - Michael A. Nielsen.
*Neural Network and Deep Learning*. Determination Press, 2018. - B.Hammer and T. Villmann.
*Mathematical Aspects of Neural Networks*. d-side publication, 2003. - Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyal,
*Understanding Deep Learning Requires Rethinking Generalization*, ICLR 2017. - Sergios Theodoridis and Konstantinos Koutroumbas,
*Pattern Recognition*, Fourth Edition, Academic Press an imprint of Elsevier, 2009. - Simon Haykin,
*Neural Networks and Learning Machines*. Third Edition, PHI Learning Private Limited, 2012. - Jacek M Zurada,
*Introduction to Artificial Neural Systems*. Zaico Publication House, 2006. - https://www.datacamp.com/community/tutorials/convolutional-neural-networks
- understanding convolutional neural network - medium.com

## ACKNOWLEDGEMENT

It's a pleasure to remind the fine assistance and sincere guidance I received in IIST Trivandrum during the 2 month internship to uphold my computer and programming skill and to enrich my knowledge. I would like to express my sincere gratitude to my supervisor Dr. Deepak Mishra, Associate Professor, for providing his invaluable guidance, comments, suggestions throughout the course of my work. I thank him for constantly motivating me to keep practicing and work harder.

## APPENDIX A Data

Since the algorithm employed used the stochastic gradient descent technique and is, therefore, partially random, the data displayed in the graphs of Section 4 is detailed below for the sake of completeness. In order to retain compactness, the following abbreviations are used: L.R. (Learning Rate); C.G. (Correct Guesses); E.C. (Epochs Completed); Wd. (Width); Dp. (Depth); B.S.(mini-batch size).

## Post your comments

Please try again.