Summer Research Fellowship Programme of India's Science Academies

An overview of artificial neural network

Umashankar Sahu

Powergrid Hostel, Centre For Basic Sciences, Pt. Ravishankar Shukla University, Raipur 492010, Chattisgarh

Dr Deepak Mishra

Indian Institute of Space Science & Technology, Thiruvananthapuram, Kerala


This paper aim to get acquainted with the Artificial Neural Network and its two methods. First it defines what an artificial neural network is and then describes the key features behind it like weights, biases, activation functions, back propagation etc. We first look at the mathematical view of feed forward neural network and how learning process occur. And then we will have a brief introduction to Convolutional neural network. Finally in order to build intuition, a case study of effectiveness in the MNIST dataset of handwritten digits is carried out, examining how parameters such as learning rate, width, depth and mini-batch size used in the network affects its accuracy.

Keywords: artificial neural network, convolutional network, feedforward network


Given below is the list of abbreviations used in the paper.

FNNFeedForward Neural Network
 MLPMulti Layer Perceptron 
CNN Convolution Neural Network
LR Learning Rate 
MNIST  Modified National Institute of Standards and Technology 


This 2 month internship at IIST Trivandrum aims in getting an insight to artificial neural networks. We are in some way or the other, surrounded by the elements of Artificial Intelligence. A major part of which is Machine Learning in which we try to make machines see the world as we human do. A big success in area is achieved due to the introduction of Deep Learning and Artificial Neural Networks (ANN's). A deep study has been made in order to gain the knowledge about ANN's. Some basic ideas like weights, biases, backpropagation and gradient descent method were learned. The objective was to learn some technique for pattern recognition, image processing and classification, and I came to learn about FeedForward Neural Networks (FNN's) and Convolutional Neural Networks (CNN's). During this internship, I also have improved my technical and coding skill particularly in Python about which I had no idea before coming here.
Throughout the duration of the course, theoretical study as well as hand to hand practical implementation is motivated. After doing lots of tutorials and readings, I learned about various activation functions and mathematics behind them, different optimization techniques in neural networks. I also came to know about Jupyter Notebook, keras function library and several other technical specialities. In order to gain a sound knowledge of what is happening, a case study of MNIST dataset of handwritten digits is carried out. Data collection and analysis has been made and eventually I came to know about deep learning.

    An Overview of the Work

    Different types of Artificial Neural Network like Feed Forward, Convolutional, Radial Basis Function (RBFN) and Bayesian Neural Network were introduced. I successfully implemented FNN and CNN, and had an introductory lessons on RBFN and Bayesian Neural Network without implementation. I also tried to learn how to enhance the capability of neural network but I didn't arrived at any conclusion. In future, A detailed study on RBFN and Bayesian neural network with their implementation is intended. I will also try to learn the mathematical aspects of enhancing the neural network capability.

    In all, this internship proved fruitful to me in achieving good technical and coding skills and having an insight to the world of Artificial Intelligence.


    Artificial Intelligence has been a popular topic in science fiction and news articles for at least a few decades, and is the futuristic area for research and innovation. It is the simulation of human intelligence processes like learning, reasoning and self correction by machines, especially computer systems. Many algorithms of this category are indeed incorporated in the everyday life of people with self-driving cars, automatically generated image captions on search engines, recommendation algorithms, ridesharing apps like uber, location prediction on Google Map, social networking, online shopping and many more. A precise definition of this is "a program that does something smart", with the meaning of "smart" changing throughout the years. More recently this term has overwhelmingly been used to refer to machine learning algorithms which can also be called as data driven algorithms since they are nothing more than algorithms which adjust some of their parameters in response to a data set. In machine learning, system learns things without being programmed to do so. Out of these algorithms, one type is currently especially popular: Deep Learning in which machines think like human brains using Artificial Neural Networks.

    Algorithms that falls under this term are more often than not used to mimic some sort of action which could be executed by human beings, but for which we lack the mathematical tool to explicitly model in a fully applicable way. They are brain inspired system which are intended to replicate the way we human learn and they are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize. In conventional approach to the programming we tell the computer what to do, breaking big problems into many small precisely defined tasks that the computer can easily perform. By contrast, in artificial neural network we don't tell the computer how to solve the problem. Instead it learns from the observational data, figuring out its own solution to the problem. One of the classic example of this is image recognition and classification. For example, recognizing a handwritten digit instead of its translation, rotation or specific configuration, is not a big task for human as we carry a supercomputer in our head. The difficulty of visual pattern recognition becomes apparent if we attempt to write a program explicitly to recognize digits. Simple intuitions about how we recognize shapes- 'a 9 has a loop at the top and a vertical stroke in the bottom right' - turns out to be not so simple to express algorithmically. When we try to make such rules we quickly get lost in a morass of exceptions and caveats and special cases.

    MNIST dataset is collection of 70,000 scanned images of handwritten digits, together with their correct classification. Some of the digits are shown in fig 2.

      Example showing the digits from MNIST dataset.

      The data comes in two part. The first part contains 60,000 images to be used as training data. These images are scanned handwriting sample from 250 people half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixel in size. The second part of MNIST data set contains 10,000 images to be used as test data which was taken from the different set of 250 to give us confidence that our system can recognize digits from people whose writing it didn't see.

      There are many types of artificial neural networks such as Recurrent (often used for speech and audio recognition) and Convolutional (often used for image classification), Radial Basis Function Network (RBFN) and Bayesian Network. This paper focuses on feedforward network and Convolutional Network.

      FeedForward Neural Network


      Nodes and layers

      A Feedforward Neural Network (FNN) is an algorithm which works in layers or waves. The models are called "feed-forward" because information flows right through the model, there are no feedback connections in which outputs of the model are fed back into itself. A FNN consists of an input layer, output layer and some layers between input and output layers known as hidden layer. The term 'hidden' has no deep philosophical or mathematical significance, it only means 'not an input or an output'. In each layer there are a given number of Nodes, which are sometimes called as perceptrons or even neurons (in analogy with the biological brain). A visual representation of a FNN with two hidden layer is presented in fig 3.

      Nodes are simply the representation of numbers stored in the program. Starting with the input layer, all the way to the output layer, the numerical values stored in the nodes are called activation of the node. These activations in the nth layers will be used to compute the activation in the (n+1)th layers. Mathematically, the activation stored in a layer with k nodes can be represented by a vector of dimension k.

        A visual representation of FeedForward Neural Network

        In order to refer to the structure of a specific network, the term Depth is commonly used to refer to the number of hidden layers in the network. Hence when we say 'Deep Network' or 'Deep Learning', it simply means that a specific network has a large number of layers. In analogous to this term, a less commonly used term width is used to refer to how many nodes are there in each layer of the network. The term is well defined for networks in which all hidden layers have the same number of nodes (the width is then this number), but not so much in networks in which the number of nodes varies from layer to layer.

        Weight, Bias and Activation Functions

        Given the value stored in the nth layer of the network (with, say, k nodes), how are the values of the next layer (with, say, j nodes) calculated? Each neuron in the nth layer has its own activation and it is given a specific connection strength to each node in the next layer. This connection strength is called Weight, which is a real number. For notation Wa,bn  RW_{a,b}^n\;\in \mathbb{R} is the weight between ath node in the (n-1)th layer to the bth node in the nth layer. In order to represent all the weights connecting these two layers, we may use a matrix

        Wn:=  [W1,1nW1,knWj,1nWj,kn]W_n:=\;\begin{bmatrix}W_{1,1}^n&\cdots&W_{1,k}^n\\\vdots&\ddots&\vdots\\W_{j,1}^n&\cdots&W_{j,k}^n\end{bmatrix}

        This system is analogous to biological system of neurons in which one neuron firing up (transmitting an electrical signal) will fire up (propagate) nearby neurons. A significant difference between these two system is that in biological networks, a neuron is either activated or not, while in Artificial Neural Network it may have a continuum of activations, such as a real number 0 and 1 (like in Sigmoid neuron), a non-negative number (like in ReLU networks) or even an imaginary number. Now, we may also wish to add what is called a 'bias' or 'threshold' which is a constant represented by b R\in \mathbb{R} and an 'activation function' which is a function σ:. A list of some commonly used activation function with their derivatives is shown in fig 4 which can be easily accessed when working with keras. The activation function converts the value calculated by weights and biases into something which can be stored in the network. Bias is a measure of how easy it is to get the perceptron to fire. In order to calculate the activations An in the nodes of nth layer, we use the formula

        An=σ(WnAn1+bn)=σ( [W1,1nW1,knWj,1nWj,kn][a1n1akn1]+[b1nbjk])A_n = \sigma( W_n A_{n-1} + b_n) = \sigma \bigg(  \begin{bmatrix}W_{1,1}^n&\cdots&W_{1,k}^n\\\vdots&\ddots&\vdots\\W_{j,1}^n&\cdots&W_{j,k}^n\end{bmatrix}\begin{bmatrix}a_1^{n-1}\\\vdots\\a_k^{n-1}\end{bmatrix} + \begin{bmatrix}b_1^n\\\vdots\\b_j^k \end{bmatrix} \bigg)

        In the above equation, one can see that the activation of a node depends on the weights and biases which is of great importance. It means that a slight change in weight or bias causes slight change in the activation of a node depending on the choice of activation function. This is very crucial for our network for learning.

          Some commonly used Activation Functions in Neural Networks

          Suppose we are trying to determine whether a handwritten image depicts a '9' or not. Let the image is 28x28 greyscale pixel images, and our network takes the greyscale values stored in the image pixels as input, then we'd have 784 = 28x28 input neurons. Each neuron in the input layer will contain some activation value which is a number between 0 and 1 inclusive depending upon the brightness of each pixel. A 0 means a white pixel and 1 means a black pixel. The activation in this layer will determine the activation in further layers. Suppose for a given set of values of weights and biases, our network recognizes the digit as '9' but this does not mean that this set of values will be helpful in recognizing other digits like '5'. Hence we need to slightly adjust weights and biases so as to recognize 9, 5 and other digits also. The smoothness of activation function σ\sigma means that small changes ΔWj\Delta W_j in the weights and Δb\Delta b in the bias will produce a small change Δoutput\Delta output from the neurons. In fact, calculus tells us that Δoutput\Delta output is well approximated by

          ΔoutputOutputWjΔWj+OutputbΔb\Delta output\approx\sum\frac{\partial Output}{\partial W_j} \Delta W_j + \frac{\partial Output}{\partial b} \Delta b

          Learning Process

          Till now we have seen the basic ideas like weight, bias and activation functions. Now we want an algorithm which lets us find weights and biases so that the output from the network approximates for all the training input images. Let xx denotes a training input. It is convenient to regard each training input xx as a 28x28 = 784 dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We will denote the corresponding output by y=y(x)y = y(x) where yy is a 10 dimensional vector. For instance, if a particular training image xx depicts a '6', then y(x)y(x)= (0,0,0,0,0,0,1,0,0,0)T(0,0,0,0,0,0,1,0,0,0)^T is the desired output from the network. This is done by one-hot-encoding method. Note that TT here is transpose operation turning a row vector into a column vector.

          Cost function

          In order to determine how well a certain prediction given by the algorithm is, we may establish a cost function also called as loss function, which will quantify how well we are achieving this goal or measure the discrepancy in our result. Consider this function, known as 'Mean Square Error' or 'Quadratic Cost Function':

          C(w,b)12nxay(x)2C(w,b)\equiv\frac{1}{2n}\sum_{x}\| a - y(x)\|^2

          Here, ww denotes the collection of all weights in the network, bb all the biases, nn is the total number of training inputs, aa is the vector of outputs from the network when xx is the input and sum is over all the training input xx. Of course aa depends on x, w and b. The notation |v| is usual length (Euclidian distance) function for a vector v.

          It is noticeable that this function becomes large when there is a large difference in input and desired output and small when input and desired output are close to each other. In other words, this function becomes large when our network approximates badly and small when the approximation is accurate. We have to find a set of weights and biases so as to minimize the cost function.

          Given below are the list of some loss function available in keras documentation which can used in computing the networks loss:

          • Mean Absolute Error
          • Mean Absolute Percentage Error
          • Square Hinge
          • logcosh
          • categorical crossentropy
          • binary crossentropy
          • poisson
          • cosine proxomity

          Gradient descent

          It is the standard method used in the optimization problem [1]. It relies on the concept from the multi variable calculus. In our neural network we want to minimise the cost function so as to minimize the error and to predict accurately. For this purpose we need to learn how to minimize a multi variable function and Gradient Descent is one of the profound example.

          Definition 1. The Gradient of a differentiable function f:RnRf:\mathbb{R}^n\rightarrow\mathbb{R} at a point x=(x1,x2,,xn)Rnx = (x_1,x_2,\ldots,x_n) \in \mathbb{R}^n is a vector of the form

          f(x)=(fx1(x),fx2(x),,fxn(x))\nabla f(x) = \bigg(\frac{\partial f}{\partial x_1}(x),\frac{\partial f}{\partial x_2}(x),\ldots,\frac{\partial f}{\partial x_n}(x)\bigg)

          xRnx\in \mathbb{R}^nIt is well known that given a point, the gradient at that point indicates the direction of steepest ascent. Given that ff is differentiable at xx, it can be approximated linearly and therefore the vector f(x)-\nabla f(x) indicates the direction of steepest descent of the function ff at the point xx.

          ​In order to obtain the minimum value of the function, the gradient descent strategy tells us to start at a given x0Rnx_0\in\mathbb{R}^n , calculate the value of and then proceed to calculate a new point x1:=x0ηf(x0)x_1:= x_0- \eta \nabla f(x_0), where η>0\eta > 0 is called learning rate. We then repeat this process, creating a sequence {xi}\{x_i\} defined by our initial choice of x0x_0, the learning rate η\eta, and the rule xi+1=xiηf(xi)x_{i+1} = x_i -\eta\nabla f(x_i) for any iNi\in\mathbb{N}. This sequence continues till we approach a region close to our desired minimum.

          To get the better intuition of what is written above, suppose f=f(v1,v2,,vn)f = f(v_1,v_2,\ldots,v_n)is a multi variable function. Small changes in each variable causes change in ff which can be approximated as:

          Δffv1Δv1+fv2Δv2++fvnΔvn\Delta f \approx \frac{\partial f}{\partial v_1}\Delta v_1 + \frac{\partial f}{\partial v_2}\Delta v_2 + \ldots + \frac{\partial f}{ \partial v_n}\Delta v_n

          let f\nabla f defines the gradient of ff and Δv=(Δv1,Δv2,,Δvn)T\Delta v = (\Delta v_1,\Delta v_2,\ldots,\Delta v_n)^T where TT is again a transpose function. Then

          f=(fv1,,fvn)T \nabla f = \bigg( \frac{\partial f}{\partial v_1},\ldots,\frac{\partial f}{\partial v_n} \bigg)^T 

          so we can write

          ΔffΔv\Delta f \approx \nabla f\Delta v

          Now, we want to make Δf\Delta f negative, so lets choose
          Δv=ηf,  η>0\Delta v = -\eta\nabla f,\; \eta >0

          then Δf=ηf2 \Delta f =-\eta \|\nabla f\|^2  which will always be negative.

          To make gradient descent work correctly, we need to choose the learning rate η\eta to be small enough so that ΔffΔv\Delta f \approx \nabla f\Delta v remains a good approximation. If we don't, we might end up with Δf>0\Delta f >0, which obviously would not be good. At the same time, we don't want η\eta to be too small, since that will make the changes Δv\Delta v too small, and thus gradient descent works slowly.

          Given that our cost function indicates how poorly our neural network approximates a given function, by calculating the gradient of the cost function with respect to the weights and biases of the network and adjusting these parameters in the direction opposite to the gradient, we will decrease our error and therefore lead us to an adequate network.

          Stochastic gradient descent

          Let's look back at our quadratic cost function in eq 1

          C(w,b)12nxay(x)2C(w,b)\equiv\frac{1}{2n}\sum_{x}\| a - y(x)\|^2

          Notice that this cost function has the form

          C=1nxCxC = \frac{1}{n}\sum_{x}C_x

          that is, it is an average over costs Cxay(x)22C_x \equiv \frac{\|a - y(x)\|^2}{2} for individual training examples.

          In practice, to compute the gradient C\nabla C we need to compute the gradient Cx\nabla C_x separately for each training input xx and then average them C=1nxCx\nabla C = \frac{1}{n}\sum_{x}\nabla C_x. Unfortunately when the number of training input is very large this can take a long time thus occur slowly.

          Here comes the idea of Stochastic Gradient Descent to speed up the learning. The idea is to estimate the gradient C\nabla C by computing Cx\nabla C_x for a small sample of randomly chosen training inputs. By averaging over this small sample, it turns out that we can quickly get a good estimate of the true gradient C\nabla C. Let us pick mm training inputs randomly and label them  X1,X2,X3,,Xm  X_1,X_2,X_3,\ldots, X_m  and refer to them as mini batch. Provided that the sample size is large enough we can expect that the average value of CXj\nabla C_{X_j} will be roughly equal to the average over all Cx\nabla C_x, that is

          j=1mCXjmxCxn=C\frac{\sum_{j=1}^{m}\nabla C_{X_j}}{m} \approx \frac{\sum_{x} \nabla C_x}{n} = \nabla C 

          where the second term is over the entire set of training data. So 

          Cj=1mCXjm\nabla C \approx \frac{\sum_{j=1}^{m}\nabla C_{X_j}}{m} 

          confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

          Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an Epoch of training. Given below are the list of Optimizer available in keras documentation which can be used to optimize the learning process:

          • SGD (stands for Stochastic Gradient Descent).
          • RMSprop.
          • Adagrad.
          • Adadelta.
          • Adam.
          • Adamax
          • Nadam


          With the advancement in the technology and the data science literature, artificial neural network becomes a major part of Artificial Intelligence after the arrival of "Backpropagation" technique. This technique allows the network to adjust some of their neuron weights and biases, in order to obtain the derivative for each of the parameters with respect to the loss function and then gradient descent can be used to update these parameters in an informed manner to improve the predictive power of the network.

          The explanation in this section is a simple introduction to the method of backpropagation [1] in feedforward neural networks. For a more comprehensive understanding of the backpropagation algorithm, the second chapter of the online book Neural Network and Deep Learning is a great reference.

          The big challenge of applying gradient descent to neural networks is calculating the partial derivatives of the cost function with respect to each individual weight and bias (namely CWi,jland Cbjl\frac{\partial C}{\partial W_{i,j}^l} and  \frac{\partial C}{\partial b_{j}^l} ). This is where backpropagation comes in. This algorithm first tells us how to calculate these values for the last layer of connections, and with these results then inductively goes "backwards" through the network, calculating the partial derivatives of each layer until it reaches the first layer of the network. Hence the name "backpropagation".

          In a previous section, we defined AnA_n to be the vector representing the activations of the nodes in the nth layer of the network. However, for the purposes of this section, it will be useful for us to consider what the values sent by the previous layer were before the activation function was applied. Consider

           zjl:=kWj,klakl1+bjl   z_j^l := \sum_{k}W_{j,k}^l a_{k}^{l-1} + b_{j}^l \; and  ajl=σ(zjl)  and  Al=σ(Zl)   a_j^l = \sigma(z_j^l) \; and \; A^l = \sigma(Z^l) 

          In the previous equation, Zl:=kzklekZ^l := \sum_{k} z_{k}^l e_k is a vector with entries corresponding to the values zjlz_{j}^l( eie_i are the standard basis vectors). Additionally, consider the quantity

          δjl:=Czjl  and  Δl:=kδklek\delta _{j}^l := \frac{\partial C}{\partial z_j^l} \;and\; \Delta ^l := \sum_{k} \delta _{k}^l e_k

          These values will be useful for propagating the algorithm backwards through the network and are directly related to CWi,jl and  Cbjl\frac{\partial C}{\partial W_{i,j}^l}  and   \frac{\partial C}{\partial b_{j}^l}, by the chain rule, since

          Cwi,jl=Czjlzjlwi,jl=δjlail1  and  Cbjl=Czjlzjlbjl=δjl\frac{\partial C}{\partial w_{i,j}^l} = \frac{\partial C}{\partial z_{j}^l} \frac{\partial z_{j}^l}{\partial w_{i,j}^l} = \delta _{j}^l a_{i}^{l-1}   and    \frac{\partial C}{\partial b_{j}^l} = \frac{\partial C}{z_{j}^l} \frac{\partial z_{j}^l}{\partial b_{j}^l} = \delta _{j}^l

          Since ajl1a_j^{l-1} is readily available for any node of the network, if we are able to calculate the value of the δjl\delta _{j}^l’s we will have been successful in obtaining our gradient! Our first step is calculating this value for the last layer of the network, that is, δjL\delta _{j}^L for a network with L layers. It is not hard to notice that, since ajL=σ(zjL)a_{j}^L = \sigma(z_{j}^L) again by the chain rule

          δjL=CajLajkzjl=CajLσ(zjL)\delta _{j}^L = \frac{\partial C}{\partial a_{j}^L} \frac{\partial a_{j}^k}{\partial z_{j}^l} = \frac{\partial C}{\partial a_{j}^L} \sigma '(z_{j}^L)

          F(x)=AL=(a1L,,akL)F(x) = A^L = (a_1^L, \cdots, a_k^L)Using the Mean Square Error function, noticing that and letting f(x)=(y1,,yk)f(x) = (y_1, \cdots, y_k) we get that

          δjL=(ajLyj)σ(zjL)\delta _{j}^L = (a_{j}^L-y_j)\sigma'(z_j^L)

          which can easily be calculated by a computer if we know how to calculate σ\sigma' (which should be true for any practical activation function). In order to obtain this result in vector form, it is common to write

          ΔL=ALσ(zL)\Delta^L = \nabla_{A^L} \odot \sigma'(z^L)

          where ΔAL:=(Ca1L,,CakL)\Delta_{A^L} := \bigg( \frac{\partial C}{\partial a_1^L}, \cdots, \frac{\partial C}{\partial a_k^L} \bigg) is the gradient of C taken with respect to the elements of AL and  A^L  and   \odot is the Hadamard Product, which multiplies two matrices (or vectors) elementwise.

          Now we will only need to "propagate" this backwards in the network in order to obtain δjL1\delta_j^{L-1}. In order to do so, apply the chain rule once again

          δjL1=CzjL1       =ZLCZLzjL1       =ikCziL.ziLzjL1       =ikδiLziLzjL1\delta_j^{L-1} = \frac{\partial C}{\partial z_{j}^{L-1}} \\              = \nabla_{Z^L} C \frac{\partial Z^L}{\partial z_j^{L-1}}\\             = \sum_{i}^{k} \frac{\partial C}{\partial z_i^L}.\frac{\partial z_i^L}{\partial z_j^{L-1}}\\             = \sum_{i}^{k} \delta_i^L \frac{\partial z_i^L}{\partial z_j^{L-1}}

          focus on the term ziLzjL1\frac{\partial z_i^L}{\partial z_j^{L-1}} we find that If we

          ziLzjL1=(kwi,kLakL1+biL)zjL1        =(kwi,kLσ(zkL1)+biL)zjL1        =(wi,jLσ(zjL1))zjL1        =wi,jL.σ(zjL1)\frac{\partial z_i^L}{\partial z_j^{L-1}} = \frac{\partial (\sum_{k} w_{i,k}^L a_k^{L-1} + b_i^L)}{\partial z_j^{L-1}}\\               = \frac{\partial(\sum_{k} w_{i,k}^L \sigma(z_k^{L-1}) + b_i^L)}{\partial z_j^{L-1}}\\                = \frac{\partial(w_{i,j}^L \sigma(z_j^{L-1}))}{\partial z_j^{L-1}}\\                = w_{i,j}^L . \sigma'(z_j^{L-1})

          which, again, can be easily calculated by a computer given the network. Therefore

          δjL1=ikδiLwi,jLσ(zjL1)\delta_j^{L-1} = \sum_{i}^{k} \delta_i^L w_{i,j}^L \sigma'(z_j^{L-1})

          This formula tells us how to calculate any δjl\delta_j^l in the network, assuming we know Δl+1\Delta^{l+1}. Since we know how to jump start ΔL\Delta^L in the last layer of the network, we are done, and the algorithm is successful.

          Summarizing what was done in this section, we first defined δil:=Czjl\delta_i^l := \frac{\partial C}{\partial z_j^{l}} which has direct relationships with our desired values of Cwi,jl and  Cbjl\frac{\partial C}{\partial w_{i,j}^l}  and   \frac{\partial C}{\partial b_j^l}. We then wrote down what the δiL\delta_i^L’s should be for the last layer of the network, and finally developed a way to calculate all the δil\delta_i^l’s, given that we know what the values of δil+1\delta_i^{l+1} are. Thus, by propagating this method backwards through the layers of the network we are able to find all our desired partial derivatives, and can therefore calculate the value of C\nabla C as a function of the weights and biases of the network and execute the method of gradient descent.


          Computer vision is a field of study in which we try to enable machines to view the world as we human do, perceive it in similar manner and use the knowledge for a multitude of tasks such as image and video recognition, image analysis and classification, media recreation and natural language processing etc. One of the major advancement in this field is due to the introduction of Convolutional Neural Network.

          Convolutional Neural Network (CNN) is a deep, feed forward artificial neural network which is commonly refer to as CNN or ConvNet. It is specifically inspired by the biological visual cortex which has small regions of cells that are sensitive to the specific area of visual field like edges and corners. The fig 5 shows a complete flow of CNN to process an input image and classifies the image based on the values. The figure 5 shows that image is fed as an input to the network, which goes through the multiple convolutions, subsampling, a fully connected layer and finally outputs something.

            Schematic diagram of a Convolutional Neural Network

            Why CNN over FNN? An image is a matrix of pixel values. In FNN we just flatten the image and feed it to MLP for classification. FNN shows good precision score for extremely basic binary images but not accurate for complex images having pixel dependencies throughout. A CNN is able to successfully capture the spatial and temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and reusability of weights.


            CNN provides an abstract representation of the data at each stage of the network which are designed to detect specific features of the network. When we look at hidden layers closer to the output of a deep network, the hidden layers have highly interpretable representations, such as faces, clothing, etc. However, when we look at the first layers of the network, they are detecting very basic features such as corners, curves, and so on.

            Convolutional layer

            The convolution layer computes the output of neurons that are connected to local regions or receptive fields in the input, each computing a dot product between their weights and a small receptive field to which they are connected to in the input volume. The element involved in carrying out the convolution operation in the first part of a convolutional layer is called the Kernel/Filter. Each computation leads to extraction of a feature map from the input image. In other words, suppose we have an image represented as a 5x5 matrix of values, and we take a 3x3 matrix and slide that 3x3 kernel around the image. Stride is the number of pixel shift over the input matrix. When stride = 1 then we move the filter to 1 pixel at a time and so on. At each position of that matrix, we multiply the values of 3x3 window by the values in the image that are currently being covered by the window. As a result, we obtain a single number that represents all the values in that window of the images. We use this layer to filtering: as the window moves over the image, we check for patterns in that section of the image. This works because of filters, which are multiplied by the values outputted by the convolution. fig 6 demonstrate the convolution process.

            Image matrix multiplies filter matrix
              output matrix
                The Convolution Process

                There are two types of results to the operation — one in which the convolved feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same depending on how the filter fits the image. The first part is done by applying Valid Padding in which we drop the part of picture in which filter does not fits. The second part is done by Same Padding in which we pad the picture with zeroes.


                The objective of subsampling is to get an input representation by reducing its dimensions, which helps in reducing overfitting. One of the techniques of subsampling is max pooling. Fig 7 shows the working of different Pooling techniques.

                  Pooling Techniques in CNN

                  With this technique, you select the highest pixel value from a region depending on its size. In other words, max pooling takes the largest value from the window of the image currently covered by the kernel. For example, a max-pooling layer of size 2x2 will select the maximum pixel intensity value from 2x2 region. The other two techniques are (i) sum pooling and (ii) average pooling.

                  We have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular neural network for classification purposes.

                  Fully connected layer

                  Now that we have converted our input image into a suitable form for our Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.


                  Choosing architectures for neural networks is not an easy task. We want to select a network architecture that is large enough to approximate the function of interest, but not too large that it takes an excessive amount of time to train. Another issue with large networks is that they require large amounts of data to train — you cannot train a neural network on a hundred data samples and expect it to get 99% accuracy on an unseen data set. In general, it is good practice to use multiple hidden layers as well as multiple nodes within the hidden layers, as these seem to result in the best performance.

                  In this section, we include a case study of the most classic example of neural network implementation: MNIST database of handwritten digits. The MNIST database is composed of 70,000 pictures, each of which is made up of a 28x28 grid of black and white pixels. Each pixel stores a value between 0 and 1, corresponding to the shade of the region, with 0 indicating a completely white square and 1 indicating a completely black square. Our objective is to create a neural network which is able to classify what type of digit is represented in the picture. This is a great application of neural network as it would be most likely to write down some explicit function to recognize the digits.


                  In order to experiment around some parameters such as depth, width, learning rate and mini-batch size used in the network, we use a python 3.7 implementation of the algorithm presented at the end of the first chapter of the book Neural Networks and Deep Learning written by Michael A. Nielsen. In this implementation, Stochastic Gradient Descent have been employed which aims to ensure that the training method is carried out in a randomly uniform way across the training data, thus making it less likely for network to specialize in specific image and perform bad on others.

                  In summary we divide the 70,000 pictures into a group of 10,000, called the test set. We also divide the training process into epochs as described in previous section. In each epoch, we randomly selects a given number (called mini-batch size) of pictures from training set. We then go through the process of backpropagation for these pictures and finally update the network by adding up all the contributions of the individual pictures to the gradient of the cost function, thus ending this epoch. We repeat this process for a given number of times until our training is done. After this step, we test the network on all the 10,000 pictures we separated in the test set without updating the network for these pictures. This separation is made to ensure that our network is not memorizing the desired output of the training set but rather be able to generalize to results outside of this set. By keeping the two sets separated, we make sure that our test is independent of the training data used to shape the network.

                  For all networks in this section, the measure of success is chosen to be average success after the specified number of epochs. 

                  Comparison Result

                  Learning rate

                  In order to test how the learning rate of a network affects its accuracy, we used two structures:

                  • A shallow network structure of 2 hidden layers with 30 nodes each.
                  • A deep network structure with (i) 6 hidden layers having width 5, (ii) 10 hidden layers having width 5. (iii) hidden layers having width 10.

                  Each point in the fig 8 represents a network trained for 15 epochs with a mini batch size of 10.

                  LR vs Success for Shallow Network
                    LR vs Success for Deep Network
                      Learning rate trials for (a) shallow network with two hidden layers of 30 nodes each, (b)deep network with (i)6 and 10 hidden layers of 5 nodes each (ii) 10 hidden layer with 10 nodes. The number of successes is based on a testing sample of 10000 pictures.

                      For shallow network, we can notice an increase in accuracy as the learning rate goes up but accuracy seems to have stagnated. For deep network the case is not the same. The network with 6 hidden layer and width 5 seems to gain good accuracy for lower learning rates and accuracy fades for increasing learning rates. For network with 10 hidden layer and width 5, there seems no effect of learning rate. But for a network with 10 hidden layer of width 10 accuracy seems higher for lower learning rates. More will be said about the influence of Learning Rates to the accuracy of the network when we discuss depth.


                      In an analogous fashion to the previous section, fix the depth and learning rate, and vary the width of the hidden layer in order to examine what relationship this parameter has with the final networks accuracy. In fig 9, results are shown for two network structure-

                      • shallow network with 2 hidden layers of same number of (varied) widths with LR = 3.
                      • deep network with 10 hidden layers of same number of (varied) widths for different LR.
                      Shallow Network
                        Deep Network with different LR
                          Width trials for (a) shallow network with two hidden layers of 30 nodes each, (b)deep network with 10 hidden layers of 5 nodes each for different LR. The number of successes is based on a testing sample of 10000 pictures.

                          These results show a very clear relationship between accuracy and width of the network, where increasing the number of nodes in each layer increases the number of successes. In the context of 'thinner' networks, adding nodes to the layer makes a very big difference, but this gain diminished as the network grows bigger.

                          At first sight, this lead the reader to conclude that in any context having shallow but wider networks means having better results. However, this is not the case. Results are less encouraging when data from Figure 7(a) is extended to include considerably wider networks and accuracy gets low with the increase in number of width in case of shallow network.

                            Epochs trials for different widths in 2 hidden layers across 50 epochs with learning rate of 3 and mini-batch size of 10. The number of successes is based on a testing sample of 10000 pictures.

                            The accuracy of the networks does increase as a function of width, up to roughly about 100 hidden layers. After this threshold, network accuracy varies significantly, in most cases for the worse. This is because a wider network has many more weights and biases that need to be adjusted, so the learning process happens more slowly. In this specific case of networks with two hidden layers of width n, 784 inputs and 10 outputs, the number of weights is 784n+ n2n^2+10n = n2n^2+794n and the number of biases is n+n+10 = 2n+10 which means that the total number of parameters that need to be adjusted is n2n^2+796n+10. Because of this squared dependency on n, larger networks will require many more epochs of training in order to achieve their full potential.

                            Not only that, but each epoch requires much more computational power in larger networks. However, the large number of parameters also means that the network is capable of expressing a larger class of functions, which is indeed shown when considering networks trained over more epochs. This is shown in Figure 10, which compares networks with 50, 100 and 200 nodes in each hidden layer across different epochs. As expected, larger networks require more training epochs in order to achieve their full potential, but when this is obtained, they indeed perform better than smaller networks. In this specific case, however, the gain is noticeable, but not overwhelming, while the required time to train larger networks is, indeed much larger.

                            In case of deep network, the accuracy seems to increase at some threshold value and then change in behaviour is observed. Accuracy increases for lower learning rates and the gain in diminished for higher learning rates.

                            Mini batch size

                            Mini-batch size vs Success for Shallow Network
                              Mini-batch size vs Success for Deep Network
                                Mini Batch Size trials for (a) shallow network with two hidden layers of 30 nodes each, (b)deep network with 6 and 10 hidden layers of 5 nodes each. The number of successes is based on a testing sample of 10000 pictures.

                                As discussed in the previous section, mini-batch size is related to the number of randomly selected image samples to calculate the overall gradient of the cost function. fig 11 shows dependency of networks success with the mini-batch size. For shallow networks it seems as if mini-batch size does not affect much the accuracy and with increasing mini-batch size, accuracy improves. In case of the deep network, network predicts correctly to a threshold value and then fluctuations are observed.


                                "Deep Learning" is a term which became popular in machine learning over the past few years, with many news articles praising successful applications of the method. This technique simply means that many layers are being used in a neural network (although what qualifies as "many" is not always well defined).

                                  Depth trial for varied width for a network with learning rate of 3, 15 epochs and mini-batch size of 10. The number of successes is based on a testing sample of 10000 pictures.

                                  Although this method has indeed been incredibly successful in many important applications, the reader should not be mislead into believing that more layers always means better accuracy. For instance, consider the results exhibited in Fig 12, where networks with different depth were compared.

                                  After a certain point, having more layers in the network dramatically decreases its accuracy, to a point where it converges to 10%, which is equivalent to just randomly guessing what digit is in the picture! A reasonable thought to have after reading the previous section is to conclude that this is because the network is larger, and therefore requires more epochs of training in order to achieve its true potential. However, further examination shows that even after 40 epochs, the network does not show any sign of improvement. However, observe in Fig 13, what happens when we lower the learning rate of the network.

                                    Trials for sigmoid networks of 10 hidden layers with 15 nodes each. The number of successes is based on a testing sample of 10000 pictures.

                                    Each curve presented in Fig 13 relates to a different region of influence of the learning rate. In the first one, represented by 0.3, the learning rate is appropriate, and, given enough training epochs, results in a successful network. In the second one, represented by 3.0, the learning rate is so big that the steps taken by the gradient descent technique completely skip over any minimum of the cost function, turning the network’s output into random guesses. The final one, represented by 1.0, is an intermediary region, where steps are small enough that the outcome of the network is definitely better than random guesses, but still too big, as to make the network oscillate intensely across one or more minimums of the cost function, resulting in unreliability. Why does the deeper network require a smaller learning rate in order to function properly when compared to shallower networks? The answer relies on the different types of operations carried out when shaping the network’s function: in shallow but wide networks, the primary operations influencing the outcome are affine combinations, which are determined by the weights and biases across layers; in deep but thin networks function composition is the key, which means that each individual weight (specially closer to the input layer) has a very large influence on the outcome of the network. As a result, in deeper networks we need smaller increments to the gradient descent in order to finely tune it to the desired dataset.


                                    • Leonardo Ferreira Guilhoto. An Overview of Artificial Neural Networks for Mathematicians. 2018
                                    • Michael A. Nielsen. Neural Network and Deep Learning. Determination Press, 2018.
                                    • B.Hammer and T. Villmann. Mathematical Aspects of Neural Networks. d-side publication, 2003.
                                    • Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyal, Understanding Deep Learning Requires Rethinking Generalization, ICLR 2017.
                                    • Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Fourth Edition, Academic Press an imprint of Elsevier, 2009.
                                    • Simon Haykin, Neural Networks and Learning Machines. Third Edition, PHI Learning Private Limited, 2012.
                                    • Jacek M Zurada, Introduction to Artificial Neural Systems. Zaico Publication House, 2006.
                                    • https://www.datacamp.com/community/tutorials/convolutional-neural-networks
                                    • understanding convolutional neural network - medium.com


                                    It's a pleasure to remind the fine assistance and sincere guidance I received in IIST Trivandrum during the 2 month internship to uphold my computer and programming skill and to enrich my knowledge. I would like to express my sincere gratitude to my supervisor Dr. Deepak Mishra, Associate Professor, for providing his invaluable guidance, comments, suggestions throughout the course of my work. I thank him for constantly motivating me to keep practicing and work harder.

                                    APPENDIX A Data

                                    Since the algorithm employed used the stochastic gradient descent technique and is, therefore, partially random, the data displayed in the graphs of Section 4 is detailed below for the sake of completeness. In order to retain compactness, the following abbreviations are used: L.R. (Learning Rate); C.G. (Correct Guesses); E.C. (Epochs Completed); Wd. (Width); Dp. (Depth); B.S.(mini-batch size).

                                      table for Figure 8(a)
                                      table for Figure 10
                                        table for figure 13
                                          table for figure 10 and figure 13
                                          table for Figure 9(a)
                                            table for figure 11(b)
                                              table for figure 12
                                                table for figure 9(a), 11(b) and 12
                                                  table for Figure 8(b)
                                                  table for Figure 9(b)
                                                    table for figure 11(a)
                                                      table for figure 9(b) and figure 11(a)

                                                      Written, reviewed, revised, proofed and published with