Summer Research Fellowship Programme of India's Science Academies

Combining pruning with structured sparsity learning to compress neural networks

Sheetal Kulkarni

Gogte Institute of Technology, Belagavi, Karnataka 590008

Dr. Sathish S Vadhiyar

Indian Institute of Science, CV Raman road, Bengaluru, Karnataka 560012


Neural networks are the ideal choice for various issues in Computer Vision, Natural Language Processing, Speech Processing, Robotics, and so on, because of their notable accuracies. On the other hand, using these neural networks on embedded device is a difficult task, since the models are extremely huge and massive in terms of memory consumption and computational complexity. This project focuses on overcoming these drawbacks, so that neural networks could be a feasible option on embedded devices. This project proposes the approach of combing pruning that eliminates the unnecessary weights, with structured sparsity learning that trims the structure of neural network to preserve the density of computations. The pruning is accomplished through L0 Regularization technique. Sensitivity analysis is carried out on LeNet, and VGG Convolutional Neural Networks, trained using MNIST, CIFAR 10 and SVHN datasets. Furthermore, readily available compressed models that are provided by TensorFlow are run and the timings are noted. Various techniques are implemented to try to accelerate the time taken by these readily available compressed models.

Keywords: neural network compression, deep learning, sensitivity analysis, neural network pruning, structured sparsity learning, tensorflow


 NN Neural Network
 SA  Sensitivity Analysis
 TF  TensorFlow
 DNN Deep Neural Network
 FC Layer Fully Connected Layer
 SGD Stochastic Gradient Descent
 SSL Structured Sparsity Learning


The neural networks (NN) have become omnipresent in various domains of Artificial Intelligence Applications such as Computer Vision, Natural Language Processing and Speech Processing. The NNs have been used as a means in several other applications as well as like self-driving cars, analysing various diseases in medical imaging fields, playing games that necessitate too much of thought such as Chess, Sudoku, Scramble, etc. The neural networks have been able to consistently prevail over the different other Machine Learning Models in approximately all these fields in terms of accuracies, with ever increasing accuracies as more complicated models are formulated and used. The fuel of the growth of these accuracies of neural networks have been large datasets (such as 1.3 million image imagenet [1], 10 million image places [2], Freebase [3], Wikidata [4]) and more powerful GPUs as well as their depth and the number of parameters present in the neural networks. Neural networks are inspired from the structure of connections between neurons in brain called synapses. The neural networks are an assembly of quite a large of these neurons arranged in layers with the neurons in a layer connected to all or some of neurons in the next subsequent layer. The significance of each of these connections is understood by making use of the datasets.

Sensitivity analysis (SA) regulates how various values of an independent variable affect a particular dependent variable under a given collection of assumptions [5]. The parameter values and assumptions of any kind of model are accountable to alterations and faults. SA is the inspection of these alterations and faults and their impacts on outcomes to be drawn from the model. It is simple and smooth to carry out SA. SA is easy to grasp, and easy to communicate. It is very convenient and very extensively used strategy when it comes to supporting decision makers [6].

TensorFlow (TF) is a platform for articulating Machine Learning algorithms, and an implementation of the same for executing such algorithms. A calculation expressed using TensorFlow can be executed with very minute or no change on an ample variety of heterogeneous systems, ranging from mobile devices like phones, etc, up to large-scale distributed systems of hundreds of machines and thousands of computational devices [7]. TF is a C++ based deep learning framework together with python APIs that are developed beneath an open source Apache 2.0 License. TF makes use of data flow graphs for carrying out numerical calculations in which the nodes correspond to mathematical operations and the edges correspond to multidimensional data array communicated between them [8]. TF has an adjustable architecture that supports several backends, CPU or GPU on desktop, mobile platforms, server platforms, etc. In addition, TF also allows users the capacity to run each node on a distinct computational device which makes it exceptionally flexible. Also, because of TensorFlow's automatic differentiation and parameter sharing capacities, a broad range of architectures are smoothly defined and executed. TF has a swiftly developing community of users and contributors that makes it a significant deep learning framework within the community [8].

Uses of Neural Networks

Today, the number of layers that are used by neural networks can reach upto a thousand. These enormous neural networks are described as Deep Neural Networks (DNN). These DNNs have been used in a wide variety of places like:

1. Image and Video: Today, the most familiar form of information and data available on the internet are Images and Videos. The DNNs have been found to be useful on the problems like image recognition [9], [10], [11], object detection [12], image captioning/retrieval [13], [14], etc.

2. Natural Language and Speech: In this domain, DNNs have been used in a lot of fields like speech recognition [15], natural language processing [16], translating languages [17], etc.

3. Medical: DNNs have been used for figuring out genetics of diseases like autism and several cancers [18] , [19]. DNNs are also found to be useful in inspecting the data generated from medical diagnostic procedures to determine the existence and acuteness of cancers.

4. Robotics: Robotics have seen the application of DNNs like planning [20], navigation [21], self driving cars [22] etc.

Inference vs Training

As any machine learning algorithm, before the neural networks can be used, they have to be trained using the datasets. The course of training neural networks learns the values of weights and biases with the help of labelled data. This is achieved using gradient descent optimizers such as Stochastic Gradient Descent (SGD), Adam, Momentum etc. The inference is involved in using these learned parameters on an input data to produce outputs. The inference is many orders less computationally complex than training, but inference may be necessary to be performed with a latency constraint. For example, in self driving cars on a less powered equipment while training can be accomplished on servers in data centres.

DNNs on Embedded Systems

Inference can be achieved on both the embedded devices as well as cloud. But usually, since the DNN models are very large, cloud is the more preferred choice. For instance, AlexNet has 60M parameters, VGG-16 has 138 million parameters and human face classification, [23] has 10 billion parameters. AlexNet CaffeModel is over 200MB and VGG-16 CaffeModel is over 500 MB which thus consume a lot of memory. Furthermore, these models also have high computational complexity. For example, [24] states that, for one image, it consumes 21.6s on a mobile CPU and 6.3s on GPU, after optimizations for VGG-16. Since using cloud takes immense time and battery power for the data to be uploaded and then downloading the inference back, and also because of its dependency on network conditions of the device, embedded systems can be a desired choice. Hence, in cases of low latency application, embedded system can be a more robust option. Additionally, since data can include sensitive information about user, there can be concerns for user privacy. Therefore, it can be favourable if a neural network can perform inference with less memory and latency footprints so that computations can be carried out on embedded devices.

Combining Pruning and Structured Sparsity Learning techniques

This tries to overcome the difficulties relating to time and space complexity of the networks used in vision problem. Neural networks are eminently parameterized. Most of the weights are contained by the fully connected layers (FC layers) whereas most of the computations are contained by the convolutional layers. The weights that do not contribute are removed by pruning and the structure of neural network is pruned by Structured Sparsity Learning (SSL). Both these strategies combined can prune the model with respect to size and computations.


The following is a review of the efforts done for compressing neural networks. The literature on compressing the neural networks is as follows:

Reduced Model: Here, the strategies target to remove parameters from a huge model that will eventually result in a small-scale and low complexity model, meaning that there would be small number of operations per inference

Reduced Model


DNNs have too many weights, most of which are very small. In absolute value, the smaller the weights, the lesser it will contribute to the activations of the next consecutive layer. Han et. al.’s work [25] makes use of this concept to cut out the connections with smaller weights that are below some threshold and then retrain the network with the remaining connections. When the network is retrained, the accuracy is recaptured. About 89% of the weights of AlexNet can be removed by this method with just 0.2% drop in top-5 accuracy [25]. About 91% pruning takes place for fully connected layers and about 63% pruning takes place for convolutional layers

Structured sparsity learning

Wen et al in [26] observes that the weights removed in the method mentioned above, are randomly cut off. Hence, few of the weights do not result in proportionate speed up, since the dense matrix operations need to be accomplished for zeros too. For the betterment of this, they propose the use of the Group-lasso Regularizer [27] instead of L2 or L1 Regularizer. The group lasso regularizer is explained in Section 3.1.1. The outcome of this is that pruning of the network architecture takes place in a structured manner. For convolution layers, removal of weights happens in the groups of kernel whereas for fully connected layer, removal of weights happens in the group of input/output of a neuron. As a result, for AlexNet on Imagenet, they removed 60% of computations with 2% loss in accuracy.

Matrix factorization

Denton et. al. in [28] build upon the basis that forward propagation of input vector of an FC layer can be produced as a matrix-vector multiplication. Singular Value Decomposition (SVD) can be used to factorize this matrix. Next, making use of only a few huge eigenvalues and corresponding singular vectors, the weight matrix can be produced as a 2 lower dimensional matrices. [29] too makes use of this approach to squeeze the matrices at runtime. This approach can squeeze ZFnet [10] network to 7-8x for a loss of 6% accuracy on ImageNet [1] as presented in [14].

L0 Regularization

If a loss Function for a neural network makes use of L0 regularization since it is a non-continuous function, the loss function cannot be minimized using SGD based optimization strategies. However, L0 regularizer recognizes the parameters which can be made exactly zero. Louizos et.al in [30] came up with an approach to approximate the L0 regularization function to a continuous, differentiable and SGD-minimizable function and therefore possibly minimizing the L0 regularized neural networks. This is further explained in section 3.2.1. They present that Wide ResNet [31] on cifar-10 and cifar-100 dataset can be compressed.

Other Techniques

Efficient architectures

This approach trains and designs compressed neural network architectures to squeeze the neural networks. In order to decrease the number of channels that are input to next layers, SqueezeNet [32] uses 1 × 1 convolution layers instead of 3 × 3 convolution layers. To reinstate the generally used convolution layers, MobileNets [33] makes use of depth-wise separable convolutions. To decrease the number of computations, Xception [34] makes use of depth-wise separable convolutions instead of inception module [35].

Knowledge distillation

Knowledge distillation [36] targets to train a compressed student model by making use of the input-label pair present in the dataset and the outcomes of huge and computationally expensive teacher model. This can produce a much enhanced accuracy on a compressed network which wouldn't have been possible if it was trained by using the conventional method of training by using only dataset. In [36] Denton et.al. presents that a student model can accomplish an accuracy proportionate to that of an ensemble of 10 much bigger neural networks.


Combining Pruning and Structured Sparsity

Most of the space of a neural network is occupied by the FC layers. For instance, in FC layers, the total weight occupied by ALexNet is about 96.7% and the total weight occupied by VGG-16 is about 90%. We concentrate on cutting off FC layer weights in making the NN compact. This is carried out with the help of pruning in [25]. For the time complexity of inference, the Convolutional Layers contribute the most. For instance, 92.2% of computations in AlexNet and 99% of computations in VGG-16 are because of convolution layers. Since the sparse matrix algorithms are slower, it is required that we regularize the structure of convolutional layers and not just remove the parameters randomly. Wen et.al. regularized the structure of neural network, therefore conserving the heavy computations with the help of group lasso regularizer in loss function in [26]. To aim at both the space complexity and time complexity of NNs, we merge these two approaches together.


Here, we propose the background on Pruning and Structured Sparsity Technique.

    Three step training for pruned networks [25]


    Because the smaller weights do not contribute much to inference, after a network is trained, the
    connections containing weights below a threshold can be pruned. The neurons that have either
    every incoming connection or every outgoing connection which can be trimmed, is removed. Further, the model is retrained using remaining connections. The procedure is depicted in Figure 1. When a few iterations (3-4 iterations) of the training-pruning are carried out, there are no new neurons that are pruned and hence, the resulting network is the output. While the network is being retrained, no parameters are initialized again and the earlier trained values are used. The reason is that the gradient descent is able to find the minima much efficiently, provided the network is initialized well enough as mentioned by [25]. Additionally, learning rate that is used is lower than the original rate because we are already near a minima.

    Structured Sparsity Learning

    With the help of group lasso regularizer in loss function, the structure of neural networks can be regularized. The loss function formulation can be written as shown in the following figure.

      Formulation of loss function

      In Figure 2, W corresponds to all the weights, L is total number of convolutional layers in network,
      L(W) is the total loss , LD(W) is loss on data, R(W) is L2 regularization and Rg(W(l)) is the regularizer applied on the weights W(l) of layer l. The regularizer for a convolution layer is
      formulated as depicted in the following figure.

        Convolution layer regularizer

        In figure 3, Wi is the group of weights which are used to calculate the ith output channel, which means the group of weights which are used to calculate the 3D filter associated with ith output channel, Wj is the group of weights that are used in computing the output out of jth input channel, which means the jth kernel of each of the 3D filter, and || · || corresponds to L2 norm. Furthermore, Cout is the number of output channels and Cin is the number of input channels present in layer l. A few of the group of weights can be driven to zeros by the group lasso regularizer. Group lasso regularizer is used to combine these two conditions, because the removal of an output channel (or equivalently a 3D filter ) in lth layer results in removal of that input channel in l + 1th (or equivalently every weight related to that channel).

        Combining the two approaches

        By using Pruning on FC layers and Structured Sparsity Learning on Convolutional Layers, we can combine Pruning and SSL. This will assist in minimizing the memory footprint along with minimizing the computations of the network. Practically, it was not noticed that the SSL took the weights to zero. Therefore, even in the case of SSL, by making use of a threshold, weights can be removed. Likewise, for the convolutional layers pruned by group-lasso, the earlier trained values are used to initialize the network for retraining. The reason is that, for well initialized parameters, the SGD based optimizers learn better.


        Sensitivity analysis is performed on LeNet, and VGG Convolutional Neural Networks, trained using MNIST, CIFAR 10 and SVHN datasets. nsitivity Analysis on LeNet5 architecture

        LeNet-5 is the most recent convolutional network devised for handwritten and machine-printed character recognition [37]. The LeNet-5 architecture is composed of two sets of convolutional and average pooling layers, which is followed by a flattening convolutional layer, next two FC layers and in the end, a softmax classifier [37]. The MNIST database which is short for modified national institute of standards and technology database, is a huge database of handwritten digits which is generally used to train several image processing systems [38]. Sensitivity analysis is achieved on LeNet-5 convolutional network, trained using MNIST database.

        Results for LeNet-5 using MNIST

        The following table shows the output values indicating validation accuracy, number of channels removed for convolutional layers and number of parameters removed fo FC layers, when epsilon (threshold), lambda lasso and lambda l2 regularizers are varied.

        Table 1. LeNet-5 using MNIST
        Epsilon (threshold) lmbda_lassolmbda_l2 Validation Accuracy (%)  C1C2 F1F2
         0.1 0.0001 0.0001 99.20 7 25 199481 4677
        0.0001 0.00001 0.00001 99.33 0 4 142003 1762
        0.001 0.0001 0.0001 99.37 4 14 203180 3301
         0.01 0.000001 0.000001 99.39 0 8 215644 1817
         0.001 0.000001 0.000001 99.40 0 6 118690 1089
         0.01 0.0001 0.0001 99.41 6 16 247473 4086
         0.00001 0.000001 0.000001 99.42 1 0 55503 606
         0.001 0.00001 0.00001 99.43 0 6 275415 2367

        In the above table, C1 stands for number of channels removed for 1st convolutional layer, C2 stands for number of channels removed for 2nd convolutional layer, F1 stands for number of parameters removed from 1st FC layer and F2 stands for number of parameters removed from 2nd FC layer.

        It was observed that, validation accuracy is maximum when epsilon is 0.001, lmbda_lasso is 0.00001 and lmbda_l2 is 0.00001. Also, it is also observed that compression is maximum when epsilon is 0.001, lmbda_lasso is 0.00001 and lmbda_l2 is 0.00001, since the remaining number of weights in 1st FC layer is minimum.

        Sensitivity Analysis on VGG architecture using SVHN

        VGG is a Convolutional Neural Network architcture. It was introduced by Karen Simonyan and Andrew Zisserman of Oxford Robotics Institute in 2014 [39]. SVHN dataset which is short for Street View House Numbers, is a real-world image dataset to advance machine learning and object recognition algorithms with nominal necessity on data preprocessing and formatting [40]. Sensitivity analysis is performed on VGG convolutional network, trained using SVHN dataset.

        Results for VGG using SVHN dataset

        The following table shows the output values indicating validation accuracy, number of channels removed for convolutional layers and number of parameters removed fo FC layers, when epsilon (threshold), lambda lasso and lambda l2 regularizers are varied.

        Table 2. VGG using SVHN
        Epsilon (threshold) lmbda_lassolmbda_l2 Validation Accuracy (%)  C1C2C3C4  C5 C6 F1F2F3
         0.001 0.00001 0.00001 95.03 0 77 68 143 285 213 2771355 738423 4150
        0.0001 0.000001 0.000001(yet to be found)  

        In the above table, C1 stands for number of channels removed for 1st convolutional layer, C2 stands for number of channels removed for 2nd convolutional layer, and so on. F1 stands for number of parameters removed from 1st FC layer, F2 stands for number of parameters removed from 2nd FC layer and F3 stands for number of parameters removed from 3rd FC layer.

        Similar to LeNet-5 architecture, after performing SA for various values of epsilon, lambda lasso and lambda l2 regularizers, the parameters that give the best compression ratio can be found out

        Sensitivity Analysis on VGG architecture using CIFAR-10

        The CIFAR-10 dataset, which is short for Canadian Institute For Advanced Research, is an assembly of pictures that are generally used to train machine learning and computer vision algorithms [41]. It is one amongst the most extensively used datasets for machine learning research. The CIFAR-10 dataset is composed of 60,000 32x32 colour images in 10 distinct classes. The 10 distinct classes correspond to aeroplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 pictures of every class [41]. Sensitivity analysis is performed on VGG convolutional network, trained using CIFAR-10 dataset.

        Results for VGG using CIFAR-10 dataset

        The following table shows the output values indicating validation accuracy, number of channels removed for convolutional layers and number of parameters removed fo FC layers, when epsilon (threshold), lambda lasso and lambda l2 regularizers are varied.

        Table 2. VGG using CIFAR-10
        Epsilon (threshold) lmbda_lassolmbda_l2 Validation Accuracy (%)   C1C2C3C4  C5 C6 F1F2F3
         0.001 0.00001 0.00001 90.93 0 7 19 48 120 117 3797018 4528580 4531550
        0.0001 0.000001 0.000001(yet to be found) 

        In the above table, C1 stands for number of channels removed for 1st convolutional layer, C2 stands for number of channels removed for 2nd convolutional layer, and so on. F1 stands for number of parameters removed from 1st FC layer, F2 stands for number of parameters removed from 2nd FC layer and F3 stands for number of parameters removed from 3rd FC layer.

        Similar to LeNet-5 architecture, after performing SA for various values of epsilon, lambda lasso and lambda l2 regularizers, the parameters that give the best compression ratio can be found out


        In this section, we observe how to segregate cats and dogs images by making use of transfer learning from a pre-trained network. This offers greater accuracies than training the network from the start [42].

        What is transfer learning?

        We define a pre-trained model to be a saved network that was earlier trained on a huge dataset [42]. We may use a readily available pretrained model or transfer learning using the pretrained convolutional networks. The instinct of transfer learning is that if this model trained on a huge and common dataset, this model will efficiently deliver as a universal model of the visual world. By making use of these models as the foundation of our own model that is specific to our task, we can take advantage of these learned feature maps without the need to train a huge model on a huge dataset. The following are the two sequence of events of transfer learning using a pretrained model:

        Feature Extraction:

        This makes use of the portrayal of learned by an earlier network so that relevant features from new samples can be extracted. In order to repurpose the feature maps learned earlier for our dataset, we directly add a new classifier, that will be trained from the scratch, above the pretrained model. Because the feature extraction portion of these pretrained convolutional networks are prone to be common and learned approaches over a picture. Nevertheless, the classification segment of the pretrained model is usually particular to original classification task, and finally particular to the set of classes upon which the model was trained.

        Fine tuning:

        This involves with unfreezing few topmost layers of a frozen model base that is used for feature extraction, and collectively training the two recently added classifier layers and also the last layers of the frozen model. This permits us to fine tune the greater order feature representations together with our final classifier so that they are more appropriate for the particular task involved.

        Common machine learning workflow is followed wherein data is examined and understood in the first step. Next, we build an input pipeline and then build our model. Further, pretrained model and pretrained weights are loaded and classification layers are stacked above. Finally, our model is trained and evaluated.

        Acceleration of pre-trained models

        It was observed that the time taken by the model improves when loss functions, optimizers and activation functions are modified according to the best suited situation. The time taken is noted for 3 epochs.

        The following table shows the time taken by varying loss functions, optimizers and activation functions.

        Table 3.  Time taken by the model to classify cats and dogs using pre-trained model
         Activation function Loss function Optimizer Time taken
         sigmoid binary_crossentropy RMSprop402.4466 
         sigmoid hinge RMSprop327.6621 
        tanh  hinge  RMSprop 314.6919 
        tanh  squared_hinge RMSprop230.6533
        tanh  squared_hingeAdagrad  165.6308

        It was observed that, when tanh is the activation function, squared_hinge is the loss function and Adagrad is the optimizer, the time taken by the model is seen to be improving.


        Combining Pruning and Structured Sparsity Learning techniques

        The experiments are carried out on MNIST dataset. MNIST is an image classification benchmark dataset comprising of 60,000 training set and 10,000 test set of 28 × 28 gray-scale pictures. The model used is a neural network called LeNet-5, that comprises of two Convolutional Layers (20 channels, 50 channels) and two FC layers (500 neurons and 10 output Neurons) that has a total of 4,30,000 parameters. The code is written in PyTorch and experiments are run on a Tesla K20 GPU. It was observed that the % decrease in size is 90.3% and % decrease in computations is 57.1%.


        Here, we have addressed the difficulty of dealing with compressing neural networks. Combining Pruning and Structured Sparsity Learning approach decreases the size of the model by pruning FC layers and by pruning structure of convolutional layers, the model complexity is boosted. The work can be extended by advancing the algorithms to deploy these pruned architectues on embedded devices. Sensitivity analysis was carried out to understand the different outcomes when the parameters are varied. The best parameters that show maximum compression were noted for LeNet-5 architecture. Similarly, SA for VGG architectures can be carried out for different values of epsilon, lambda lasso and lambda l2 regularizers and figure out which parameter values give the best compression ratio. Readily available TF model was run and the time taken by it was tried to accelerate.


        [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei Fei, "ImageNet: A Large-Scale Hierarchical Image Database", in CVPR09, 2009.

        [2] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, "Places: A 10 million image database for scene recognition", IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1452–1464, 2017.

        [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, 2008, pp. 1247–1250.

        [4] D. Vrandeciˇ c and M. Kr ́ otzsch, “Wikidata: a free collaborative knowledge base,” 2014.

        [5] Alexander, E.R. (1989), "Sensitivity analysis in complex decision models", Journal of the American Planning Association 55: 323-333.

        [6] David J. Pannell (1997), "Sensitivity analysis: strategies, methods, concepts, examples", School of Agricultural and Resource Economics, University of Western Australia, Crawley 6009, Australia.

        [7] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris ´ Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol ´ Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems", 2015. Software available from tensorflow.org.

        [8] Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah, "Comparative Study of Deep Learning Software Frameworks", Research and Technology Center, Robert Bosch LLC. arXiv: 1511 .06435v3 [cs.LG] 30 Mar 2016.

        [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks", in Advances in neural information processing systems, 2012, pp. 1097–1105.

        [10] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks", in European conference on computer vision. Springer, 2014, pp. 818–83[11] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition", arXiv preprint arXiv:1409.1556, 2014.

        [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation", in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

        [13] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, "Visual instance retrieval with deep convolutional networks", ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.

        [14] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, "Multi-scale orderless pooling of deep convolutional activation features", in European conference on computer vision. Springer, 2014, pp. 392–407.

        [15] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, B. Kingsbury et al., "Deep neural networks for acoustic modeling in speech recognition", IEEE Signal processing magazine, vol. 29, 2012.

        [16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch", Journal of machine learning research, vol. 12, no. Aug, pp. 2493–2537, 2011.

        [17] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., "Recent advances in deep learning for speech research at microsoft", in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8604–8608.

        [18] H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, T. R. Hughes et al., "The human splicing code reveals new insights into the genetic determinants of disease", Science, vol. 347, no. 6218, p. 1254806, 2015.

        [19] J. Zhou and O. G. Troyanskaya, "Predicting effects of noncoding variants with deep learning–based sequence model", Nature methods, vol. 12, no. 10, p. 931, 2015.

        [20] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, "From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots", in 2017 IEEE international conference on robotics and automation (icra). IEEE, 2017, pp. 1527–1533.

        [21] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, "Cognitive mapping and planning for visual navigation", in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2616–2625.

        [22] S. Shalev-Shwartz, S. Shammah, and A. Shashua, "Safe, multi-agent, reinforcement learning for autonomous driving", arXiv preprint arXiv:1610.03295, 2016.

        [23] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning with cots hpc systems,” in International Conference on Machine Learning, 2013, pp. 1337–1345.

        [24] L. N. Huynh, R. K. Balan, and Y. Lee, "Deepsense: A gpu-based deep convolutional neural network framework on commodity mobile devices", in Proceedings of the 2016 Workshop on Wearable Systems and Applications. ACM, 2016, pp. 25–30.

        [25] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network", in Advances in neural information processing systems, 2015, pp. 1135–1143.

        [26] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning structured sparsity in deep neural networks", in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.

        [27] M. Yuan and Y. Lin, "Model selection and estimation in regression with grouped variables", Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

        [28] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, "Exploiting linear structure within convolutional networks for efficient evaluation", in Advances in neural information processing systems, 2014, pp. 1269–1277.

        [29] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, and F. Kawsar, "Deepx: A software accelerator for low-power deep learning inference on mobile devices", in Proceedings of the 15th International Conference on Information Processing in Sensor Networks. IEEE Press, 2016, p. 23.

        [30] C. Louizos, M. Welling, and D. P. Kingma, “Learning Sparse Neural Networks through $L 0$ Regularization,” arXiv:1712.01312 [cs, stat], Dec. 2017, arXiv: 1712.01312. [Online]. Available: http://arxiv.org/abs/1712.01312

        [31] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.

        [32] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size", arXiv preprint arXiv:1602.07360, 2016.

        [33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications", arXiv preprint arXiv:1704.04861, 2017.

        [34] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.

        [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

        [36] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

        [37] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, "Gradient-Based Learning Applied to
        Document Recognition", PROC. OF THE IEEE, NOVEMBER 1998.

        [38] Wan Zhu, "Classification of MNIST Handwritten Digit Database using Neural Network", Research School of Computer Science, Australian National University, Acton, ACT 2601, Australia.

        ​[42] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, Chunfang Liu ,"A Survey on Deep Transfer Learning", the 27th International Conference on Artificial Neural Networks (ICANN 2018), arXiv:1808.01974 [cs.LG]

        [40] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet, "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks", Google Inc., Mountain View, CA.

        [41] YAN-YAN Wang (2018), "IMAGE CLASSIFICATION ON CIFAR-10 DATASET", International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 – 0882 Volume 7, Issue 6, June 2018.

        [13] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, "Visual instance retrieval with deep convolutional networks", ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.


        It gives me immense pleasure to express deep sense of gratitude to Dr. Sathish Vadhiyar for his invaluable guidance and support throughout my internship. It was the regular consultation with him that made me work harder and made me try to look at things with much better perspective. I am indebted to him for responding to my queries so promptly. I am thankful to Indian Academy of Sciences for providing me the opportunity to learn. I would like to extend my gratitude to the research fellows for the discussion and sharing their priceless ideas. I would like to thank my family for their constant support and faith in me.

        Written, reviewed, revised, proofed and published with