# Evaluation of different machine learning algorithms for multi-spectral satellite image classification

Aryaman Sinha

Indian Institute of Technology Bhubaneswar, Khordha, Odisha 752050

Prof. Uttam Kumar

Assistant Professor, International Institute of Information Technology Bangalore, Electronics City, Bangalore 560100

## Abstract

A number of classification algorithms have been developed over the past two decades to analyze the remotely sensed data that perform both binary and multi-class classification. Classification of remote sensing data is useful to obtain land cover and land use information of an area that acts as one of the important input layers for developing complex models for numerous environmental applications such as urban planning and smart city development, disaster management, biodiversity and biomass studies, water bodies mapping, agricultural and vegetation studies, climate change, etc. Remote sensing data of various spatial, spectral, radiometric and temporal resolutions acquired through different space borne satellites are available in public domain that can be classified using free and open source software (FOSS) based machine learning algorithms to obtain useful information related to the Earth’s surface. Geographic Resources Analysis Support System Geographic Information System (GRASS GIS) along with GDAL (Geospatial Data Abstraction Library), OGR (OpenGIS Simple Features Reference), and python libraries (scikit-learn) are FOSS that facilitate geospatial data management and analysis, image processing, graphics/maps production, spatial modelling, statistical computing, time series analysis and visualization of various types of data. In this work, we use Landsat-8 OLI multispectral dataset having six spectral bands in the electromagnetic (EM) spectrum (viz. Blue, Green, Red, Near Infra-Red (NIR), Shortwave Infra-Red1 - SWIR 1 and Shortwave Infra-Red2 - SWIR 2) acquired over Bangalore City during March, 2018 to classify into four land use classes – buildings (concrete roof, asbestos roof, roads, parking lots, walk ways), water (lakes, ponds), vacant land / open areas and green vegetation (trees, parks, lawns). Comparative evaluation of the performances of different machine learning algorithms such as Maximum Likelihood Classifier, Random Forest, Multi-layer Perceptron, Support Vector Machine, XGBoost, Stacked Denoising Auto-Encoder and Energy Based Models is performed by computing user’s, producer’s, overall accuracy, kappa statistics, k-fold cross validation and ROC curve. An ensemble classifier model was developed that renders higher classification accuracy than the individual classifiers.

Keywords: remote sensing, machine learning, GIS, spatial, spectral, algorithms

## Abbreviations

List of abbreviations
 DT Decision Tree RF Random Forest SVM Support Vector Machine MLC Maximum Likelihood Classifier CART Classification and Regression Trees ROC Receiver Operating Characteristics GEOBIA Geographic Object-based Image Analysis UA User’s Accuracy PA Producer’s Accuracy OA Overall Accuracy

## Background

Classification of remotely sensed data for mapping and monitoring the different land cover and land use types is an important application in understanding the Earth’s system science. Land cover and land use information of an area act as one of the important input layers for developing complex models for numerous environmental applications such as urban planning and smart city development, disaster management, biodiversity and biomass studies, water bodies mapping, agricultural and vegetation studies, climate change, etc.

## Statement of the Problems

Previous studies have showed that machine learning algorithm are particularly more useful and accurate than traditional classification technique, especially when feature space is complex. There is a wide range of machine learning algorithms which have been used for classification of land cover and land use from unsupervised learning to supervised methods such as Maximum Likelihood, Random forest, SVM, XGBoost, Auto-Encoders, Energy Based Models, Multi-layer perceptron, etc. Accuracy of the classified maps obtained from various classification techniques can vary depending upon the choice of algorithm. Therefore, selecting the best performing algorithm is an essential task and also depends on the training data, pre-processing and other auxiliary variables.

## Objectives of the Research

• To evaluate different machine learning algorithms for multi-spectral satellite image (Landsat 8 OLI) classification based on their user’s, producer’s, overall accuracy and kappa statistics.
• To develop an ensemble method using combination of strong classifiers.

## Scope

This research focuses on the efficacy of machine learning algorithms to create land use map of Bangalore City for year 2018. The four land use classes considered in this study are urban area, vegetation, water and open area.

The techniques developed in this study can be extended for historical data analysis (like 2014, 2009, 2004, 2000, etc.). This time series information can be used for prediction of city growth in near future and can aid in formulation of policies, city developmental guidelines and strategies to be followed for achieving the smart city goals.

## Information

There have been several researches to explore the usage of land cover / land use analysis on remote sensed data using different classification algorithms. Otukei and Blaschke [1] performed land cover mapping and land cover change assessment from 1986 to 2001 in Pallisa District, Uganda using DTs, SVMs, and MLC. They analysed, the usage of data mining to find the appropriate bands for classification and decision thresholds, and assessed the performance of the classification algorithms, the study concluded that land cover dynamics was occurring at unpredicted rate.

Shao and Lunetta [2] did the comparison of SVMs, CART and Neural Network for land cover classification using limited data points for training. They showed that use of MODIS (Moderate Resolution Imaging Spectroradiometer) time-series data can increase features per dimensions for classification. Training data size variations and their effect on the characteristics of features revealed that SVM was more accurate than Neural Nets and CART increasing the accuracy differences with small training size. The overall accuracies differed for homogenous and heterogenous sub-pixel cover while indicating high potential for regional scale operational land cover characteristics.

Rodriguez-Galiano and Chica-Rivas [3] evaluated SVM, DTs, ANN and RF for land cover mapping of Mediterranean area using multi-seasonal satellite data using user’s, producer’s, overall accuracies, kappa statistics, noise sensitivity and Z score.

Chen et al., [4] used stacked denoising autoencoders for land cover classification and feature extraction using hyperspectral satellite images. They built stacked autoencoder as a deep learning method to extract important features and use that pretrained encoder for training the classifier to get the classified result.

Kanita Tangthaikwan [5] experimented the usage of multi-class SVMs for spatial data classification from multispectral scanner (MSS) satellite data to identify the areas of land use using RBF kernel. In this study, pixel – based classification was performed according to the spectral values of the data and sigma of RBF kernel was varied to obtain the higher accuracy while comparing SVM with MLP and PCA.

Apurva [6] did the geographical area mapping and classification using different machine learning algorithms and discussed different ways of utilizing land cover and land use information for sustainable city development, traffic congestion, air pollution, etc. Thus, to increase the sustainability the land use should be planned to build such that environment friendly neighborhood can be developed.

Land cover mapping at high resolution can be challenging because of several reasons like large data volume, processing time, computational load, complexity of developing training, test and validation set, heterogeneity in data, etc. Maxwell [7]used combination of GEOBIA, RF machine learning, public imagery and primary data to classify barren land, where mixed classes proved to be difficult to map with high accuracy as assessed using UA and PA. This research suggests to explore the methods that solely depends upon the open source products to render land cover maps.

## Summary

Review of literature reveal the different measures that have to be considered during land cover / land use mapping such as the choice of classification algorithm, training data size, selection and use of open source products like R, Python, GRASS GIS etc. Few researches also highlighted the discussion on the problem of land cover classification from mixed pixels which can be prominently seen in medium to coarse spatial resolution remotely sensed data.

## Data Overview and Pre-processing

Multi-spectral dataset of Bangalore City (acquired during March 2018) were taken for the present study, which consists six spectral band datasets i.e. Blue, Green, Red, Near Infra-Red (NIR), Shortwave Infra-Red 1 (SWIR 1) and Shortwave Infra-Red 2 (SWIR 2) in Each image was of 1660 x 1529 dimension stored in Geotiff (GTiff) format. Training and test data with labelled pixels of four classes were used in the subsequent classification.

During handling the data, each band image was converted to one image i.e. one image of 6 modalities i.e. of dimension 1660 x 1529 x 6 as numpy array. The training pixels were separated from other pixels using dataframe format of numpy array (i.e. converted the numpy array to pandas dataframe in python). Thus, dataframe contained pixel information for training the classification models as well as for predicting the output results.

Label Description
 Class Id Class Name Legend 1 Urban / built-up / concrete / houses / buildings / roads / pavements / walk ways Red 2 Vegetation / Parks / Trees Green 3 Water / Lakes Blue 4 Open area / barren land / vacant land Pink
False Composite map of given dataset
RGB map of given dataset

Training map

## Software and Library

GRASS (Geographic Resources Analysis Support System) is a free and open source Geographic Information System (GIS) software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling and visualization. GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. GRASS GIS was used for visualization of remotely sensed data and classified maps. Raster modules were used to analyse raster maps; it uses GDAL Library provided by OSGeo for python developers (Python 3.7.3).

## Maximum-likelihood Classifier (MLC)

In this classification model, we assume that probability p(x|y) is distributed as multi-variate normal (or gaussian) distribution. As in logistic regression, we take the log likelihood and maximize the function while trying to estimate the likelihood parameters (which are mean and covariance matrix). So maximum likelihood estimation solves those parameters, gives the decision boundary which classifies the training observations x according to the probability that ith observation of x belongs to a cluster i.e. observation is assigned to class which has the highest probability. As we assumed the normal distribution, each cluster (class) consists of equiprobable contours which can be drawn around the centers of the clusters, and as we go away from the center of cluster the probability of contours decreases. MLC defines a threshold distance by defining maximum probability value. Thus, we finally get the decision boundaries for classification.

So, if we have the classification problem in which input features x are continuous random variables, then Gaussian Discriminant Analysis model can be used (which models the p(x|y) as normal distribution). GRASS GIS has a function called i.gensig() for supervised learning, which reads the raster map layer i.e. training map, ed and uses them to develop the spectral signature file as input for maximum likelihood classifier. The spectral signature file contains number of points in class, mean values per band of the class, and semi-matrix for band-band covariance. The maximum likelihood classifier uses the region means and covariance matrices from signature input file generated by i.gensig() and based on the region it calculates the statistical distance (i.e. probability value) and thus defines its class (cluster). The i.maxlik() function in GRASS is thus used for maximum likelihood discriminant analysis classification. So, here used i.gensig() is used to create signature file using training map and i.maxlik()builds a classifier and give classification output as a result.

## Random Forest Classifier (RFC)

Random Forest or Random Decision Forest uses ensemble learning method to classify.

Decision Tree is an acyclic graph that can be used to make decisions. In each branching node, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.

Decision tree is similar to the rule-based system. Given the training dataset with targets and features, the algorithm will come up with some set of rules. These rules can be used to perform the prediction on the test dataset.

The methods used to ensemble decision trees are bagging and boosting. Bagging is a method which uses the original training dataset and makes duplicate subsets of training dataset for training which can be chosen randomly with replacement. Radom Forest is the ensemble classifier which uses bagging (or bootstrap aggregation) and in addition it also uses random selection of features for splitting.

Sklearn provides the Random Forest Classifier which uses some parameters to tune the model: -

• n_estimators i.e. number of estimators.
• max_features i.e. the number of features to consider when looking for the best split.
• max_depth i.e. maximum allowable depth for a tree.
• max_leaf_nodes i.e. maximum number of allowable leaf nodes at last.
• min_impurity_decrease i.e. node will be split if this split induces a decrease of the impurity greater than or equal to this value

Here, the Grid search algorithm was used to find best parameters.

Thus, parameter setting used are: -

• n_estimators = 4
• max_features = sqrt
• max_leaf_nodes= 15
• max_depth= 10
• min_impurity_decrease = 1e-3
• random_state = 42

60% data for training and 40% for validation.

## Multi-Layer Percptron (MLP)

Multi-Layer Perceptron is a class of feed-forward artificial neural network.

It consists of three main layers: input layer, hidden layer and output layer, whereas there can be one or more hidden layers. Hidden layers and output layer have node with non-linear activation functions. Each node of MLP consists of the weight (w), bias value (b) and activation function (g(.)) When input is passed through it the output value produced is g(w.x+b) and MLP consists of such multi nodes which are fully connected feed-forward neural network and just same as neural network. It updates the weights using backpropagation algorithm.

There were two historical activation functions tanh and sigmoid but now in deep learning, we use rectified linear unit (ReLU) as an activation function and its more regularly used than sigmoid to overcome numerical value problems related to sigmoid. Learning in Perceptron occurs by update of weights after input data is processed. The update is based on the error change in output compared to actual result. We mainly use stochastic gradient descent optimization algorithm to update weights such that the loss is also at its minima, whereas Adam optimizer, is an adaptive learning optimization algorithm which can be used instead classical stochastic gradient descent optimization to update the weights. Learning rate (or step size) is a hyperparameter which tells to what extent new value of weight should replace old value of weight. There are many parameters and functions that effect the learning of any neural network, and one can achieve the best model possible for a given dataset through them.

Sklearn provides MLP Classifier function which is affected by the parameter settings that are used to determine the model. The parameter setting used in our model are: -

•  hidden_layer_sizes = (32,16,16,32)
• early_stopping = True
• max_iter = 25
• activation = relusolver = adam
• validation_fraction = 0.1
• tol = 1e-4
• learning_rate_init =1e-3
• nesterovs_momentum = True
• random_state = 42

Here, hidden_layer_sizes take input of tuple which tells what should be the number of hidden layers and what number of nodes each should have; if early_stopping is True then this means that, if the validation loss doesn’t improve over a tolerance value (tol as one of the parameter) then learning process will stop; activation function used is ReLU with Adam optimizer (solver); if nesterovs_momentum is True this means along with adam optimization function Nesterov momentum is also added, it’s the method which allows achieving convergence more strongly by applying velocity to the parameters to compute interim parameters and these interim parameters are used to calculate gradient; learning rate was initialized to 0.001; max_iter signifies the number of maximum iteration allowed to learn the model; validation_fraction is the fraction of dataset used for validation., Here 90% was used for training the model and remaining 10% for validation . Thus, model have input layer size as 6 (as there are six bands) and output layer size as 4 (as four classes).

## Support Vector Machine (SVM)

Support Vector Machine (SVM) are supervised algorithms used for classification and regression. It creates hyperplane in N-dimensions (N is number of features) that distinctly classifies the data points. To create two different classes there can be many ways to create hyperplane but SVM finds the plane that has maximum margin i.e. maximum distance from data points of both classes.

SVM uses kernel to transform the feature space, i.e. every dot product is replaced by the kernel function. Selecting an appropriate kernel is one of the parameters for SVM and there are several kernels available for selection in classification such as 1) Polynomial 2) RBF (Radial Basis Function) 3) Linear 4) Sigmoid Functions, etc.

SVMs need adjustment of large number of parameters for optimization: -

• Kernel Function
• Gamma parameter - it defines the distance of the data points from the margin. Larger value of the distance means data points are close and vice versa.
• Regularization parameter (often symbolized by C in cost function). For large value of C small margin hyperplane is optimized whereas for small value of C large margin hyperplane is optimized to get good classifier.
• Decision function - it is used for multi-class SVM. Here, algorithm builds several binary classifiers that distinguish one label from rest. This decision type is called one-vs-rest (ovr) or one-vs-all. Other method is to build binary classifiers between pairs called one-vs-one (ovo).
• Degree - it is only defined for polynomial kernel.
• Bias on kernel function - it is used only for polynomial and sigmoid kernel.

Sklearn provides SVM classifier and using appropriate parameter setting one can get good model.

The parameter setting used for our model (multi-class SVM Classifier) are:

• C = 1e-6 (regularization parameter)
• decision_function_shape = ovo (one-vs-one)
• kernel = poly (Polynomial)
• degree = 1
• tol = 4e-2 (Tolerance criteria for stopping)
• random_state= 42

## XGBoost Classifier

XGBoost is fastest optimized implementation of gradient boosted trees. It improves the major inefficiencies of Gradient Boosting. XGBoost provides parallel tree boosting that solves many data science problems in fast and accurate way. It is the optimized distributed gradient boosting library designed for Gradient Boosting Framework. The library provides several hyperparameters to investigate the model quickly and library algorithm is designed such that it tries not to overfit.

XGBoost has a Sklearn based function for classification called XGBClassifier which provides us to tune every parameter using grid search algorithm. Commonly used Parameters to tune the model are: -

• n_estimators - it’s the number of subtrees used to build the gradient boosted ensemble tree.
• max_depth - the maximum tree depth each individual tree can grow.
• learning_rate.reg_alpha - it’s the constant that controls the L1 regularization.
• reg_lambda - it’s the constant that controls the L2 regularization.
• gamma - minimum loss reduction to make further partition.

The parameter setting used for our model are

• n_estimators = 100
• learning_rate = 0.17
• max_delta_step = 1
• min_child_weight = 2
• gamma = 1e-3
• reg_alpha = 1
• reg_lambda = 1
• max_depth = 3

We have used 85% data for training and 15% for validation.

## Stacked Denoising Auto-Encoder (SDAE) Classifier

Typical SDAE includes two encoder layers and two decoder layers. In encoding part, output of first encoder serves as input to next encoder layer. There can also be more than one hidden encoder layers. let’s suppose there are L encoder layers then,

$\mathrm Y^{(\mathrm k+1)}\;=\;\mathrm g(\mathrm W^{(\mathrm k+1)}\;\mathrm Y^{(\mathrm k)}\;+\;\mathrm b^{(\mathrm k+1)})\;\;\mathrm{where}\;\mathrm k=0,1,\dots.,\mathrm L-1$

here Y(0) is input data X and Y(L) is last encoding layer which extracts high level features from the input data X. Similarly, for decoding layer, the output from first decoding layer serves as input to next decoding layer.

$\widehat{\mathrm Y}\;^{(\mathrm k+1)}=\;\mathrm h(\mathrm W^{(\mathrm L-\mathrm k)}\;\widehat{\mathrm Y}^{(\mathrm k+1)}\;+\;\mathrm b^{(\mathrm L-\mathrm k)})$

here $\widehat{\mathrm Y}$is the reconstructed output from decoder layer. g(.) and h(.) are activation functions used for encoder layer and decoder layer. Here relu was used as g(.) and softmax as h(.).

Now for training SDAE, train one DAE at a time i.e. train first DAE separately, then take the output produced by using only encoded layer part from the previous trained DAE, then use that output as input to next connecting DAE, and so on the all DAEs are trained one by one.

Now to build a SDAE Classifier remove the decoding part of the SDAE and connect Softmax layer for multi-class classification. Firstly, we train softmax layer with given input from last encoding layer to classify and to get weights.

Next, pre-trained SDAE and softmax layer weights are used that were assumed as initial weights in SDAE classifier and are fine tuned. Our proposed model contains two DAE to form SDAE and one softmax layer for classification to finally form SDAE classifier as shown in ​ Fig 4-7​​.

DAE-1
DAE-2
Softmax Layer
SDAE – Classifier

For training the models, SGD (Stochastic Gradient Descent) optimization with Nesterov Momentum for both DAEs and Nesterov Adam optimizer for Softmax layer were used. Learning rate taken for DAE-1 was 0.01 and DAE-2 was 0.1 and softmax layer was 0.002. Loss function used for DAEs is mean squared error, and for softmax layer is cross-entropy. For SDAE classifier, Nesterov Adam optimizer with cross-entropy as a loss function with 0.002 as learning rate are used (90% data for training and 10% for validation). Training for all the models were done for 15 epochs and learning rate were reduced by a factor of 0.1 if change in validation loss was not seen (under a tolerance of 1e-4) within 5 epochs.

## Energy Based Models

Energy Based learning provides a unified framework for many probabilistic as well as non-probabilistic machine learning algorithms. It can be seen as alternative for probabilistic models for estimation, classification or decision making. As there is no proper requirement of normalization, energy-based approach avoid the problem of normalization constant of probabilistic distribution.

Energy Based Models (EBMs) measures the compatibility between input data X and predicted values Y using the energy function E (Y, X). The basic idea of inference using Energy based model is to minimize the energy function such that we get the correct Y from the set Y (consists of all observable Ys).

$\begin{array}{l}\mathrm Y^{\ast\;}=\;\mathrm{argmin}\;\mathrm E(\mathrm Y,\mathrm X)\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\\\;\;\;\;\;\;\;\;\;\;\;\;\;\mathrm Y\in Y\\\;\end{array}$

Yann LeCun suggested simple architectures and loss function for classification problems in A Tutorial on Energy Based Learning [8]. The basic premise to design a model is to take a defined architecture with some inference algorithm and loss function, and the idea to choose or design loss function is such that it pulls down the energy for correct value of Y and pulls up the energy for incorrect value of Y and make an energy gap in between the correct label and other set of values. For multi-class classification Yann LeCun described a simple architecture for a model shown in ​ Fig 8 & 9​.

Architecture for multi-class classification

According to the above architecture, we can design any model which should have described energy function E (W, Y, X) where Y = G (W, X), which is the output for the designed inference algorithm.

Simple architecture that can be trained with energy loss function

According to LeCun et al (2006) [8] and LeCun and Huang (2005) [9] the simplest and most straight forward loss function is energy loss (L1 norm).

$\mathrm E(\mathrm W,\mathrm Y^{\mathrm i},\mathrm X^{\mathrm i})\;=\left|\left|\;\mathrm G(\mathrm W,\mathrm X^{\mathrm i})\;-\;\mathrm Y^{\mathrm i}\right|\right|$
$\mathrm L\;(\mathrm W,\;\mathrm Y,\;\mathrm X)\;=\;\sum_{\mathrm i}\;\mathrm E(\mathrm W,\mathrm Y^{\mathrm i,}\mathrm X^{\mathrm i})\;=\;\sum_{\mathrm i}\;\left|\left|\mathrm G(\mathrm W,\mathrm X^{\mathrm i})\;-\;\mathrm Y^{\mathrm i}\right|\right|$

Using these two architectures, two models are built with same inference algorithm i.e. simple neural network (algorithm to predict Y i.e. G(W,X)) and with different energy function and loss functions.

1)     EBM-NN-1: - In this model, second architecture with neural network is used as shown in the ​ Fig 10​. L1 norm Loss (i.e. Energy Loss as a Loss function) was used. For optimization, Adam optimizer with learning rate of 0.001 was used to train the network.

2)     EBM-NN-2: - In this model, architecture with same neural network was used as shown in the ​Fig 10​, i.e. E(W,Y,X) =${\sum }_{j}^{N}\delta \left(Y-k\right)$ g(j) as energy function where g(j) is the value from output vector G(W,X). To train the network, hinge loss as loss function i.e. one of the generalized margin losses which creates energy gap between the correct label and incorrect label was used.

${\mathrm L}_{\mathrm{hinge}}(\mathrm W,\mathrm Y^{\mathrm i},\mathrm X^{\mathrm i})\;=\;\max\;(0,\;\mathrm m\;+\;\mathrm E(\mathrm W,\mathrm Y^{\mathrm i},\mathrm X^{\mathrm i})-\mathrm E(\mathrm W^{,\mathrm i},\mathrm X^{\mathrm i}))$

here is the most offended incorrect answer and m is the positive margin, usually taken as 1.0. For training the model, Adam optimizer with 0.001 as leaning rate and m as 1.0 were considered.

Neural Network used as Inference Algorithm

## Majority Vote Ensemble Classifier

Model ensembling represents the family of techniques that reduces the classification errors. Here we have discussed and implemented Majority Vote Ensemble which says if there is a collection of well performing models then majority voting works efficiently. One rule to select the models to build an ensemble is to take the models which are less correlated to each other for improved performance.

So, we took several combinations of all models in bundle of three at time. The best model which gave highest improvement was the combining of Maximum Likelihood Classifier, Support Vector Machine and XGBoost Classifier as the highest accuracy among all three was 91.7% and lowest was 88.9% (mentioned in ​ Table 2​) and combination of these models gave result with overall accuracy of 95.7% (as shown in ​Table 2​) i.e. increase of 4% accuracy. The correlation between models were only between 0.64 to 0.66. This model is named as Majority Vote Ensemble -1 (MVE-1).

Another combination of Maximum Likelihood Classifier, Multi-layer Perceptron and EBM-NN-2 did not increase the accuracy but showed maximum overall accuracy among all the classifier i.e. 96.1% (​Table 2​). This combination only saw 0.4 % increase in accuracy as we had correlation value ranging from 0.66 to 0.79. This model gave 95.7% accuracy which is further referred as Majority Vote Ensemble – 2 (MVE-2) in this document.

## Classification Assessment and Discussion

The assessment of the classification results for every classification algorithm was done using confusion matrices, from which user’s accuracy (UA), producer’s accuracy (PA), overall accuracy (OA) and kappa value were calculated.

According to ​Table 2​, overall accuracy of all the classifiers are above 91% except Random Forest and XGBoost i.e. both tree-based ensemble classifiers gave less accuracy than others. The classification result obtained by EBM-NN-2 was most accurate with 95.7% overall accuracy and 0.942 kappa value followed by SDAE Classifier with 95.3% overall accuracy with 0.937 kappa and EBM-NN-1 with 94.5% overall accuracy with 0.927 kappa. It can be directly seen that neural network structured algorithms gave most promising results than tree-based algorithms. MLP also gave very high accuracy of 92.5% which is a neural network structured algorithm. SVM on the other hand rendered 91.7% also gave nearly similar accuracy compared to different types of approaches from neural nets. MLC with probabilistic approach also gave high accuracy of 91.3%. Thus, it can be concluded that Neural Nets, SVM and MLC outperformed the Tree-based classifiers.

As MLC showed best kappa value for class 4 (open area), so for ensemble model, MLC was included. Thus, we got two ensemble models - one with high improvement (minimum of 4% overall accuracy), called MVE-1 and another model which does not have improvement but it gave highest overall accuracy of 96.1% (with minimum of 0.4% improvement on overall accuracy) called MVE-2.

Now, if we look over for best kappa values in ​Table 2​ then, for class 1, SDAE gave the best value of 1.0; for class 2, SVM and MVE-1 gave the highest value of 0.997 whereas MVE-2 and MLC gave second highest value of 0.996. For class 3, MLP and EBM-NN-1 rendered highest value of 0.991 whereas MVE-2 and EBM-NN-2 gave second highest value of 0.988; for class 4, MLC showed highest value of 0.875, and MVE-2 gave second highest value of 0.834. Thus, MVE-2 is the overall best model achieved as it has second highest kappa values for each class except class 1.

Overall Summary of result of every classification algorithm used.
 Classifiers MLC RFC MLP SVM XGBoost SDAE EBM-NN-1 EBM-NN-2 MVE-1 MVE-2 OA 0.913 0.898 0.925 0.917 0.889 0.953 0.945 0.957 0.957 0.961 Kappa 0.884 0.863 0.899 0.889 0.852 0.937 0.927 0.942 0.943 0.947 Class Id UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa UA PA Kappa 1 0.956 0.885 0.943 0.955 0.859 0.940 0.968 0.798 0.957 0.994 0.746 0.991 0.967 0.847 0.956 1.000 0.862 1.000 0.997 0.847 0.996 0.997 0.855 0.996 0.988 0.861 0.983 0.990 0.861 0.987 2 0.997 0.985 0.996 0.991 0.929 0.988 0.964 0.966 0.953 0.998 0.992 0.997 0.999 0.781 0.999 0.994 0.973 0.992 0.995 0.948 0.993 0.995 0.989 0.994 0.998 0.981 0.997 0.997 0.989 0.996 3 0.810 1.000 0.753 0.986 0.874 0.980 0.993 1.000 0.991 0.987 0.978 0.982 0.987 0.996 0.982 0.990 0.980 0.987 0.993 0.993 0.991 0.991 0.988 0.988 0.987 0.996 0.982 0.991 1.000 0.988 4 0.913 0.813 0.875 0.678 0.944 0.601 0.784 0.935 0.721 0.712 0.991 0.643 0.628 0.960 0.548 0.841 0.998 0.794 0.811 0.996 0.758 0.855 0.997 0.812 0.866 0.988 0.824 0.874 0.989 0.834

## Percentage Land Cover Assessment

Percent land cover is the main aspect of the classified map obtained from classifier output. They can be used to assess the nature of the provided map i.e. how the given area is covered by different classes.

Here, as from the classified outputs given in ​Sec. 4.3​, it can be seen that Bangalore has majority of urban area in compact form i.e. majority of urban area have buildings, pavements, roads are in the center of the city and the corners have been progressively expanding through the roads which can be seen very clearly. Vegetation is scarce in the center of the city. Through statistical data (in ​Table 3​) it can be seen that buildings and vegetation cover mainly differ by 5-9% which can be improved if initiatives can be taken in future. Whereas, water body cover is least among all the four land cover categories. They are either present in specific areas in the form of lakes or swimming pool i.e. natural water bodies are very less in the city area (maximum percentage cover was only upto 1.68%), As such, Bangalore has very serious water shortage problem, and water tankers are used for water supply. According to ​ Table 3​ it can be seen that the city has mostly open area but only outside the main city i.e. we can say Bangalore is growing in the form of a big cluster If any road, building, or offices are planned then it can only be regulated outside the city as there is not much space as observed in the classified output.

Percentage Land Cover for different classified results.
 Classifiers MLC RFC MLP SVM XGBoost SDAE EBM-NN-1 EBM-NN-2 MVE-1 MVE-2 Class Id % Cover % Cover % Cover % Cover % Cover % Cover % Cover % Cover % Cover % Cover 1 21.15 26.18 29.77 29.69 30.74 30.43 30.76 31.40 29.74 31.01 2 20.11 17.22 20.32 21.15 21.69 23.91 26.32 26.96 21.77 20.91 3 0.83 1.57 1.35 1.46 1.42 1.61 1.68 1.61 1.36 1.34 4 57.91 55.03 48.55 47.69 46.15 44.04 41.24 40.03 47.13 46.74
Percentage Land Cover Bar Graph for different classifiers

## Classified Results

Maximum Likelihood Classifier
Random Forest Classifier
Multi-Layer Perceptron
Support Vector Machine
XGBoost Classifier
SDAE Classifier
EBM-NN-1
EBM-NN-2
MVE-1
MVE-2
Class Legend

## CONCLUSION

The comparative evaluation of different machine learning algorithms for land cover classification using user’s, producer’s and overall accuracy was carried out.

The greatest overall accuracy achieved is 95.7% and 95.3% from EBM-NN-2 and SDAE classifier respectively. For class 1, SDAE has maximum kappa value of 1 with 100% user accuracy; for class 2 XGBoost showed maximum kappa value of 0.99 with 99.99% user’s accuracy. For class 3 both EBM-NN-1 and MLP rendered maximum kappa value of 0.99 and 99.3% user’s accuracy; for class 4 MLC showed maximum kappa value of 0.875 and 91.3% user’s accuracy.

Thus, every classifier responds differently to each class and gives different accuracy. One can differentiate the models on the basis of each class and depict the classifier that can give correct labels for that unique class. An attempt was made to propose an ensemble model by combining best model for class 4 (i.e. MLC) with other models to form overall best classification model, where MVE-2 gave 96.1% accuracy.

From land cover assessment, Bangalore city is currently not close to a smart city despite the technical availabilities. There is lack of road system, improper water supply, traffic issues and unsustainable development. The local government should device and implement proper policies, guidelines and take precautionary measures during the development of the city.

## ACKNOWLEDGEMENTS

Utmost gratitude and thank to Prof. Uttam Kumar for encouraging me and guiding me throughout the project. Also, I would like to thank Indian Academy of Sciences for giving this wonderful opportunity and also for providing the good platform like AuthorCafe to give amazing experience during the project documentation.

#### References

•  J.R. Otukei & T. Blaschke, “Land cover change assessment using decision trees, support vector machines and maximum likelihood classification algorithms”, International Journal of Applied Earth Observation and Geoinformation, 12S (2010) S27–S31.

• Y. Shao & R. S. Lunetta, “Comparison of support vector machine, neural network, and CART algorithms for the land-cover classification using limited training data points”, ISPRS Journal of Photogrammetry and Remote Sensing 70,2012, 78–87.

• V. F. Rodriguez-Galiano & M. Chica-Rivas 2012. “Evaluation of different machine learning methods for land cover mapping of a Mediterranean area using multi-seasonal Landsat images and Digital Terrain Models”, International Journal of Digital Earth, 2012, 1-18, iFirst article.

• Chen Xing et al 2016. “Stacked Denoise Autoencoder Based Feature Extraction and Classification for Hyperspectral Images”, Journal of Sensors, 2016, 1-10.

• K. Tangthaikwan et al, "Multiclass support vector machine for classification spatial data from satellite image," 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, 2017, 111-115.

• A. Saksena, et al, "Geographical Area Mapping and Classification Utilizing Multispectral Satellite Imagery Processing Based on Machine Learning Algorithms Classifying Land based on its use for different purposes," 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, 2018, 1065-1070.

• Aaron E. Maxwell et al, “Large-Area, High Spatial Resolution Land Cover Mapping Using Random Forests, GEOBIA, and NAIP Orthophotography: Findings and Recommendations”, Remote Sens. 2019, 11(12), 1409.

• LeCun et al., “A Tutorial on Energy- Based Learning”, January 2006, Predicting Structured Data, 2006, MIT Press: 8-19, 23-29.

• LeCun & Huang, “Loss Functions for Discriminative Training of Energy Based Models”, January 2005, Proc. of the 10-th International Workshop on Artificial Intelligence and Statistics (AIStats'05): 2-5.

More
Written, reviewed, revised, proofed and published with