Evaluation of different machine learning algorithms for multispectral satellite image classification
Abstract
Keywords: remote sensing, machine learning, GIS, spatial, spectral, algorithms
Abbreviations
DT  Decision Tree 
RF  Random Forest 
SVM  Support Vector Machine 
MLC  Maximum Likelihood Classifier 
CART  Classification and Regression Trees 
ROC  Receiver Operating Characteristics 
GEOBIA  Geographic Objectbased Image Analysis 
UA  User’s Accuracy 
PA  Producer’s Accuracy 
OA  Overall Accuracy 
INTRODUCTION
Background
Statement of the Problems
Previous studies have showed that machine learning algorithm are particularly more useful and accurate than traditional classification technique, especially when feature space is complex. There is a wide range of machine learning algorithms which have been used for classification of land cover and land use from unsupervised learning to supervised methods such as Maximum Likelihood, Random forest, SVM, XGBoost, AutoEncoders, Energy Based Models, Multilayer perceptron, etc. Accuracy of the classified maps obtained from various classification techniques can vary depending upon the choice of algorithm. Therefore, selecting the best performing algorithm is an essential task and also depends on the training data, preprocessing and other auxiliary variables.
Objectives of the Research
 To evaluate different machine learning algorithms for multispectral satellite image (Landsat 8 OLI) classification based on their user’s, producer’s, overall accuracy and kappa statistics.
 To develop an ensemble method using combination of strong classifiers.
Scope
This research focuses on the efficacy of machine learning algorithms to create land use map of Bangalore City for year 2018. The four land use classes considered in this study are urban area, vegetation, water and open area.
The techniques developed in this study can be extended for historical data analysis (like 2014, 2009, 2004, 2000, etc.). This time series information can be used for prediction of city growth in near future and can aid in formulation of policies, city developmental guidelines and strategies to be followed for achieving the smart city goals.
LITERATURE REVIEW
Information
There have been several researches to explore the usage of land cover / land use analysis on remote sensed data using different classification algorithms. Otukei and Blaschke ^{[1]}^{ } performed land cover mapping and land cover change assessment from 1986 to 2001 in Pallisa District, Uganda using DTs, SVMs, and MLC. They analysed, the usage of data mining to find the appropriate bands for classification and decision thresholds, and assessed the performance of the classification algorithms, the study concluded that land cover dynamics was occurring at unpredicted rate.
Shao and Lunetta ^{[2]}^{ } did the comparison of SVMs, CART and Neural Network for land cover classification using limited data points for training. They showed that use of MODIS (Moderate Resolution Imaging Spectroradiometer) timeseries data can increase features per dimensions for classification. Training data size variations and their effect on the characteristics of features revealed that SVM was more accurate than Neural Nets and CART increasing the accuracy differences with small training size. The overall accuracies differed for homogenous and heterogenous subpixel cover while indicating high potential for regional scale operational land cover characteristics.
RodriguezGaliano and ChicaRivas ^{[3]}^{ } evaluated SVM, DTs, ANN and RF for land cover mapping of Mediterranean area using multiseasonal satellite data using user’s, producer’s, overall accuracies, kappa statistics, noise sensitivity and Z score.
Chen et al., ^{[4]}^{ } used stacked denoising autoencoders for land cover classification and feature extraction using hyperspectral satellite images. They built stacked autoencoder as a deep learning method to extract important features and use that pretrained encoder for training the classifier to get the classified result.
Kanita Tangthaikwan ^{[5]}^{ } experimented the usage of multiclass SVMs for spatial data classification from multispectral scanner (MSS) satellite data to identify the areas of land use using RBF kernel. In this study, pixel – based classification was performed according to the spectral values of the data and sigma of RBF kernel was varied to obtain the higher accuracy while comparing SVM with MLP and PCA.
Apurva ^{[6]}^{ } did the geographical area mapping and classification using different machine learning algorithms and discussed different ways of utilizing land cover and land use information for sustainable city development, traffic congestion, air pollution, etc. Thus, to increase the sustainability the land use should be planned to build such that environment friendly neighborhood can be developed.
Land cover mapping at high resolution can be challenging because of several reasons like large data volume, processing time, computational load, complexity of developing training, test and validation set, heterogeneity in data, etc. Maxwell ^{[7]}^{ }used combination of GEOBIA, RF machine learning, public imagery and primary data to classify barren land, where mixed classes proved to be difficult to map with high accuracy as assessed using UA and PA. This research suggests to explore the methods that solely depends upon the open source products to render land cover maps.
Summary
Review of literature reveal the different measures that have to be considered during land cover / land use mapping such as the choice of classification algorithm, training data size, selection and use of open source products like R, Python, GRASS GIS etc. Few researches also highlighted the discussion on the problem of land cover classification from mixed pixels which can be prominently seen in medium to coarse spatial resolution remotely sensed data.
METHODOLOGY
Data Overview and Preprocessing
Multispectral dataset of Bangalore City (acquired during March 2018) were taken for the present study, which consists six spectral band datasets i.e. Blue, Green, Red, Near InfraRed (NIR), Shortwave InfraRed 1 (SWIR 1) and Shortwave InfraRed 2 (SWIR 2) in Each image was of 1660 x 1529 dimension stored in Geotiff (GTiff) format. Training and test data with labelled pixels of four classes were used in the subsequent classification.
During handling the data, each band image was converted to one image i.e. one image of 6 modalities i.e. of dimension 1660 x 1529 x 6 as numpy array. The training pixels were separated from other pixels using dataframe format of numpy array (i.e. converted the numpy array to pandas dataframe in python). Thus, dataframe contained pixel information for training the classification models as well as for predicting the output results.
Class Id  Class Name  Legend 
1  Urban / builtup / concrete / houses / buildings / roads / pavements / walk ways  Red 
2  Vegetation / Parks / Trees  Green 
3  Water / Lakes  Blue 
4  Open area / barren land / vacant land  Pink 
Software and Library
GRASS (Geographic Resources Analysis Support System) is a free and open source Geographic Information System (GIS) software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling and visualization. GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. GRASS GIS was used for visualization of remotely sensed data and classified maps. Raster modules were used to analyse raster maps; it uses GDAL Library provided by OSGeo for python developers (Python 3.7.3).
Classification methods
Maximumlikelihood Classifier (MLC)
In this classification model, we assume that probability p(xy) is distributed as multivariate normal (or gaussian) distribution. As in logistic regression, we take the log likelihood and maximize the function while trying to estimate the likelihood parameters (which are mean and covariance matrix). So maximum likelihood estimation solves those parameters, gives the decision boundary which classifies the training observations x according to the probability that i^{th} observation of x belongs to a cluster i.e. observation is assigned to class which has the highest probability. As we assumed the normal distribution, each cluster (class) consists of equiprobable contours which can be drawn around the centers of the clusters, and as we go away from the center of cluster the probability of contours decreases. MLC defines a threshold distance by defining maximum probability value. Thus, we finally get the decision boundaries for classification.
So, if we have the classification problem in which input features x are continuous random variables, then Gaussian Discriminant Analysis model can be used (which models the p(xy) as normal distribution). GRASS GIS has a function called i.gensig() for supervised learning, which reads the raster map layer i.e. training map, ed and uses them to develop the spectral signature file as input for maximum likelihood classifier. The spectral signature file contains number of points in class, mean values per band of the class, and semimatrix for bandband covariance. The maximum likelihood classifier uses the region means and covariance matrices from signature input file generated by i.gensig() and based on the region it calculates the statistical distance (i.e. probability value) and thus defines its class (cluster). The i.maxlik() function in GRASS is thus used for maximum likelihood discriminant analysis classification. So, here used i.gensig() is used to create signature file using training map and i.maxlik()builds a classifier and give classification output as a result.
Random Forest Classifier (RFC)
Random Forest or Random Decision Forest uses ensemble learning method to classify.
Decision Tree is an acyclic graph that can be used to make decisions. In each branching node, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.
Decision tree is similar to the rulebased system. Given the training dataset with targets and features, the algorithm will come up with some set of rules. These rules can be used to perform the prediction on the test dataset.
The methods used to ensemble decision trees are bagging and boosting. Bagging is a method which uses the original training dataset and makes duplicate subsets of training dataset for training which can be chosen randomly with replacement. Radom Forest is the ensemble classifier which uses bagging (or bootstrap aggregation) and in addition it also uses random selection of features for splitting.
Sklearn provides the Random Forest Classifier which uses some parameters to tune the model: 
 n_estimators i.e. number of estimators.
 max_features i.e. the number of features to consider when looking for the best split.
 max_depth i.e. maximum allowable depth for a tree.
 max_leaf_nodes i.e. maximum number of allowable leaf nodes at last.
 min_impurity_decrease i.e. node will be split if this split induces a decrease of the impurity greater than or equal to this value
Here, the Grid search algorithm was used to find best parameters.
Thus, parameter setting used are: 
 n_estimators = 4
 max_features = sqrt
 max_leaf_nodes= 15
 max_depth= 10
 min_impurity_decrease = 1e3
 random_state = 42
60% data for training and 40% for validation.
MultiLayer Percptron (MLP)
MultiLayer Perceptron is a class of feedforward artificial neural network.
It consists of three main layers: input layer, hidden layer and output layer, whereas there can be one or more hidden layers. Hidden layers and output layer have node with nonlinear activation functions. Each node of MLP consists of the weight (w), bias value (b) and activation function (g(.)) When input is passed through it the output value produced is g(w.x+b) and MLP consists of such multi nodes which are fully connected feedforward neural network and just same as neural network. It updates the weights using backpropagation algorithm.
There were two historical activation functions tanh and sigmoid but now in deep learning, we use rectified linear unit (ReLU) as an activation function and its more regularly used than sigmoid to overcome numerical value problems related to sigmoid. Learning in Perceptron occurs by update of weights after input data is processed. The update is based on the error change in output compared to actual result. We mainly use stochastic gradient descent optimization algorithm to update weights such that the loss is also at its minima, whereas Adam optimizer, is an adaptive learning optimization algorithm which can be used instead classical stochastic gradient descent optimization to update the weights. Learning rate (or step size) is a hyperparameter which tells to what extent new value of weight should replace old value of weight. There are many parameters and functions that effect the learning of any neural network, and one can achieve the best model possible for a given dataset through them.
Sklearn provides MLP Classifier function which is affected by the parameter settings that are used to determine the model. The parameter setting used in our model are: 
 hidden_layer_sizes = (32,16,16,32)
 early_stopping = True
 max_iter = 25
 activation = relusolver = adam
 validation_fraction = 0.1
 tol = 1e4
 learning_rate_init =1e3
 nesterovs_momentum = True
 random_state = 42
Here, hidden_layer_sizes take input of tuple which tells what should be the number of hidden layers and what number of nodes each should have; if early_stopping is True then this means that, if the validation loss doesn’t improve over a tolerance value (tol as one of the parameter) then learning process will stop; activation function used is ReLU with Adam optimizer (solver); if nesterovs_momentum is True this means along with adam optimization function Nesterov momentum is also added, it’s the method which allows achieving convergence more strongly by applying velocity to the parameters to compute interim parameters and these interim parameters are used to calculate gradient; learning rate was initialized to 0.001; max_iter signifies the number of maximum iteration allowed to learn the model; validation_fraction is the fraction of dataset used for validation., Here 90% was used for training the model and remaining 10% for validation . Thus, model have input layer size as 6 (as there are six bands) and output layer size as 4 (as four classes).
Support Vector Machine (SVM)
Support Vector Machine (SVM) are supervised algorithms used for classification and regression. It creates hyperplane in Ndimensions (N is number of features) that distinctly classifies the data points. To create two different classes there can be many ways to create hyperplane but SVM finds the plane that has maximum margin i.e. maximum distance from data points of both classes.
SVM uses kernel to transform the feature space, i.e. every dot product is replaced by the kernel function. Selecting an appropriate kernel is one of the parameters for SVM and there are several kernels available for selection in classification such as 1) Polynomial 2) RBF (Radial Basis Function) 3) Linear 4) Sigmoid Functions, etc.
SVMs need adjustment of large number of parameters for optimization: 
 Kernel Function
 Gamma parameter  it defines the distance of the data points from the margin. Larger value of the distance means data points are close and vice versa.
 Regularization parameter (often symbolized by C in cost function). For large value of C small margin hyperplane is optimized whereas for small value of C large margin hyperplane is optimized to get good classifier.
 Decision function  it is used for multiclass SVM. Here, algorithm builds several binary classifiers that distinguish one label from rest. This decision type is called onevsrest (ovr) or onevsall. Other method is to build binary classifiers between pairs called onevsone (ovo).
 Degree  it is only defined for polynomial kernel.
 Bias on kernel function  it is used only for polynomial and sigmoid kernel.
Sklearn provides SVM classifier and using appropriate parameter setting one can get good model.
The parameter setting used for our model (multiclass SVM Classifier) are:
 C = 1e6 (regularization parameter)
 decision_function_shape = ovo (onevsone)
 kernel = poly (Polynomial)
 degree = 1
 tol = 4e2 (Tolerance criteria for stopping)
 random_state= 42
XGBoost Classifier
XGBoost is fastest optimized implementation of gradient boosted trees. It improves the major inefficiencies of Gradient Boosting. XGBoost provides parallel tree boosting that solves many data science problems in fast and accurate way. It is the optimized distributed gradient boosting library designed for Gradient Boosting Framework. The library provides several hyperparameters to investigate the model quickly and library algorithm is designed such that it tries not to overfit.
XGBoost has a Sklearn based function for classification called XGBClassifier which provides us to tune every parameter using grid search algorithm. Commonly used Parameters to tune the model are: 
 n_estimators  it’s the number of subtrees used to build the gradient boosted ensemble tree.
 max_depth  the maximum tree depth each individual tree can grow.
 learning_rate.reg_alpha  it’s the constant that controls the L1 regularization.
 reg_lambda  it’s the constant that controls the L2 regularization.
 gamma  minimum loss reduction to make further partition.
The parameter setting used for our model are
 n_estimators = 100
 learning_rate = 0.17
 max_delta_step = 1
 min_child_weight = 2
 gamma = 1e3
 reg_alpha = 1
 reg_lambda = 1
 max_depth = 3
We have used 85% data for training and 15% for validation.
Stacked Denoising AutoEncoder (SDAE) Classifier
Typical SDAE includes two encoder layers and two decoder layers. In encoding part, output of first encoder serves as input to next encoder layer. There can also be more than one hidden encoder layers. let’s suppose there are L encoder layers then,
here Y^{(0) }is input data X and Y^{(L) }is last encoding layer which extracts high level features from the input data X. Similarly, for decoding layer, the output from first decoding layer serves as input to next decoding layer.
here $\widehat{\mathrm Y}$is the reconstructed output from decoder layer. g(.) and h(.) are activation functions used for encoder layer and decoder layer. Here relu was used as g(.) and softmax as h(.).
Now for training SDAE, train one DAE at a time i.e. train first DAE separately, then take the output produced by using only encoded layer part from the previous trained DAE, then use that output as input to next connecting DAE, and so on the all DAEs are trained one by one.
Now to build a SDAE Classifier remove the decoding part of the SDAE and connect Softmax layer for multiclass classification. Firstly, we train softmax layer with given input from last encoding layer to classify and to get weights.
Next, pretrained SDAE and softmax layer weights are used that were assumed as initial weights in SDAE classifier and are fine tuned. Our proposed model contains two DAE to form SDAE and one softmax layer for classification to finally form SDAE classifier as shown in Fig 47.
For training the models, SGD (Stochastic Gradient Descent) optimization with Nesterov Momentum for both DAEs and Nesterov Adam optimizer for Softmax layer were used. Learning rate taken for DAE1 was 0.01 and DAE2 was 0.1 and softmax layer was 0.002. Loss function used for DAEs is mean squared error, and for softmax layer is crossentropy. For SDAE classifier, Nesterov Adam optimizer with crossentropy as a loss function with 0.002 as learning rate are used (90% data for training and 10% for validation). Training for all the models were done for 15 epochs and learning rate were reduced by a factor of 0.1 if change in validation loss was not seen (under a tolerance of 1e4) within 5 epochs.
Energy Based Models
Energy Based learning provides a unified framework for many probabilistic as well as nonprobabilistic machine learning algorithms. It can be seen as alternative for probabilistic models for estimation, classification or decision making. As there is no proper requirement of normalization, energybased approach avoid the problem of normalization constant of probabilistic distribution.
Energy Based Models (EBMs) measures the compatibility between input data X and predicted values Y using the energy function E (Y, X). The basic idea of inference using Energy based model is to minimize the energy function such that we get the correct Y from the set Y (consists of all observable Ys).
Yann LeCun suggested simple architectures and loss function for classification problems in A Tutorial on Energy Based Learning ^{[8]}^{ }. The basic premise to design a model is to take a defined architecture with some inference algorithm and loss function, and the idea to choose or design loss function is such that it pulls down the energy for correct value of Y and pulls up the energy for incorrect value of Y and make an energy gap in between the correct label and other set of values. For multiclass classification Yann LeCun described a simple architecture for a model shown in Fig 8 & 9.
According to the above architecture, we can design any model which should have described energy function E (W, Y, X) where Y = G (W, X), which is the output for the designed inference algorithm.
According to LeCun et al (2006) ^{[8]}^{ }^{ }and LeCun and Huang (2005)^{ } ^{[9]}^{ } the simplest and most straight forward loss function is energy loss (L1 norm).
Using these two architectures, two models are built with same inference algorithm i.e. simple neural network (algorithm to predict Y i.e. G(W,X)) and with different energy function and loss functions.
1) EBMNN1:  In this model, second architecture with neural network is used as shown in the Fig 10. L1 norm Loss (i.e. Energy Loss as a Loss function) was used. For optimization, Adam optimizer with learning rate of 0.001 was used to train the network.
2) EBMNN2:  In this model, architecture with same neural network was used as shown in the Fig 10, i.e. E(W,Y,X) =${\sum}_{j}^{N}\delta (Yk)$ g(j) as energy function where g(j) is the value from output vector G(W,X). To train the network, hinge loss as loss function i.e. one of the generalized margin losses which creates energy gap between the correct label and incorrect label was used.
here is the most offended incorrect answer and m is the positive margin, usually taken as 1.0. For training the model, Adam optimizer with 0.001 as leaning rate and m as 1.0 were considered.
Majority Vote Ensemble Classifier
Model ensembling represents the family of techniques that reduces the classification errors. Here we have discussed and implemented Majority Vote Ensemble which says if there is a collection of well performing models then majority voting works efficiently. One rule to select the models to build an ensemble is to take the models which are less correlated to each other for improved performance.
So, we took several combinations of all models in bundle of three at time. The best model which gave highest improvement was the combining of Maximum Likelihood Classifier, Support Vector Machine and XGBoost Classifier as the highest accuracy among all three was 91.7% and lowest was 88.9% (mentioned in Table 2) and combination of these models gave result with overall accuracy of 95.7% (as shown in Table 2) i.e. increase of 4% accuracy. The correlation between models were only between 0.64 to 0.66. This model is named as Majority Vote Ensemble 1 (MVE1).
Another combination of Maximum Likelihood Classifier, Multilayer Perceptron and EBMNN2 did not increase the accuracy but showed maximum overall accuracy among all the classifier i.e. 96.1% (Table 2). This combination only saw 0.4 % increase in accuracy as we had correlation value ranging from 0.66 to 0.79. This model gave 95.7% accuracy which is further referred as Majority Vote Ensemble – 2 (MVE2) in this document.
RESULTS AND DISCUSSION
Classification Assessment and Discussion
The assessment of the classification results for every classification algorithm was done using confusion matrices, from which user’s accuracy (UA), producer’s accuracy (PA), overall accuracy (OA) and kappa value were calculated.
According to Table 2, overall accuracy of all the classifiers are above 91% except Random Forest and XGBoost i.e. both treebased ensemble classifiers gave less accuracy than others. The classification result obtained by EBMNN2 was most accurate with 95.7% overall accuracy and 0.942 kappa value followed by SDAE Classifier with 95.3% overall accuracy with 0.937 kappa and EBMNN1 with 94.5% overall accuracy with 0.927 kappa. It can be directly seen that neural network structured algorithms gave most promising results than treebased algorithms. MLP also gave very high accuracy of 92.5% which is a neural network structured algorithm. SVM on the other hand rendered 91.7% also gave nearly similar accuracy compared to different types of approaches from neural nets. MLC with probabilistic approach also gave high accuracy of 91.3%. Thus, it can be concluded that Neural Nets, SVM and MLC outperformed the Treebased classifiers.
As MLC showed best kappa value for class 4 (open area), so for ensemble model, MLC was included. Thus, we got two ensemble models  one with high improvement (minimum of 4% overall accuracy), called MVE1 and another model which does not have improvement but it gave highest overall accuracy of 96.1% (with minimum of 0.4% improvement on overall accuracy) called MVE2.
Now, if we look over for best kappa values in Table 2 then, for class 1, SDAE gave the best value of 1.0; for class 2, SVM and MVE1 gave the highest value of 0.997 whereas MVE2 and MLC gave second highest value of 0.996. For class 3, MLP and EBMNN1 rendered highest value of 0.991 whereas MVE2 and EBMNN2 gave second highest value of 0.988; for class 4, MLC showed highest value of 0.875, and MVE2 gave second highest value of 0.834. Thus, MVE2 is the overall best model achieved as it has second highest kappa values for each class except class 1.
Classifiers  MLC  RFC  MLP  SVM  XGBoost  SDAE  EBMNN1  EBMNN2  MVE1  MVE2  
OA  0.913  0.898  0.925  0.917  0.889  0.953  0.945  0.957  0.957  0.961  
Kappa  0.884  0.863  0.899  0.889  0.852  0.937  0.927  0.942  0.943  0.947  
Class Id  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa  UA  PA  Kappa 
1  0.956  0.885  0.943  0.955  0.859  0.940  0.968  0.798  0.957  0.994  0.746  0.991  0.967  0.847  0.956  1.000  0.862  1.000  0.997  0.847  0.996  0.997  0.855  0.996  0.988  0.861  0.983  0.990  0.861  0.987 
2  0.997  0.985  0.996  0.991  0.929  0.988  0.964  0.966  0.953  0.998  0.992  0.997  0.999  0.781  0.999  0.994  0.973  0.992  0.995  0.948  0.993  0.995  0.989  0.994  0.998  0.981  0.997  0.997  0.989  0.996 
3  0.810  1.000  0.753  0.986  0.874  0.980  0.993  1.000  0.991  0.987  0.978  0.982  0.987  0.996  0.982  0.990  0.980  0.987  0.993  0.993  0.991  0.991  0.988  0.988  0.987  0.996  0.982  0.991  1.000  0.988 
4  0.913  0.813  0.875  0.678  0.944  0.601  0.784  0.935  0.721  0.712  0.991  0.643  0.628  0.960  0.548  0.841  0.998  0.794  0.811  0.996  0.758  0.855  0.997  0.812  0.866  0.988  0.824  0.874  0.989  0.834 
Percentage Land Cover Assessment
Percent land cover is the main aspect of the classified map obtained from classifier output. They can be used to assess the nature of the provided map i.e. how the given area is covered by different classes.
Here, as from the classified outputs given in Sec. 4.3, it can be seen that Bangalore has majority of urban area in compact form i.e. majority of urban area have buildings, pavements, roads are in the center of the city and the corners have been progressively expanding through the roads which can be seen very clearly. Vegetation is scarce in the center of the city. Through statistical data (in Table 3) it can be seen that buildings and vegetation cover mainly differ by 59% which can be improved if initiatives can be taken in future. Whereas, water body cover is least among all the four land cover categories. They are either present in specific areas in the form of lakes or swimming pool i.e. natural water bodies are very less in the city area (maximum percentage cover was only upto 1.68%), As such, Bangalore has very serious water shortage problem, and water tankers are used for water supply. According to Table 3 it can be seen that the city has mostly open area but only outside the main city i.e. we can say Bangalore is growing in the form of a big cluster If any road, building, or offices are planned then it can only be regulated outside the city as there is not much space as observed in the classified output.
Classifiers  MLC  RFC  MLP  SVM  XGBoost  SDAE  EBMNN1  EBMNN2  MVE1  MVE2 
Class Id  % Cover  % Cover  % Cover  % Cover  % Cover  % Cover  % Cover  % Cover  % Cover  % Cover 
1  21.15  26.18  29.77  29.69  30.74  30.43  30.76  31.40  29.74  31.01 
2  20.11  17.22  20.32  21.15  21.69  23.91  26.32  26.96  21.77  20.91 
3  0.83  1.57  1.35  1.46  1.42  1.61  1.68  1.61  1.36  1.34 
4  57.91  55.03  48.55  47.69  46.15  44.04  41.24  40.03  47.13  46.74 
Classified Results
CONCLUSION
The comparative evaluation of different machine learning algorithms for land cover classification using user’s, producer’s and overall accuracy was carried out.
The greatest overall accuracy achieved is 95.7% and 95.3% from EBMNN2 and SDAE classifier respectively. For class 1, SDAE has maximum kappa value of 1 with 100% user accuracy; for class 2 XGBoost showed maximum kappa value of 0.99 with 99.99% user’s accuracy. For class 3 both EBMNN1 and MLP rendered maximum kappa value of 0.99 and 99.3% user’s accuracy; for class 4 MLC showed maximum kappa value of 0.875 and 91.3% user’s accuracy.
Thus, every classifier responds differently to each class and gives different accuracy. One can differentiate the models on the basis of each class and depict the classifier that can give correct labels for that unique class. An attempt was made to propose an ensemble model by combining best model for class 4 (i.e. MLC) with other models to form overall best classification model, where MVE2 gave 96.1% accuracy.
From land cover assessment, Bangalore city is currently not close to a smart city despite the technical availabilities. There is lack of road system, improper water supply, traffic issues and unsustainable development. The local government should device and implement proper policies, guidelines and take precautionary measures during the development of the city.
ACKNOWLEDGEMENTS
Utmost gratitude and thank to Prof. Uttam Kumar for encouraging me and guiding me throughout the project. Also, I would like to thank Indian Academy of Sciences for giving this wonderful opportunity and also for providing the good platform like AuthorCafe to give amazing experience during the project documentation.
References

J.R. Otukei & T. Blaschke, “Land cover change assessment using decision trees, support vector machines and maximum likelihood classification algorithms”, International Journal of Applied Earth Observation and Geoinformation, 12S (2010) S27–S31.

Y. Shao & R. S. Lunetta, “Comparison of support vector machine, neural network, and CART algorithms for the landcover classification using limited training data points”, ISPRS Journal of Photogrammetry and Remote Sensing 70,2012, 78–87.

V. F. RodriguezGaliano & M. ChicaRivas 2012. “Evaluation of different machine learning methods for land cover mapping of a Mediterranean area using multiseasonal Landsat images and Digital Terrain Models”, International Journal of Digital Earth, 2012, 118, iFirst article.

Chen Xing et al 2016. “Stacked Denoise Autoencoder Based Feature Extraction and Classification for Hyperspectral Images”, Journal of Sensors, 2016, 110.

K. Tangthaikwan et al, "Multiclass support vector machine for classification spatial data from satellite image," 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, 2017, 111115.

A. Saksena, et al, "Geographical Area Mapping and Classification Utilizing Multispectral Satellite Imagery Processing Based on Machine Learning Algorithms Classifying Land based on its use for different purposes," 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, 2018, 10651070.

Aaron E. Maxwell et al, “LargeArea, High Spatial Resolution Land Cover Mapping Using Random Forests, GEOBIA, and NAIP Orthophotography: Findings and Recommendations”, Remote Sens. 2019, 11(12), 1409.

LeCun et al., “A Tutorial on Energy Based Learning”, January 2006, Predicting Structured Data, 2006, MIT Press: 819, 2329.

LeCun & Huang, “Loss Functions for Discriminative Training of Energy Based Models”, January 2005, Proc. of the 10th International Workshop on Artificial Intelligence and Statistics (AIStats'05): 25.
Post your comments
Please try again.