# Selection of Regression Models Using Genetic Algorithms

Indian Statistical Institute, Kolkata 700108

Guided by:

Partha P. Majumder

Distinguished Professor, National Institute of Biomedical Genomics, Kalyani 741251

## Abstract

Statistical methods are widely used to solve various problems of genetics and genomics. One such use is to build a predictive model using data on hand. Multiple linear regression is such a statistical method that estimates a linear relationship between a response variable and one or more predictors. In this method we have to estimate parameters (regression coefficients) using the sample data. Among many methods of estimation, the method of least squares is popular for estimating regression coefficients. The method of least squares minimizes the sum of squares of deviations of observed values from predicted values. Usual techniques of implementation of this method require a sample of observations whose number should be greater than or equal to the number of regression coefficients. But in problems of genetics we commonly encounter a situation in which the number of possible predictors is so large that it is almost impossible to collect data on such a large number of samples. In this study our aim is to overcome this problem using alternative statistical/numerical methods.

Our first approach is to implement the method of least squares using a Genetic Algorithm (GA). A Genetic Algorithm is a probabilistic algorithm which mimics the method of natural selection to optimize a function. We suggest a GA which can implement the method of least squares and compare its performance with the usual implementation of method of least squares whenever possible. What we observed is that this GA becomes computationally costly when the number of parameters becomes large. So we proposed a second approach in which we selected an appropriate group of predictors of size less than the sample size from the larger set of predictors which can provide the description of the observations of the response variable. First we use a standard Forward Selection method with just a little modification. For selecting a more optimal model, we then suggest a GA which can search the predictor space more exhaustively than the Forward Selection. Successful implementation of these methods and algorithms provided an initial set of results that appeared reasonable and acceptable. However, we have also identified some shortcomings of the proposed methodology that require further exploration. These are mentioned at the end of this report.

Keywords: least squares, estimates, precision, stability, convergence, genetic algorithm

## Abbreviations

 GA Genetic Algorithm OLS Ordinary Least Squares SSR Regression Sum of Squares SSE Error Sum of Squares SST Total Sum of Squares AIC Akaike Information Criterion DNA Deoxyribo Nucleic Acid sd Standard Deviation

## Background

In statistics, linear regression is used in modelling a linear relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables or predictors). Linear regression models are often fitted using the method of least squares. In the least squares method, we minimize the sum of squares of deviations of observed values from predicted values. Therefore, the method of least squares provides a solution to an optimization problem. Implementation of the least squares method is problematic in many real-life scenarios. Our approach to solving this problem is to use a Genetic Algorithm (GA).

Genetic algorithm is a powerful and efficient algorithm which is used to find the optimal value(s) of a function. Genetic algorithms imitate the biological processes of natural evolution to find the ‘fittest’ (the solutions which optimize that particular function) solutions. Genetic algorithms implement the optimization strategies by simulating evolution of population through selection, crossover and mutation. It can find solutions to problems that other optimization methods cannot solve due to a lack of continuity, differentiability or other similar features of the target function.

## Statement of The Problems

Whenever the number of explanatory variables or predictors $p$ exceeds the sample size $n$, we face problems in implementing the method of least squares. In real life problems in genetics, the number of observations (sample size, $n$ ) is often smaller than the number of predictor variables $(p)$. Thus, in such situations, least squares estimates of regression coefficients are difficult to obtain or are unstable. In this study, we wish to explore the use of GA to address such regression problems when n << p.

When model parameters are optimised using iterative algorithms like GA, a greater number of parameters can result in a more complex error surface to be optimised. This complexity may affect the overall convergence time of the model. Moreover, we always seek a model which is simple, easy to interpret and has greater accuracy in prediction. If removing a predictor from the model does not affect the prediction accuracy, a model without that predictor is always preferred. These are the reasons why we perform the selection of a subset of predictors from the large set of predictors before attempting to estimate the regression coefficients. We shall adopt GA approaches at both stages i.e. model selection and estimation of regression coefficients.

## Multiple Linear Regression

Given a dataset $y_{i\;},\;x_{i1,\;......,\;}x_{ip,\;}i\;=\;1,2,....,n$ of  sample size $n$, a linear regression model assumes that the relationship between the dependent variable y and the p-vector of regressors $X$ is linear. This relationship is modelled through an error variable ε — an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and the regressors. Thus the model takes the form

$y_{i\;}=\;\beta_0\;+\;\beta_1x_{i1}\;+\;...\;+\;\beta_px_{ip}\;+\;\varepsilon_i\;,\forall\;i\;=\;1,2,....,n$

Often these $n$equations are stacked together and written in matrix notation as

$y\;=\;X\beta\;+\;\varepsilon$

where

$y\;=\;\begin{pmatrix}\begin{array}{c}y_1\\\vdots\\\vdots\end{array}\\y_n\end{pmatrix}\;,\;X\;=\;\begin{bmatrix}1&x_{11}&\cdots&x_{1p}\\\vdots&\vdots&\ddots&\vdots\\1&x_{n1}&\cdots&x_{np}\end{bmatrix},\;\beta\;=\;\begin{pmatrix}\begin{array}{c}\beta_0\\\vdots\\\vdots\end{array}\\\beta_p\end{pmatrix}\;$

$\varepsilon\;=\;\begin{pmatrix}\begin{array}{c}\varepsilon_1\\\vdots\\\vdots\end{array}\\\varepsilon_n\end{pmatrix}$

We would like to estimate $\beta_{j\;}'\;s\;$(called the regression coefficients) from the given data. A large number of procedures has been developed for parameters estimation and inference in linear regression. But the most common technique is ordinary least squares (OLS). Under some standard assumptions the OLS method minimizes the sum of squared residuals (the difference between the observed values and the estimated values). The assumptions are the following:

1. Weak Exogeneity - the predictor variables are treated as fixed values, rather than random variables.

2. The errors are independent and have mean $0$.

3. Homoscedasticity - The errors have equal variance $\forall i$.

4. All the $x_j$’s $(\forall j\in\{1,\;...,\;p\})$ must be uncorrelated, i.e., the matrix $X$ should have rank $(p+1)$ .

So we want to minimize the function $f\;(\beta_{0,}\;\beta_{1,}\;...,\;\beta_p)\;=\;{\textstyle\sum_1^n}(y_i\;-\;\beta_0\;-\;\beta_1x_{i1}\;-\;...\;-\;\beta_px_{ip})^2\;w.r.t.\;\beta_0,\;\beta_1,\;...,\;\beta_{p\;.}$ In case of $n>p+1$, we differentiate $f$ w.r.t. $\beta_0,\;\beta_1,\;...,\;\beta_p$ to get the normal equations in the form $X^Ty\;=\;X^TX\beta$. Now $X^TX\;$is a square matrix of order $p+1$ and rank $(X^TX)$= rank $(X)$ $=p+1$. So $(X^TX)\;$is invertible and thus we get $\widehat\beta$ (the vector of estimated regression coefficients) = $(X^TX)^{-1}(X^Ty)$​. Thus we obtain the prediction equation (called the multiple linear regression equation) $\widehat y\;=\;\widehat{\beta_0}\;+\;\widehat{\beta_1}x_1\;+\;...\;+\;\widehat{\beta_p}x_p$. It is easy to see that $\widehat{\beta_0}\;=\;\overline y\;-\;{\textstyle\sum_1^p}\widehat{\beta_j}\overline{x_j}$ i.e. we can obtain the least squares estimate of $\beta_0$ from the least squares estimates of other regression coefficients. The above mentioned model is a model with intercept $(\beta_0)$. If we subtract the mean of each variable from themselves, the regression equation becomes an equation without that intercept term $(\widehat{\beta_0})$, the other estimated coefficients being unchanged. Thus to obtain the least squares estimates of regression coefficients in another way, we first subtract the mean of each variable from them, then minimize the new function $f_1(\beta_{1,}...,\beta_p)={\textstyle\sum_1^n}(\underline{y_i}-\beta_1\underline{x_{i1}}-...-\beta_p\underline{x_{ip}})^2w.r.t.\;\beta_1,...,\beta_p\; where\;\underline{y_i}=y_{i\;}-\overline{y_i},\;\underline{x_{ij}}=x_{ij}-\overline{\;x_{ij}}\forall i\;=\;1,\;...,\;n\forall j\;=\;1,\;...,\;p.$ Finally we obtain the least squares estimate of ​​ $\beta_0$ from others'.​

The goodness of fit is assessed by a statistic called $R^2$ statistic or coefficient of determination. It measures how close the data are to the fitted observations we get by the regression model. $R^2\;=\;\frac{\sum_1^n(\;\widehat{y_i}-\;\overline y)^2}{\sum_1^n(\;y_i-\;\overline y)^2}$ where the quantity $\textstyle\sum_1^n(\;\widehat{y_i}-\;\overline y\;)^2$ is the sum of squares explained by regression (SSR) and the quantity $\textstyle\sum_1^n(\;y_i-\;\overline y\;)^2$ is the total sum of squares (SST), So $R^2$ measures what proportion of total sum of squares i.e. total variance of the response can be explained by the regression model. The more the value of $R^2$ statistic, the more the model is good.​​

But when $n< p+1$, then rank $(X^TX)$= rank $(X)$ $\leq$ $n< p+1$, so $(X^TX)\;$is not invertible and it is not possible to estimate all the regression coefficients in the previous way. In this case, we suggest an algorithm based on genetic algorithm that can do this minimization and hence the estimation​. ​

## Model Selection In Regression

Model selection is an important step of regression. In regression, some of the measured or plausible predictors may actually be completely irrelevant to prediction using a linear model. Also a large number of predictors (p >> n) lead to difficulties in obtaining the least squares estimates. By selecting only uncorrelated predictors we can get rid of the problem of multicollinearity. Selecting predictors in regression models is a complicated problem, and there are many conflicting views on which type of variable selection procedure is best.

The simplest method of model selection would be to examine all possible combinations of the variables which is computationally expensive. Other than this method there is no guaranteed way of finding the optimal predictor subset. Adequate locally optimum solutions can be found by using forward and backward stepwise multiple regression.

With these methods, a small part of the available search space is examined. In order to study the importance of uncorrelated group of variables a more global optimisation method needs to be used. One of the possible ways of doing this is to use a stochastic search method like the genetic algorithm.

The goodness of regression models with different number of predictors cannot be compared using $R^2$statistic as its value always increases whenever a new predictor is included in the model. The adjusted $R^2$statistic, which is defined as $1\;-\;\frac{(1-R^2)\;(n-1)}{(n-\;p\;-1)}$, makes us pay a penalty for adding more predictors to the model. Therefore, we can just use the adjusted $R^2$statistic to assess the goodness of models with different number of predictors. That is, according to the adjusted $R^2$criterion, the best regression model is the one with the largest adjusted $R^2$-value. To compare regression models, sometimes we also use statistics referred to as information criterion statistics. For regression models, these statistics combine information about the SSE, number of parameters in the model, and the sample size. A low value, compared to values for other possible models, is good. The information criterion that we present here is called Akaike’s Information Criterion (AIC). For a model with p predictors, it is given by $AIC_{p\;}=\;n\log_e\left(SSE\right)\;-\;n\log_e\left(n\right)\;+\;kp$, where SSE (Error Sum of Squares) = SST - SSR. Here k is the penalty parameter for adding predictors. Usually $k\;=\;2$ is taken. The larger the value of k is, the lesser the number of the predictors selected in the model with the lowest AIC.

## A Brief Overview of Genetic Algorithm

Since genetic algorithm simulates a biological process, most of the terminologies are taken from biology. The basic components of a simple genetic algorithm are

i.  A fitness function for optimization

ii. A population of chromosomes

iii. Selection of chromosomes that will participate in the process of reproduction

iv. Crossover to produce next generation of altered chromosomes

v.  Mutation on randomly selected chromosomes in new generation

Here we briefly explain these components. Actual methods of implementing these components or operators are explained in section 3. METHODOLOGY.

The fitness function is the function that the algorithm is trying to optimize. The fitness function tests and quantifies how ‘fit’ each potential solution is.

In biology, chromosomes are tiny packages which contain one DNA molecule and its associated proteins. A chromosome consists of genes, commonly referred as blocks of DNA, where each gene encodes a specific trait, for example hair colour or eye colour.

Here the term chromosome refers to a numerical value or values that represent a candidate solution to the problem that the genetic algorithm is trying to optimize. Each candidate solution is encoded as an array of parameter values. It depends on the specific problem how to translate the sample space of candidate solutions into chromosomes. One approach is to convert each parameter value into a binary string, then concatenate the parameters end-to-end like genes on a DNA strand to create the chromosomes. Chromosomes can be an array of real numbers also.

A genetic algorithm begins with a randomly generated population of chromosomes, which serves as the first generation. Then each chromosome in the population is evaluated by the fitness function for selection. Now the selection operator chooses (with preference to the fitter chromosomes) some of the chromosomes for reproduction. There are different techniques to implement selection in genetic algorithms e.g. tournament selection, roulette-wheel selection, linear rank selection etc. The selection operator chooses chromosomes with replacement, so the same chromosome can be chosen more than once. After applying selection operator, the crossover operator is used to create new chromosomes for the next generation from the existing solutions available in the current generation. The crossover operator resembles the biological crossing over and recombination of chromosomes in cell meiosis. It exchanges genetic materials between a pair of homologus chromosomes. Along with crossover, "mutation" is also used to search for the optimal solution. In biology, mutation is the permanent alteration of the nucleotide sequence of the genome of an organism. In this context, mutation is the occasional introduction of new features in to the population pool to maintain diversity in the population. Typically mutation happens with a very low probability, such as 0.1. Some algorithms implement the mutation operator before the selection and crossover operators. Typically the selection, crossover, and the mutation process continue until the number of offsprings is the same as the initial population, so that the second generation is composed entirely of new offspring and the first generation is completely replaced. However, some algorithms let highly-fit members of the first generation survive into the second generation. This is called elitism. Now the second generation is tested by the fitness function, and the cycle repeats. The chromosome (solution) with the maximum fitness along with its fitness value (functional value) is recorded from each generation. Genetic algorithms are iterated until the fitness value of the “best” chromosome stabilizes and does not change for many generations. This means the algorithm has converged.

The working of a genetic algorithm is shown in the figure below.

Working of a Genetic Algorithm

The performance of a genetic algorithm depends strongly on the method used to encode candidate solutions into chromosomes. Other important considerations are the probability of crossover, the probability of mutation, the initial size of the population, and the number of iterations. These values can be adjusted after assessing the algorithm’s performance on a few trial runs.

## A Genetic Algorithm for Least Squares Estimation of Regression Coefficients

Fitness Function:

For a model with $p$ predictors, our fitness function is $f(\beta_{0\;,\;}\beta_1,\;...,\;\beta_p)\;=\;{\textstyle\sum_1^n}(\;y_i\;-\;\beta_{0\;}-\;\beta_1x_{i1}\;-\;...\;-\;\beta_px_{ip})^2$ which is to be minimized. This is a problem with $p+1$parameters. But as $\widehat{\beta_0}$can be obtained from other predicted regression coefficients, we have considered this as a problem with $p$parameters.

Chromosomes and Population:

We have encoded each chromosome as a p-element array of real numbers. We first specified two vectors of length $p$, lower and upper, say lower $=(l_1,\;...,\;l_p)$ and upper $=(u_1,\;...,\;u_p)$, so that the algorithm searches for the optimal $\beta_j$in the interval $\lbrack l_j,u_j\rbrack$. We generated our initial population of pre-defined size by randomly choosing individuals (real numbers) from the corresponding intervals. So our chromosomes are of the form $\mathfrak B\;=\;\lbrack\beta_1,\;...,\;\beta_p\rbrack$, $l_j\leq\beta_j\leq u_j$, for $j\;=\;1,\;...,\;p$. ​

We have used $l_j\;=\;-\frac{sd(y)}{sd(x_j)}\;,\;u_j\;=\;\frac{sd(y)}{sd(x_j)}$ as our bounds of estimated regression coefficients where sd() stands for standard deviation. The rationale behind this is given in Appendix 1.

Elitism:

We have kept our population size fixed in each generation. Elitism has been maintained in this algorithm. 5% of the fitter chromosomes has been passed on to the next generation without any further operation and the remaining individuals of the new generation has been generated through selection and crossover.

Selection Operators:

The selection operator only chooses parents from the fraction of the population that is kept. The selection operators which we have used are linear rank selection and tournament selection. In the linear rank selection method, each individual in the population is ranked in the increasing order of fitness and then assigned a selection probability that is proportional to the individual’s rank. Tournament selection involves running several "tournaments" among a few individuals (or "chromosomes") chosen at random from the population. The winner of each tournament is selected for crossover. In a tournament of size $k$, $k$ chromosomes are chosen from the population at random and then the best (fittest) chromosome is chosen from the tournament with probability $p$, the second best chromosome with probability $p(1-p)$, the third best chromosome with probability $p(1-p)^2$ and so on.

Crossover Operator:

For producing two offspring using crossover, we need to select two parent chromosomes. After selecting two parent chromosomes ${\mathfrak B}_a$and   ${\mathfrak B}_b$, they produce two different offspring using crossover operator with some pre-determined probability (said to be the probability of crossover, say $\rho_1$) and with probability $1\;-\;\rho_1$, the parents goes to the next generation as their offspring. We have set $\rho_1\;=\;0.8$. We have used the whole arithmetic crossover operator. In this crossover method, a random number $\theta$, between 0 and 1, is generated randomly and then the two offspring are produced by the formula below,

${\mathfrak B}_{a1}\;=\;\theta{\mathfrak B}_a\;+\;(1-\theta){\mathfrak B}_b$

${\mathfrak B}_{b1}\;=\;\theta{\mathfrak B}_b\;+\;(1-\theta){\mathfrak B}_a$

​Mutation Operator:

After getting all chromosomes of the next generation by elitism, selection and crossover operations, the mutation operation is performed randomly with some pre-determined probability (said to be the probability of mutation, say $\rho_2$). When a parameter in a chromosome mutates, its value is replaced by a random number chosen from the corresponding lower and upper bounds. We have set $\rho_2\;=\;0.1$

## A Genetic Algorithm for Selection of Regression Model

When $n\;<\;p+1$, we can not use the Backward Stepwise procedure for model selection as it starts with the model with all the $p$ predictors. We can use a modified Forward Stepwise procedure in which we stop when it includes $n-1$ (or less) predictors. It performs multiple iterations by adding one predictor at a time. In each iteration, multiple models are built by adding each of the predictors at a time. The AIC of the models is also computed and the model that yields the lowest AIC is retained for the next iteration.

Other than this, we propose a GA for model selection which can give us the model wih as many predictors we want and with the lowest AIC.

Chromosomes and Population:

Initially if we have $p$plausible predictors, then we have used $p-bit$binary strings as our chromosomes where each chromosome refers to a particular regression model as follows. If a chromosome has 1 in its $i-th$ place then it refers to a model with the $i-th$ predictor and otherwise a model without the $i-th$ predictor. We generated our initial population of pre-defined size by randomly creating binary strings of size $p$.

Fitness Function:

Taking a $p-bit$binary string as input, the fitness function counts the number of $1's$ in it and if that number is less than or equal to $n-1$, it fits the corresponding model, calculates the AIC and gives it as output. For binary strings with number of 1 greater than or equal to $n$, it gives infinity as output (as we want to minimize the fitness function).

We have used different values of the penalty parameter $k$ in AIC for getting optimal model of different sizes ( i.e. wih different number of predictors).

Elitism and Selection:

As before, we have maintained elitism and used Tournament Selection.

Crossover Operator:

As explained earlier, crossover is implemented with probability $\rho_1\;=\;0.8$. For binary strings, we have used Single Point Crossover operator. Under this operation, a point on both parents' chromosomes is picked randomly, and designated as 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offsprings. Below it is shown in the diagram.

Crossover

Mutation Operator:

As before, after getting all chromosomes of the next generation by elitism, selection and crossover operations, mutation operation is performed randomly on a bit of chromosome-strings with probability $\rho_2\;=\;0.1$. When a bit of a chromosome mutates, its value is flipped from 0 to 1 or 1 to 0.​

## ​​Generation of Dataset and Implementation of Algorithms

We generated data and implemented our algorithms in statistical software R. For implementing Genetic Algorithm we have used the package GA.

## Implementing GA for Least Squares Estimation

We have generated samples of predictor variables and response variables depending on those predictors by a particular model (the regression model mentioned above) for different values of $n$ and $p$ i.e. we have fixed the values of $\beta_0,\beta_1,...,\beta_p$, generated n observations for each of the predictors $x_1,....,x_p$ independently from normal distributions, error $\varepsilon$ from standard normal distribution (so that $\epsilon$ could have mean 0) and response y by the model $y\;=\;\beta_0+\beta_1x_{1\;}+...+\beta_px_{p\;}+\varepsilon$.​

## Implementing Forward Selection and GA for Model Selection

We have generated samples of size $n$ for the $p$predictors (from multivariate normal distribution with a specific correlation structure) and for error $\varepsilon$(again from normal distribution with mean 0) and generated response $y\;$as linear combination of $m$ (known) of those $p$ predictors plus the error so that the optimal model consisted of those predictors and we could check the performance of our algorithms.

For implementing forward stepwise regression, we have used the "step" function in R.

## Least Squares Estimation

First we have observed that tournament selection gives faster convergence than linear rank selection. So we have used tournament selection for larger values of $p$. For comparing the rate of convergence of two different selection operators, we have used a dataset with $p\;=1$. Both the selection operators have yielded stable estimates of regression coefficients for every trial run but the time of convergence varied in each trial. Even with the same initial population size, tournament selection has converged much faster than linear rank selection in each trial which is shown in the table below. We have also observed that the time of convergence in linear rank selection was variable from one trial to another. For tournament selection, the time to convergence was much less variable.

Rate of Convergance of different Selection Operators
 Run No of Iterations for Linear Rank Selection No of Iterations for Tournament Selection 1 208 22 2 493 19 3 240 18

Then we gradually increased the value of $p$. We have observed that for $p>5$, as $p$ increases, the minimum intitial population size required for stable convergence of GA estimates of regression coefficients increases almost at an exponential rate.

One example (for a particular dataset) is shown in the graph below. But this is the general trend observed.

Minimum Initial Population Size required for Convergence for different values of p

For this reason, and for limitations of time and computing power, we couldn't go beyond $p>10$. But up to that point, GA has given remarkable results.

For a dataset with $p\;=\;5$ and $n\;=\;15$, the performance of GA in estimating regression coefficients is shown below.

• Population Size = 1,000
Comparison of GA Estimates with Exact Least Squares Estimates for p = 5
 Exact Least Squares Estimates GA Estimates 37.41 37.41 -16.71 -16.71 7.60 7.60 5.84 5.84 0.19 0.19

For another dataset with and , the performance of GA in estimating regression coefficients is shown below.

• Population Size = 20,000
Comparison of GA Estimates with Exact Least Squares Estimates for p = 10
 Exact Least Squares Estimates GA Estimates 11.1 11.1 2.8 2.7 10.2 10.5 5.3 5.3 7.0 6.9 18.7 18.6 8.4 8.3 13.3 13.3 5.8 5.8 9.9 10.0

So GA has given stable estimates of regression coefficients in all cases.

## Model Selection

We have generated datasets with $p\;=\;50$ and $n\;=\;25$. The $50$ predictors are from a multivariate normal distribution with the correlation matrix consists of five $10\times10$ block matrices, entries (of each block) being randomly generated from the interval $\lbrack0.4,0.6\rbrack$. Then we have generated response variable $y$ as linear combination of only $m$ predictors and error with $m\;=5\;and\;m\;=\;10$ respectively for each dataset.​

Results for dataset 1:

Response (as generated): $y=22x_1+14x_{11}+9x_{21}+15x_{31}+15x_{41}+\varepsilon$

Information of the most significant model:

• Predictors: $x_1,x_{11},x_{21},x_{31},x_{41}$
• AIC: 119.91

Information of the Optimal Model selected by Forward Selection:

• Selected Predictors:

$x_1,x_{41},x_{31},x_{11,}x_{21,}x_{26,}x_{25,}x_9,x_4,x_{30,}x_{48,}x_{42},x_8,x_{28},x_{15},x_{27},x_{16},x_{17},x_{49},x_{36},x_{22,}x_{32},x_3\;\;\;\;\;\;\;\;\;\;\;$

• AIC: -310.91

Information of the Optimal Model Selected by GA :

• Selected Predictors:

$x_1,x_2,x_3,x_4,x_5,x_{11.}x_{14,}x_{17,}x_{19,}x_{21},x_{25},x_{26,}x_{29},x_{31},x_{32},x_{37},x_{41},x_{42},x_{45},x_{46},x_{48},x_{50}$

• AIC = -344.57

Results for dataset 2:

Response (as generated): $y=22x_1+16x_5+14x_{11}+19x_{15}+9x_{21}+15x_{25}+15x_{31}+16x_{35}+15x_{41}+17x_{45}+\varepsilon$

Information of the most significant model:

• Predictors: $x_1,x_5,x_{11},x_{15},x_{21},x_{25},x_{31},x_{35},x_{41},x_{45}$
• AIC: 64.35

Information of the Optimal Model selected by Forward Selection:

• Selected Predictors:

$x_1,x_{41},x_{31},x_{11},x_{45},x_{25},x_4,x_{16},x_{35},x_{21},x_{48},x_{42},x_{14},x_{29},x_{15},x_5,x_6,x_{39},x_{22},x_{17},x_{47},x_{10},x_{20}$

• AIC: -324.12

Information of the Optimal Model selected by GA:

• Selected Predictors:

$x_1,x_5,x_6,x_{11},x_{13},x_{15},x_{18},x_{21},x_{23},x_{25},x_{31},x_{32},x_{33},x_{34},x_{35},x_{40},x_{41},x_{44},x_{45},x_{46},x_{48},x_{50}$

• AIC: - 399.85

The optimal models selected by GA have lower AIC than those selected by forward selection in both cases. It is striking that AIC's of the optimal models selected by both forward selection and GA make a huge jump (from positive to negative) from the AIC's of the most significant model. The reason behind this is that the SSE's of these optimal models are$<{10}^{-5}$ (because we know, with addition of predictors, SSE always decreases) and being proportional to the logarithm of SSE's (AIC = 25lnSSE - 25ln25 + 2 (No. of predictors) in these cases), AIC's thus become large negative.

We observed that the optimal models selected by both forward selection and GA contain those significant predictors in both cases. But along with them, they contain other non-significant predictors also. The visible reason behind it is that those selected models have lower AICs than the models containing only those significant predictors. For computing the exact least squares estimates of regression coefficients, it is not a problem as we now have a model containing all the significant predictors with the number of predictors less than the sample size. Below we give a comparison between the least squares estimates of the coefficients corresponding to the significant predictors of the significant models and the models selected by GA. It should also be noted that the coefficients corresponding to the non-significant predictors in the models selected by GA are negligible compared to the coefficients corresponding to the significant predictors (and most of the coefficients corresponding to the non-significant predictors are approximately equal to zero).

Comparison of Least Squares Estimates of Significant Coefficients by different models for data-set 1
 Significant Predictors Estimates of corresponding Coefficients by the Significant Model Estimates of corresponding Coefficients by the Model selected by GA x1 21.98 22.52 x11 14.08 14.51 x21 8.89 9.31 x31 14.87 14.71 x41 14.96 15.28
Comparison of Least Squares Estimates of Significant Coefficients by different models for data-set 2
 Significant Predictors Estimates of corresponding Coefficients by the Significant Model Estimates of corresponding Coefficients by the Model selected by GA x1 21.78 21.66 x5 16.17 16.01 x11 13.99 14.05 x15 19.30 19.19 x21 8.89 8.78 x25 15.32 15.33 x31 15.21 15.34 x35 13.89 15.55 x41 14.66 15.26 x45 15.89 17.18

Then we started to investigate why the selected models contain non-significant predictors i.e. why their AICs are lower than than the significant models. One possible reason is the moderate correlations between groups of predictors. In all practical situations, this is the most likely scenario i.e most of the time predictors happen to be correlated.

We surmised that if we increase the value of the penalty for adding predictor in AIC, namely $k$, it may lead to selection of only the significant predictors. So we changed the number of significant predictors $m$and observed at what value of $k$ we got the desired model as the model selected by GA. We observed that when the number of significant predictors is low, we need a higher value of $k$and with the increase in the number of significant predictors, the minimum required value of $k$decreases almost at a steady and low rate, when $n$ and $p$are fixed. One such dependency (for a particular dataset) is shown in the graph below:

k as a function of No of Significant Predictors when n=25,p=50

## CONCLUSIONS

We can implement the method of least squares using a Genetic Algorithm which provides stable estimates of regression coefficients even if $n.

For large values of $p$, the GA for estimating regression coefficients is computationally costly.

To reduce computational cost, we can apply Model Selection to lower the value of $p$ which can also the solve the problem of $n<.

For model selection, GA performs better than the usual Forward Selection procedure.

With the usual AIC as the optimality criterion, model selection fails to select the most significant model i.e. model only with the most significant predictors, when the predictors are moderately correlated.

Changing the value of the penalty parameter in AIC leads to obtaining of the most significant model by model selection.

To select the most significant models with different number of significant predictors $(m)$, we need to use different values of the penalty parameter $(k)$ in AIC. Thus we can think $k$ as a function of $m,n$ and $p$.

## FUTURE SCOPE

To reduce the computational complexity in estimating the least squares estimates of regression coefficients using GA. This can be done by using more precise initial intervals or by lowering the probability of crossover.

To explain more rigourously why the optimal model is not always the most significant model.

To explore what role do correlations among predictors play in this context.

To estimate the aforesaid function $k(m,n,p)$.

## APPENDIX 1

Here we will give the justification for taking the aforesaid intervals as the bounds of $\widehat{\beta_j}'s$.

Consider regression of $y$ and $x_j$ on $x_1,x_2,...,x_{j-1},x_{j+1},...,x_p$ separately and let $\widehat{\alpha_k}'s$ and $\widehat{\gamma_k}'s$ $\forall k\in\{1,...,j-1,j+1,..p\}$ be the estimates of the corresponding regression coefficients respectively. Also let $\widehat{y_i}$ and $\widehat{x_{j_i}}$ be the predicted values by regression $\forall i\in\{1,...,n\}$​.

Define $r_i=y_i-\widehat{y_i}\forall\;i\;\in\{1,...,n\}$ and $r_{j_i}={x_j}_i-\widehat{{x_j}_i}\forall\;i\;\in\{1,...,n\}$. It can be proved that $\widehat{\beta_j}=\rho_j\frac{sd(r)}{sd(r_j)}$where $\rho_{j\;}$is the partial correalation coefficient between $y$ and $x_j$. Also $sd(r)=\sqrt{(1-R_{-j}^2)}sd(y)$ where $R_{-j}$ is the multiple correlation coefficient between $y$ and $x_1,...,x_{j-1},x_{j+1},..x_p$ and $sd(r_j)=\sqrt{(1-R_j^2)}sd(x_j)$ where $R_j$ is the multiple correlation coefficient between $x_j$ and $x_1,...,x_{j-1},x_{j+1},..x_p$.

So $(\widehat{\beta_j})^2=\rho_j^2\frac{1-R_{-j}^2}{1-R_j^2}\frac{var(y)}{var(x_j)}$. Now with $-1\leq\rho_j\leq1$ (as it is, after all, a product moment correlation coefficient) and (this is true because of the fact that $x_j's\;$ are uncorrelated), we have $(\widehat{\beta_j})^2\leq\frac{var(y)}{var(x_j)}\;i.e.\;-\frac{sd(y)}{sd(x_j)}\leq\widehat{\beta_j}\leq\frac{sd(y)}{sd(x_j)}$.​

The proofs of the results and the theorems mentioned in Appendix 1 can be found in Gun A.M., Gupta M.K. and Dasgupta B. 2014. Fundamentals of Statistics vol-1 The World Press Private Limited,Kolkata pp. 285-291

## ​​REFERENCES

Carr J. 2014. An Introduction to Genetic Algorithms pp. 1-40.

Gun A.M., Gupta M.K. and Dasgupta B. 2014. Fundamentals of Statistics vol-1 The World Press Private Limited, Kolkata ch. 11-12.

Bhattachariya R.K. 2013. Introduction to Genetic Algorithms. IIT Guwahati.

Paterlini S. and Minerva T. 2010. Regression Model Selection Using Genetic Algorithm. Proceeding NN'10/EC'10/FS'10. Proceedings of the 11th WSEAS international conference on neural networks and 11th SEAS international conference on evolutionary computing and 11th WSEAS international conference on Fuzzy systems pp. 19-27

Rawlings J. O., Pantula S. G. and Dickey D. A. 1998. Applied Regression Analysis:A Research Tool. Springer ch. 3.

Broadhurst D., Goodacre R., Jones A., Rowland J. J. and Kell D. B. 1997 Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression,with applications to pyrolysis mass spectrometry. Analytica Chimica Acta 348 pp. 71-86

Wallet B. C., Marchette D. J., Solka J. L. and Wegman E. J. 1996 A Genetic Algorithm for Best Subset Selection in Linear Regression. Computing science and statistics. Volume 28., Graph-image-vision: proceedings of the 28th Symposium on the Interface, Sydney, Australia.

Karr C. L., Stanley D.A., and Scheiner B.J. 1991. Genetic Algorithm applied to Least Squares Curve Fitting. Report of Investigations 9339, United States Department of the Interior and United States Bureau of Mines.

## ACKNOWLEDGEMENTS

I am grateful to Indian Academy of Sciences, Indian National Science Academy and The National Academy of Sciences, India for providing me this opportunity to carry out this project. I express my sincere thanks to Professor Partha P. Majumder, Distinguished Professor, National Institute of Biomedical Genomics, Kalyani for his constant advice and guidance throughout my project to make my summer days a fruitful and memorable one. I would also like to thank all the faculty members, research scholars, data scientists, trainees and other employees and staff of NIBMG, Kalyani for helping me to complete my work. It is my privilege to express my gratitude to Dr. Arnab Chakraborty, Assistant Professor, Applied Statistics Unit, Indian Statistical Institute for providing my letter of recommendation to IAS,INSA and NAS.I am thankful to AuthorCafe for providing me such a great platform to write my report and to have my work archived online.

More
Written, reviewed, revised, proofed and published with