Loading...

Summer Research Fellowship Programme of India's Science Academies

Selection of Regression Models Using Genetic Algorithms

Tamojit Sadhukhan

Indian Statistical Institute, Kolkata 700108

Guided by:

Partha P. Majumder

Distinguished Professor, National Institute of Biomedical Genomics, Kalyani 741251

Abstract

Statistical methods are widely used to solve various problems of genetics and genomics. One such use is to build a predictive model using data on hand. Multiple linear regression is such a statistical method that estimates a linear relationship between a response variable and one or more predictors. In this method we have to estimate parameters (regression coefficients) using the sample data. Among many methods of estimation, the method of least squares is popular for estimating regression coefficients. The method of least squares minimizes the sum of squares of deviations of observed values from predicted values. Usual techniques of implementation of this method require a sample of observations whose number should be greater than or equal to the number of regression coefficients. But in problems of genetics we commonly encounter a situation in which the number of possible predictors is so large that it is almost impossible to collect data on such a large number of samples. In this study our aim is to overcome this problem using alternative statistical/numerical methods.

Our first approach is to implement the method of least squares using a Genetic Algorithm (GA). A Genetic Algorithm is a probabilistic algorithm which mimics the method of natural selection to optimize a function. We suggest a GA which can implement the method of least squares and compare its performance with the usual implementation of method of least squares whenever possible. What we observed is that this GA becomes computationally costly when the number of parameters becomes large. So we proposed a second approach in which we selected an appropriate group of predictors of size less than the sample size from the larger set of predictors which can provide the description of the observations of the response variable. First we use a standard Forward Selection method with just a little modification. For selecting a more optimal model, we then suggest a GA which can search the predictor space more exhaustively than the Forward Selection. Successful implementation of these methods and algorithms provided an initial set of results that appeared reasonable and acceptable. However, we have also identified some shortcomings of the proposed methodology that require further exploration. These are mentioned at the end of this report.

Keywords: least squares, estimates, precision, stability, convergence, genetic algorithm

Abbreviations

GAGenetic Algorithm
OLSOrdinary Least Squares
SSRRegression Sum of Squares
SSEError Sum of Squares
SSTTotal Sum of Squares
AICAkaike Information Criterion
DNADeoxyribo Nucleic Acid
sdStandard Deviation

INTRODUCTION

Background

In statistics, linear regression is used in modelling a linear relationship between a response (or dependent variable) and one or more explanatory variables (or independent variables or predictors). Linear regression models are often fitted using the method of least squares. In the least squares method, we minimize the sum of squares of deviations of observed values from predicted values. Therefore, the method of least squares provides a solution to an optimization problem. Implementation of the least squares method is problematic in many real-life scenarios. Our approach to solving this problem is to use a Genetic Algorithm (GA).

Genetic algorithm is a powerful and efficient algorithm which is used to find the optimal value(s) of a function. Genetic algorithms imitate the biological processes of natural evolution to find the ‘fittest’ (the solutions which optimize that particular function) solutions. Genetic algorithms implement the optimization strategies by simulating evolution of population through selection, crossover and mutation. It can find solutions to problems that other optimization methods cannot solve due to a lack of continuity, differentiability or other similar features of the target function.

Statement of The Problems

Whenever the number of explanatory variables or predictors pp exceeds the sample size nn, we face problems in implementing the method of least squares. In real life problems in genetics, the number of observations (sample size, nn ) is often smaller than the number of predictor variables (p)(p). Thus, in such situations, least squares estimates of regression coefficients are difficult to obtain or are unstable. In this study, we wish to explore the use of GA to address such regression problems when n << p.

When model parameters are optimised using iterative algorithms like GA, a greater number of parameters can result in a more complex error surface to be optimised. This complexity may affect the overall convergence time of the model. Moreover, we always seek a model which is simple, easy to interpret and has greater accuracy in prediction. If removing a predictor from the model does not affect the prediction accuracy, a model without that predictor is always preferred. These are the reasons why we perform the selection of a subset of predictors from the large set of predictors before attempting to estimate the regression coefficients. We shall adopt GA approaches at both stages i.e. model selection and estimation of regression coefficients.

LITERATURE REVIEW

Multiple Linear Regression

Given a dataset yi,xi1,......,xip,i=1,2,....,n y_{i\;},\;x_{i1,\;......,\;}x_{ip,\;}i\;=\;1,2,....,n of  sample size n, a linear regression model assumes that the relationship between the dependent variable y and the p-vector of regressors X is linear. This relationship is modelled through an error variable ε — an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and the regressors. Thus the model takes the form

yi&ThickSpace;=&ThickSpace;β0&ThickSpace;+&ThickSpace;β1xi1&ThickSpace;+&ThickSpace;...&ThickSpace;+&ThickSpace;βpxip&ThickSpace;+&ThickSpace;εi&ThickSpace;,&ThickSpace;i&ThickSpace;=&ThickSpace;1,2,....,ny_{i\;}=\;\beta_0\;+\;\beta_1x_{i1}\;+\;...\;+\;\beta_px_{ip}\;+\;\varepsilon_i\;,\forall\;i\;=\;1,2,....,n

Often these nnequations are stacked together and written in matrix notation as

y&ThickSpace;=&ThickSpace;Xβ&ThickSpace;+&ThickSpace;εy\;=\;X\beta\;+\;\varepsilon

where

y&ThickSpace;=&ThickSpace;(y1yn)&ThickSpace;,&ThickSpace;X&ThickSpace;=&ThickSpace;[1x11x1p1xn1xnp],&ThickSpace;β&ThickSpace;=&ThickSpace;(β0βp)&ThickSpace;y\;=\;\begin{pmatrix}\begin{array}{c}y_1\\\vdots\\\vdots\end{array}\\y_n\end{pmatrix}\;,\;X\;=\;\begin{bmatrix}1&amp;x_{11}&amp;\cdots&amp;x_{1p}\\\vdots&amp;\vdots&amp;\ddots&amp;\vdots\\1&amp;x_{n1}&amp;\cdots&amp;x_{np}\end{bmatrix},\;\beta\;=\;\begin{pmatrix}\begin{array}{c}\beta_0\\\vdots\\\vdots\end{array}\\\beta_p\end{pmatrix}\;

ε=(ε1εn) \varepsilon\;=\;\begin{pmatrix}\begin{array}{c}\varepsilon_1\\\vdots\\\vdots\end{array}\\\varepsilon_n\end{pmatrix}

We would like to estimate βj&ThickSpace;&ThickSpace;s&ThickSpace;\beta_{j\;}&#x27;\;s\;(called the regression coefficients) from the given data. A large number of procedures has been developed for parameters estimation and inference in linear regression. But the most common technique is ordinary least squares (OLS). Under some standard assumptions the OLS method minimizes the sum of squared residuals (the difference between the observed values and the estimated values). The assumptions are the following:

1. Weak Exogeneity - the predictor variables are treated as fixed values, rather than random variables.

2. The errors are independent and have mean 00.

3. Homoscedasticity - The errors have equal variance i\forall i.

4. All the xjx_j’s (j{1,&ThickSpace;...,&ThickSpace;p})(\forall j\in\{1,\;...,\;p\}) must be uncorrelated, i.e., the matrix XX should have rank (p+1)(p+1) .

So we want to minimize the function f&ThickSpace;(β0,&ThickSpace;β1,&ThickSpace;...,&ThickSpace;βp)&ThickSpace;=&ThickSpace;1n(yi&ThickSpace;&ThickSpace;β0&ThickSpace;&ThickSpace;β1xi1&ThickSpace;&ThickSpace;...&ThickSpace;&ThickSpace;βpxip)2&ThickSpace;w.r.t.&ThickSpace;β0,&ThickSpace;β1,&ThickSpace;...,&ThickSpace;βp&ThickSpace;.f\;(\beta_{0,}\;\beta_{1,}\;...,\;\beta_p)\;=\;{\textstyle\sum_1^n}(y_i\;-\;\beta_0\;-\;\beta_1x_{i1}\;-\;...\;-\;\beta_px_{ip})^2\;w.r.t.\;\beta_0,\;\beta_1,\;...,\;\beta_{p\;.} In case of n&gt;p+1n&gt;p+1, we differentiate ff w.r.t. β0,&ThickSpace;β1,&ThickSpace;...,&ThickSpace;βp\beta_0,\;\beta_1,\;...,\;\beta_p to get the normal equations in the form XTy&ThickSpace;=&ThickSpace;XTXβX^Ty\;=\;X^TX\beta. Now XTX&ThickSpace;X^TX\;is a square matrix of order p+1p+1 and rank (XTX)(X^TX)= rank (X)(X) =p+1=p+1. So (XTX)&ThickSpace;(X^TX)\;is invertible and thus we get β^\widehat\beta (the vector of estimated regression coefficients) = (XTX)1(XTy)(X^TX)^{-1}(X^Ty)​. Thus we obtain the prediction equation (called the multiple linear regression equation) y^&ThickSpace;=&ThickSpace;β0^&ThickSpace;+&ThickSpace;β1^x1&ThickSpace;+&ThickSpace;...&ThickSpace;+&ThickSpace;βp^xp\widehat y\;=\;\widehat{\beta_0}\;+\;\widehat{\beta_1}x_1\;+\;...\;+\;\widehat{\beta_p}x_p. It is easy to see that β0^&ThickSpace;=&ThickSpace;y&ThickSpace;&ThickSpace;1pβj^xj\widehat{\beta_0}\;=\;\overline y\;-\;{\textstyle\sum_1^p}\widehat{\beta_j}\overline{x_j} i.e. we can obtain the least squares estimate of β0\beta_0 from the least squares estimates of other regression coefficients. The above mentioned model is a model with intercept (β0)(\beta_0). If we subtract the mean of each variable from themselves, the regression equation becomes an equation without that intercept term (β0^)(\widehat{\beta_0}), the other estimated coefficients being unchanged. Thus to obtain the least squares estimates of regression coefficients in another way, we first subtract the mean of each variable from them, then minimize the new function f1(β1,...,βp)=1n(yiβ1xi1...βpxip)2w.r.t.&ThickSpace;β1,...,βp&ThickSpace;where&ThickSpace;yi=yi&ThickSpace;yi,&ThickSpace;xij=xij&ThickSpace;xiji&ThickSpace;=&ThickSpace;1,&ThickSpace;...,&ThickSpace;nj&ThickSpace;=&ThickSpace;1,&ThickSpace;...,&ThickSpace;p.f_1(\beta_{1,}...,\beta_p)={\textstyle\sum_1^n}(\underline{y_i}-\beta_1\underline{x_{i1}}-...-\beta_p\underline{x_{ip}})^2w.r.t.\;\beta_1,...,\beta_p\; where\;\underline{y_i}=y_{i\;}-\overline{y_i},\;\underline{x_{ij}}=x_{ij}-\overline{\;x_{ij}}\forall i\;=\;1,\;...,\;n\forall j\;=\;1,\;...,\;p. Finally we obtain the least squares estimate of ​​ β0\beta_0 from others'.​

The goodness of fit is assessed by a statistic called R2R^2 statistic or coefficient of determination. It measures how close the data are to the fitted observations we get by the regression model. R2&ThickSpace;=&ThickSpace;1n(&ThickSpace;yi^&ThickSpace;y)21n(&ThickSpace;yi&ThickSpace;y)2R^2\;=\;\frac{\sum_1^n(\;\widehat{y_i}-\;\overline y)^2}{\sum_1^n(\;y_i-\;\overline y)^2} where the quantity 1n(&ThickSpace;yi^&ThickSpace;y&ThickSpace;)2\textstyle\sum_1^n(\;\widehat{y_i}-\;\overline y\;)^2 is the sum of squares explained by regression (SSR) and the quantity 1n(&ThickSpace;yi&ThickSpace;y&ThickSpace;)2\textstyle\sum_1^n(\;y_i-\;\overline y\;)^2 is the total sum of squares (SST), So R2R^2 measures what proportion of total sum of squares i.e. total variance of the response can be explained by the regression model. The more the value of R2R^2 statistic, the more the model is good.​​

But when n&lt;p+1n&lt; p+1, then rank (XTX)(X^TX)= rank (X)(X) \leq n&lt;p+1n&lt; p+1, so (XTX)&ThickSpace;(X^TX)\;is not invertible and it is not possible to estimate all the regression coefficients in the previous way. In this case, we suggest an algorithm based on genetic algorithm that can do this minimization and hence the estimation​. ​

Model Selection In Regression

Model selection is an important step of regression. In regression, some of the measured or plausible predictors may actually be completely irrelevant to prediction using a linear model. Also a large number of predictors (p >> n) lead to difficulties in obtaining the least squares estimates. By selecting only uncorrelated predictors we can get rid of the problem of multicollinearity. Selecting predictors in regression models is a complicated problem, and there are many conflicting views on which type of variable selection procedure is best.

The simplest method of model selection would be to examine all possible combinations of the variables which is computationally expensive. Other than this method there is no guaranteed way of finding the optimal predictor subset. Adequate locally optimum solutions can be found by using forward and backward stepwise multiple regression.

With these methods, a small part of the available search space is examined. In order to study the importance of uncorrelated group of variables a more global optimisation method needs to be used. One of the possible ways of doing this is to use a stochastic search method like the genetic algorithm.

The goodness of regression models with different number of predictors cannot be compared using R2R^2statistic as its value always increases whenever a new predictor is included in the model. The adjusted R2R^2statistic, which is defined as 1&ThickSpace;&ThickSpace;(1R2)&ThickSpace;(n1)(n&ThickSpace;p&ThickSpace;1)1\;-\;\frac{(1-R^2)\;(n-1)}{(n-\;p\;-1)}, makes us pay a penalty for adding more predictors to the model. Therefore, we can just use the adjusted R2R^2statistic to assess the goodness of models with different number of predictors. That is, according to the adjusted R2R^2criterion, the best regression model is the one with the largest adjusted R2R^2-value. To compare regression models, sometimes we also use statistics referred to as information criterion statistics. For regression models, these statistics combine information about the SSE, number of parameters in the model, and the sample size. A low value, compared to values for other possible models, is good. The information criterion that we present here is called Akaike’s Information Criterion (AIC). For a model with p predictors, it is given by AICp&ThickSpace;=&ThickSpace;nloge(SSE)&ThickSpace;&ThickSpace;nloge(n)&ThickSpace;+&ThickSpace;kpAIC_{p\;}=\;n\log_e\left(SSE\right)\;-\;n\log_e\left(n\right)\;+\;kp, where SSE (Error Sum of Squares) = SST - SSR. Here k is the penalty parameter for adding predictors. Usually k&ThickSpace;=&ThickSpace;2k\;=\;2 is taken. The larger the value of k is, the lesser the number of the predictors selected in the model with the lowest AIC.

A Brief Overview of Genetic Algorithm

Since genetic algorithm simulates a biological process, most of the terminologies are taken from biology. The basic components of a simple genetic algorithm are

i.  A fitness function for optimization

ii. A population of chromosomes

iii. Selection of chromosomes that will participate in the process of reproduction

iv. Crossover to produce next generation of altered chromosomes

v.  Mutation on randomly selected chromosomes in new generation

Here we briefly explain these components. Actual methods of implementing these components or operators are explained in section 3. METHODOLOGY.

The fitness function is the function that the algorithm is trying to optimize. The fitness function tests and quantifies how ‘fit’ each potential solution is.

In biology, chromosomes are tiny packages which contain one DNA molecule and its associated proteins. A chromosome consists of genes, commonly referred as blocks of DNA, where each gene encodes a specific trait, for example hair colour or eye colour.

Here the term chromosome refers to a numerical value or values that represent a candidate solution to the problem that the genetic algorithm is trying to optimize. Each candidate solution is encoded as an array of parameter values. It depends on the specific problem how to translate the sample space of candidate solutions into chromosomes. One approach is to convert each parameter value into a binary string, then concatenate the parameters end-to-end like genes on a DNA strand to create the chromosomes. Chromosomes can be an array of real numbers also.

A genetic algorithm begins with a randomly generated population of chromosomes, which serves as the first generation. Then each chromosome in the population is evaluated by the fitness function for selection. Now the selection operator chooses (with preference to the fitter chromosomes) some of the chromosomes for reproduction. There are different techniques to implement selection in genetic algorithms e.g. tournament selection, roulette-wheel selection, linear rank selection etc. The selection operator chooses chromosomes with replacement, so the same chromosome can be chosen more than once. After applying selection operator, the crossover operator is used to create new chromosomes for the next generation from the existing solutions available in the current generation. The crossover operator resembles the biological crossing over and recombination of chromosomes in cell meiosis. It exchanges genetic materials between a pair of homologus chromosomes. Along with crossover, "mutation" is also used to search for the optimal solution. In biology, mutation is the permanent alteration of the nucleotide sequence of the genome of an organism. In this context, mutation is the occasional introduction of new features in to the population pool to maintain diversity in the population. Typically mutation happens with a very low probability, such as 0.1. Some algorithms implement the mutation operator before the selection and crossover operators. Typically the selection, crossover, and the mutation process continue until the number of offsprings is the same as the initial population, so that the second generation is composed entirely of new offspring and the first generation is completely replaced. However, some algorithms let highly-fit members of the first generation survive into the second generation. This is called elitism. Now the second generation is tested by the fitness function, and the cycle repeats. The chromosome (solution) with the maximum fitness along with its fitness value (functional value) is recorded from each generation. Genetic algorithms are iterated until the fitness value of the “best” chromosome stabilizes and does not change for many generations. This means the algorithm has converged.

The working of a genetic algorithm is shown in the figure below.

IMAGE_1.jpg
    Working of a Genetic Algorithm

    The performance of a genetic algorithm depends strongly on the method used to encode candidate solutions into chromosomes. Other important considerations are the probability of crossover, the probability of mutation, the initial size of the population, and the number of iterations. These values can be adjusted after assessing the algorithm’s performance on a few trial runs.

    METHODOLOGY

    A Genetic Algorithm for Least Squares Estimation of Regression Coefficients

    Fitness Function:

    For a model with pp predictors, our fitness function is f(β0&ThickSpace;,&ThickSpace;β1,&ThickSpace;...,&ThickSpace;βp)&ThickSpace;=&ThickSpace;1n(&ThickSpace;yi&ThickSpace;&ThickSpace;β0&ThickSpace;&ThickSpace;β1xi1&ThickSpace;&ThickSpace;...&ThickSpace;&ThickSpace;βpxip)2f(\beta_{0\;,\;}\beta_1,\;...,\;\beta_p)\;=\;{\textstyle\sum_1^n}(\;y_i\;-\;\beta_{0\;}-\;\beta_1x_{i1}\;-\;...\;-\;\beta_px_{ip})^2 which is to be minimized. This is a problem with p+1 p+1parameters. But as β0^\widehat{\beta_0}can be obtained from other predicted regression coefficients, we have considered this as a problem with p pparameters.

    Chromosomes and Population:

    We have encoded each chromosome as a p-element array of real numbers. We first specified two vectors of length p, lower and upper, say lower =(l1,&ThickSpace;...,&ThickSpace;lp)=(l_1,\;...,\;l_p) and upper =(u1,&ThickSpace;...,&ThickSpace;up)=(u_1,\;...,\;u_p), so that the algorithm searches for the optimal βj\beta_jin the interval [lj,uj]\lbrack l_j,u_j\rbrack. We generated our initial population of pre-defined size by randomly choosing individuals (real numbers) from the corresponding intervals. So our chromosomes are of the form B&ThickSpace;=&ThickSpace;[β1,&ThickSpace;...,&ThickSpace;βp]\mathfrak B\;=\;\lbrack\beta_1,\;...,\;\beta_p\rbrack, ljβjujl_j\leq\beta_j\leq u_j, for j&ThickSpace;=&ThickSpace;1,&ThickSpace;...,&ThickSpace;pj\;=\;1,\;...,\;p. ​

    We have used lj&ThickSpace;=&ThickSpace;sd(y)sd(xj)&ThickSpace;,&ThickSpace;uj&ThickSpace;=&ThickSpace;sd(y)sd(xj)l_j\;=\;-\frac{sd(y)}{sd(x_j)}\;,\;u_j\;=\;\frac{sd(y)}{sd(x_j)} as our bounds of estimated regression coefficients where sd() stands for standard deviation. The rationale behind this is given in Appendix 1.

    Elitism:

    We have kept our population size fixed in each generation. Elitism has been maintained in this algorithm. 5% of the fitter chromosomes has been passed on to the next generation without any further operation and the remaining individuals of the new generation has been generated through selection and crossover.

    Selection Operators:

    The selection operator only chooses parents from the fraction of the population that is kept. The selection operators which we have used are linear rank selection and tournament selection. In the linear rank selection method, each individual in the population is ranked in the increasing order of fitness and then assigned a selection probability that is proportional to the individual’s rank. Tournament selection involves running several "tournaments" among a few individuals (or "chromosomes") chosen at random from the population. The winner of each tournament is selected for crossover. In a tournament of size kk, kk chromosomes are chosen from the population at random and then the best (fittest) chromosome is chosen from the tournament with probability pp, the second best chromosome with probability p(1p)p(1-p), the third best chromosome with probability p(1p)2p(1-p)^2 and so on.

    Crossover Operator:

    For producing two offspring using crossover, we need to select two parent chromosomes. After selecting two parent chromosomes Ba{\mathfrak B}_aand   Bb{\mathfrak B}_b, they produce two different offspring using crossover operator with some pre-determined probability (said to be the probability of crossover, say ρ1\rho_1) and with probability 1&ThickSpace;&ThickSpace;ρ11\;-\;\rho_1, the parents goes to the next generation as their offspring. We have set ρ1&ThickSpace;=&ThickSpace;0.8\rho_1\;=\;0.8. We have used the whole arithmetic crossover operator. In this crossover method, a random number θ\theta, between 0 and 1, is generated randomly and then the two offspring are produced by the formula below,

    Ba1&ThickSpace;=&ThickSpace;θBa&ThickSpace;+&ThickSpace;(1θ)Bb{\mathfrak B}_{a1}\;=\;\theta{\mathfrak B}_a\;+\;(1-\theta){\mathfrak B}_b

    Bb1&ThickSpace;=&ThickSpace;θBb&ThickSpace;+&ThickSpace;(1θ)Ba{\mathfrak B}_{b1}\;=\;\theta{\mathfrak B}_b\;+\;(1-\theta){\mathfrak B}_a

    ​Mutation Operator:

    After getting all chromosomes of the next generation by elitism, selection and crossover operations, the mutation operation is performed randomly with some pre-determined probability (said to be the probability of mutation, say ρ2\rho_2). When a parameter in a chromosome mutates, its value is replaced by a random number chosen from the corresponding lower and upper bounds. We have set ρ2&ThickSpace;=&ThickSpace;0.1\rho_2\;=\;0.1

    A Genetic Algorithm for Selection of Regression Model

    When n&ThickSpace;&lt;&ThickSpace;p+1n\;&lt;\;p+1, we can not use the Backward Stepwise procedure for model selection as it starts with the model with all the pp predictors. We can use a modified Forward Stepwise procedure in which we stop when it includes n1n-1 (or less) predictors. It performs multiple iterations by adding one predictor at a time. In each iteration, multiple models are built by adding each of the predictors at a time. The AIC of the models is also computed and the model that yields the lowest AIC is retained for the next iteration.

    Other than this, we propose a GA for model selection which can give us the model wih as many predictors we want and with the lowest AIC.

    Chromosomes and Population:

    Initially if we have ppplausible predictors, then we have used pbitp-bitbinary strings as our chromosomes where each chromosome refers to a particular regression model as follows. If a chromosome has 1 in its ithi-th place then it refers to a model with the ithi-th predictor and otherwise a model without the ithi-th predictor. We generated our initial population of pre-defined size by randomly creating binary strings of size pp.

    Fitness Function:

    Taking a pbitp-bitbinary string as input, the fitness function counts the number of 1s1&#x27;s in it and if that number is less than or equal to n1n-1, it fits the corresponding model, calculates the AIC and gives it as output. For binary strings with number of 1 greater than or equal to nn, it gives infinity as output (as we want to minimize the fitness function).

    We have used different values of the penalty parameter kk in AIC for getting optimal model of different sizes ( i.e. wih different number of predictors).

    Elitism and Selection:

    As before, we have maintained elitism and used Tournament Selection.

    Crossover Operator:

    As explained earlier, crossover is implemented with probability ρ1&ThickSpace;=&ThickSpace;0.8\rho_1\;=\;0.8. For binary strings, we have used Single Point Crossover operator. Under this operation, a point on both parents' chromosomes is picked randomly, and designated as 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offsprings. Below it is shown in the diagram.

    Untitled.jpg
      Crossover

      Mutation Operator:

      As before, after getting all chromosomes of the next generation by elitism, selection and crossover operations, mutation operation is performed randomly on a bit of chromosome-strings with probability ρ2&ThickSpace;=&ThickSpace;0.1\rho_2\;=\;0.1. When a bit of a chromosome mutates, its value is flipped from 0 to 1 or 1 to 0.​

      ​​Generation of Dataset and Implementation of Algorithms

      We generated data and implemented our algorithms in statistical software R. For implementing Genetic Algorithm we have used the package GA.

      Implementing GA for Least Squares Estimation

      We have generated samples of predictor variables and response variables depending on those predictors by a particular model (the regression model mentioned above) for different values of nn and pp i.e. we have fixed the values of β0,β1,...,βp\beta_0,\beta_1,...,\beta_p, generated n observations for each of the predictors x1,....,xpx_1,....,x_p independently from normal distributions, error ε\varepsilon from standard normal distribution (so that ε could have mean 0) and response y by the model y&ThickSpace;=&ThickSpace;β0+β1x1&ThickSpace;+...+βpxp&ThickSpace;+εy\;=\;\beta_0+\beta_1x_{1\;}+...+\beta_px_{p\;}+\varepsilon.​

      Implementing Forward Selection and GA for Model Selection

      We have generated samples of size nn for the pppredictors (from multivariate normal distribution with a specific correlation structure) and for error ε\varepsilon(again from normal distribution with mean 0) and generated response y&ThickSpace;y\;as linear combination of mm (known) of those pp predictors plus the error so that the optimal model consisted of those predictors and we could check the performance of our algorithms.

      For implementing forward stepwise regression, we have used the "step" function in R.

      RESULTS AND DISCUSSION

      Least Squares Estimation

      First we have observed that tournament selection gives faster convergence than linear rank selection. So we have used tournament selection for larger values of pp. For comparing the rate of convergence of two different selection operators, we have used a dataset with p&ThickSpace;=1p\;=1. Both the selection operators have yielded stable estimates of regression coefficients for every trial run but the time of convergence varied in each trial. Even with the same initial population size, tournament selection has converged much faster than linear rank selection in each trial which is shown in the table below. We have also observed that the time of convergence in linear rank selection was variable from one trial to another. For tournament selection, the time to convergence was much less variable.

      Rate of Convergance of different Selection Operators
      RunNo of Iterations for Linear Rank SelectionNo of Iterations for Tournament Selection
      120822
      249319
      324018

      Then we gradually increased the value of pp. We have observed that for p&gt;5p&gt;5, as pp increases, the minimum intitial population size required for stable convergence of GA estimates of regression coefficients increases almost at an exponential rate.

      One example (for a particular dataset) is shown in the graph below. But this is the general trend observed.

      Rplot_1.jpeg
        Minimum Initial Population Size required for Convergence for different values of p

        For this reason, and for limitations of time and computing power, we couldn't go beyond p&gt;10p&gt;10. But up to that point, GA has given remarkable results.

        For a dataset with p&ThickSpace;=&ThickSpace;5p\;=\;5 and n&ThickSpace;=&ThickSpace;15n\;=\;15, the performance of GA in estimating regression coefficients is shown below.

        • Population Size = 1,000
        Comparison of GA Estimates with Exact Least Squares Estimates for p = 5
        Exact Least Squares EstimatesGA Estimates
        37.4137.41
        -16.71-16.71
        7.607.60
        5.845.84
        0.190.19

        For another dataset with p = 10 and n = 15, the performance of GA in estimating regression coefficients is shown below.

        • Population Size = 20,000
        Comparison of GA Estimates with Exact Least Squares Estimates for p = 10
        Exact Least Squares EstimatesGA Estimates
        11.111.1
        2.82.7
        10.210.5
        5.35.3
        7.06.9
        18.718.6
        8.48.3
        13.313.3
        5.85.8
        9.910.0

        So GA has given stable estimates of regression coefficients in all cases.

        Model Selection

        We have generated datasets with p&ThickSpace;=&ThickSpace;50p\;=\;50 and n&ThickSpace;=&ThickSpace;25n\;=\;25. The 5050 predictors are from a multivariate normal distribution with the correlation matrix consists of five 10×1010\times10 block matrices, entries (of each block) being randomly generated from the interval [0.4,0.6]\lbrack0.4,0.6\rbrack. Then we have generated response variable yy as linear combination of only mm predictors and error with m&ThickSpace;=5&ThickSpace;and&ThickSpace;m&ThickSpace;=&ThickSpace;10m\;=5\;and\;m\;=\;10 respectively for each dataset.​

        Results for dataset 1:

        Response (as generated): y=22x1+14x11+9x21+15x31+15x41+εy=22x_1+14x_{11}+9x_{21}+15x_{31}+15x_{41}+\varepsilon

        Information of the most significant model:

        • Predictors: x1,x11,x21,x31,x41x_1,x_{11},x_{21},x_{31},x_{41}
        • AIC: 119.91

        Information of the Optimal Model selected by Forward Selection:

        • Selected Predictors:

        x1,x41,x31,x11,x21,x26,x25,x9,x4,x30,x48,x42,x8,x28,x15,x27,x16,x17,x49,x36,x22,x32,x3&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;&ThickSpace;x_1,x_{41},x_{31},x_{11,}x_{21,}x_{26,}x_{25,}x_9,x_4,x_{30,}x_{48,}x_{42},x_8,x_{28},x_{15},x_{27},x_{16},x_{17},x_{49},x_{36},x_{22,}x_{32},x_3\;\;\;\;\;\;\;\;\;\;\;

        • AIC: -310.91

        Information of the Optimal Model Selected by GA :

        • Selected Predictors:

        x1,x2,x3,x4,x5,x11.x14,x17,x19,x21,x25,x26,x29,x31,x32,x37,x41,x42,x45,x46,x48,x50x_1,x_2,x_3,x_4,x_5,x_{11.}x_{14,}x_{17,}x_{19,}x_{21},x_{25},x_{26,}x_{29},x_{31},x_{32},x_{37},x_{41},x_{42},x_{45},x_{46},x_{48},x_{50}

        • AIC = -344.57

        Results for dataset 2:

        Response (as generated): y=22x1+16x5+14x11+19x15+9x21+15x25+15x31+16x35+15x41+17x45+εy=22x_1+16x_5+14x_{11}+19x_{15}+9x_{21}+15x_{25}+15x_{31}+16x_{35}+15x_{41}+17x_{45}+\varepsilon

        Information of the most significant model:

        • Predictors: x1,x5,x11,x15,x21,x25,x31,x35,x41,x45x_1,x_5,x_{11},x_{15},x_{21},x_{25},x_{31},x_{35},x_{41},x_{45}
        • AIC: 64.35

        Information of the Optimal Model selected by Forward Selection:

        • Selected Predictors:

        x1,x41,x31,x11,x45,x25,x4,x16,x35,x21,x48,x42,x14,x29,x15,x5,x6,x39,x22,x17,x47,x10,x20x_1,x_{41},x_{31},x_{11},x_{45},x_{25},x_4,x_{16},x_{35},x_{21},x_{48},x_{42},x_{14},x_{29},x_{15},x_5,x_6,x_{39},x_{22},x_{17},x_{47},x_{10},x_{20}

        • AIC: -324.12

        Information of the Optimal Model selected by GA:

        • Selected Predictors:

        x1,x5,x6,x11,x13,x15,x18,x21,x23,x25,x31,x32,x33,x34,x35,x40,x41,x44,x45,x46,x48,x50x_1,x_5,x_6,x_{11},x_{13},x_{15},x_{18},x_{21},x_{23},x_{25},x_{31},x_{32},x_{33},x_{34},x_{35},x_{40},x_{41},x_{44},x_{45},x_{46},x_{48},x_{50}

        • AIC: - 399.85

        The optimal models selected by GA have lower AIC than those selected by forward selection in both cases. It is striking that AIC's of the optimal models selected by both forward selection and GA make a huge jump (from positive to negative) from the AIC's of the most significant model. The reason behind this is that the SSE's of these optimal models are<10-5 (because we know, with addition of predictors, SSE always decreases) and being proportional to the logarithm of SSE's (AIC = 25lnSSE - 25ln25 + 2 (No. of predictors) in these cases), AIC's thus become large negative.

        We observed that the optimal models selected by both forward selection and GA contain those significant predictors in both cases. But along with them, they contain other non-significant predictors also. The visible reason behind it is that those selected models have lower AICs than the models containing only those significant predictors. For computing the exact least squares estimates of regression coefficients, it is not a problem as we now have a model containing all the significant predictors with the number of predictors less than the sample size. Below we give a comparison between the least squares estimates of the coefficients corresponding to the significant predictors of the significant models and the models selected by GA. It should also be noted that the coefficients corresponding to the non-significant predictors in the models selected by GA are negligible compared to the coefficients corresponding to the significant predictors (and most of the coefficients corresponding to the non-significant predictors are approximately equal to zero).

        Comparison of Least Squares Estimates of Significant Coefficients by different models for data-set 1 
        Significant PredictorsEstimates of corresponding Coefficients by the Significant ModelEstimates of corresponding Coefficients by the Model selected by GA
        x121.9822.52
        x1114.0814.51
        x218.899.31
        x3114.8714.71
        x4114.9615.28
        Comparison of Least Squares Estimates of Significant Coefficients by different models for data-set 2
        Significant PredictorsEstimates of corresponding Coefficients by the Significant ModelEstimates of corresponding Coefficients by the Model selected by GA
        x121.7821.66
        x516.1716.01
        x1113.9914.05
        x1519.3019.19
        x218.898.78
        x2515.3215.33
        x3115.2115.34
        x3513.8915.55
        x4114.6615.26
        x4515.8917.18

        Then we started to investigate why the selected models contain non-significant predictors i.e. why their AICs are lower than than the significant models. One possible reason is the moderate correlations between groups of predictors. In all practical situations, this is the most likely scenario i.e most of the time predictors happen to be correlated.

        We surmised that if we increase the value of the penalty for adding predictor in AIC, namely k, it may lead to selection of only the significant predictors. So we changed the number of significant predictors mand observed at what value of k we got the desired model as the model selected by GA. We observed that when the number of significant predictors is low, we need a higher value of kand with the increase in the number of significant predictors, the minimum required value of kdecreases almost at a steady and low rate, when n and pare fixed. One such dependency (for a particular dataset) is shown in the graph below:

        Rplot.jpeg
          k as a function of No of Significant Predictors when n=25,p=50

          CONCLUSIONS

          We can implement the method of least squares using a Genetic Algorithm which provides stable estimates of regression coefficients even if n<p.

          For large values of pp, the GA for estimating regression coefficients is computationally costly.

          To reduce computational cost, we can apply Model Selection to lower the value of p which can also the solve the problem of n<<p.

          For model selection, GA performs better than the usual Forward Selection procedure.

          With the usual AIC as the optimality criterion, model selection fails to select the most significant model i.e. model only with the most significant predictors, when the predictors are moderately correlated.

          Changing the value of the penalty parameter in AIC leads to obtaining of the most significant model by model selection.

          To select the most significant models with different number of significant predictors (m)(m), we need to use different values of the penalty parameter (k)(k) in AIC. Thus we can think kk as a function of m,nm,n and pp.

          FUTURE SCOPE

          To reduce the computational complexity in estimating the least squares estimates of regression coefficients using GA. This can be done by using more precise initial intervals or by lowering the probability of crossover.

          To explain more rigourously why the optimal model is not always the most significant model.

          To explore what role do correlations among predictors play in this context.

          To estimate the aforesaid function k(m,n,p)k(m,n,p).

          APPENDIX 1

          Here we will give the justification for taking the aforesaid intervals as the bounds of βj^s\widehat{\beta_j}&#x27;s.

          Consider regression of yy and xjx_j on x1,x2,...,xj1,xj+1,...,xpx_1,x_2,...,x_{j-1},x_{j+1},...,x_p separately and let αk^s\widehat{\alpha_k}&#x27;s and γk^s\widehat{\gamma_k}&#x27;s k{1,...,j1,j+1,..p}\forall k\in\{1,...,j-1,j+1,..p\} be the estimates of the corresponding regression coefficients respectively. Also let yi^\widehat{y_i} and xji^\widehat{x_{j_i}} be the predicted values by regression i{1,...,n}\forall i\in\{1,...,n\}​.

          Define ri=yiyi^&ThickSpace;i&ThickSpace;{1,...,n}r_i=y_i-\widehat{y_i}\forall\;i\;\in\{1,...,n\} and rji=xjixji^&ThickSpace;i&ThickSpace;{1,...,n}r_{j_i}={x_j}_i-\widehat{{x_j}_i}\forall\;i\;\in\{1,...,n\}. It can be proved that βj^=ρjsd(r)sd(rj)\widehat{\beta_j}=\rho_j\frac{sd(r)}{sd(r_j)}where ρj&ThickSpace;\rho_{j\;}is the partial correalation coefficient between yy and xjx_j. Also sd(r)=(1Rj2)sd(y)sd(r)=\sqrt{(1-R_{-j}^2)}sd(y) where RjR_{-j} is the multiple correlation coefficient between y and x1,...,xj1,xj+1,..xpx_1,...,x_{j-1},x_{j+1},..x_p and sd(rj)=(1Rj2)sd(xj)sd(r_j)=\sqrt{(1-R_j^2)}sd(x_j) where RjR_j is the multiple correlation coefficient between xjx_j and x1,...,xj1,xj+1,..xpx_1,...,x_{j-1},x_{j+1},..x_p.

          So (βj^)2=ρj21Rj21Rj2var(y)var(xj)(\widehat{\beta_j})^2=\rho_j^2\frac{1-R_{-j}^2}{1-R_j^2}\frac{var(y)}{var(x_j)}. Now with 1ρj1-1\leq\rho_j\leq1 (as it is, after all, a product moment correlation coefficient) and Rj2<R-j2  (this is true because of the fact that xjs&ThickSpace;x_j&#x27;s\; are uncorrelated), we have (βj^)2var(y)var(xj)&ThickSpace;i.e.&ThickSpace;sd(y)sd(xj)βj^sd(y)sd(xj)(\widehat{\beta_j})^2\leq\frac{var(y)}{var(x_j)}\;i.e.\;-\frac{sd(y)}{sd(x_j)}\leq\widehat{\beta_j}\leq\frac{sd(y)}{sd(x_j)}.​

          The proofs of the results and the theorems mentioned in Appendix 1 can be found in Gun A.M., Gupta M.K. and Dasgupta B. 2014. Fundamentals of Statistics vol-1 The World Press Private Limited,Kolkata pp. 285-291

          ​​REFERENCES

          Carr J. 2014. An Introduction to Genetic Algorithms pp. 1-40.

          Gun A.M., Gupta M.K. and Dasgupta B. 2014. Fundamentals of Statistics vol-1 The World Press Private Limited, Kolkata ch. 11-12.

          Bhattachariya R.K. 2013. Introduction to Genetic Algorithms. IIT Guwahati.

          Paterlini S. and Minerva T. 2010. Regression Model Selection Using Genetic Algorithm. Proceeding NN'10/EC'10/FS'10. Proceedings of the 11th WSEAS international conference on neural networks and 11th SEAS international conference on evolutionary computing and 11th WSEAS international conference on Fuzzy systems pp. 19-27

          Rawlings J. O., Pantula S. G. and Dickey D. A. 1998. Applied Regression Analysis:A Research Tool. Springer ch. 3.

          Broadhurst D., Goodacre R., Jones A., Rowland J. J. and Kell D. B. 1997 Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression,with applications to pyrolysis mass spectrometry. Analytica Chimica Acta 348 pp. 71-86

          Wallet B. C., Marchette D. J., Solka J. L. and Wegman E. J. 1996 A Genetic Algorithm for Best Subset Selection in Linear Regression. Computing science and statistics. Volume 28., Graph-image-vision: proceedings of the 28th Symposium on the Interface, Sydney, Australia.

          Karr C. L., Stanley D.A., and Scheiner B.J. 1991. Genetic Algorithm applied to Least Squares Curve Fitting. Report of Investigations 9339, United States Department of the Interior and United States Bureau of Mines.

          ACKNOWLEDGEMENTS

          I am grateful to Indian Academy of Sciences, Indian National Science Academy and The National Academy of Sciences, India for providing me this opportunity to carry out this project. I express my sincere thanks to Professor Partha P. Majumder, Distinguished Professor, National Institute of Biomedical Genomics, Kalyani for his constant advice and guidance throughout my project to make my summer days a fruitful and memorable one. I would also like to thank all the faculty members, research scholars, data scientists, trainees and other employees and staff of NIBMG, Kalyani for helping me to complete my work. It is my privilege to express my gratitude to Dr. Arnab Chakraborty, Assistant Professor, Applied Statistics Unit, Indian Statistical Institute for providing my letter of recommendation to IAS,INSA and NAS.I am thankful to AuthorCafe for providing me such a great platform to write my report and to have my work archived online.

          More
          Written, reviewed, revised, proofed and published with