Loading...

Summer Research Fellowship Programme of India's Science Academies

POPULATION GENETIC ANALYSIS OF DNA SEQUENCE VARIATIONS

Pratyashee Ojah

Department of Statistics, Cotton University, Panbazar, Guwahati 781009

Guided by:

Prof. Partha P. Majumder

Distinguished Professor, National Institute of Biomedical Genomics, Kalyani, West Bengal 741251

Abstract

Genes are the basis of inheritance and are responsible for all the characteristics of an organism. One of the salient findings of the Human Genome Project is that 99.9% of the human genome is exactly the same in all individuals. Thus, the remaining 0.1% of the variable DNA makes every individual unique. The study of genetic variations among individuals of populations is of great importance for understanding commonality of ancestries, identifying genetical processes that have impacted on genetic variation, discovering disease-causing genes etc. Genome-wide data on variations among individuals were generated from many global populations. These data have been generated under the “1000 Genomes Project,” which was an international research effort to establish a detailed catalogue of genetic variations. Such datasets can be very efficiently analysed with the aid of computation and statistical methodologies. Statistical Genetics is the field wherein methodologies are developed for drawing inferences from genomic data. This report will inform the structure of the populations under study and their genetic relationships, and also differences in the structure of the genomes across two chromosomes.

Keywords: Chromosome, Gene, Genome, Single nucleotide polymorphism, 1000 Genomes Project, Population structure, Statistical test of significance

Abbreviations

DNADeoxyribonucleic acid
mRNAmessenger RNA
IGSRInternational Genome Sample Resource
d.f.degrees of freedom
l.o.s.level of significance
HWEHardy-Weinberg Equilibrium

INTRODUCTION

The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human genetic variations and genotype data. It aims to provide a deep characterisation of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. The project was funded by the Wellcome Trust to maintain and expand the resources. The IGSR was set up to ensure accessibility and usage of the dataset in future and also to expand the dataset with new populations that were previously not a part of 1000 Genomes. The final analysis of the 1000 Genomes, as in October 2015, incorporates 26 populations from Asia (South Asian & East Asian), Africa, Europe and America. In this study, six populations have been considered for analysis viz. three South Asian (SAS) populations: Gujarati Indian from Houston, Texas (GIH), Indian Telugu from the UK (ITU), Bengali from Bangladesh (BEB) and three European (EUR) populations: Finnish in Finland (FIN), Iberian population in Spain (IBS) and Toscani in Italia (TSI).

This study is a part of Population Genetics which deals with genetic differences within and between populations. Statistical methodologies like data cleaning, testing of hypotheses, chi-square applications and graphical representation have been used to get an analytical insight of the data. The dataset for this project has been downloaded from the 1000 Genomes website pertaining to DNA variations on chromosome 1(a long chromosome) and 21(a short chromosome) for the six populations under study.

The rationale behind this study is to analyse the genotype data of the selected populations using statistical methodologies, study the various characteristics of the data, test relevant hypotheses that could answer questions about the variations in the genetic level and infer the structure of genomes across the populations for the two chromosomes.

METHODOLOGY

The project makes a comparative study of three South Asian and three European populations. The steps mentioned below are followed for these two sets of populations:

1. DATA COLLECTION

Data have been downloaded from the 1000 Genomes website pertaining to DNA variations on Chromosome 1 & 21 for the six populations. These data contain the genotype counts of 3000 genetic loci for each population.

Numbers of individuals for each population differ and are tabulated below: 

2. IDENTIFICATION OF DATA FOR ANALYSIS

This step prepares the data for further analysis. This study is based on the biallelic loci, i.e., if two alleles occur at a particular locus, it is biallelic for that population. So, as the data contains genetic loci with multiallelic sites and multi nucleotide variants which are removed and the remaining loci are carried forward for analysis. 

3. ANALYSIS OF MONOMORPHIC LOCI

In a population, if only one allele occurs at a site or locus, it is monomorphic or monoallelic in that population. The alleles with frequency less than 0.01 or greater than 0.99 are monomorphic in nature. Otherwise, it is polymorphic in nature. A monomorphic locus occurs in the same form across the population thus, does not account for any variation. So, these loci are excluded from the study as they provide no information about variation. However, an analysis can be made on the distribution of the monomorphic loci in one population with respect to the other populations. 

4. IDENTIFICATION OF POLYMORPHIC LOCI

The loci with variable proportions of both the alleles are polymorphic in nature. These loci provide information about genetic variation and are of great importance. The loci which are polymorphic across all the three populations are selected for the next step of the study. 

5. TESTING FOR HARDY-WEINBERG EQUILLIBRIUM USING CHI- SQUARE TEST FOR GOODNESS OF FIT

Hardy-Weinberg Principle states that the allele frequencies in a population are stable and constant from generation to generation. One of its assumptions is that mating is random for which the choice of male and female gametes are independent trials. Thus, it is based on the outcome of repeated and independent trials. One of the most important implications of Hardy Weinberg Principle is that we can calculate the expected genotype frequencies for the future generations. However, in practice, there are various evolutionary influences like natural selection, mutation, gene flow etc. that defy the Hardy-Weinberg Principle. So, it is necessary to have an idea about the deviation from the equilibrium and understand the validity of the model. We use chi-square test of goodness of fit for this purpose.

Suppose, in a population of N individuals, we study two alleles A and a, then the possible genotype combinations are AA, Aa and aa.

Let they have certain genotypes frequencies, f(AA), f(Aa) and f(aa) which are calculated after observation. Let the probability of occurrence of A and a in a genetic locus be p and q respectively, such that p + q= 1.

According to Hardy-Weinberg Principle, the probabilities p and q are constant from generation to generation. So, for homozygotes, the probability of two A-bearing gametes coming together is

P(AA) = p x p=p2

and for a-bearing gametes the probability is

P(aa) =q x q=q2.  

For heterozygotes, the probability is

P(Aa) = (p x q) + (q x p) =2pq.

Therefore, the expected genotype frequencies are given as:

fE (AA) = p2N

fE (Aa) = 2pqN

fE (aa) = q2N 

Chi-square test of goodness of fit enables us to find if the deviation of the experiment from theory is just by chance or due to inadequacy of the theory to fit the observed data.

Let, for a particular genetic locus, Oi be the observed genotype frequency and Ei be the expected genotype frequency.

We are to test the null hypothesis,

Ho: The genetic locus under study is in Hardy-Weinberg Equilibrium,

against the alternative hypothesis,

H1: The genetic locus under study is not in Hardy-Weinberg Equilibrium. 

We can tabulate the data as:

Observed and Expected Genotype Frequencies

GENOTYPE

Oi

Ei

AA

f(AA)

fE(AA)

Aa

f(Aa)

fE(Aa)

Aa

f(aa)

fE(aa)

Total

N

N

 The test-statistic χ2 = i=13(OiEi)2Ei\displaystyle\sum_{i=1}^3\frac{(O_i-E_i)^2}{E_i} ; i=13Oi=  Eii=13=N\displaystyle\sum_{i=1}^3O_i=\overset3{\underset{i=1}{\displaystyle\sum\;E_i}}=N

where χ2 follows chi-square distribution. The d.f. associated with χ2 is given by,

n= number of independent observations – number of parameters estimated from the data - 1

Here, the number of independent observations is 3, as there are three genotype frequencies. Number of estimated parameters is 1, as we have estimated p from the observed data.

Therefore, d.f. is n=3-1-1=1

The p value associated with χ2 is the probability that chance alone produce a deviation between the observed and expected values. So, a larger p value implies that chance alone could account for the deviation and it strengthens our confidence in the validity of the model. However, a smaller p value clearly indicates that there is certainly some cause other than chance that causes the deviation. Hence, validity of the model is undermined.

For testing the Hardy Weinberg Principle, we consider p=0.0001. So, if p<0.0001, we reject the null hypothesis and conclude that the particular locus is not in Hardy Weinberg Equilibrium. Again, if p>0.0001, we may accept the null hypothesis and infer that the locus is in Hardy Weinberg Equilibrium.

In this project, testing of Hardy Weinberg Equilibrium has been done for all the six populations. The loci which follow the principle and are common to all the three populations (BEB, GIH & ITU in SAS and FIN, IBS & TSI in EUR) are carried forward for further analysis.

6. TEST OF INDEPENDENCE OF GENOTYPE FREQUENCIES USING CONTINGENCY TABLES

Hypothesis testing based on contingency tables define the relationship between the row and column attributes. Suppose we have an r x s contingency table, i.e., a table with r rows and s columns of two attributes. We need to test whether the row attributes are differently distributed over the column attributes.

Suppose, we need to test whether the genotype frequency distribution of G1, G2 and G3 is the same across a set of given populations P1, P2 and P3. Let, I1, I2 and I3 be the number of individuals from the populations P1, P2 and P3 respectively. Here, genotype frequency and number of individuals from a population are two independent attributes.

i.e., the null hypothesis to be tested at each genetic locus is:

Ho: There is no significant difference in the distribution of genotype frequencies across the three populations.

Against the alternative hypothesis,

H1: There is significant difference in the distribution of genotype frequencies across the three populations. 

So, we get a 3x3 contingency table as follows:

Contingency table

Genotype

 

Individuals

G1

G2

G3

Total

I1

(I1G1)

(I1G2)

(I1G3)

(I1)

I2

(I2G1)

(I2G2)

(I2G3)

(I2)

I3

(I3G1)

(I3G2)

(I3G3)

(I3)

Total

(G1)

(G2)

(G3)

M

where, (Ii) = number of individuals in the ith population

(Gj) = number of individuals possessing the jth genotype

(Ii Gj) = number of individuals with jth genotype from the ith population

M = total frequency

These are the observed frequencies.

Under the null hypothesis, the attributes are independent, so the expected frequencies are:

P[Ii] = Probability that an individual is from ith population = (Ii)M \frac{(I_i)}M

P[Gj] = Probability that an individual possesses jth genotype= (Gj)M \frac{(G_j)}M

P[IiGj] = Probability that an individual has jth genotype and belongs to the ith population= (IiGj)M \frac{(I_iG_j)}M

Since, Ii and Gj are independent,

P[IiGj] = P[Ii] . P[Gj] = (Ii)M \frac{(I_i)}M . (Gj)M \frac{(G_j)}M

(IiGj)E = Expected number of individuals with jth genotype from the ith population

=M . (Ii)M \frac{(I_i)}M. (Gj)M \frac{(G_j)}M = (Ii).(Gj)M\frac{(I_i).(G_j)}M

The test statistic χ2 = i=1r&ThickSpace;j=1s[(IiGj)(IiGj)E)]2(IiGj)E\displaystyle\sum_{i=1}^r\;\sum_{j=1}^s\frac{\lbrack(I_iG_j)-(I_iG_j)_E)\rbrack^2}{(I_iG_j)_E}

where χ2 follows chi-square distribution. The d.f. associated with χ2 is given by,

n= (r-1)(s-1)

for a 3 x 3 contingency table, d.f. is = (3-1)(3-1)= 4

We test the hypothesis at 5% l.o.s. But, for the calculation of the p-value we use Bonferroni correction.

WHY BONFERRONI CORRECTION?

Bonferroni correction is an adjustment made to the p-value. The p-value is divided by the number of observations. In genetic studies, several dependent or independent statistical tests are done on a large dataset. Suppose, we have a dataset of one million SNPs for a significance test and let the p- value is 0.05. Then we would have 50,000 SNPs which are significant for the test, which could be for no reason. The chances of obtaining false positive results increases.

Due to this correction, the observations with larger deviations are detected and are significant for the given test. However, smaller effects escape notice along with those with no effect that are not rejected.

Thus, if for a certain genetic locus the p value is less than the critical p value, the null hypothesis is rejected and we can conclude that there is significant difference in the distribution of genotype frequencies across the three populations. If the p value is more than the critical p value, the null hypothesis may be accepted and it can be inferred that there is no significant difference in the distribution of genotype frequencies across the three populations.

 7. HISTOGRAM OF ALLELE FREQUENCIES

In order to study the allele frequency distribution across different populations, the alternate allele is considered and the allele frequency distribution for both the chromosomes is plotted as a histogram. The allele frequencies are along the X-axis and their frequency of occurrence is along Y-axis. This helps to compare the structure of the genotype composition of the different populations.

8. z- TEST FOR DIFFERENCE OF PROPORTIONS

 We apply the z-test for difference of proportions to test the following null hypothesis,

Hypothesis 1

Ho: There is no significant difference in the distribution of polymorphic loci in the two sets of population across chromosome 1 and 21.

Against the alternative hypothesis,

H1: There is significant difference in the distribution of polymorphic loci in the two sets of population across chromosome 1 and 21.

Also we apply t-test for difference of means to test the null hypothesis,

Hypothesis 2

Ho: There is no significant difference in the distribution of polymorphic loci in the two sets of population for a particular chromosome.

Against the null hypothesis,

H1: There is significant difference in the distribution of polymorphic loci in the two sets of population for a particular chromosome.

The test statistic is given by,

z= p1p2p(1p)n\displaystyle \begin{array}{l}\frac{\vert\overline{p_1}-\overline{p_2}\;\vert}{\sqrt{\displaystyle\frac{\overline p(1-\overline p)}n}}\\\end{array}

where p1\overline{p_1}& p2\overline{p_2} are sample proportions. n is the sample size, p\overline pis the pooled sample proportion, z \simN(0,1).

The calculations, statistical tests of significance and graph plots for the selected loci have been done using R programming and MS Excel.

RESULTS AND DISCUSSION

SAS POPULATION

After removing the multiallelic and multi-nucleotide variants, we have 2867 out of 3000 loci in chromosome 1 and 2897 out of 3000 loci in chromosome 21 which are biallelic and the analysis is carried forward with these loci.

1)    Analysis of monomorphic and polymorphic loci

CHROMOSOME 1 (2867 loci under study)

BEB population

Monomorphic loci: 2608 (90.1%)

Distribution of Monomorphic loci of BEB population w.r.t. GIH & ITU population

GIH

ITU

Count

Monomorphic

Monomorphic

2557 (98%)

Monomorphic

Polymorphic

12 (0.48%)

Polymorphic

Monomorphic

10 (0.38%)

Polymorphic

Polymorphic

29 (1.12%)

Polymorphic loci: 259 (9.9%)

GIH population

Monomorphic loci: 2616 (91.24%)

Distribution of Monomorphic loci of GIH population w.r.t. BEB & ITU population

BEB

ITU

Count

Monomorphic

Monomorphic

2557 (97.74%)

Monomorphic

Polymorphic

10 (0.38%)

Polymorphic

Monomorphic

37 (1.14%)

Polymorphic

Polymorphic

12 (0.48%)

Polymorphic loci: 251 (8.76%)

ITU population

Monomorphic loci: 2643 (92.2%)

Distribution of Monomorphic loci of ITU population w.r.t. BEB & GIH population
BEBGIH

Count

Monomorphic

Monomorphic

2557 (88.26%)

Monomorphic

Polymorphic

29 (1%)

Polymorphic

Monomorphic

37 (1.27%)

Polymorphic

Polymorphic

20 (0.69%)

Polymorphic loci: 224 (7.8%)

CHROMOSOME 21 (2897 loci under study)

BEB population

Monomorphic loci: 2571 (88.74%)

Distribution of Monomorphic loci of BEB population w.r.t. GIH & ITU population

GIH

ITU

Count

Monomorphic

Monomorphic

2543 (98.91%)

Monomorphic

Polymorphic

15 (0.58%)

Polymorphic

Monomorphic

8 (0.31%)

Polymorphic

Polymorphic

5 (0.19%)

Polymorphic loci: 326 (11.26%)

GIH population

Monomorphic loci: 2579 (89%)

Distribution of Monomorphic loci of GIH population w.r.t. BEB & ITU population

BEB

ITU

Count

Monomorphic

Monomorphic

2543 (98.60%)

Monomorphic

Polymorphic

15 (0.58%)

Polymorphic

Monomorphic

15 (0.58%)

Polymorphic

Polymorphic

6 (0.23%)

Polymorphic loci: 318 (11%)

 ITU population

Monomorphic loci: 2571 (88.74%)

Distribution of Monomorphic loci of ITU population w.r.t. BEB & GIH population
BEBGIH

Count

Monomorphic

Monomorphic

2543(98.91%)

Monomorphic

Polymorphic

8 (0.31%)

Polymorphic

Monomorphic

15 (0.58%)

Polymorphic

Polymorphic

5 (0.19%)

 Polymorphic loci: 326 (11.26%)

 2)    Number of Polymorphic loci common to BEB , GIH & ITU Populations

Common Polymorphic loci

Chromosome 1

190 (6.62%)

Chromosome 21

300 (10.35%)

 3)    Number of loci that are in Hardy Weinberg Equilibrium for all the three populations

Loci following HWE

Chromosome 1

150 (78.94%)

Chromosome 21

140 (46.67%)

4)    Contingency test is applied for the loci obtained under step (3)

p value is determined using Bonferroni Correction.

p=0.05/150=0.0003 (for Chromosome 1)

p=0.05/140=0.0003 (for Chromosome 21) 

After the contingency test it is found that there are 4 loci in Chromosome 1 and 21 locus in Chromosome 21 for which p < 0.0003.

These loci have significant difference in the genotype frequency distribution across the three populations. 

EUR POPULATION 

After removing the multiallelic and multi-nucleotide variants, we have 2867 out of 3000 loci in chromosome 1 and 2893 out of 3000 loci in chromosome 21 which are biallelic and the analysis is carried forward with these loci.

1)    Analysis of monomorphic and polymorphic loci 

CHROMOSOME 1 (2867 loci under study)

FIN population

Monomorphic loci: 2620 (91.38%)

Distribution of Monomorphic loci of FIN population w.r.t. IBS & TSI population

IBS

TSI

Count

Monomorphic

Monomorphic

2577 (98.35%)

Monomorphic

Polymorphic

11 (0.42%)

Polymorphic

Monomorphic

21 (0.81%)

Polymorphic

Polymorphic

11 (0.42%)

Polymorphic loci: 247(8.62%)

IBS population

Monomorphic loci: 2638 (92.01%)

Distribution of Monomorphic loci of IBS population w.r.t. FIN & TSI population

FIN

TSI

Count

Monomorphic

Monomorphic

2577 (97.69%)

Monomorphic

Polymorphic

11 (0.42%)

Polymorphic

Monomorphic

38 (1.44%)

Polymorphic

Polymorphic

12 (0.45%)

Polymorphic loci: 229 (7.99%)

TSI population

Monomorphic loci: 2645 (92.26%)

Distribution of Monomorphic loci of TSI population w.r.t. FIN & IBS population

FIN

IBS

Count

Monomorphic

Monomorphic

2577 (97.42%)

Monomorphic

Polymorphic

20(0.76%)

Polymorphic

Monomorphic

40(1.52%)

Polymorphic

Polymorphic

8(0.30%)

Polymorphic loci: 222 (7.74%)

CHROMOSOME 21 (2893 loci under study)

FIN population

Monomorphic loci: 2581 (89.21%)

Distribution of Monomorphic loci of  FIN population w.r.t. IBS &TSI population

IBS

TSI

Count

Monomorphic

Monomorphic

2541(98.45%)

Monomorphic

Polymorphic

12(0.46%)

Polymorphic

Monomorphic

10(0.39%)

Polymorphic

Polymorphic

18(0.7%)

Polymorphic loci: 312 (10.79%)

IBS population

Monomorphic loci: 2570 (88.83%)

Distribution of Monomorphic loci of IBS population w.r.t. FIN & IBS population

FIN

TSI

Count

Monomorphic

Monomorphic

2541 (98.87%)

Monomorphic

Polymorphic

12 (0.47%)

Polymorphic

Monomorphic

15(0.58%)

Polymorphic

Polymorphic

3(0.08%)

Polymorphic loci: 323 (11.17%)

 TSI population

Monomorphic loci: 2574 (88.97%)

Distribution of Monomorphic loci of TSI population w.r.t. FIN & IBS population

FIN

IBS

Count

Monomorphic

Monomorphic

2541(98.72%)

Monomorphic

Polymorphic

10 (0.38%)

Polymorphic

Monomorphic

15 (0.58%)

Polymorphic

Polymorphic

8 (0.32%)

 Polymorphic loci: 319 (11.03%)

 1)    Number of Polymorphic loci common to FIN , IBS & TSI Populations

Common Polymorphic loci

Chromosome 1

188 (6.55%)

Chromosome 21

286 (9.88%)

 2)    Number of loci that are in Hardy Weinberg Equilibrium for all the three populations

Loci following HWE

Chromosome 1

134 (71.27%)

Chromosome 21

128 (44.75%)

3)    Contingency test is applied for the loci obtained under step (2)

p value is determined using Bonferroni Correction.

p =0.05/134=0.0004 (for Chromosome 1)

p=0.05/128=0.0004 (for Chromosome 21)

 After the contingency test it is found that there is 1 locus in each Chromosome 1 and 21 for which p < 0.0004.

These loci have significant difference in the genotype frequency distribution across the three populations.

4) 1)    z-TEST FOR DIFFERENCE OF PROPORTIONS

For hypothesis 1:

We consider the polymorphic loci in SAS population chromosome 1(p1) & 21(p2).

Polymorphic loci in SAS population chromosome 1(p1) & 21(p2)

Chromosome 1

Chromosome 21

259

326

251

318

224

326

From the data we get,

n1 = 2867, n2 = 2897

The sample proportion p1{\overline p}_1 = 259+251+2243&ThickSpace;×&ThickSpace;2867\frac{259+251+224}{3\;\times\;2867} = 0.085339

The sample proportion p2\displaystyle {\overline p}_2 = 326+318+3263&ThickSpace;×&ThickSpace;2897\frac{326+318+326}{3\;\times\;2897} = 0.11161

The pooled proportion p\overline p = (3&ThickSpace;×&ThickSpace;n1&ThickSpace;×&ThickSpace;p1&ThickSpace;)&ThickSpace;+&ThickSpace;(3&ThickSpace;×&ThickSpace;n2&ThickSpace;×p2&ThickSpace;)3&ThickSpace;×&ThickSpace;(&ThickSpace;n1&ThickSpace;+n2)\begin{array}{l}\frac{(3\;\times\;n_1\;\times\;\overline{p_1}\;)\;+\;(3\;\times\;n_2\;\times\overline{p_2}\;)}{3\;\times\;(\;n_1\;+n_2)}\\\end{array} = 734+97017292\frac{734+970}{17292} =0.0985

The value of calculated z is,

z= 11.5931

 The critical value of z is 3.3459. Since, calculated z is greater than critical value of z, it is significant. Hence, we reject the null hypothesis and conclude that there is significant difference in the distribution of polymorphic loci of SAS population across chromosome 1 and 21.

Again we consider polymorphic loci in EUR population for Chromosome 1(q1) and 21(q2

Polymorphic loci in EUR population for Chromosome 1(q1) and 21(q2)

Chromosome 1

Chromosome 21

247

312

229

323

222

319

From the data we get,

m1 = 2867, m2 = 2897

The sample proportion q1\displaystyle {\overline q}_1 = 247+229+2223&ThickSpace;×&ThickSpace;2867\frac{247+229+222}{3\;\times\;2867} = 0.08115

The sample proportion q2\displaystyle {\overline q}_2 = 312+323+3193&ThickSpace;×&ThickSpace;2897\frac{312+323+319}{3\;\times\;2897} = 0.1099

The pooled proportion q\displaystyle \overline q= (3&ThickSpace;×&ThickSpace;m1&ThickSpace;×&ThickSpace;q1&ThickSpace;)&ThickSpace;+&ThickSpace;(3&ThickSpace;×&ThickSpace;m2&ThickSpace;×q2&ThickSpace;)3&ThickSpace;×&ThickSpace;(&ThickSpace;m1&ThickSpace;+m2)\begin{array}{l}\frac{(3\;\times\;m_1\;\times\;\overline{q_1}\;)\;+\;(3\;\times\;m_2\;\times\overline{q_2}\;)}{3\;\times\;(\;m_1\;+m_2)}\\\end{array}= 698+95517292\frac{698+955}{17292}= 0.9559​

The value of calculated z is,

z= 26.99

The critical value of z at 5% l.o.s. is 3.3459. Since, calculated z is greater than critical value of z, it is significant. Hence, we reject the null hypothesis and conclude that there is significant difference in the distribution of polymorphic loci of EUR population across chromosome 1 and 21.

In hypothesis 2,

For Chromosome 1, let a1 and a2 denote the polymorphic loci of SAS and EUR populations.

From the data we get,

n1 = 2867, m1 = 2867

The sample proportion a1\displaystyle {\overline a}_1 = 259+251+2243&ThickSpace;×&ThickSpace;2867\frac{259+251+224}{3\;\times\;2867} = 0.0853

The sample proportion a2\displaystyle {\overline a}_2 = 247+229+2223&ThickSpace;×&ThickSpace;2867\frac{247+229+222}{3\;\times\;2867} = 0.08115

The pooled proportion a\displaystyle \overline a= (3&ThickSpace;×&ThickSpace;n1&ThickSpace;×&ThickSpace;a1&ThickSpace;)&ThickSpace;+&ThickSpace;(3&ThickSpace;×&ThickSpace;m1&ThickSpace;×a2&ThickSpace;)3&ThickSpace;×&ThickSpace;(&ThickSpace;n1&ThickSpace;+m1)\begin{array}{l}\frac{(3\;\times\;n_1\;\times\;\overline{a_1}\;)\;+\;(3\;\times\;m_1\;\times\overline{a_2}\;)}{3\;\times\;(\;n_1\;+m_1)}\\\end{array}= 734+69817202\frac{734+698}{17202}= 0.083

The value of calculated z is,

z= 2.0047

The critical value of z at 5% l.o.s. is 3.3459. Since, calculated z is less than critical value of z, it is not significant. Hence, we , may accept the null hypothesis and conclude that there is no significant difference in the distribution of polymorphic loci of chromosome 1 across SAS and EUR population.

 For Chromosome 21, let b1 and b2 denote the polymorphic loci of SAS and EUR populations.

Polymorphic loci of SAS and EUR populations

SAS

EUR

326

312

318

323

326

326

 From the data we get,

n2 = 2897, m2 = 2897

The sample proportion b1\displaystyle {\overline b}_1 = 326+318+3263&ThickSpace;×&ThickSpace;2897\frac{326+318+326}{3\;\times\;2897} = 0.11161

The sample proportion b2\displaystyle {\overline b}_2 = 312+323+3193&ThickSpace;×&ThickSpace;2897\frac{312+323+319}{3\;\times\;2897} = 0.1099

The pooled proportion b\displaystyle \overline b= (3×n2×b1)+(3×m2×b2)3×(n2+m2)\displaystyle \begin{array}{l}\frac{(3\;\times\;n_2\;\times\;\overline{b_1}\;)\;+\;(3\;\times\;m_2\;\times\overline{b_2}\;)}{3\;\times\;(\;n_2\;+m_2)}\\\end{array}= 970+95517382\frac{970+955}{17382}= 0.1107

The value of calculated z is,

z= 0.1107

 The critical value of z at 5% l.o.s. is 3.3459. Since, calculated z is less than critical value of z, it is not significant. Hence, we, may accept the null hypothesis and conclude that there is no significant difference in the distribution of polymorphic loci of chromosome 21 across SAS and EUR population.

 5) HISTOGRAM 

SAS POPULATION (CHROMOSOME 1)  

BEB1_1.png
    GIH1.png
      ITU1.png

        EUR POPULATION (CHROMOSOME 1)

        FIN1.png
          IBS1.png
            TSI1.png

              SAS POPULATION (CHROMOSOME 21)

              BEB21.png
                GIH21_1.png
                  ITU21.png

                    EUR POPULATION (CHROMOSOME 21)

                    FIN21.png
                      IBS21.png

                        CONCLUSION AND RECOMMENDATIONS

                        The analysis of the genotype data for the different populations indicates that for a particular chromosome, the genetic structures within the South Asian populations as well as the European populations are quite similar. Analysis of the allele frequencies for each evaluated locus demonstrated that they did not differ significantly. From the results of contingency tests, we can infer that the genotype frequency distribution is independent of the geographical location of the individuals. Also, it is seen that for a particular chromosome, the distribution of alleles is almost the same across the six populations. This is because of natural selection of genes for a particular chromosome. However, if we consider the genetic structures of Chromosome 1 and 21 for each population set, there are significant differences for the two sets of populations. This can be explained by the phenomenon of natural selection that act differently in the two sets of chromosomes. These data can be further studied to trace the ancestries of these populations. However, further modifications and effective statisticalmeasures are desired for extracting more information.

                        REFERENCES

                        1.     Hartl D.L., Clark A.G. (2007): Principles of Population Genetics, Fourth Edition, Sinauer Associates, Inc. Publishers (Sunderland, Massachusetts)

                        2.     Zheng-Bradley X., Flicek P. (2016): Applications of the 1000 Genomes Project resources

                        3.     The 1000 Genome Project Consortium (2010): A map of human genome variation from population scale sequencing

                        4.     Elston R., et.al (2015): Genetic Terminology

                        5.     Das R., Upadhyai P. (2017): Application of geographic population structure (GPS) algorithm for biogeographical analysis of populations with complex ancestries: a case study of South Asians from 1000 genomes project”  

                        http://www.usq.edu.au/library/help/referencing/apa

                        http://owl.english.purdue.edu/owl/resource/560/01/

                        ACKNOWLEDGEMENTS

                        I am very thankful to the Indian Academy of Sciences (Bangalore), Indian National Science Academy (New Delhi), National Academy of Sciences (Allahabad) for providing me the Focus Area Summer Research Fellowship Programme-2018 and allowing me to work under such experienced professionals. I am highly obliged to my project guide, Partha P. Majumder, Distinguished Professor at National Institute of Biomedical Genomics, Kalyani for giving me the privilege to work under his supervision and guiding me throughout the project. I am also very thankful to Ms. Vijay Laxmi Ray, Ms. Srija Mukhopaddhay and Mrs. Chandrika Bhattacharya for their continuous guidance and patience.

                        I would take a moment to thank Bhargob Kakoty, my friend who had informed me about this fellowship. I am highly grateful to Dibyajyoti Bora, Assistant Professor at Cotton University for recommending me to apply for this fellowship. I am also thankful to Dr. Kamal Barman and Dr. Bandana Sharma, Associate Professors (Dept. of Statistics) ,who had continuously supported me to carry out this project. I have gained a lot of knowledge and experience from this summer project which will be highly useful for my future studies. I am highly thankful to Almighty God for always blessing me. I am obliged to my parents who are always by my side to help me, my friends who are my constant supporters, my fellow trainees for encouraging me for the completion of this project.

                        Thank you!

                        More
                        Written, reviewed, revised, proofed and published with