Summer Research Fellowship Programme of India's Science Academies

Introduction to Bayesian statistics

Aneesh Ojha

Birla Institute of Technology Mesra, Ranchi

P Kandaswamy

Bharathiar University, Coimbatore


The purpose of this article is to present the basic principles of the Bayesian approach to statistics and to contrast it with the frequentist approach. Usually, it is straightforward to calculate the probability of the sample space of an event knowing the process that generates the data. However, we do not have perfect knowledge about the processes always and it is the goal of statistical inference to determine the unknown parameters of these mechanisms. Bayesian Statistics helps to determine probabilistic statements about parameters by using what is known (data) and extrapolate backward to a probable cause. The objectives of this article are to discuss the concepts of the Bayesian approach and the inherent limitations of the Bayesian approach.

Keywords: Bayes' theorem, maximum likelihood, priors, posterior, statistical inference


Bayesian statistics is a hot topic which finds its application in various fields, centralized on the idea of Bayes' theorem named after Thomas Bayes who formulated a specific case of it presented in his paper published in 1763. The concept is so old but new development started in the 1960s giving the topic a new birth.

There is always a heated debate between the Frequentist and Bayesian statisticians and narrowing down it further comes to a basic fundamental question of philosophy: the definition of probability. My work does not involve to resolve them but to introduce the basic concepts and characteristics of both indicating the points where they differ from each other. Most of the statistical analysis in practice use the Frequentist approach but Bayesian methods are now tools of choice in many areas.


Conditional Probability

It is the measure of the probability of an event given that another event has already occurred. Consider an event E and an event F associated with the same sample space of a random experiment. The probability of the event E is called the conditional probability of E given that F has already occurred and is denoted by P(E|F). We can write the conditional probability of E given that F has occurred as

P(E|F) =     Number  of  elementary  events  favourable  to  EFNumber  of  elementary  events  which  are  favourable  to  F\frac{\;\;Number\;of\;elementary\;events\;favourable\;to\;E\bigcap F}{Number\;of\;elementary\;events\;which\;are\;favourable\;to\;F}

=   n(EF)    n(F)\frac{\;n(E\bigcap F)}{\;\;n(F)}, provided p(F) ≠ 0

Dividing the numerator and the denominator by total number of elementary events of the sample space, we see that P(E|F) can also be writeen as

P(E|F)=   n(EF)  n(S)    n(F)n(S)\frac{\;\frac{n(E\bigcap F)}{\;n(S)}}{\;\;\frac{n(F)}{n(S)}} = P(EF)  P(F)\frac{P(E\bigcap F)}{\;P(F)} , which is valid when P(F) ≠ 0 i.e., F ≠ ϕ

Properties of conditional probability

Let E and F be events of a sample space S of an experiment, then we have

Property 1 : P(S|F) = P(F|F) = 1

Property 2 : If A and B are any two events of a sample space S and F is an event of S such that P(F) ≠ 0, then

P((AUB)|F) = P(A|F) + P(B|F) − P((A∩B)|F)

Property 3 : P(E'|F) = 1 − P(E|F)

Multiplication Theorem on Probability

Let E and F be two events associated with a sample space S. The set E ∩ F denotes the event that both E and F have occurred or it denotes the simultaneous occurrence of both the events E and F.

Sometimes we need to find the probability of the events E ∩ F, for example, probability of the event selecting a king and a queen from a deck of 52 cards. The probability of event E ∩ F is obtained by using the conditional probability.

We know that

P(E|F) = P(EF)  P(F)\frac{P(E\bigcap F)}{\;P(F)}, P(F) ≠0

From this we can write

P(E∩F) = P(F).P(E|F)

Also, we know that

P(F|E) = P(FE)  P(E)\frac{P(F\bigcap E)}{\;P(E)}, P(E) ≠ 0

P(F|E) = P(EF)  P(E)\frac{P(E\bigcap F)}{\;P(E)}( since E ​∩ F = F ∩ E )

P(E∩F) = P(E). P(F|E)

Therefore , we have

P(E∩F) = P(E). P(F|E)

= P(F). P(E|F) provided P(E) ≠ 0 and P(F) ≠ 0.

This result is known as multiplication rule of probability .

Multiplication rule of probability for more than two events : If E ,F and G are three events of sample space , we have

P(E∩F∩G) = P(E).P(F|E)P(G|E∩F)

Similarly , the multiplication rule of probability can be extended for four or more events.

Independent events

Two events E and F are said to be independent, if

P(E|F) = P(E) provided P(F) ≠ 0

and P(F|E) = P(F) provided P(E) ≠ 0

Now by multiplication rule of probability , we have

P(E∩F) = P(E).P(F|E) ...(1)

If E and F are independent , then (1) becomes

P(E∩F) = P(E).P(F) ...(2)

Thus it can also be stated as :

Two events E and F associated with the same random experiment , then E and F are said to be independent if

P(E∩F) = P(E).P(F)


Partition of a Sample

A set of events E1 , E2, ......, En is said to represent a partition of the sample space S if it satisfies the followings:

(i) Ei ∩ Ej = ϕ, i ≠ j, i, j = 1,2,3,....n

(ii) E1 U E2 U ....... U En = S and

(iii) P(Ei) > 0 for all i = 1,2,3....,n.

The events E1, E2, ....., En represent a partition of the sample space S if they are paiwise disjoint, exhaustive and have nonzero probabilities.There can be several partitions of the same sample space.

Theorem of Total Probability

    Partition of a sample space S

    Let { E1, E2, E3,..., En} be a partition of the sample space S, and the suppose that each of the events E1, E2,..., En has nonzero probability of occurence. Let A be any event associated with S, then

    P(A) = P(E1)P(A|E1) + P(E2)P(A|E2) + ...... + P(En)P(A|En)

    = n j=1 p(Ej)P(A|Ej)


    Given that E1, E2 , .... , En is a partition of the sample space S. Therefore ,

    S = E1 U E2 U ...... U En

    and Ei ∩ Ej = ϕ , i ≠ j , i , j = 1, 2 ,.....,n.

    Now, we know that for any event A,

    A = A ∩ S

    = A ∩(E1 U E2 U ......U En)

    = (A ∩ E1) U (A ∩ E2) U.......U (A ∩ En)

    Also A ∩ Ei and A ∩ Ej are respectively the subsets of Ei and Ej. We know that Ei and Ej are disjoint, for i ≠ j, therefore, A ∩ Ei and A ∩ Ej are also disjoint for all i ≠ j , i , j = 1, 2, ..., n.

    Thus, P(A) = P[( A ∩ E1) U ( A ∩ E2) U ....... U ( A ∩ En)]

    = P( A ∩ E1) + P(A∩ E2) + ...... + P(A ∩ En)

    Now by multiplication rule of probability, we have

    P(A ∩ Ej) = P(Ei)P(A|Ei) as P(Ei) ≠ 0 Ɐ i =1,2,3,...n

    Therefore P(A) = P(E1)P(A|E1) + P(E2)P(A|E2) + ..... + P(En)P(A|En)

    or P(A) = ∑n j=1 p(Ej)P(A|Ej)

    Bayes' theorem

    If E1, E2, ...., En are n non empty events which constitute a partition of sample space S, i.e. E1, E2,....., En are pairwise disjoint and E1 U E2 U ... U En = S and A is any event of nonzero probability, then

    P(Ei|A) =     P(Ei)P(AEi)j=1nP(Ej)P(AEj)\textstyle\frac{\;\;P(E_i)P(A\vert E_i)}{\sum_{j=1}^nP(E_j)P(A\vert E_j)} for any i =1, 2, 3, ..., n


    By formula of conditional probability, we know that

    P(Ei|A) = P(AEi)  P(A)\textstyle\frac{P(A\cap E_i)}{\;P(A)}

    =     P(Ei)P(AEi)          P(A)\textstyle\frac{\;\;P(E_i)P(A\vert E_i)}{\;\;\;\;\;P(A)}(by multiplication rule of probability)

    =     P(Ei)P(AEi)j=1n  P(Ej)P(AEj)\textstyle\frac{\;\;P(E_i)P(A\vert E_i)}{\sum_{j=1}^n\;P(E_j)P(A\vert E_j)}(by the result of theorem of total probability)​


    In frequentist statistics, the sample data is assumed to be a result of an infinite number of exactly same repeated experiments. As per frequentists, the conclusion drawn are based on the fact that the experiment was repeated in exactly the same condition for an infinite number of times. Consider the experiment of tossing a coin, as per the frequentist approach the probability associated with the event of getting a tail is what you will get when you toss the coin for an infinite number of times in the same condition. If we toss the coin for 20 times resulting in 14 times tail, the frequentists believe that we have selected a wrong sample of the population from an infinitely repeated throw. If the coin is tossed for 20 times more the result would be different as because we chose a different sample space this time. Thus the probability of getting a tail when you toss a coin is 12\frac12which is as per the frequentists what exactly you will get when you toss the coin for an infinite number of times.​ In the frequentist approach, the data is assumed to be random (since the whole idea is centered on the concept of an experiment being repeated an infinite number of times) and results from a fixed population. Frequentist statisticians view the parameters of a model being fixed and data of the system varying, this idea is what it mainly distinguishes it from a bayesian approach.


    Bayesians do not believe in the repetition of an experiment to define probability. A probability is defined on the basis as a measure of certainty in the belief of occurrence of an event. The probability associated with an event is merely an abstraction which we use to help to express our uncertainty. So if a bayesian statistician says that the probability of a coin landing head is 50% it does not mean he has actually tossed the coin for n number of times resulting into 50% times a head but it means that the probability has been assigned with respect to the belief of uncertainty of the event of a coin landing head that is there is a 50- 50% chance of coin landing head and tail.

    Bayesians see probability as an expression of belief meaning they can be updated in the light of new data. Bayesian names come from the Bayes' theorem which is the central idea to this approach. Bayes' theorem is used to express our uncertainty in parameter after we observe data.

    Bayesians assume the data to be fixed and view the parameters of a model to be varying. In the bayesian approach, parameters van be viewed from two perspectives. Either the parameters can be seen as truly varying or the knowledge about the parameters can be considered as imperfect. The varying parameters from different studies can be seen as either of the two cases. The frequentist approach in this sense is less flexible and observes the parameters as constant or unvarying and represents the average of a long run typically infinite number of identical experiments.

    Bayesians use Bayes theorem to calculate the post data with the given data set and model describing the theory. The probabilistic description of belief is called prior and the post data being calculated is called the posterior.

    Prior + Data → Posterior​

    Bayes' rule tells us how to update our prior beliefs in order to derive better beliefs about a situation in light of new data.

    Consider the experiment of tossing a coin. Given that there are two coins out of which one is a double-headed and another one is a fair coin. If we want to calculate the probability that given the tossed coin lands head what is the probability that the coin was fair? We can apply the Bayes' theorem to calculate

    Consider, P(Efair) = Probability of selecting a fair coin

    P(Ebiased) = Probability of selecting a double-headed coin

    P(H|Efair) = Probability of coin landing head given that it is a fair coin

    P(H|Ebiased) = Probability of coin landing head given that it is a double-headed coin

    So, we have

    P(EfairH)  =                      P(HEfair)P(Efair)P(HEfair)P(Efair)  +  P(HEbiased)P(Ebiased)P(E_{fair}\vert H)\;=\;\frac{\;\;\;\;\;\;\;\;\;\;P(H\vert E_{fair})P(E_{fair})}{P(H\vert E_{fair})P(E_{fair})\;+\;P(H\vert E_{biased})P(E_{biased})}

      Posterior          Likelihood  PriorP(EfairH)  =  P(HEfair)P(Efair)P(H)                                                Model  Evidence\begin{array}{l}\;Posterior\;\;\;\;\;Likelihood\;Prior\\\overbrace{P(E_{fair}\vert H)}\;=\;\frac{\overbrace{P(H\vert E_{fair})}\overbrace{P(E_{fair})}}{\underbrace{P(H)}}\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;Model\;Evidence\end{array}

    • Posterior is the value we are interested in to calculate and is the main goal of bayesian inference.
    • Priors are the most controversial part of bayesian formula. It is a probability distribution that represents our pre-data beliefs across different values of the parameters in our model.
    • Likelihood is the part which is common to both bayesian and frequentist approach.
    • Model Evidence (denominator) which represents the sum of all the cases in favor of the coin landing head be it a double-headed or a fair coin.


    Consider the experiment of tossing a fair coin so the probability of coin landing head is given by θ = 12\frac12. Further assume that if the coin is tossed again then the result of the first flip does not affect the result of the other that is the first and second results are independent. So the probability of obtaining two heads in a row is : Pr(H,H | θ, Model) = Pr(H | θ, Model) × Pr(H | θ, Model) = θ  ×  θ\theta\;\times\;\theta= θ2\theta^2

    = (12)2\left(\frac12\right)^2= 14\frac14

    where the model represents the set of assumptions we make in our analysis. Similarly, we can calculate the probability for each possible outcomes :

    The probability of each outcome for a fair coin flipped twice
                No HeadOne HeadTwo Head 
     Probabilty 14\frac14   12\frac12 14\frac14
     No. of outcomes 1  2 1
     Percentage of occurence 25% 50%25% 
    The no. of outcomes for one head is two because it can occur in two ways that is head in first coin   and tail in second and vice-versa.

    Seeing the table it can be stated that getting one head is more likely to come or achieve as compared to two heads or no heads.

    The Likelihood function expresses how likely or probable a given set of observations is for different values of statistical parameters. Likelihood is often confused with probability but there is a clear distinction between the two. Probability is the measure of the likelihood that an event will occur. The basic idea is out of all given occurrences, what is the certainty that a specific event will occur? whereas likelihood is a function of parameters within the parameter space that describes the probability of obtaining the observed data.

    Also, the area under the normal distribution curve represents the probabilities under fixed distribution whereas in likelihoods they are the y-axis value for fixed data points with distributions that can be moved. Probability quantifies anticipation (of outcome), likelihood quantifies trust (in the model).

    Suppose we have a random sample X1, X2,..., Xn for which the probability density (or mass) function of each Xi is f(xi; θ). Then, the joint probability mass (or density) function of X1, X2,..., Xn, which we call lik(θ) is:

    L(θ)=P(X1=x1,X2=x2,    ,Xn=xn)=f(x1;θ)f(x2;θ)    f(xn;θ)=i=1nf(xi;θ)L(\theta)=P(X_1=x_1,X_2=x_2,\;\dots\;,X_n=x_n)=f(x_1;\theta)⋅f(x_2;\theta)\;\cdots\;f(x_n;\theta)=\prod_{i=1}^nf(x_i;\theta)

    In words: lik(θ) = probability of observing the given data as a function of θ.

    Maximum Likelihood

    The frequentist approach to estimation is known as the method of maximum likelihood. The principle of maximum likelihood is simple and uses the calculation of parameters to maximize the likelihood of obtaining the data sample. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. In Frequentist inference, we determine the uncertainty in our estimates by examining the curvature of the likelihood near the maximum likelihood estimates. From Likelihood expression, we have :


    The maximum likelihood estimate of θ is that value of θ that maximizes lik(θ): it is the value that makes the observed data the “most probable”. Rather than maximizing this product which can be quite tedious, we often use the fact that the logarithm is an increasing function so it will be equivalent to maximize the log likelihood:

    l(θ)=log              i=1              n(f(xi;θ))\overset{\;\;\;\;\;\;\;n}{\underset{\;\;\;\;\;\;\;i=1}{l(\theta)=\sum\log}}(f(x_i;\theta))


    Bayes' rule tells us how to convert the likelihood into a posterior probability distribution for parameters, which can be used for inference. The numerator of Bayes' rule tells us to multiply the likelihood by a weighting of each parameter value called priors. Priors are the most controversial part of a bayesian approach and are always argued for between bayesians and frequentists.

    Consider a case where a doctor gives their probability that an individual has a particular disease before the results of a blood test are available. Using their knowledge of patient's history, and their expertise on the particular condition, they assign a prior disease probability. This underlying belief of uncertainty before the evidence is available is what we call a prior.

    Based on the previous records we probably have an idea of the underlying uncertainty in the value for an event. The prior is always a valid probability distribution and can be used to calculate prior expectations of parameters value.

    Why do we need priors ?

    Bayes' rule is the way to update our initial beliefs in the light of the data.

    initial  beliefBayes  rule  +  datanew  beliefinitial\;belief\xrightarrow{Bayes'\;rule\;+\;data}new\;belief

    So in this sense, it is clear that we must specify an initial belief, otherwise we have nothing to update. Therefore priors help us to define or specify our initial belief of uncertainty for an event.


    In Bayesian statistics, the posterior distribution combines our pre-existing beliefs with information from observed data to produce an updated belief, which is used as the starting point for all further analyses. To calculate it we need a likelihood function that determines the influence of the data on the posterior which is then combined in a certain way using Bayes' rule with priors that represent our pre-data beliefs across the range of parameter values.

    likelihood  +  prior  bayes  ruleposteriorlikelihood\;+\;prior\;\xrightarrow{bayes'\;rule}posterior

    The posterior represents our updated knowledge of past experience and information from observed data.


    Statistical inference is the process of drawing conclusions about certain aspects of a population based on the sample of data collected from that population. In statistical inference, we not only try to draw conclusions about the particular subjects observed in the study but also about the larger population of subjects from which the study sample was drawn. Statistical inference consists of first selecting a statistical model that generates the data and second consists of drawing propositions from the model.

    Purpose of Statistical Inference

    In inference, we want to draw conclusions based purely on the rules of probability. Statistical inference provides the logical framework which we can use to trial our beliefs about the noisy world against data. We formalize our beliefs in models of probability. The models are probabilistic because we cannot say with certainty whether something will or will not occur. The statistical inference is broadly categorized into two thoughts of carrying out the inference that is frequentist and Bayesian inference.

    The purpose of statistical inference is to estimate the sample to uncertainty. Understanding how uncertain our findings are, allows us to take uncertainty into account when drawing conclusions. It allows us to provide a range of values for the true value of something in the population and it allows us to make statements about whether our study provides evidence to reject a hypothesis.

    If we wish to summarise our evidence for a particular hypothesis, we describe this using the language of probability, as the 'probability of the hypothesis given the data obtained'. The difficulty is that when we choose a probability model to describe a situation, it enables us to calculate the probability of obtaining the data given our hypothesis being true - the opposite of what we want. The issue of statistical inference, common to both Frequentists and Bayesians, is how to invert this probability to get the desired result.

    Frequentist Inference

    The Frequentist Inference approach to statistics is based on sampling theory in which random samples of data are taken from a process to ascertain the underlying parameter of interest. Two primary assumptions are that the process is repeatable and that the underlying parameter remains constant. Hypothesis testing and the construction of confidence intervals which take a frequentist approach. In this approach null and alternative hypotheses are constructed, the type 1 error that the researcher is willing to accept is selected (i.e., the probability of rejecting the null hypothesis when it is true), the parameter is estimated, and the hypothesis is evaluated and/or a confidence interval is constructed. In this approach when constructing a 95% confidence interval the percentage indicates that 95% of the confidence intervals constructed this way contain the true value. In the frequentist approach, there is no requirement for prior knowledge about the parameter of interest. The parameter value is estimated from the sample data. There are numerous applications of the frequentist approach the two most prominent are the design of experiments and regression analysis.

    Confidence interval

    The main part of the frequentist estimation process is the confidence interval. In applied research, these intervals form the main result of the paper. For example: From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% confidence interval of 1% ≤ RT ≤ 5%.

    In the Frequentist approach, we imagine taking repeated samples from a population of interest, and for each of the fictitious samples, we estimate a confidence interval. A 95% confidence interval means that across the infinity of intervals that we calculate, the true value of the parameter will lie in this range 95% of the time.

    Bayesian Inference

    Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more information becomes available. ​The Bayesian Inference approach takes a probabilistic view of the unknown quantity (i.e., parameter). Bayesian Inference begins with the prior distribution of the parameter (i.e., before the data is seen) then the probability of the prior distribution is induced from the posterior distribution of the data. Typically, there is a description of the posterior distribution of the parameter (i.e., mean and quantiles). The highest posterior density intervals are indicative of the highest posterior probability. In a Bayesian 95% interval, the percentage indicates that there is a 95% probability that the parameter is contained in the interval. In the Bayesian approach, prior information about the parameter distribution is required. There are several applications of the Bayesian approach including probabilistic risk analysis, expert systems and pattern recognition.

    My simplistic conclusion is that the bayesian approach is well suited to estimation problems, especially where the estimate is to be used to make a practical decision. The frequentist approach is well suited to testing hypotheses about the nature of the world.

    Credible interval

    Bayesian credible intervals, in contrast to confidence intervals, describe our uncertainty in the location of the parameter values. They are calculated from the posterior density. In particular, a 95% credible region satisfies the condition that 95% of the posterior probability lies in this parameter range. The statement:

    From our research, we concluded that the percentage of penguins with red tails, RT, has a 95% credible interval of 0% ≤ RT ≤ 4% can be interpreted straight forwardly as ‘From our research, we conclude that there is a 95% probability that the percentage of penguins with red tails lies in the range 0% ≤ RT ≤ 4%.

    In contrast to the Frequentist confidence interval, a credible interval is more straightforward to understand. It is a statement of confidence in the location of a parameter. Also, in contrast to the Frequentist confidence intervals, the uncertainty here refers to our inherent uncertainty in the value of the parameter, estimated using the current sample, rather than an infinite number of made up samples.


    Bayesian statistical methods are mostly used in three main situations. The first one is where prior probability needs to be included due to lack of data on some aspect of a model, or because inadequacies of some evidence has to be acknowledged through making assumptions about the biases involved. These situations can occur when a policy decision must be made on the basis of a combination of imperfect evidence from multiple sources.

    The second situation occurs with moderate-size problems with multiple sources of evidence, where hierarchical models can be constructed on the assumption of shared prior distributions whose parameters can be estimated from the data. Common application areas include meta-analysis, disease mapping, multi-center studies, and so on.

    The third case concerns where a huge joint probability model is constructed, relating possibly thousands of observations and parameters, and the only feasible way of making inferences on the unknown quantities is through taking a Bayesian approach: examples include image processing, spam filtering, signal analysis, and gene expression data.

    The recent development in the field of computer science sees wide application of bayesian approach. The Bayesian approach is suitable in machine learning where 'updating' inherent is needed. Simple examples can be found in modern software for spam filtering, suggesting which books or movies a user might enjoy given his or her past preferences.


    The aim of this article was to get the basic idea about the different types of statistical philosophies out there and how any single of them cannot be used in every situation. There are many areas of frequentist methodology that should be replaced by Bayesian methodology. There are some frequentist situations where Bayesian approach does not work properly. Philosophical unification of these two is unlikely, as each highlights a different aspect of statistical analysis. It’s high time that both the philosophies are merged to solve the real world problems by addressing the flaws of the other.


    I would like to thank my guide Dr. P Kandaswamy for his guidance and encouragement in carrying out this project work. I would also like to thank Dr. R Vijayraghavan for his guidance and support and giving the right direction to approach the difficulties and grasp the underlying concepts in the text. Lastly, I would like to thank the Indian Academy of Science for providing me with this research fellowship program which is a stepping stone for my aim to explore the applications of mathematics.


    [1] Bernardo, J.M. and Smith, A.F.M., 1994. Bayesian theory. Wiley, New York.

    [2] Lee, P.M., 1997. Bayesian statistics: an introduction. 2nd edn. Arnold, London.

    [3] Bolstad WM. Introduction to Bayesian statistics. 2nd ed. New York: John Wiley & Son Inc; 2007.

    [4] Stigler S. Who discovered Bayes’s theorem. The American Statistician. 1983.

    [5] Jeffreys H. Theory of Probability. 3rd ed. Oxford, England: Oxford University Press.

    [6] Lambert, Ben - A student’s guide to Bayesian statistics-Sage 2018.

    [7] Wright, G. and Ayton, P. (eds.), 1994. Subjective probability. Wiley, London.

    [8] Vijay K. Rohatgi, A. K. Md. Ehsanes Saleh - An Introduction to Probability and Statistics (Wiley Series in Probability and Statistics) -Wiley-Interscience.

    [9] Douglas C. Montgomery, George C. Runger - Applied Statistics and Probability for Engineers, 5th Edition -Wiley (2010).

    Written, reviewed, revised, proofed and published with