Conditional Distributions, Part 2

We present more examples to further illustrate the thought process of conditional distributions. A conditional distribution is a probability distribution derived from a given probability distribution by focusing on a subset of the original sample space (we assume that the probability distribution being discussed is a model for some random experiment). The new sample space (the subset of the original one) may be some outcomes that are of interest to an experimenter in a random experiment or may reflect some new information we know about the random experiment in question. We illustrate this thought process in the previous post Conditional Distributions, Part 1 using discrete distributions. In this post, we present some continuous examples for conditional distributions. One concept illustrated by the examples in this post is the notion of mean residual life, which has an insurance interpretation (e.g. the average remaining time until death given that the life in question is alive at a certain age).

_____________________________________________________________________________________________________________________________

The Setting

The thought process of conditional distributions is discussed in the previous post Conditional Distributions, Part 1. We repeat the same discussion using continuous distributions.

Let X be a continuous random variable that describes a certain random experiment. Let S be the sample space of this random experiment. Let f(x) be its probability density function.

We assume that X is a univariate random variable, meaning that the sample space S is the real line \mathbb{R} or a subset of \mathbb{R}. Since X is a continuous random variable, we know that S would contain an interval, say, (a,b).

Suppose that in the random experiment in question, certain event A has occurred. The probability of the event A is obtained by integrating the density function over the set A.

    \displaystyle P(A)=\int_{x \in A} f(x) \ dx

Since the event A has occurred, P(A)>0. Since we are dealing with a continuous distribution, the set A would contain an interval, say (c,d) (otherwise P(A)=0). So the new probability distribution we define is also a continuous distribution. The following is the density function defined on the new sample space A.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Example 1

Let X be the lifetime (in years) of a brand new computer purchased from a certain manufacturer. Suppose that the following is the density function of the random variable X.

    \displaystyle f(x)=\frac{3}{2500} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ 0<x<10

Suppose that you have just purchased a one such computer that is 2-year old and in good working condition. We have the following questions.

  • What is the expected lifetime of this 2-year old computer?
  • What is the expected number of years of service that will be provided by this 2-year old computer?

Both calculations are conditional means since the computer in question already survived to age 2. However, there is a slight difference between the two calculations. The first one is the expected age of the 2-year old computer, i.e., the conditional mean E(X \lvert X>2). The second one is the expected remaining lifetime of the 2-year old computer, i.e., E(X-2 \lvert X>2).

For a brand new computer, the sample space is the interval S=0<x<10. Knowing that the computer is already 2-year old, the new sample space is A=2<x<10. The total probability of the new sample space is:

    \displaystyle P(A)=P(X>2)=\int_{2}^{10} \frac{3}{2500} \ (100x-20x^2 + x^3) \ dx=\frac{2048}{2500}=0.8192

The conditional density function of X given X>2 is:

    \displaystyle \begin{aligned} f(x \lvert X>2)&=\frac{\frac{3}{2500} \ (100x-20x^2 + x^3)} {\frac{2048}{2500}} \\&=\frac{3}{2048} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ \ 2<x<10  \end{aligned}

The first conditional mean is:

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^{10} x \ f(x \lvert X>2) \ dx \\&=\int_2^{10} \frac{3}{2048} \ x(100x-20x^2 + x^3) \ dx \\&=\int_2^{10} \frac{3}{2048} \ (100x^2-20x^3 + x^4) \ dx \\&=\frac{47104}{10240}=4.6 \end{aligned}

The second conditional mean is:

    \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=2.6

In contrast, the unconditional mean is:

    \displaystyle E(X)=\int_0^{10} \frac{3}{2500} \ (100x^2-20x^3 + x^4) \ dx=4

So if the lifetime of a computer is modeled by the density function f(x) given here, the expected lifetime of a brand new computer is 4 years. If you know that a computer has already been in use for 2 years and is in good condition, the expected lifetime is 4.6 years, where 2 years of which have already passed, showing us that the remaining lifetime is 2.6 years.

Note that the following calculation is not E(X \lvert X>2), though is something that some students may attempt to do.

    \displaystyle \int_2^{10} x \ f(x) \ dx =\int_2^{10} \frac{3}{2500} \ x(100x-20x^2 + x^3) \ dx=\frac{47104}{12500}=3.76832

The above calculation does not use the conditional distribution that X>2. Also note that the answer is less than the unconditional mean E(X).

Example 2 – Exponential Distribution

Work Example 1 again by assuming that the lifetime of the type of computers in questions follows the exponential distribution with mean 4 years.

The following is the density function of the lifetime X.

    \displaystyle f(x)=0.25 \ e^{-0.25 x}, \ \ \ \ \ \ 0<x<\infty

The probability that the computer has survived to age 2 is:

    \displaystyle P(X>2)=\int_2^\infty 0.25 \ e^{-0.25 x} \ dx=e^{-0.25 (2)}=e^{-0.5}

The conditional density function given that X>2 is:

    \displaystyle f(x \lvert X>2)= \frac{0.25 \ e^{-0.25 x}}{e^{-0.25 (2)}}=0.25 \ e^{-0.25 (x-2)}, \ \ \ \ \ \ \ 2<x<\infty

To compute the conditional mean E(X \lvert X>2), we have

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^\infty x \ f(x \lvert X>2) \ dx \\&=\int_2^\infty 0.25 \ x \ e^{-0.25 (x-2)} \ dx \\&=\int_0^\infty 0.25 \ (u+2) \ e^{-0.25 u} \ du \ \ \ (\text{change of variable}) \\&=\int_0^\infty 0.25 \ u \ e^{-0.25 u} \ du+2\int_0^\infty 0.25 \ e^{-0.25 u} \ du \\&=\frac{1}{0.25}+2=4+2=6\end{aligned}

Then \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=6-2=4.

We have an interesting result here. The expected lifetime of a brand new computer is 4 years. Yet the remaining lifetime for a 2-year old computer is still 4 years! This is the no-memory property of the exponential distribution – if the lifetime of a type of machines is distributed according to an exponential distribution, it does not matter how old the machine is, the remaining lifetime is always the same as the unconditional mean! This point indicates that the exponential distribution is not an appropriate for modeling the lifetime of machines or biological lives that wear out over time.

_____________________________________________________________________________________________________________________________

Mean Residual Life

If a 40-year old man who is a non-smoker wants to purchase a life insurance policy, the insurance company is interested in knowing the expected remaining lifetime of the prospective policyholder. This information will help determine the pricing of the life insurance policy. The expected remaining lifetime of the prospective policyholder is called is called the mean residual life and is the conditional mean E(X-t \lvert X>t) where X is a model for the lifetime of some life.

In engineering and manufacturing applications, probability modeling of lifetimes of objects (e.g. devices, systems or machines) is known as reliability theory. The mean residual life also plays an important role in such applications.

Thus if the random variable X is a lifetime model (lifetime of a life, system or device), then the conditional mean E(X-t \lvert X>t) is called the mean residual life and is the expected remaining lifetime of the life or system in question given that the life has survived to age t.

On the other hand, if the random variable X is a model of insurance losses, then the conditional mean E(X-t \lvert X>t) is the expected claim payment per loss given that the loss has exceeded the deductible of t. In this interpretation, the conditional mean E(X-t \lvert X>t) is called the mean excess loss function.

_____________________________________________________________________________________________________________________________

Summary

In conclusion, we summarize the approach for calculating the two conditional means demonstrated in the above examples.

Suppose X is a continuous random variable with the support being (0,\infty) (the positive real numbers), with f(x) being the density function. The following is the density function of the conditional probability distribution given that X>t.

    \displaystyle f(x \lvert X>t)=\frac{f(x)}{P(X>t)}, \ \ \ \ \ \ \ \ \ x>t

Then we have the two conditional means:

    \displaystyle E(X \lvert X>t)=\int_t^\infty x \ f(x \lvert X>t) \ dx=\int_t^\infty x \ \frac{f(x)}{P(X>t)} \ dx

    \displaystyle E(X-t \lvert X>t)=\int_t^\infty (x-t) \ f(x \lvert X>t) \ dx=\int_t^\infty (x-t) \ \frac{f(x)}{P(X>t)} \ dx

If E(X \lvert X>t) is calculated first (or is easier to calculate), then E(X-t \lvert X>t)=E(X \lvert X>t)-t, as shown in the above examples.

If X is a discrete random variable, then the integrals are replaced by summation symbols. As indicated above, the conditional mean E(X-t \lvert X>t) is called the mean residual life when X is a probability model of the lifetime of some system or life.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

Advertisements

Conditional Distributions, Part 1

We illustrate the thought process of conditional distributions with a series of examples. These examples are presented in a series of blog posts. In this post, we look at some conditional distributions derived from discrete probability distributions.

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

The Setting

Suppose we have a discrete random variable X with f(x)=P(X=x) as the probability mass function. Suppose some random experiment can be modeled by the discrete random variable X. The sample space S for this probability experiment is the set of sample points with positive probability masses, i.e. S is the set of all x for which f(x)=P(X=x)>0. In the examples below, S is either a subset of the real line \mathbb{R} or a subset of the plane \mathbb{R} \times \mathbb{R}. Conceivably the sample space could be subset of any Euclidean space \mathbb{R}^n in higher dimension.

Suppose that we are informed that some event A in the random experiment has occurred (A is a subset of the sample space S). Given this new information, all the sample points outside of the event A are irrelevant. Or perhaps, in this random experiment, we are only interested in those outcomes that are elements of some subset A of the sample space S. In either of these scenarios, we wish to make the event A as a new sample space.

The probability of the event A, denoted by P(A), is derived by summing the probabilities f(x)=P(X=x) over all the sample points x \in A. We have:

    \displaystyle P(A)=\sum_{x \in A} P(X=x)

The probability P(A) may not be 1.0. So the probability masses f(x)=P(X=x) for the sample points x \in A, if they are unadjusted, may not form a probability distribution. However, if we consider each such probability mass f(x)=P(X=x) as a proportion of the probability P(A), then the probability masses of the event A will form a probability distribution. For example, say the event A consists of two probability masses 0.2 and 0.3, which sum to 0.5. Then in the new sample space, the first probability mass is 0.4 (0.2 multiplied by \displaystyle \frac{1}{0.5} or divided by 0.5) and the second probability mass is 0.6.

We now summarize the above paragraph. Using the event A as a new sample space, the probability mass function is:

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Suppose that two students take a multiple choice test that has 5 questions. Let X be the number of correct answers of one student and Y be the number of correct answers of the other student (these can be considered as test scores for the purpose of the examples here). Assume that X and Y are independent. The following shows the probability functions.

      \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&P(X=x) &\text{ }&\text{ }&P(Y=y) \\\text{Correct Answers}&\text{ }&\text{ }&\text{ } &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.4&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 1&\text{ }&\text{ }&0.2&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 2&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 3&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 4&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ }  \\ 5&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2  \end{bmatrix}

Note that E(X)=1.6 and E(Y)=2.9. Without knowing any additional information, we can expect on average one student gets 1.6 correct answers and one student gets 2.9 correct answers. If having 3 or more correct answers is considered passing, then the student represented by X has a 30% chance of passing while the student represented by Y has a 60% chance of passing. The following examples show how the expectation can change as soon as new information is known.

The following examples are based on these two test scores X and Y.

Example 1

In this example, we only consider the student whose correct answers are modeled by the random variable X. In addition to knowing the probability function P(X=x), we also know that this student has at least one correct answer (i.e. the new information is X>0).

In light of the new information, the new sample space is A=\left\{1,2,3,4,5 \right\}. Note that P(A)=0.6. In this new sample space, each probability mass is the original one divided by 0.6. For example, for the sample point X=1, we have \displaystyle P(X=1 \lvert X>0)=\frac{0.2}{0.6}=\frac{2}{6}. The following is the conditional probability distribution of X given X>0.

      \displaystyle P(X=1 \lvert X>0)=\frac{2}{6}

      \displaystyle P(X=2 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=3 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=4 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=5 \lvert X>0)=\frac{1}{6}

The conditional mean is the mean of the conditional distribution. We have \displaystyle E(X \lvert X>0)=\frac{16}{6}=2.67. Given that this student is knowledgeable enough to answer some question correctly, the expectation is higher than before knowing the additional information. Also, given the new information, the student in question has a 50% chance of passing (vs. 30% before the new information is known).

Example 2

We now look at a joint distribution that has a 2-dimensional sample space. Consider the joint distribution of test scores X and Y. If the new information is that the total number of correct answers among them is 4, how would this change our expectation of their performance?

Since X and Y are independent, the sample space is a square as indicated the figure below.

\text{ }

Figure 1 – Sample Space of Test Scores

Because the two scores are independent, the joint probability at each of these 36 sample points is the product of the individual probabilities. We have P(X=x,Y=y)=P(X=x) \times P(Y=y). The following figure shows one such joint probability.

Figure 2 – Joint Probability Function

After taking the test, suppose that we have the additional information that the two students have a total of 4 correct answers. With this new information, we can focus our attention on the new sample space that is indicated in the following figure.

Figure 3 – New Sample Space

Now we wish to discuss the conditional probability distribution of X \lvert X+Y=4 and the conditional probability distribution of Y \lvert X+Y=4. In particular, given that there are 4 correct answers between the two students, what would be their expected numbers of correct answers and what would be their chances of passing?

There are 5 sample points in the new sample space (the 5 points circled above). The conditional probability distribution is obtained by making each probability mass as a fraction of the sum of the 5 probability masses. First we calculate the 5 joint probabilities.

      \displaystyle P(X=0,Y=4)=P(X=0) \times P(Y=4) =0.4 \times 0.2=0.08

      \displaystyle P(X=1,Y=3)=P(X=1) \times P(Y=3) =0.2 \times 0.2=0.04

      \displaystyle P(X=2,Y=2)=P(X=2) \times P(Y=2) =0.1 \times 0.2=0.02

      \displaystyle P(X=3,Y=1)=P(X=3) \times P(Y=1) =0.1 \times 0.1=0.01

      \displaystyle P(X=4,Y=0)=P(X=4) \times P(Y=0) =0.1 \times 0.1=0.01

The sum of these 5 joint probabilities is P(X+Y=4)=0.16. Making each of these joint probabilities as a fraction of 0.16, we have the following two conditional probability distributions.

      \displaystyle P(X=0 \lvert X+Y=4)=\frac{8}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=0 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=1 \lvert X+Y=4)=\frac{4}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=1 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=2 \lvert X+Y=4)=\frac{2}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=2 \lvert X+Y=4)=\frac{2}{16}

      \displaystyle P(X=3 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=3 \lvert X+Y=4)=\frac{4}{16}

      \displaystyle P(X=4 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=4 \lvert X+Y=4)=\frac{8}{16}

Now the conditional means given that X+Y=4, comparing against the unconditional means.

      \displaystyle E(X \lvert X+Y=4)=\frac{0+4+4+3+4}{16}=\frac{15}{16}=0.9375 \ \ \ \ \ \ \ \ \text{vs} \ \ E(X)=1.6

      \displaystyle E(Y \lvert X+Y=4)=\frac{0+1+4+12+32}{16}=\frac{49}{16}=3.0625 \ \ \ \ \ \text{vs} \ \ E(Y)=2.9

Now compare the chances of passing.

      \displaystyle P(X \ge 3 \lvert X+Y=4)=\frac{4}{16}=0.25 \ \ \ \ \ \ \ \ \ \ \text{vs} \ \ P(X \ge 3)=0.3

      \displaystyle P(Y \ge 3 \lvert X+Y=4)=\frac{14}{16}=0.875 \ \ \ \ \ \ \ \ \text{vs} \ \ P(Y \ge 3)=0.6

Based on the new information of X+Y=4, we have a lower expectation for the student represented by X and a higher expectation for the student represented by Y. Observe that the conditional probability at X=0 increases to 0.5 from 0.4, while the conditional probability at X=4 increases to 0.5 from 0.2.

Example 3

Now suppose the new information is that the two students do well on the test. Particularly, their combined number of correct answers is greater than or equal to 5, i.e., X+Y \ge 5. How would this impact the conditional distributions?

First we discuss the conditional distributions for X \lvert X+Y \ge 5 and Y \lvert X+Y \ge 5. By considering the new information, the following is the new sample space.

Figure 4 – New Sample Space

To derive the conditional distribution of X \lvert X+Y \ge 5, sum the joint probabilities within the new sample space for each X=x. The calculation is shown below.

      \displaystyle P(X=0 \cap X+Y \ge 5)=0.4 \times 0.2=0.08

      \displaystyle P(X=1 \cap X+Y \ge 5)=0.2 \times (0.2+0.2)=0.08

      \displaystyle P(X=2 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2)=0.06

      \displaystyle P(X=3 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2+0.2)=0.08

      \displaystyle P(X=4 \cap X+Y \ge 5)=0.1 \times (1-0.1)=0.09

      \displaystyle P(X=5 \cap X+Y \ge 5)=0.1 \times (1)=0.10

The sum of these probabilities is 0.49, which is P(X+Y \ge 5). The conditional distribution of X \lvert X+Y \ge 5 is obtained by taking each of the above probabilities as a fraction of 0.49. We have:

      \displaystyle P(X=0 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=1 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(X=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=4 \lvert X+Y \ge 5)=\frac{9}{49}=0.184

      \displaystyle P(X=5 \lvert X+Y \ge 5)=\frac{10}{49}=0.204

We have the conditional mean \displaystyle E(X \lvert X+Y \ge 5)=\frac{0+8+12+24+36+50}{49}=\frac{130}{49}=2.653 (vs. E(X)=1.6). The conditional probability of passing is \displaystyle P(X \ge 3 \lvert X+Y \ge 5)=\frac{27}{49}=0.55 (vs. P(X \ge 3)=0.3).

Note that the above conditional distribution for X \lvert X+Y \ge 5 is not as skewed as the original one for X. With the information that both test takers do well, the expected score for the student represented by X is much higher.

With similar calculation we have the following results for the conditional distribution of Y \lvert X+Y \ge 5.

      \displaystyle P(Y=0 \lvert X+Y \ge 5)=\frac{1}{49}=0.02

      \displaystyle P(Y=1 \lvert X+Y \ge 5)=\frac{2}{49}=0.04

      \displaystyle P(Y=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(Y=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(Y=4 \lvert X+Y \ge 5)=\frac{12}{49}=0.245

      \displaystyle P(Y=5 \lvert X+Y \ge 5)=\frac{20}{49}=0.408

We have the conditional mean \displaystyle E(Y \lvert X+Y \ge 5)=\frac{0+2+12+24+48+100}{49}=\frac{186}{49}=3.8 (vs. E(Y)=2.9). The conditional probability of passing is \displaystyle P(Y \ge 3 \lvert X+Y \ge 5)=\frac{40}{49}=0.82 (vs. P(Y \ge 3)=0.6). Indeed, with the information that both test takers do well, we can expect much higher results from each individual test taker.

Example 4

In Examples 2 and 3, the new information involve both test takers (both random variables). If the new information involves just one test taker, it may be immaterial on the exam score of the other student. For example, suppose that Y \ge 4. Then what is the conditional distribution for X \lvert Y \ge 4? Since X and Y are independent, the high score Y \ge 4 has no impact on the score X. However, the high joint score X+Y \ge 5 does have an impact on each of the individual scores (Example 3).

_____________________________________________________________________________________________________________________________

Summary

We conclude with a summary of the thought process of conditional distributions.

Suppose X is a discrete random variable and f(x)=P(X=x) is its probability function. Further suppose that X is the probability model of some random experiment. The sample space of this random experiment is S.

Suppose we have some new information that in this random experiment, some event A has occurred. The event A is a subset of the sample space S.

To incorporate this new information, the event A is the new sample space. The random variable incorporated with the new information, denoted by X \lvert A, has a conditional probability distribution. The following is the probability function of the conditional distribution.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

where P(A) = \displaystyle \sum_{x \in A} P(X=x).

The thought process is that in the conditional distribution is derived from taking each original probability mass as a fraction of the total probability P(A). The probability function derived in this manner reflects the new information that the event A has occurred.

Once the conditional probability function is derived, it can be used just like any other probability function, e.g. computationally for finding various distributional quantities.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

An Introduction to the Bayes’ Formula

We open up a discussion of the Bayes’ formula by going through a basic example. The Bayes’ formula or theorem is a method that can be used to compute “backward” conditional probabilities such as the examples described here. The formula will be stated after we examine the calculation from Example 1. The following diagram describes Example 1. Example 2 is presented at the end of the post and is left as exercise. For a basic discussion of the Bayes’ formula, see [1] and chapter 4 of [2].

Example 1

As indicated in the diagram, Box 1 has 1 red ball and three white balls and Box 2 has 2 red balls and 2 white balls. The example involves a sequence of two steps. In the first step (the green arrow in the above diagram), a box is randomly chosen from two boxes. In the second step (the blue arrow), a ball is randomly selected from the chosen box. We assume that the identity of the chosen box is unknown to the participants of this random experiment (e.g. suppose the two boxes are identical in appearance and a box is chosen by your friend and its identity is kept from you). Since a box is chosen at random, it is easy to see that P(B_1)=P(B_2)=0.5.

The example involves conditional probabilities. Some of the conditional probabilities are natural and are easy to see. For example, if the chosen box is Box 1, it is clear that the probability of selecting a red ball is \displaystyle \frac{1}{4}, i.e. \displaystyle P(R \lvert B_1)=\frac{1}{4}. Likewise, the conditional probability P(R \lvert B_2) is \displaystyle \frac{2}{4}. These two conditional probabilities are “forward” conditional probabilities since the events R \lvert B_1 and R \lvert B_2 occur in a natural chronological order.

What about the reversed conditional probabilities P(B_1 \lvert R) and P(B_2 \lvert R)? In other words, if the selected ball from the unknown box (unknown to you) is red, what is the probability that the ball is from Box 1?

The above question seems a little backward. After the box is randomly chosen, it is fixed (though the identity is unknown to you). Since it is fixed, shouldn’t the probability that the box being Box 1 is \displaystyle \frac{1}{2}? Since the box is already chosen, how can the identity of the box be influenced by the color of the ball selected from it? The answer is of course no.

We should not look at the chronological sequence of events. Instead, the key to understanding the example is through performing the random experiment repeatedly. Think of the experiment of choosing one box and then selecting one ball from the chosen box. Focus only on the trials that result in a red ball. For the result to be a red ball, we need to get either Box 1/ Red or Box 2/Red. Compute the probabilities of these two cases. Then add these two probabilities, we will obtain the probability that the selected ball is red. The following diagram illustrates this calculation.

Example 1 – Tree Diagram

The outcomes with red border in the above diagram are the outcomes that result in a red ball. The diagram shows that if we perform this experiment many times, about 37.5% of the trials will result in a red ball (on average 3 out of 8 trials will result in a red ball). In how many of these trials, is Box 1 the source of the red ball? In the diagram, we see that the case Box 2/Red is twice as likely as the case Box 1/Red. We conclude that the case Box 1/Red accounts for about one third of the cases when the selected ball is red. In other words, one third of the red balls come from Box 1 and two third of the red balls come from Box 2. We have:

\displaystyle (1) \ \ \ \ \ P(B_1 \lvert R)=\frac{1}{3}

\displaystyle (2) \ \ \ \ \ P(B_2 \lvert R)=\frac{2}{3}

Instead of using the tree diagram or the reasoning indicated in the paragraph after the tree diagram, we could just as easily apply the Bayes’ formula:

\displaystyle \begin{aligned}(3) \ \ \ \ \ P(B_1 \lvert R)&=\frac{P(R \lvert B_1) \times P(B_1)}{P(R)} \\&=\frac{\frac{1}{2} \times \frac{1}{4}}{\frac{3}{8}} \\&=\frac{1}{3}  \end{aligned}

In the calculation in (3) (as in the tree diagram), we use the law of total probability:

\displaystyle \begin{aligned}(4) \ \ \ \ \ P(R)&=P(R \lvert B_1) \times P(B_1)+P(R \lvert B_2) \times P(B_2) \\&=\frac{1}{4} \times \frac{1}{2}+\frac{2}{4} \times \frac{1}{2} \\&=\frac{3}{8}  \end{aligned}

______________________________________________________________
Remark

We are not saying that an earlier event (the choosing of the box) is altered in some way by a subsequent event (the observing of a red ball). The above probabilities are subjective. How strongly do you believe that the “unknown” box is Box 1? If you use probabilities to quantify your belief, without knowing any additional information, you would say the probability that the “unknown” box being Box 1 is \frac{1}{2}.

Suppose you reach into the “unknown” box and get a red ball. This additional information alters your belief about the chosen box. Since Box 2 has more red balls, the fact that you observe a red ball will tell you that it is more likely that the “unknown” chosen box is Box 2. According to the above calculation, you update the probability of the chosen box being Box 1 to \frac{1}{3} and the probability of it being Box 2 as \frac{2}{3}.

In the language of Bayesian probability theory, the initial belief of P(B_1)=0.5 and P(B_2)=0.5 is called the prior probability distribution. After a red ball is observed, the updated belief as in the probabilities \displaystyle P(B_1 \lvert R)=\frac{1}{3} and \displaystyle P(B_2 \lvert R)=\frac{2}{3} is called the posterior probability distribution.

As demonstrated by this example, the Bayes’ formula is for updating probabilities in light of new information. Though the updated probabilities are subjective, they are not arbitrary. We can make sense of these probabilities by assessing the long run results of the experiment objectively.

______________________________________________________________
An Insurance Perspective

The example discussed here has an insurance interpretation. Suppose an insurer has two groups of policyholders, both equal in size. One group consists of low risk insureds where the probability of experiencing a claim in a year is \frac{1}{4} (i.e. the proportion of red balls in Box 1). The insureds in other group, a high risk group, have a higher probability of experiencing a claim in a year, which is \frac{2}{4} (i.e. the proportion of red balls in Box 2).

Suppose someone just purchase a policy. Initially, the risk profile of this newly insured is uncertain. So the initial belief is that it is equally likely for him to be in the low risk group as in the high risk group.

Suppose that during the first policy year, the insured has incurred one claim. The observation alters our belief about this insured. With the additional information of having one claim, the probability that the insured belong to the high risk group is increased to \frac{2}{3}. The risk profile of this insured is altered based on new information. The insurance point of view described here has the exact same calculation as in the box-ball example and is that of using past claims experience to update future claims experience.

______________________________________________________________
Bayes’ Formula

Suppose we have a collection of mutually exclusive events B_1, B_2, \cdots, B_n. That is, the probabilities P(B_i) sum to 1.0. Suppose R is an event. Think of the events B_i as “causes” that can explain the event R, an observed result. Given R is observed, what is the probability that the cause of R is B_k? In other words, we are interested in finding the conditional probability P(B_k \lvert R).

Before we have the observed result R, the probabilities P(B_i) are the prior probabilities of the causes. We also know the probability of observing R given a particular cause (i.e. we know P(R \lvert B_i). The probabilities P(R \lvert B_i) are “forward” conditional probabilities.

Given that we observe R, we are interested in knowing the “backward” probabilities P(B_i \lvert R). These probabilities are called the posterior probabilities of the causes. Mathematically, the Bayes’ formula is simply an alternative way of writing the following conditional probability.

\displaystyle (5) \ \ \ \ \ P(B_k \lvert R)=\frac{P(B_k \cap R)}{P(R)}

In (5), as in the discussion of the random experiment of choosing box and selecting ball, we are restricting ourselves to only the cases where the event R is observed. Then we ask, out of all the cases where R is observed, how many of these cases are caused by the event B_k?

The numerator of (5) can be written as

\displaystyle (6) \ \ \ \ \ P(B_k \cap R)=P(R \lvert B_k) \times P(B_k)

The denominator of (5) is obtained from applying the total law of probability.

\displaystyle (7) \ \ \ \ \ P(R)=P(R \lvert B_1) P(B_1) + P(R \lvert B_2) P(B_2)+ \cdots + P(R \lvert B_n) P(B_n)

Plugging (6) and (7) into (5), we obtain a statement of the Bayes’ formula.

\displaystyle (8) \ \ \ \ \ P(B_k \lvert R)=\frac{P(P(R \lvert B_k) \times P(B_k)}{\sum \limits_{j=1}^n P(R \lvert B_j) \times P(B_j)} \ \ \ \ \ \ \ \text{(Bayes' Formula)}

Of course, for any computation problem involving the Bayes’ formula, it is best not to memorize the formula in (8). Instead, simply apply the thought process that gives rise to the formula (e.g. the tree diagram shown above).

The Bayes’ formula has some profound philosophical implications, evidenced by the fact that it spawned a separate school of thought called Bayesian statistics. However, our discussion here is solely on its original role in finding certain backward conditional probabilities.

______________________________________________________________
Example 2

Example 2 is left as exercise. The event that both selected balls are red would give even more weight to Box 2. In other words, in the event that a red ball is selected twice in a row, we would believe that it is even more likely that the unknown box is Box 2.
______________________________________________________________
Reference

  1. Feller, W., An Introduction to Probability Theory and Its Applications, third edition, John Wiley & Sons, New York, 1968.
  2. Grinstead, C. M., Snell, J. L. Introduction to Probability, Online Book in PDF format.