Conditional Distributions, Part 2

We present more examples to further illustrate the thought process of conditional distributions. A conditional distribution is a probability distribution derived from a given probability distribution by focusing on a subset of the original sample space (we assume that the probability distribution being discussed is a model for some random experiment). The new sample space (the subset of the original one) may be some outcomes that are of interest to an experimenter in a random experiment or may reflect some new information we know about the random experiment in question. We illustrate this thought process in the previous post Conditional Distributions, Part 1 using discrete distributions. In this post, we present some continuous examples for conditional distributions. One concept illustrated by the examples in this post is the notion of mean residual life, which has an insurance interpretation (e.g. the average remaining time until death given that the life in question is alive at a certain age).

_____________________________________________________________________________________________________________________________

The Setting

The thought process of conditional distributions is discussed in the previous post Conditional Distributions, Part 1. We repeat the same discussion using continuous distributions.

Let X be a continuous random variable that describes a certain random experiment. Let S be the sample space of this random experiment. Let f(x) be its probability density function.

We assume that X is a univariate random variable, meaning that the sample space S is the real line \mathbb{R} or a subset of \mathbb{R}. Since X is a continuous random variable, we know that S would contain an interval, say, (a,b).

Suppose that in the random experiment in question, certain event A has occurred. The probability of the event A is obtained by integrating the density function over the set A.

    \displaystyle P(A)=\int_{x \in A} f(x) \ dx

Since the event A has occurred, P(A)>0. Since we are dealing with a continuous distribution, the set A would contain an interval, say (c,d) (otherwise P(A)=0). So the new probability distribution we define is also a continuous distribution. The following is the density function defined on the new sample space A.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Example 1

Let X be the lifetime (in years) of a brand new computer purchased from a certain manufacturer. Suppose that the following is the density function of the random variable X.

    \displaystyle f(x)=\frac{3}{2500} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ 0<x<10

Suppose that you have just purchased a one such computer that is 2-year old and in good working condition. We have the following questions.

  • What is the expected lifetime of this 2-year old computer?
  • What is the expected number of years of service that will be provided by this 2-year old computer?

Both calculations are conditional means since the computer in question already survived to age 2. However, there is a slight difference between the two calculations. The first one is the expected age of the 2-year old computer, i.e., the conditional mean E(X \lvert X>2). The second one is the expected remaining lifetime of the 2-year old computer, i.e., E(X-2 \lvert X>2).

For a brand new computer, the sample space is the interval S=0<x<10. Knowing that the computer is already 2-year old, the new sample space is A=2<x<10. The total probability of the new sample space is:

    \displaystyle P(A)=P(X>2)=\int_{2}^{10} \frac{3}{2500} \ (100x-20x^2 + x^3) \ dx=\frac{2048}{2500}=0.8192

The conditional density function of X given X>2 is:

    \displaystyle \begin{aligned} f(x \lvert X>2)&=\frac{\frac{3}{2500} \ (100x-20x^2 + x^3)} {\frac{2048}{2500}} \\&=\frac{3}{2048} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ \ 2<x<10  \end{aligned}

The first conditional mean is:

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^{10} x \ f(x \lvert X>2) \ dx \\&=\int_2^{10} \frac{3}{2048} \ x(100x-20x^2 + x^3) \ dx \\&=\int_2^{10} \frac{3}{2048} \ (100x^2-20x^3 + x^4) \ dx \\&=\frac{47104}{10240}=4.6 \end{aligned}

The second conditional mean is:

    \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=2.6

In contrast, the unconditional mean is:

    \displaystyle E(X)=\int_0^{10} \frac{3}{2500} \ (100x^2-20x^3 + x^4) \ dx=4

So if the lifetime of a computer is modeled by the density function f(x) given here, the expected lifetime of a brand new computer is 4 years. If you know that a computer has already been in use for 2 years and is in good condition, the expected lifetime is 4.6 years, where 2 years of which have already passed, showing us that the remaining lifetime is 2.6 years.

Note that the following calculation is not E(X \lvert X>2), though is something that some students may attempt to do.

    \displaystyle \int_2^{10} x \ f(x) \ dx =\int_2^{10} \frac{3}{2500} \ x(100x-20x^2 + x^3) \ dx=\frac{47104}{12500}=3.76832

The above calculation does not use the conditional distribution that X>2. Also note that the answer is less than the unconditional mean E(X).

Example 2 – Exponential Distribution

Work Example 1 again by assuming that the lifetime of the type of computers in questions follows the exponential distribution with mean 4 years.

The following is the density function of the lifetime X.

    \displaystyle f(x)=0.25 \ e^{-0.25 x}, \ \ \ \ \ \ 0<x<\infty

The probability that the computer has survived to age 2 is:

    \displaystyle P(X>2)=\int_2^\infty 0.25 \ e^{-0.25 x} \ dx=e^{-0.25 (2)}=e^{-0.5}

The conditional density function given that X>2 is:

    \displaystyle f(x \lvert X>2)= \frac{0.25 \ e^{-0.25 x}}{e^{-0.25 (2)}}=0.25 \ e^{-0.25 (x-2)}, \ \ \ \ \ \ \ 2<x<\infty

To compute the conditional mean E(X \lvert X>2), we have

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^\infty x \ f(x \lvert X>2) \ dx \\&=\int_2^\infty 0.25 \ x \ e^{-0.25 (x-2)} \ dx \\&=\int_0^\infty 0.25 \ (u+2) \ e^{-0.25 u} \ du \ \ \ (\text{change of variable}) \\&=\int_0^\infty 0.25 \ u \ e^{-0.25 u} \ du+2\int_0^\infty 0.25 \ e^{-0.25 u} \ du \\&=\frac{1}{0.25}+2=4+2=6\end{aligned}

Then \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=6-2=4.

We have an interesting result here. The expected lifetime of a brand new computer is 4 years. Yet the remaining lifetime for a 2-year old computer is still 4 years! This is the no-memory property of the exponential distribution – if the lifetime of a type of machines is distributed according to an exponential distribution, it does not matter how old the machine is, the remaining lifetime is always the same as the unconditional mean! This point indicates that the exponential distribution is not an appropriate for modeling the lifetime of machines or biological lives that wear out over time.

_____________________________________________________________________________________________________________________________

Mean Residual Life

If a 40-year old man who is a non-smoker wants to purchase a life insurance policy, the insurance company is interested in knowing the expected remaining lifetime of the prospective policyholder. This information will help determine the pricing of the life insurance policy. The expected remaining lifetime of the prospective policyholder is called is called the mean residual life and is the conditional mean E(X-t \lvert X>t) where X is a model for the lifetime of some life.

In engineering and manufacturing applications, probability modeling of lifetimes of objects (e.g. devices, systems or machines) is known as reliability theory. The mean residual life also plays an important role in such applications.

Thus if the random variable X is a lifetime model (lifetime of a life, system or device), then the conditional mean E(X-t \lvert X>t) is called the mean residual life and is the expected remaining lifetime of the life or system in question given that the life has survived to age t.

On the other hand, if the random variable X is a model of insurance losses, then the conditional mean E(X-t \lvert X>t) is the expected claim payment per loss given that the loss has exceeded the deductible of t. In this interpretation, the conditional mean E(X-t \lvert X>t) is called the mean excess loss function.

_____________________________________________________________________________________________________________________________

Summary

In conclusion, we summarize the approach for calculating the two conditional means demonstrated in the above examples.

Suppose X is a continuous random variable with the support being (0,\infty) (the positive real numbers), with f(x) being the density function. The following is the density function of the conditional probability distribution given that X>t.

    \displaystyle f(x \lvert X>t)=\frac{f(x)}{P(X>t)}, \ \ \ \ \ \ \ \ \ x>t

Then we have the two conditional means:

    \displaystyle E(X \lvert X>t)=\int_t^\infty x \ f(x \lvert X>t) \ dx=\int_t^\infty x \ \frac{f(x)}{P(X>t)} \ dx

    \displaystyle E(X-t \lvert X>t)=\int_t^\infty (x-t) \ f(x \lvert X>t) \ dx=\int_t^\infty (x-t) \ \frac{f(x)}{P(X>t)} \ dx

If E(X \lvert X>t) is calculated first (or is easier to calculate), then E(X-t \lvert X>t)=E(X \lvert X>t)-t, as shown in the above examples.

If X is a discrete random variable, then the integrals are replaced by summation symbols. As indicated above, the conditional mean E(X-t \lvert X>t) is called the mean residual life when X is a probability model of the lifetime of some system or life.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

Advertisements

Conditional Distributions, Part 1

We illustrate the thought process of conditional distributions with a series of examples. These examples are presented in a series of blog posts. In this post, we look at some conditional distributions derived from discrete probability distributions.

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

The Setting

Suppose we have a discrete random variable X with f(x)=P(X=x) as the probability mass function. Suppose some random experiment can be modeled by the discrete random variable X. The sample space S for this probability experiment is the set of sample points with positive probability masses, i.e. S is the set of all x for which f(x)=P(X=x)>0. In the examples below, S is either a subset of the real line \mathbb{R} or a subset of the plane \mathbb{R} \times \mathbb{R}. Conceivably the sample space could be subset of any Euclidean space \mathbb{R}^n in higher dimension.

Suppose that we are informed that some event A in the random experiment has occurred (A is a subset of the sample space S). Given this new information, all the sample points outside of the event A are irrelevant. Or perhaps, in this random experiment, we are only interested in those outcomes that are elements of some subset A of the sample space S. In either of these scenarios, we wish to make the event A as a new sample space.

The probability of the event A, denoted by P(A), is derived by summing the probabilities f(x)=P(X=x) over all the sample points x \in A. We have:

    \displaystyle P(A)=\sum_{x \in A} P(X=x)

The probability P(A) may not be 1.0. So the probability masses f(x)=P(X=x) for the sample points x \in A, if they are unadjusted, may not form a probability distribution. However, if we consider each such probability mass f(x)=P(X=x) as a proportion of the probability P(A), then the probability masses of the event A will form a probability distribution. For example, say the event A consists of two probability masses 0.2 and 0.3, which sum to 0.5. Then in the new sample space, the first probability mass is 0.4 (0.2 multiplied by \displaystyle \frac{1}{0.5} or divided by 0.5) and the second probability mass is 0.6.

We now summarize the above paragraph. Using the event A as a new sample space, the probability mass function is:

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Suppose that two students take a multiple choice test that has 5 questions. Let X be the number of correct answers of one student and Y be the number of correct answers of the other student (these can be considered as test scores for the purpose of the examples here). Assume that X and Y are independent. The following shows the probability functions.

      \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&P(X=x) &\text{ }&\text{ }&P(Y=y) \\\text{Correct Answers}&\text{ }&\text{ }&\text{ } &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.4&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 1&\text{ }&\text{ }&0.2&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 2&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 3&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 4&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ }  \\ 5&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2  \end{bmatrix}

Note that E(X)=1.6 and E(Y)=2.9. Without knowing any additional information, we can expect on average one student gets 1.6 correct answers and one student gets 2.9 correct answers. If having 3 or more correct answers is considered passing, then the student represented by X has a 30% chance of passing while the student represented by Y has a 60% chance of passing. The following examples show how the expectation can change as soon as new information is known.

The following examples are based on these two test scores X and Y.

Example 1

In this example, we only consider the student whose correct answers are modeled by the random variable X. In addition to knowing the probability function P(X=x), we also know that this student has at least one correct answer (i.e. the new information is X>0).

In light of the new information, the new sample space is A=\left\{1,2,3,4,5 \right\}. Note that P(A)=0.6. In this new sample space, each probability mass is the original one divided by 0.6. For example, for the sample point X=1, we have \displaystyle P(X=1 \lvert X>0)=\frac{0.2}{0.6}=\frac{2}{6}. The following is the conditional probability distribution of X given X>0.

      \displaystyle P(X=1 \lvert X>0)=\frac{2}{6}

      \displaystyle P(X=2 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=3 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=4 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=5 \lvert X>0)=\frac{1}{6}

The conditional mean is the mean of the conditional distribution. We have \displaystyle E(X \lvert X>0)=\frac{16}{6}=2.67. Given that this student is knowledgeable enough to answer some question correctly, the expectation is higher than before knowing the additional information. Also, given the new information, the student in question has a 50% chance of passing (vs. 30% before the new information is known).

Example 2

We now look at a joint distribution that has a 2-dimensional sample space. Consider the joint distribution of test scores X and Y. If the new information is that the total number of correct answers among them is 4, how would this change our expectation of their performance?

Since X and Y are independent, the sample space is a square as indicated the figure below.

\text{ }

Figure 1 – Sample Space of Test Scores

Because the two scores are independent, the joint probability at each of these 36 sample points is the product of the individual probabilities. We have P(X=x,Y=y)=P(X=x) \times P(Y=y). The following figure shows one such joint probability.

Figure 2 – Joint Probability Function

After taking the test, suppose that we have the additional information that the two students have a total of 4 correct answers. With this new information, we can focus our attention on the new sample space that is indicated in the following figure.

Figure 3 – New Sample Space

Now we wish to discuss the conditional probability distribution of X \lvert X+Y=4 and the conditional probability distribution of Y \lvert X+Y=4. In particular, given that there are 4 correct answers between the two students, what would be their expected numbers of correct answers and what would be their chances of passing?

There are 5 sample points in the new sample space (the 5 points circled above). The conditional probability distribution is obtained by making each probability mass as a fraction of the sum of the 5 probability masses. First we calculate the 5 joint probabilities.

      \displaystyle P(X=0,Y=4)=P(X=0) \times P(Y=4) =0.4 \times 0.2=0.08

      \displaystyle P(X=1,Y=3)=P(X=1) \times P(Y=3) =0.2 \times 0.2=0.04

      \displaystyle P(X=2,Y=2)=P(X=2) \times P(Y=2) =0.1 \times 0.2=0.02

      \displaystyle P(X=3,Y=1)=P(X=3) \times P(Y=1) =0.1 \times 0.1=0.01

      \displaystyle P(X=4,Y=0)=P(X=4) \times P(Y=0) =0.1 \times 0.1=0.01

The sum of these 5 joint probabilities is P(X+Y=4)=0.16. Making each of these joint probabilities as a fraction of 0.16, we have the following two conditional probability distributions.

      \displaystyle P(X=0 \lvert X+Y=4)=\frac{8}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=0 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=1 \lvert X+Y=4)=\frac{4}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=1 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=2 \lvert X+Y=4)=\frac{2}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=2 \lvert X+Y=4)=\frac{2}{16}

      \displaystyle P(X=3 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=3 \lvert X+Y=4)=\frac{4}{16}

      \displaystyle P(X=4 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=4 \lvert X+Y=4)=\frac{8}{16}

Now the conditional means given that X+Y=4, comparing against the unconditional means.

      \displaystyle E(X \lvert X+Y=4)=\frac{0+4+4+3+4}{16}=\frac{15}{16}=0.9375 \ \ \ \ \ \ \ \ \text{vs} \ \ E(X)=1.6

      \displaystyle E(Y \lvert X+Y=4)=\frac{0+1+4+12+32}{16}=\frac{49}{16}=3.0625 \ \ \ \ \ \text{vs} \ \ E(Y)=2.9

Now compare the chances of passing.

      \displaystyle P(X \ge 3 \lvert X+Y=4)=\frac{4}{16}=0.25 \ \ \ \ \ \ \ \ \ \ \text{vs} \ \ P(X \ge 3)=0.3

      \displaystyle P(Y \ge 3 \lvert X+Y=4)=\frac{14}{16}=0.875 \ \ \ \ \ \ \ \ \text{vs} \ \ P(Y \ge 3)=0.6

Based on the new information of X+Y=4, we have a lower expectation for the student represented by X and a higher expectation for the student represented by Y. Observe that the conditional probability at X=0 increases to 0.5 from 0.4, while the conditional probability at X=4 increases to 0.5 from 0.2.

Example 3

Now suppose the new information is that the two students do well on the test. Particularly, their combined number of correct answers is greater than or equal to 5, i.e., X+Y \ge 5. How would this impact the conditional distributions?

First we discuss the conditional distributions for X \lvert X+Y \ge 5 and Y \lvert X+Y \ge 5. By considering the new information, the following is the new sample space.

Figure 4 – New Sample Space

To derive the conditional distribution of X \lvert X+Y \ge 5, sum the joint probabilities within the new sample space for each X=x. The calculation is shown below.

      \displaystyle P(X=0 \cap X+Y \ge 5)=0.4 \times 0.2=0.08

      \displaystyle P(X=1 \cap X+Y \ge 5)=0.2 \times (0.2+0.2)=0.08

      \displaystyle P(X=2 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2)=0.06

      \displaystyle P(X=3 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2+0.2)=0.08

      \displaystyle P(X=4 \cap X+Y \ge 5)=0.1 \times (1-0.1)=0.09

      \displaystyle P(X=5 \cap X+Y \ge 5)=0.1 \times (1)=0.10

The sum of these probabilities is 0.49, which is P(X+Y \ge 5). The conditional distribution of X \lvert X+Y \ge 5 is obtained by taking each of the above probabilities as a fraction of 0.49. We have:

      \displaystyle P(X=0 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=1 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(X=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=4 \lvert X+Y \ge 5)=\frac{9}{49}=0.184

      \displaystyle P(X=5 \lvert X+Y \ge 5)=\frac{10}{49}=0.204

We have the conditional mean \displaystyle E(X \lvert X+Y \ge 5)=\frac{0+8+12+24+36+50}{49}=\frac{130}{49}=2.653 (vs. E(X)=1.6). The conditional probability of passing is \displaystyle P(X \ge 3 \lvert X+Y \ge 5)=\frac{27}{49}=0.55 (vs. P(X \ge 3)=0.3).

Note that the above conditional distribution for X \lvert X+Y \ge 5 is not as skewed as the original one for X. With the information that both test takers do well, the expected score for the student represented by X is much higher.

With similar calculation we have the following results for the conditional distribution of Y \lvert X+Y \ge 5.

      \displaystyle P(Y=0 \lvert X+Y \ge 5)=\frac{1}{49}=0.02

      \displaystyle P(Y=1 \lvert X+Y \ge 5)=\frac{2}{49}=0.04

      \displaystyle P(Y=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(Y=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(Y=4 \lvert X+Y \ge 5)=\frac{12}{49}=0.245

      \displaystyle P(Y=5 \lvert X+Y \ge 5)=\frac{20}{49}=0.408

We have the conditional mean \displaystyle E(Y \lvert X+Y \ge 5)=\frac{0+2+12+24+48+100}{49}=\frac{186}{49}=3.8 (vs. E(Y)=2.9). The conditional probability of passing is \displaystyle P(Y \ge 3 \lvert X+Y \ge 5)=\frac{40}{49}=0.82 (vs. P(Y \ge 3)=0.6). Indeed, with the information that both test takers do well, we can expect much higher results from each individual test taker.

Example 4

In Examples 2 and 3, the new information involve both test takers (both random variables). If the new information involves just one test taker, it may be immaterial on the exam score of the other student. For example, suppose that Y \ge 4. Then what is the conditional distribution for X \lvert Y \ge 4? Since X and Y are independent, the high score Y \ge 4 has no impact on the score X. However, the high joint score X+Y \ge 5 does have an impact on each of the individual scores (Example 3).

_____________________________________________________________________________________________________________________________

Summary

We conclude with a summary of the thought process of conditional distributions.

Suppose X is a discrete random variable and f(x)=P(X=x) is its probability function. Further suppose that X is the probability model of some random experiment. The sample space of this random experiment is S.

Suppose we have some new information that in this random experiment, some event A has occurred. The event A is a subset of the sample space S.

To incorporate this new information, the event A is the new sample space. The random variable incorporated with the new information, denoted by X \lvert A, has a conditional probability distribution. The following is the probability function of the conditional distribution.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

where P(A) = \displaystyle \sum_{x \in A} P(X=x).

The thought process is that in the conditional distribution is derived from taking each original probability mass as a fraction of the total probability P(A). The probability function derived in this manner reflects the new information that the event A has occurred.

Once the conditional probability function is derived, it can be used just like any other probability function, e.g. computationally for finding various distributional quantities.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation W \sim \text{binom}(n,p) denotes the statement that W has a binomial distribution with parameters n and p. In other words, W is the number of successes in a sequence of n independent Bernoulli trials where p is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let X be the number of correct answers in the first quiz and Y be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both X and Y are binomial distributions (both \sim \text{binom}(5,\frac{1}{4})). The sum X+Y is also a binomial distribution (\sim \text{binom}(10,\frac{1}{4})). The question that is being asked is a conditional probability, i.e., P(X=3 \lvert X+Y=4). Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions X and Y with the same probability of success p (the number of trials does not have to be the same), the conditional distribution X \lvert X+Y=a is a hypergeometric distribution. Interestingly, the probability of success p has no bearing on this observation. For Example 1, we have the following calculation.

\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}

\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

Interestingly, the conditional mean E(X \lvert X+Y=4)=2, while the unconditional mean E(X)=5 \times \frac{1}{4}=1.25. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

__________________________________________________
Why This Works

Suppose X \sim \text{binom}(5,p) and Y \sim \text{binom}(5,p) and they are independent. The joint distribution of X and Y has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \  P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y}  \end{aligned}

where x=0,1,2,3,4,5 and y=0,1,2,3,4,5.

The conditional probability P(X=k \lvert X+Y=4) involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \  P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0)  \end{aligned}

We can plug (3) into (4) and work out the calculation. But (4) is actually equivalent to the following because X+Y \sim \text{binom}(10,p).

\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}

As stated earlier, the conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \  P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle  \binom{10}{4} p^4 \ (1-p)^{6}}   \end{aligned}

With the terms involving p and 1-p cancel out, we have:

\displaystyle (7) \ \ \ \  P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle  \binom{10}{4}}

__________________________________________________
Summary

Suppose X \sim \text{binom}(N,p) and Y \sim \text{binom}(M,p) and they are independent. Then X+Y is also a binomial distribution, i.e., \sim \text{binom}(N+M,p). Suppose that both binomial experiments \text{binom}(N,p) and \text{binom}(M,p) have been performed and it is known that there are a successes in total. Then X \lvert X+Y=a has a hypergeometric distribution.

\displaystyle (8) \ \ \ \  P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle  \binom{N+M}{a}}

where k=0,1,2,3,\cdots,\text{min}(N,a).

As discussed earlier, think of the N trials in \text{binom}(N,p) as red balls and think of the M trials in \text{binom}(M,p) as blue balls in a jar. Think of the a successes as the number of balls you are about to draw from the jar. So you reach into the jar and select a balls without replacement. The calculation in (8) gives the probability that you select k red balls and a-k blue balls.

The probability of success p in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

__________________________________________________
Additional Practice
Practice problems are found in the following blog post.

How to pick binomial trials