Conditional Distributions, Part 2

We present more examples to further illustrate the thought process of conditional distributions. A conditional distribution is a probability distribution derived from a given probability distribution by focusing on a subset of the original sample space (we assume that the probability distribution being discussed is a model for some random experiment). The new sample space (the subset of the original one) may be some outcomes that are of interest to an experimenter in a random experiment or may reflect some new information we know about the random experiment in question. We illustrate this thought process in the previous post Conditional Distributions, Part 1 using discrete distributions. In this post, we present some continuous examples for conditional distributions. One concept illustrated by the examples in this post is the notion of mean residual life, which has an insurance interpretation (e.g. the average remaining time until death given that the life in question is alive at a certain age).

_____________________________________________________________________________________________________________________________

The Setting

The thought process of conditional distributions is discussed in the previous post Conditional Distributions, Part 1. We repeat the same discussion using continuous distributions.

Let X be a continuous random variable that describes a certain random experiment. Let S be the sample space of this random experiment. Let f(x) be its probability density function.

We assume that X is a univariate random variable, meaning that the sample space S is the real line \mathbb{R} or a subset of \mathbb{R}. Since X is a continuous random variable, we know that S would contain an interval, say, (a,b).

Suppose that in the random experiment in question, certain event A has occurred. The probability of the event A is obtained by integrating the density function over the set A.

    \displaystyle P(A)=\int_{x \in A} f(x) \ dx

Since the event A has occurred, P(A)>0. Since we are dealing with a continuous distribution, the set A would contain an interval, say (c,d) (otherwise P(A)=0). So the new probability distribution we define is also a continuous distribution. The following is the density function defined on the new sample space A.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Example 1

Let X be the lifetime (in years) of a brand new computer purchased from a certain manufacturer. Suppose that the following is the density function of the random variable X.

    \displaystyle f(x)=\frac{3}{2500} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ 0<x<10

Suppose that you have just purchased a one such computer that is 2-year old and in good working condition. We have the following questions.

  • What is the expected lifetime of this 2-year old computer?
  • What is the expected number of years of service that will be provided by this 2-year old computer?

Both calculations are conditional means since the computer in question already survived to age 2. However, there is a slight difference between the two calculations. The first one is the expected age of the 2-year old computer, i.e., the conditional mean E(X \lvert X>2). The second one is the expected remaining lifetime of the 2-year old computer, i.e., E(X-2 \lvert X>2).

For a brand new computer, the sample space is the interval S=0<x<10. Knowing that the computer is already 2-year old, the new sample space is A=2<x<10. The total probability of the new sample space is:

    \displaystyle P(A)=P(X>2)=\int_{2}^{10} \frac{3}{2500} \ (100x-20x^2 + x^3) \ dx=\frac{2048}{2500}=0.8192

The conditional density function of X given X>2 is:

    \displaystyle \begin{aligned} f(x \lvert X>2)&=\frac{\frac{3}{2500} \ (100x-20x^2 + x^3)} {\frac{2048}{2500}} \\&=\frac{3}{2048} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ \ 2<x<10  \end{aligned}

The first conditional mean is:

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^{10} x \ f(x \lvert X>2) \ dx \\&=\int_2^{10} \frac{3}{2048} \ x(100x-20x^2 + x^3) \ dx \\&=\int_2^{10} \frac{3}{2048} \ (100x^2-20x^3 + x^4) \ dx \\&=\frac{47104}{10240}=4.6 \end{aligned}

The second conditional mean is:

    \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=2.6

In contrast, the unconditional mean is:

    \displaystyle E(X)=\int_0^{10} \frac{3}{2500} \ (100x^2-20x^3 + x^4) \ dx=4

So if the lifetime of a computer is modeled by the density function f(x) given here, the expected lifetime of a brand new computer is 4 years. If you know that a computer has already been in use for 2 years and is in good condition, the expected lifetime is 4.6 years, where 2 years of which have already passed, showing us that the remaining lifetime is 2.6 years.

Note that the following calculation is not E(X \lvert X>2), though is something that some students may attempt to do.

    \displaystyle \int_2^{10} x \ f(x) \ dx =\int_2^{10} \frac{3}{2500} \ x(100x-20x^2 + x^3) \ dx=\frac{47104}{12500}=3.76832

The above calculation does not use the conditional distribution that X>2. Also note that the answer is less than the unconditional mean E(X).

Example 2 – Exponential Distribution

Work Example 1 again by assuming that the lifetime of the type of computers in questions follows the exponential distribution with mean 4 years.

The following is the density function of the lifetime X.

    \displaystyle f(x)=0.25 \ e^{-0.25 x}, \ \ \ \ \ \ 0<x<\infty

The probability that the computer has survived to age 2 is:

    \displaystyle P(X>2)=\int_2^\infty 0.25 \ e^{-0.25 x} \ dx=e^{-0.25 (2)}=e^{-0.5}

The conditional density function given that X>2 is:

    \displaystyle f(x \lvert X>2)= \frac{0.25 \ e^{-0.25 x}}{e^{-0.25 (2)}}=0.25 \ e^{-0.25 (x-2)}, \ \ \ \ \ \ \ 2<x<\infty

To compute the conditional mean E(X \lvert X>2), we have

    \displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^\infty x \ f(x \lvert X>2) \ dx \\&=\int_2^\infty 0.25 \ x \ e^{-0.25 (x-2)} \ dx \\&=\int_0^\infty 0.25 \ (u+2) \ e^{-0.25 u} \ du \ \ \ (\text{change of variable}) \\&=\int_0^\infty 0.25 \ u \ e^{-0.25 u} \ du+2\int_0^\infty 0.25 \ e^{-0.25 u} \ du \\&=\frac{1}{0.25}+2=4+2=6\end{aligned}

Then \displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=6-2=4.

We have an interesting result here. The expected lifetime of a brand new computer is 4 years. Yet the remaining lifetime for a 2-year old computer is still 4 years! This is the no-memory property of the exponential distribution – if the lifetime of a type of machines is distributed according to an exponential distribution, it does not matter how old the machine is, the remaining lifetime is always the same as the unconditional mean! This point indicates that the exponential distribution is not an appropriate for modeling the lifetime of machines or biological lives that wear out over time.

_____________________________________________________________________________________________________________________________

Mean Residual Life

If a 40-year old man who is a non-smoker wants to purchase a life insurance policy, the insurance company is interested in knowing the expected remaining lifetime of the prospective policyholder. This information will help determine the pricing of the life insurance policy. The expected remaining lifetime of the prospective policyholder is called is called the mean residual life and is the conditional mean E(X-t \lvert X>t) where X is a model for the lifetime of some life.

In engineering and manufacturing applications, probability modeling of lifetimes of objects (e.g. devices, systems or machines) is known as reliability theory. The mean residual life also plays an important role in such applications.

Thus if the random variable X is a lifetime model (lifetime of a life, system or device), then the conditional mean E(X-t \lvert X>t) is called the mean residual life and is the expected remaining lifetime of the life or system in question given that the life has survived to age t.

On the other hand, if the random variable X is a model of insurance losses, then the conditional mean E(X-t \lvert X>t) is the expected claim payment per loss given that the loss has exceeded the deductible of t. In this interpretation, the conditional mean E(X-t \lvert X>t) is called the mean excess loss function.

_____________________________________________________________________________________________________________________________

Summary

In conclusion, we summarize the approach for calculating the two conditional means demonstrated in the above examples.

Suppose X is a continuous random variable with the support being (0,\infty) (the positive real numbers), with f(x) being the density function. The following is the density function of the conditional probability distribution given that X>t.

    \displaystyle f(x \lvert X>t)=\frac{f(x)}{P(X>t)}, \ \ \ \ \ \ \ \ \ x>t

Then we have the two conditional means:

    \displaystyle E(X \lvert X>t)=\int_t^\infty x \ f(x \lvert X>t) \ dx=\int_t^\infty x \ \frac{f(x)}{P(X>t)} \ dx

    \displaystyle E(X-t \lvert X>t)=\int_t^\infty (x-t) \ f(x \lvert X>t) \ dx=\int_t^\infty (x-t) \ \frac{f(x)}{P(X>t)} \ dx

If E(X \lvert X>t) is calculated first (or is easier to calculate), then E(X-t \lvert X>t)=E(X \lvert X>t)-t, as shown in the above examples.

If X is a discrete random variable, then the integrals are replaced by summation symbols. As indicated above, the conditional mean E(X-t \lvert X>t) is called the mean residual life when X is a probability model of the lifetime of some system or life.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

Conditional Distributions, Part 1

We illustrate the thought process of conditional distributions with a series of examples. These examples are presented in a series of blog posts. In this post, we look at some conditional distributions derived from discrete probability distributions.

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

The Setting

Suppose we have a discrete random variable X with f(x)=P(X=x) as the probability mass function. Suppose some random experiment can be modeled by the discrete random variable X. The sample space S for this probability experiment is the set of sample points with positive probability masses, i.e. S is the set of all x for which f(x)=P(X=x)>0. In the examples below, S is either a subset of the real line \mathbb{R} or a subset of the plane \mathbb{R} \times \mathbb{R}. Conceivably the sample space could be subset of any Euclidean space \mathbb{R}^n in higher dimension.

Suppose that we are informed that some event A in the random experiment has occurred (A is a subset of the sample space S). Given this new information, all the sample points outside of the event A are irrelevant. Or perhaps, in this random experiment, we are only interested in those outcomes that are elements of some subset A of the sample space S. In either of these scenarios, we wish to make the event A as a new sample space.

The probability of the event A, denoted by P(A), is derived by summing the probabilities f(x)=P(X=x) over all the sample points x \in A. We have:

    \displaystyle P(A)=\sum_{x \in A} P(X=x)

The probability P(A) may not be 1.0. So the probability masses f(x)=P(X=x) for the sample points x \in A, if they are unadjusted, may not form a probability distribution. However, if we consider each such probability mass f(x)=P(X=x) as a proportion of the probability P(A), then the probability masses of the event A will form a probability distribution. For example, say the event A consists of two probability masses 0.2 and 0.3, which sum to 0.5. Then in the new sample space, the first probability mass is 0.4 (0.2 multiplied by \displaystyle \frac{1}{0.5} or divided by 0.5) and the second probability mass is 0.6.

We now summarize the above paragraph. Using the event A as a new sample space, the probability mass function is:

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Suppose that two students take a multiple choice test that has 5 questions. Let X be the number of correct answers of one student and Y be the number of correct answers of the other student (these can be considered as test scores for the purpose of the examples here). Assume that X and Y are independent. The following shows the probability functions.

      \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&P(X=x) &\text{ }&\text{ }&P(Y=y) \\\text{Correct Answers}&\text{ }&\text{ }&\text{ } &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.4&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 1&\text{ }&\text{ }&0.2&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 2&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 3&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 4&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ }  \\ 5&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2  \end{bmatrix}

Note that E(X)=1.6 and E(Y)=2.9. Without knowing any additional information, we can expect on average one student gets 1.6 correct answers and one student gets 2.9 correct answers. If having 3 or more correct answers is considered passing, then the student represented by X has a 30% chance of passing while the student represented by Y has a 60% chance of passing. The following examples show how the expectation can change as soon as new information is known.

The following examples are based on these two test scores X and Y.

Example 1

In this example, we only consider the student whose correct answers are modeled by the random variable X. In addition to knowing the probability function P(X=x), we also know that this student has at least one correct answer (i.e. the new information is X>0).

In light of the new information, the new sample space is A=\left\{1,2,3,4,5 \right\}. Note that P(A)=0.6. In this new sample space, each probability mass is the original one divided by 0.6. For example, for the sample point X=1, we have \displaystyle P(X=1 \lvert X>0)=\frac{0.2}{0.6}=\frac{2}{6}. The following is the conditional probability distribution of X given X>0.

      \displaystyle P(X=1 \lvert X>0)=\frac{2}{6}

      \displaystyle P(X=2 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=3 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=4 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=5 \lvert X>0)=\frac{1}{6}

The conditional mean is the mean of the conditional distribution. We have \displaystyle E(X \lvert X>0)=\frac{16}{6}=2.67. Given that this student is knowledgeable enough to answer some question correctly, the expectation is higher than before knowing the additional information. Also, given the new information, the student in question has a 50% chance of passing (vs. 30% before the new information is known).

Example 2

We now look at a joint distribution that has a 2-dimensional sample space. Consider the joint distribution of test scores X and Y. If the new information is that the total number of correct answers among them is 4, how would this change our expectation of their performance?

Since X and Y are independent, the sample space is a square as indicated the figure below.

\text{ }

Figure 1 – Sample Space of Test Scores

Because the two scores are independent, the joint probability at each of these 36 sample points is the product of the individual probabilities. We have P(X=x,Y=y)=P(X=x) \times P(Y=y). The following figure shows one such joint probability.

Figure 2 – Joint Probability Function

After taking the test, suppose that we have the additional information that the two students have a total of 4 correct answers. With this new information, we can focus our attention on the new sample space that is indicated in the following figure.

Figure 3 – New Sample Space

Now we wish to discuss the conditional probability distribution of X \lvert X+Y=4 and the conditional probability distribution of Y \lvert X+Y=4. In particular, given that there are 4 correct answers between the two students, what would be their expected numbers of correct answers and what would be their chances of passing?

There are 5 sample points in the new sample space (the 5 points circled above). The conditional probability distribution is obtained by making each probability mass as a fraction of the sum of the 5 probability masses. First we calculate the 5 joint probabilities.

      \displaystyle P(X=0,Y=4)=P(X=0) \times P(Y=4) =0.4 \times 0.2=0.08

      \displaystyle P(X=1,Y=3)=P(X=1) \times P(Y=3) =0.2 \times 0.2=0.04

      \displaystyle P(X=2,Y=2)=P(X=2) \times P(Y=2) =0.1 \times 0.2=0.02

      \displaystyle P(X=3,Y=1)=P(X=3) \times P(Y=1) =0.1 \times 0.1=0.01

      \displaystyle P(X=4,Y=0)=P(X=4) \times P(Y=0) =0.1 \times 0.1=0.01

The sum of these 5 joint probabilities is P(X+Y=4)=0.16. Making each of these joint probabilities as a fraction of 0.16, we have the following two conditional probability distributions.

      \displaystyle P(X=0 \lvert X+Y=4)=\frac{8}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=0 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=1 \lvert X+Y=4)=\frac{4}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=1 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=2 \lvert X+Y=4)=\frac{2}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=2 \lvert X+Y=4)=\frac{2}{16}

      \displaystyle P(X=3 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=3 \lvert X+Y=4)=\frac{4}{16}

      \displaystyle P(X=4 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=4 \lvert X+Y=4)=\frac{8}{16}

Now the conditional means given that X+Y=4, comparing against the unconditional means.

      \displaystyle E(X \lvert X+Y=4)=\frac{0+4+4+3+4}{16}=\frac{15}{16}=0.9375 \ \ \ \ \ \ \ \ \text{vs} \ \ E(X)=1.6

      \displaystyle E(Y \lvert X+Y=4)=\frac{0+1+4+12+32}{16}=\frac{49}{16}=3.0625 \ \ \ \ \ \text{vs} \ \ E(Y)=2.9

Now compare the chances of passing.

      \displaystyle P(X \ge 3 \lvert X+Y=4)=\frac{4}{16}=0.25 \ \ \ \ \ \ \ \ \ \ \text{vs} \ \ P(X \ge 3)=0.3

      \displaystyle P(Y \ge 3 \lvert X+Y=4)=\frac{14}{16}=0.875 \ \ \ \ \ \ \ \ \text{vs} \ \ P(Y \ge 3)=0.6

Based on the new information of X+Y=4, we have a lower expectation for the student represented by X and a higher expectation for the student represented by Y. Observe that the conditional probability at X=0 increases to 0.5 from 0.4, while the conditional probability at X=4 increases to 0.5 from 0.2.

Example 3

Now suppose the new information is that the two students do well on the test. Particularly, their combined number of correct answers is greater than or equal to 5, i.e., X+Y \ge 5. How would this impact the conditional distributions?

First we discuss the conditional distributions for X \lvert X+Y \ge 5 and Y \lvert X+Y \ge 5. By considering the new information, the following is the new sample space.

Figure 4 – New Sample Space

To derive the conditional distribution of X \lvert X+Y \ge 5, sum the joint probabilities within the new sample space for each X=x. The calculation is shown below.

      \displaystyle P(X=0 \cap X+Y \ge 5)=0.4 \times 0.2=0.08

      \displaystyle P(X=1 \cap X+Y \ge 5)=0.2 \times (0.2+0.2)=0.08

      \displaystyle P(X=2 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2)=0.06

      \displaystyle P(X=3 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2+0.2)=0.08

      \displaystyle P(X=4 \cap X+Y \ge 5)=0.1 \times (1-0.1)=0.09

      \displaystyle P(X=5 \cap X+Y \ge 5)=0.1 \times (1)=0.10

The sum of these probabilities is 0.49, which is P(X+Y \ge 5). The conditional distribution of X \lvert X+Y \ge 5 is obtained by taking each of the above probabilities as a fraction of 0.49. We have:

      \displaystyle P(X=0 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=1 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(X=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=4 \lvert X+Y \ge 5)=\frac{9}{49}=0.184

      \displaystyle P(X=5 \lvert X+Y \ge 5)=\frac{10}{49}=0.204

We have the conditional mean \displaystyle E(X \lvert X+Y \ge 5)=\frac{0+8+12+24+36+50}{49}=\frac{130}{49}=2.653 (vs. E(X)=1.6). The conditional probability of passing is \displaystyle P(X \ge 3 \lvert X+Y \ge 5)=\frac{27}{49}=0.55 (vs. P(X \ge 3)=0.3).

Note that the above conditional distribution for X \lvert X+Y \ge 5 is not as skewed as the original one for X. With the information that both test takers do well, the expected score for the student represented by X is much higher.

With similar calculation we have the following results for the conditional distribution of Y \lvert X+Y \ge 5.

      \displaystyle P(Y=0 \lvert X+Y \ge 5)=\frac{1}{49}=0.02

      \displaystyle P(Y=1 \lvert X+Y \ge 5)=\frac{2}{49}=0.04

      \displaystyle P(Y=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(Y=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(Y=4 \lvert X+Y \ge 5)=\frac{12}{49}=0.245

      \displaystyle P(Y=5 \lvert X+Y \ge 5)=\frac{20}{49}=0.408

We have the conditional mean \displaystyle E(Y \lvert X+Y \ge 5)=\frac{0+2+12+24+48+100}{49}=\frac{186}{49}=3.8 (vs. E(Y)=2.9). The conditional probability of passing is \displaystyle P(Y \ge 3 \lvert X+Y \ge 5)=\frac{40}{49}=0.82 (vs. P(Y \ge 3)=0.6). Indeed, with the information that both test takers do well, we can expect much higher results from each individual test taker.

Example 4

In Examples 2 and 3, the new information involve both test takers (both random variables). If the new information involves just one test taker, it may be immaterial on the exam score of the other student. For example, suppose that Y \ge 4. Then what is the conditional distribution for X \lvert Y \ge 4? Since X and Y are independent, the high score Y \ge 4 has no impact on the score X. However, the high joint score X+Y \ge 5 does have an impact on each of the individual scores (Example 3).

_____________________________________________________________________________________________________________________________

Summary

We conclude with a summary of the thought process of conditional distributions.

Suppose X is a discrete random variable and f(x)=P(X=x) is its probability function. Further suppose that X is the probability model of some random experiment. The sample space of this random experiment is S.

Suppose we have some new information that in this random experiment, some event A has occurred. The event A is a subset of the sample space S.

To incorporate this new information, the event A is the new sample space. The random variable incorporated with the new information, denoted by X \lvert A, has a conditional probability distribution. The following is the probability function of the conditional distribution.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

where P(A) = \displaystyle \sum_{x \in A} P(X=x).

The thought process is that in the conditional distribution is derived from taking each original probability mass as a fraction of the total probability P(A). The probability function derived in this manner reflects the new information that the event A has occurred.

Once the conditional probability function is derived, it can be used just like any other probability function, e.g. computationally for finding various distributional quantities.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

\copyright \ \text{2013 by Dan Ma}

Looking for a Match

I recently gave an exam in my statistics course, which turned out to be an excellent example for the matching problem, a classic problem in probability. There were 66 students taking the test. I wrote down the names of the students in the order of turning in the exam. The following table shows the positions of the students in alphabetical order and in the order they turned in the exam.

For example, the first student in the alphabetical order was the 12th student who turned in the exam. The first student who turned in the exam was the 37th student in the alphabetical order. It turned out that there was a student who had the same position in both orders. Such a student is called a match (see the following table).

This example captures the essence of a classic problem in probability called the matching problem. There are many other colorful descriptions of the matching problem. All of them involve the pairing of two orderings on a set of objects. We have a set of objects or items that are ordered in some natural way. Then we scramble the items randomly (or according to some ordering that is unpredictable in advance). Whenever an item has the same position in both the original order and in the scrambled order, it is called a match.

In our exam example, one version of the matching problem asks: what is the probability that there is at least one student that is a match?

In fact, in matching two orderings, such as the one described here, a match happens more often than not. The probability of having at least one match is roughly 0.63. Specifically, when n is the number of students taking the exam, the probability of finding at least one match approaches 1-e^{-1}=0.632121 as n \rightarrow \infty. The derivation of this probability is based on the inclusion-exclusion principle and is discussed in the blog post called The Matching Problem.

Even though the probability of having at least one match is a function of n (the number of items), the probability converges to 1-e^{-1} pretty rapidly. Thus for all practical purposes, we can say that the probability of having at least one match is 0.63 (roughly two-third), whenever the number of items involved in the random scrambling is reasonably large (as in the 66 students taking the exam).

Instead of finding the probability of having at least one match, we can also ask for the probability of having exactly k matches, where k is any integer from 0 to n. Let X_n be the number of matches when we match the “exam turning in” order with the alphabetical order for n students. The probability function P(X_n=k) is derived in the blog post called More About the Matching Problem.

The blog post More About the Matching Problem also points out that P(X_n=k) is approximated by the Poisson distribution with parameter \alpha=1. Thus we have:

\displaystyle (1) \ \ \ \ \ P(X_n=k) \approx \frac{e^{-1}}{k!} \ \ \ \ \ \ \ \ \ \ k=0,1,2,\cdots,n

The following are the first 4 probabilities of P(X_n=k).

\displaystyle (2) \ \ \ \ \ P(X_n=0) \approx \frac{e^{-1}}{0!}=0.3679

\displaystyle (3) \ \ \ \ \ P(X_n=1) \approx \frac{e^{-1}}{1!}=0.3679

\displaystyle (4) \ \ \ \ \ P(X_n=2) \approx \frac{e^{-1}}{2!}=0.1839

\displaystyle (5) \ \ \ \ \ P(X_n=3) \approx \frac{e^{-1}}{3!}=0.0313

In the random experiment of matching two orderings on the same set of objects, about 37% of the time, there is no match and about 37% of the time there is exactly one match. Having no matches and having exactly one match are the mostly scenarios (occurring about 74% of the time). Having 2 matches is possible, but only happens about 18% of the time. It is rare to have 3 ore more matches.

Another interesting observation is that if a match occurs, it is mostly likely that there is only one match (such as the example discussed here).

There are many colorful descriptions of the matching problem. The possibilities are limitless. One example is that of n married couples going to a ball room dancing class. The instructor randomly assign the ladies to the gentlemen as dancing partners. A match occurs if a wife is assigned as
the dancing partner of her husband.

A previous blog post (Tis the Season for Gift Exchange) presents an example involving gift exchange. Each person attending a party brings a gift for gift exchange. The gifts are put in a pile and each person randomly selects a gift from the pile. A match occurs if a person selects his or her own gift.

In another blog post (A lazy professor who lets students do their own grading), a professor randomly returns quizzes to the students for grading. A match occurs if a students is assigned his or her own quiz.

An Introduction to the Bayes’ Formula

We open up a discussion of the Bayes’ formula by going through a basic example. The Bayes’ formula or theorem is a method that can be used to compute “backward” conditional probabilities such as the examples described here. The formula will be stated after we examine the calculation from Example 1. The following diagram describes Example 1. Example 2 is presented at the end of the post and is left as exercise. For a basic discussion of the Bayes’ formula, see [1] and chapter 4 of [2].

Example 1

As indicated in the diagram, Box 1 has 1 red ball and three white balls and Box 2 has 2 red balls and 2 white balls. The example involves a sequence of two steps. In the first step (the green arrow in the above diagram), a box is randomly chosen from two boxes. In the second step (the blue arrow), a ball is randomly selected from the chosen box. We assume that the identity of the chosen box is unknown to the participants of this random experiment (e.g. suppose the two boxes are identical in appearance and a box is chosen by your friend and its identity is kept from you). Since a box is chosen at random, it is easy to see that P(B_1)=P(B_2)=0.5.

The example involves conditional probabilities. Some of the conditional probabilities are natural and are easy to see. For example, if the chosen box is Box 1, it is clear that the probability of selecting a red ball is \displaystyle \frac{1}{4}, i.e. \displaystyle P(R \lvert B_1)=\frac{1}{4}. Likewise, the conditional probability P(R \lvert B_2) is \displaystyle \frac{2}{4}. These two conditional probabilities are “forward” conditional probabilities since the events R \lvert B_1 and R \lvert B_2 occur in a natural chronological order.

What about the reversed conditional probabilities P(B_1 \lvert R) and P(B_2 \lvert R)? In other words, if the selected ball from the unknown box (unknown to you) is red, what is the probability that the ball is from Box 1?

The above question seems a little backward. After the box is randomly chosen, it is fixed (though the identity is unknown to you). Since it is fixed, shouldn’t the probability that the box being Box 1 is \displaystyle \frac{1}{2}? Since the box is already chosen, how can the identity of the box be influenced by the color of the ball selected from it? The answer is of course no.

We should not look at the chronological sequence of events. Instead, the key to understanding the example is through performing the random experiment repeatedly. Think of the experiment of choosing one box and then selecting one ball from the chosen box. Focus only on the trials that result in a red ball. For the result to be a red ball, we need to get either Box 1/ Red or Box 2/Red. Compute the probabilities of these two cases. Then add these two probabilities, we will obtain the probability that the selected ball is red. The following diagram illustrates this calculation.

Example 1 – Tree Diagram

The outcomes with red border in the above diagram are the outcomes that result in a red ball. The diagram shows that if we perform this experiment many times, about 37.5% of the trials will result in a red ball (on average 3 out of 8 trials will result in a red ball). In how many of these trials, is Box 1 the source of the red ball? In the diagram, we see that the case Box 2/Red is twice as likely as the case Box 1/Red. We conclude that the case Box 1/Red accounts for about one third of the cases when the selected ball is red. In other words, one third of the red balls come from Box 1 and two third of the red balls come from Box 2. We have:

\displaystyle (1) \ \ \ \ \ P(B_1 \lvert R)=\frac{1}{3}

\displaystyle (2) \ \ \ \ \ P(B_2 \lvert R)=\frac{2}{3}

Instead of using the tree diagram or the reasoning indicated in the paragraph after the tree diagram, we could just as easily apply the Bayes’ formula:

\displaystyle \begin{aligned}(3) \ \ \ \ \ P(B_1 \lvert R)&=\frac{P(R \lvert B_1) \times P(B_1)}{P(R)} \\&=\frac{\frac{1}{2} \times \frac{1}{4}}{\frac{3}{8}} \\&=\frac{1}{3}  \end{aligned}

In the calculation in (3) (as in the tree diagram), we use the law of total probability:

\displaystyle \begin{aligned}(4) \ \ \ \ \ P(R)&=P(R \lvert B_1) \times P(B_1)+P(R \lvert B_2) \times P(B_2) \\&=\frac{1}{4} \times \frac{1}{2}+\frac{2}{4} \times \frac{1}{2} \\&=\frac{3}{8}  \end{aligned}

______________________________________________________________
Remark

We are not saying that an earlier event (the choosing of the box) is altered in some way by a subsequent event (the observing of a red ball). The above probabilities are subjective. How strongly do you believe that the “unknown” box is Box 1? If you use probabilities to quantify your belief, without knowing any additional information, you would say the probability that the “unknown” box being Box 1 is \frac{1}{2}.

Suppose you reach into the “unknown” box and get a red ball. This additional information alters your belief about the chosen box. Since Box 2 has more red balls, the fact that you observe a red ball will tell you that it is more likely that the “unknown” chosen box is Box 2. According to the above calculation, you update the probability of the chosen box being Box 1 to \frac{1}{3} and the probability of it being Box 2 as \frac{2}{3}.

In the language of Bayesian probability theory, the initial belief of P(B_1)=0.5 and P(B_2)=0.5 is called the prior probability distribution. After a red ball is observed, the updated belief as in the probabilities \displaystyle P(B_1 \lvert R)=\frac{1}{3} and \displaystyle P(B_2 \lvert R)=\frac{2}{3} is called the posterior probability distribution.

As demonstrated by this example, the Bayes’ formula is for updating probabilities in light of new information. Though the updated probabilities are subjective, they are not arbitrary. We can make sense of these probabilities by assessing the long run results of the experiment objectively.

______________________________________________________________
An Insurance Perspective

The example discussed here has an insurance interpretation. Suppose an insurer has two groups of policyholders, both equal in size. One group consists of low risk insureds where the probability of experiencing a claim in a year is \frac{1}{4} (i.e. the proportion of red balls in Box 1). The insureds in other group, a high risk group, have a higher probability of experiencing a claim in a year, which is \frac{2}{4} (i.e. the proportion of red balls in Box 2).

Suppose someone just purchase a policy. Initially, the risk profile of this newly insured is uncertain. So the initial belief is that it is equally likely for him to be in the low risk group as in the high risk group.

Suppose that during the first policy year, the insured has incurred one claim. The observation alters our belief about this insured. With the additional information of having one claim, the probability that the insured belong to the high risk group is increased to \frac{2}{3}. The risk profile of this insured is altered based on new information. The insurance point of view described here has the exact same calculation as in the box-ball example and is that of using past claims experience to update future claims experience.

______________________________________________________________
Bayes’ Formula

Suppose we have a collection of mutually exclusive events B_1, B_2, \cdots, B_n. That is, the probabilities P(B_i) sum to 1.0. Suppose R is an event. Think of the events B_i as “causes” that can explain the event R, an observed result. Given R is observed, what is the probability that the cause of R is B_k? In other words, we are interested in finding the conditional probability P(B_k \lvert R).

Before we have the observed result R, the probabilities P(B_i) are the prior probabilities of the causes. We also know the probability of observing R given a particular cause (i.e. we know P(R \lvert B_i). The probabilities P(R \lvert B_i) are “forward” conditional probabilities.

Given that we observe R, we are interested in knowing the “backward” probabilities P(B_i \lvert R). These probabilities are called the posterior probabilities of the causes. Mathematically, the Bayes’ formula is simply an alternative way of writing the following conditional probability.

\displaystyle (5) \ \ \ \ \ P(B_k \lvert R)=\frac{P(B_k \cap R)}{P(R)}

In (5), as in the discussion of the random experiment of choosing box and selecting ball, we are restricting ourselves to only the cases where the event R is observed. Then we ask, out of all the cases where R is observed, how many of these cases are caused by the event B_k?

The numerator of (5) can be written as

\displaystyle (6) \ \ \ \ \ P(B_k \cap R)=P(R \lvert B_k) \times P(B_k)

The denominator of (5) is obtained from applying the total law of probability.

\displaystyle (7) \ \ \ \ \ P(R)=P(R \lvert B_1) P(B_1) + P(R \lvert B_2) P(B_2)+ \cdots + P(R \lvert B_n) P(B_n)

Plugging (6) and (7) into (5), we obtain a statement of the Bayes’ formula.

\displaystyle (8) \ \ \ \ \ P(B_k \lvert R)=\frac{P(P(R \lvert B_k) \times P(B_k)}{\sum \limits_{j=1}^n P(R \lvert B_j) \times P(B_j)} \ \ \ \ \ \ \ \text{(Bayes' Formula)}

Of course, for any computation problem involving the Bayes’ formula, it is best not to memorize the formula in (8). Instead, simply apply the thought process that gives rise to the formula (e.g. the tree diagram shown above).

The Bayes’ formula has some profound philosophical implications, evidenced by the fact that it spawned a separate school of thought called Bayesian statistics. However, our discussion here is solely on its original role in finding certain backward conditional probabilities.

______________________________________________________________
Example 2

Example 2 is left as exercise. The event that both selected balls are red would give even more weight to Box 2. In other words, in the event that a red ball is selected twice in a row, we would believe that it is even more likely that the unknown box is Box 2.
______________________________________________________________
Reference

  1. Feller, W., An Introduction to Probability Theory and Its Applications, third edition, John Wiley & Sons, New York, 1968.
  2. Grinstead, C. M., Snell, J. L. Introduction to Probability, Online Book in PDF format.

How To Calculate Winning Odds in California Lottery

Ever wonder how to calculate winning odds of lottery games? The winning odds of the top prize of Fantasy 5 in California Lottery are 1 in 575,757. The winnings odds of the top prize of SuperLOTTO plus are 1 in 41,416,353. The winnings odds of the top prize of Mega Millions are 1 in 175,711,534. In this post, we show how to calculate the odds for these games in the California Lottery. The calculation is an excellent combinatorial exercise as well as in calculating hypergeometric probability.

All figures and data are obtained from the California Lottery.

____________________________________________________
Fantasy 5

The following figures show a playslip and a sample ticket for the game of Fantasy 5.

Figure 1

Figure 2

In the game of Fantasy 5, the player chooses 5 numbers from 1 to 39. If all 5 chosen numbers match the 5 winning numbers, the player wins the top prize which starts at $50,000 and can go up to $500,000 or more. The odds of winning the top prize are 1 in 575,757. There are lower tier prizes that are easier to win but with much lower winning amounts. The following figure shows the prize categories and the winning odds of Fantasy 5.

Figure 3

All 5 of 5
In matching the player’s chosen numbers with the winning numbers, the order of the numbers do not matter. Thus in the calculation of odds, we use combination rather than permutation. Thus we have:

\displaystyle (1) \ \ \ \ \ \binom{39}{5}=\frac{39!}{5! \ (39-5)!}=575757

Based on (1), the odds of matching all 5 winning numbers is 1 in 575,757 (the odds of winning the top prize).

Any 4 of 5
To match 4 out of 5 winning numbers, 4 of the player’s chosen numbers are winning numbers and 1 of the player’s chosen numbers is from the non-winning numbers (34 of them). Thus the probability of matching 4 out of 5 winning numbers is:

\displaystyle (2) \ \ \ \ \ \frac{\displaystyle \binom{5}{4} \ \binom{34}{1}}{\displaystyle \binom{39}{5}}=\frac{5 \times 34}{575757}=\frac{1}{3386.8} \ \ \text{(1 out of 3,387)}

Any 3 of 5
To find the odds for matching 3 out of 5 winning numbers, we need to find the probability that 3 of the player’s chosen numbers are from the 5 winning numbers and 2 of the selected numbers are from the 34 non-winning numbers. Thus we have:

\displaystyle (3) \ \ \ \ \ \frac{\displaystyle \binom{5}{3} \ \binom{34}{2}}{\displaystyle \binom{39}{5}}=\frac{10 \times 561}{575757}=\frac{1}{102.63} \ \ \text{(1 out of 103)}

Any 2 of 5
Similarly, the following shows how to calculate the odds of matching 2 out of 5 winning numbers:

\displaystyle (4) \ \ \ \ \ \frac{\displaystyle \binom{5}{2} \ \binom{34}{3}}{\displaystyle \binom{39}{5}}=\frac{10 \times 5984}{575757}=\frac{1}{9.6216} \ \ \text{(1 out of 10)}

____________________________________________________
SuperLOTTO Plus

Here are the pictures of a playslip and a sample ticket of the game of SuperLOTTO Plus.

Figure 4

Figure 5

Based on the playslip (Figure 4), the player chooses 5 numbers from 1 to 47. The player also chooses an additional number called a Mega number from 1 to 27. To win the top prize, there must be a match between the player’s 5 selections and the 5 winning numbers as well as a match between the player’s Mega number and the winning Mega number (All 5 of 5 and Mega in Figure 6 below).

Figure 6

All 5 of 5 and Mega
To find the odds of the match of “All 5 of 5 and Mega”, the total number of possibilities is obtained by choosing 5 numbers from 27 numbers and choose 1 number from 27 numbers. We have:

\displaystyle (5) \ \ \ \ \ \binom{47}{5} \times \binom{27}{1}=41,416,353

Thus the odds of matching “All 5 of 5 and Mega” are 1 in 41,416,353.

Any 5 of 5
To find the odds of matching “All 5 of 5″ (i.e. the player’s 5 selections match the 5 winning numbers but no match with the Mega winning number), we need to choose 5 numbers from the 5 winning numbers, choose 0 numbers from the 42 non-winning numbers, choose 0 numbers from the 1 Mega winning number and choose 1 number from the 26 non-Mega winning numbers. This may seem overly precise, but will make it easier to the subsequent derivations. We have:

\displaystyle \begin{aligned}(6) \ \ \ \ \   \frac{\displaystyle \binom{5}{5} \ \binom{42}{0} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{1 \times 1 \times 1 \times 26}{41416353} \\&=\frac{1}{1592936.654} \\&\text{ } \\&=\text{1 out of 1,592,937}  \end{aligned}

Any 4 of 5 and Mega
To calculate the odds for matching “any 4 of 5 and Mega”, we need to choose 4 out of 5 winning numbers, choose 1 out of 42 non-winning numbers, choose 1 out of 1 Mega winning number, and choose 0 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(7) \ \ \ \ \   \frac{\displaystyle \binom{5}{4} \ \binom{42}{1} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 42 \times 1 \times 1}{41416353} \\&=\frac{1}{197220.7286} \\&\text{ } \\&=\text{1 out of 197,221}  \end{aligned}

Any 4 of 5
To calculate the odds for matching “any 4 of 5″ (no match for Mega number), we need to choose 4 out of 5 winning numbers, choose 1 out of 42 non-winning numbers, choose 0 out of 1 Mega winning number, and choose 1 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(8) \ \ \ \ \   \frac{\displaystyle \binom{5}{4} \ \binom{42}{1} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 42 \times 1 \times 26}{41416353} \\&=\frac{1}{7585.412637} \\&\text{ } \\&=\text{1 out of 7,585}  \end{aligned}

Any 3 of 5 and Mega
To calculate the odds for matching “any 3 of 5 and Mega”, we need to choose 3 out of 5 winning numbers, choose 2 out of 42 non-winning numbers, choose 1 out of 1 Mega winning number, and choose 0 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(9) \ \ \ \ \   \frac{\displaystyle \binom{5}{3} \ \binom{42}{2} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 861 \times 1 \times 1}{41416353} \\&=\frac{1}{4810.261672} \\&\text{ } \\&=\text{1 out of 4,810}  \end{aligned}

The rest of the calculations for SuperLOTTO Plus should be routine. It is a matter to deciding how many to choose from the 5 winning numbers, how many to choose from the 42 non-winning numbers as well as how many to choose from the 1 winning Mega number and how many to choose from the 26 non-winning Mega numbers.

Any 3 of 5
\displaystyle \begin{aligned}(10) \ \ \ \ \   \frac{\displaystyle \binom{5}{3} \ \binom{42}{2} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 861 \times 1 \times 26}{41416353} \\&=\frac{1}{185.0100643} \\&\text{ } \\&=\text{1 out of 185}  \end{aligned}

Any 2 of 5 and Mega
\displaystyle \begin{aligned}(11) \ \ \ \ \   \frac{\displaystyle \binom{5}{2} \ \binom{42}{3} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 11480 \times 1 \times 1}{41416353} \\&=\frac{1}{360.7696254} \\&\text{ } \\&=\text{1 out of 361}  \end{aligned}

Any 1 of 5 and Mega
\displaystyle \begin{aligned}(12) \ \ \ \ \   \frac{\displaystyle \binom{5}{1} \ \binom{42}{4} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 111930 \times 1 \times 1}{41416353} \\&=\frac{1}{74.00402573} \\&\text{ } \\&=\text{1 out of 74}  \end{aligned}

None of 5 only Mega
\displaystyle \begin{aligned}(13) \ \ \ \ \   \frac{\displaystyle \binom{5}{0} \ \binom{42}{5} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{1 \times 850668 \times 1 \times 1}{41416353} \\&=\frac{1}{48.68685903} \\&\text{ } \\&=\text{1 out of 49}  \end{aligned}

____________________________________________________
Mega Millions

The following are a playslip, a sample ticket and the winning odds of the game of Mega Millions.

Figure 7

Figure 8

Figure 9

Based on the playslip (Figure 7), the player chooses 5 numbers from 1 to 56. The player also chooses an additional number called a Mega number from 1 to 46. To win the top prize, there must be a match between the player’s 5 selections and the 5 winning numbers as well as a match between the player’s Mega number and the winning Mega number. The calculation of the odds indicated in Figure 9 are left as exercises.

Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation W \sim \text{binom}(n,p) denotes the statement that W has a binomial distribution with parameters n and p. In other words, W is the number of successes in a sequence of n independent Bernoulli trials where p is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let X be the number of correct answers in the first quiz and Y be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both X and Y are binomial distributions (both \sim \text{binom}(5,\frac{1}{4})). The sum X+Y is also a binomial distribution (\sim \text{binom}(10,\frac{1}{4})). The question that is being asked is a conditional probability, i.e., P(X=3 \lvert X+Y=4). Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions X and Y with the same probability of success p (the number of trials does not have to be the same), the conditional distribution X \lvert X+Y=a is a hypergeometric distribution. Interestingly, the probability of success p has no bearing on this observation. For Example 1, we have the following calculation.

\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}

\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

Interestingly, the conditional mean E(X \lvert X+Y=4)=2, while the unconditional mean E(X)=5 \times \frac{1}{4}=1.25. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

__________________________________________________
Why This Works

Suppose X \sim \text{binom}(5,p) and Y \sim \text{binom}(5,p) and they are independent. The joint distribution of X and Y has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \  P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y}  \end{aligned}

where x=0,1,2,3,4,5 and y=0,1,2,3,4,5.

The conditional probability P(X=k \lvert X+Y=4) involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \  P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0)  \end{aligned}

We can plug (3) into (4) and work out the calculation. But (4) is actually equivalent to the following because X+Y \sim \text{binom}(10,p).

\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}

As stated earlier, the conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \  P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle  \binom{10}{4} p^4 \ (1-p)^{6}}   \end{aligned}

With the terms involving p and 1-p cancel out, we have:

\displaystyle (7) \ \ \ \  P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle  \binom{10}{4}}

__________________________________________________
Summary

Suppose X \sim \text{binom}(N,p) and Y \sim \text{binom}(M,p) and they are independent. Then X+Y is also a binomial distribution, i.e., \sim \text{binom}(N+M,p). Suppose that both binomial experiments \text{binom}(N,p) and \text{binom}(M,p) have been performed and it is known that there are a successes in total. Then X \lvert X+Y=a has a hypergeometric distribution.

\displaystyle (8) \ \ \ \  P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle  \binom{N+M}{a}}

where k=0,1,2,3,\cdots,\text{min}(N,a).

As discussed earlier, think of the N trials in \text{binom}(N,p) as red balls and think of the M trials in \text{binom}(M,p) as blue balls in a jar. Think of the a successes as the number of balls you are about to draw from the jar. So you reach into the jar and select a balls without replacement. The calculation in (8) gives the probability that you select k red balls and a-k blue balls.

The probability of success p in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

__________________________________________________
Additional Practice
Practice problems are found in the following blog post.

How to pick binomial trials

A lazy professor who lets students do their own grading

After reading the example discussed here, any professor or instructor who is contemplating randomly distributing quizzes back to the students for grading perhaps should reconsider the idea! This is one of several posts discussing the matching problem. Go to the end of this post to find links to these previous posts.

Consider the following problem. Suppose that a certain professor is lazy and lets his students grade their own quizzes. After he collects the quizzess from his n students, he randomly assigns the quizzes back to these n students for grading. If a student is assigned his or her own quiz, we say that it is a match. We have the following questions:

  • What is the probability that each of the n students is a match?
  • What is the probability that none of the n students is a match?
  • What is the probability that exactly k of the n students are matches?
  • What is the probability that at least one of the n students is a match?

The above problem is called the matching problem, which is a classic problem in probability. In this post we solve the last question indicated above. Though the answer is in terms of the total number of quizzes n, it turns out that the answer is independent of n and is approximately \frac{2}{3}. Thus if the professor assigns the quizzes randomly, it will be very unusual that there is no match.

The last question above is usually stated in other matching situations. One is that there are n couples (say, each couple consists of a husband and a wife) in a class for ballroom dancing. Suppose that the dance instructor randomly matches the men to the ladies. When a husband is assigned his own wife, we say that it is a match. What is the probability that there is at least one couple that is a match?

The key to answering this question is the theorem stated in Feller (page 99 in chapter 4 of [1]). We state the theorem and make use of it in the solution of the last question above. A sketch of the proof will be given at the end. For ideas on the solutions to the first three questions above, see this previous post.

The union of n events
For any n events E_1,E_2, \cdots,E_n that are defined on the same sample space, we have the following formula:

(1) \ \ \ \ \ P[E_1 \cup E_2 \cup \cdots \cup E_n]=\sum \limits_{m=1}^{n} (-1)^{m+1} \thinspace S_m where

S_1=\sum \limits_{r=1}^{n}P[E_r],

S_2=\sum \limits_{j<k}P[E_j \cap E_k],

S_m=\sum P[E_{i(1)} \cap E_{i(2)} \cap \cdots \cap E_{i(m)}]

Note that in the general term S_m, the sum is taken over all increasing sequence i(\cdot), i.e. 1 \le i(1) < i(2) < \cdots < i(m) \le n. For n=2,3, we have the following familiar formulas:

\displaystyle P[E_1 \cup E_2]=P[E_1]+P[E_2]-P[E_1 \cap E_2]

\displaystyle \begin{aligned}P[E_1 \cup E_2 \cup E_3]=& \ \ \ \ P[E_1]+P[E_2]+P[E_3]\\&-P[E_1 \cap E_2]-P[E_1 \cap E_3]-P[E_2 \cap E_3]\\&+P[E_1 \cap E_2 \cap E_3] \end{aligned}

The Matching Problem

Suppose that the n students are labeled 1,2, \cdots, n. Let E_i be the even that the i^{th} student is assigned his or her own quiz by the professor. Then P[E_1 \cup E_2 \cup \cdots \cup E_n] is the probability that there is at least one correct match.

Note that P[E_i]=\frac{(n-1)!}{n!}=\frac{1}{n}. This is the case since we let the i^{th} student be fixed and we permute the other n-1 students. Likewise, P[E_i \cap E_j]=\frac{(n-2)!}{n!}, since we fix the i^{th} and j^{th} students and let the other (n-2)! students permute. In general, whenever i(1),i(2),\cdots,i(m) are m distinct integers and are increasing, we have:

\displaystyle P[E_{i(1)} \cap E_{i(2)} \cdots \cap E_{i(m)}]=\frac{(n-m)!}{n!}

We now apply the formula (1). First we show that for each m where 1 \le m \le n, S_m=\frac{1}{m!}. Since there are \binom{n}{m} many ways to have m matches out of n students, we have:

\displaystyle \begin{aligned}S_m&=\sum P[E_{i(1)} \cap E_{i(2)} \cdots \cap E_{i(m)}]\\&=\binom{n}{m} \frac{(n-m)!}{n!}\\&=\frac{1}{m!} \end{aligned}

Applying the formula for the union of n events, we have:

\displaystyle P[E_1 \cup E_2 \cup \cdots \cup E_n]=1-\frac{1}{2!}+\frac{1}{3!}-\cdots+(-1)^{n+1}\frac{1}{n!}

\displaystyle 1-P[E_1 \cup E_2 \cup \cdots \cup E_n]=1-1+\frac{1}{2!}-\frac{1}{3!}+\cdots+(-1)^{n}\frac{1}{n!}

Note that the left-hand side of the above equality is the first n+1 terms in the expansion of e^{-1}. Thus we have:

\displaystyle \begin{aligned}\lim_{n \rightarrow \infty}P[E_1 \cup E_2 \cup \cdots \cup E_n]&=\lim_{n \rightarrow \infty}\biggl(1-\frac{1}{2!}+\frac{1}{3!}-\cdots+(-1)^{n+1}\frac{1}{n!}\biggr)\\&=1-e^{-1}\\&=0.6321205588 \end{aligned}

The above limit converges quite rapidly. Let P_n=P[E_1 \cup E_2 \cup \cdots \cup E_n]. The following table lists out the first several terms of this limit.

\displaystyle \begin{pmatrix} n&\text{   }&P_n \\{2}&\text{   }&0.50000 \\{3}&\text{   }&0.66667 \\{4}&\text{   }&0.62500 \\{5}&\text{   }&0.63333 \\{6}&\text{   }&0.63194 \\{7}&\text{   }&0.63214 \\{8}&\text{   }&0.63212\end{pmatrix}

Sketch of Proof for the Formula
The key idea is that any sample point in the union E_1 \cup E_2 \cdots \cup E_n is counted in exactly one time in the right hand side of (1). Suppose that a sample point is in exactly t of the events E_1,E_2,\cdots,E_n. Then the following shows the number of times the sample point is counted in each expression:

\displaystyle \sum \limits_{m=1}^{n}P[E_m] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ t \text{ times}

\displaystyle \sum \limits_{a<b}P[E_a \cap E_b] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \binom{t}{2} \text{ times}

\displaystyle \sum \limits_{a<b<c}P[E_a \cap E_b \cap E_c] \ \ \ \ \ \ \ \ \ \binom{t}{3} \text{ times and so on}

Thus the sample point in question will be counted exactly H times in the right hand side of the formula.

\displaystyle H=\binom{t}{1}-\binom{t}{2}+\binom{t}{3}- \cdots + (-1)^{t+1}\binom{t}{t}

The following is the derivation that H=1.

\displaystyle 0=(1-1)^t=\sum \limits_{a=0}^{t} \binom{t}{a}(-1)^{a}(1)^{t-a}=\binom{t}{0}-\binom{t}{1}+\binom{t}{2}+ \cdots +(-1)^t \binom{t}{t}

\displaystyle 1=\binom{t}{0}=\binom{t}{1}-\binom{t}{2}- \cdots +(-1)^{t+1} \binom{t}{t}=H

Reference

  1. Feller, W., An Introduction to Probability Theory and its Applications, Vol. I, 3rd ed., John Wiley & Sons, Inc., New York, 1968

Previous Posts on The Matching Problem
The Matching Problem

More About the Matching Problem

Tis the Season for Gift Exchange