A randomized definition of the natural logarithm constant

The number e is the base of the natural logarithm. It is an important constant in mathematics, which is approximately 2.718281828. This post discusses a charming little problem involving the the number e. This number can be represented in many ways. In a calculus course, the number e may be defined as the upper limit of the following integral:

    \displaystyle \int_1^e \frac{1}{t} \ dt=1

Another representation is that it is the sum \displaystyle e=\sum_{n=1}^\infty \frac{1}{n!}. Still another is that it is the limit \displaystyle e=\lim_{n \rightarrow \infty} \biggl( 1+\frac{1}{n} \biggr)^n. According to the authors of a brief article from the Mathematics Magazine [1], these textbook definitions do not give immediate insight about the number e (article can be found here). As a result, students come away from the course without a solid understanding of the number e and may have to resort to rote memorization on what the number e is. Instead, the article gives six probability oriented ways in which the number e can occur. These occurrences of e are more interesting and, according to the authors of [1], can potentially increase students’ appreciation of the number e. In two of these six examples (Example 2 and Example 5), the number e is defined by drawing random numbers from the interval (0,1). This post discusses Example 2.


Random Experiment

Here’s the description of the random experiment that generates the number e. Randomly and successively select numbers from the interval (0, 1). The experiment terminates when the sum of the random numbers exceeds 1. What is the average number of selections in this experiment? In other words, we are interested in the average length of the experiment. According to the article [1], the average length is the number e. The goal here is to give a proof of this result.

As illustration, the experiment is carried out 1 million times using random numbers that are generated in Excel using the Rand() function. The following table summarizes the results.

    \left[\begin{array}{rrrrrr}      \text{Length of} & \text{ } & \text{ } & \text{ } & \text{Relative} \\      \text{Experiment} & \text{ } & \text{Frequency} & \text{ } & \text{Frequency} \\      \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\      2 & \text{ } & 500777 & \text{ } & 0.500777 \\      3 & \text{ } & 332736 & \text{ } & 0.332736 \\      4 & \text{ } & 124875 & \text{ } & 0.124875 \\      5 & \text{ } & 33465 & \text{ } & 0.033465 \\      6 & \text{ } & 6827 & \text{ } & 0.006827 \\      7 & \text{ } & 1130 & \text{ } & 0.001130 \\      8 & \text{ } & 172 & \text{ } & 0.000172 \\      9 & \text{ } & 14 & \text{ } & 0.000014 \\      10 & \text{ } & 4 & \text{ } & 0.000004 \\      \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\       \text{Total} & \text{ } & 1000000 & \text{ } & \text{ }      \end{array}\right]

The average length of experiment in these 1 million experiments is 2.717001. Even though the rate of convergence to the number e is fairly slow, the simulated data demonstrates that on average it takes approximately e number of simulations of numbers in the unit interval to get a sum that exceeds 1.


A proof

Let U_1,U_2,U_3,\cdots be a sequence of independent and identically distributed random variables such that the common distribution is a uniform distribution on the unit interval (0,1). Let N be defined as follows:

    \displaystyle N=\text{min}\left\{n: U_1+U_2+\cdots+U_n>1 \right\} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The objective is to determine E(N). On the surface, it seems that we need to describe the distribution of the independent sum X_n=U_1+U_2+\cdots+U_n for all possible n. Doing this may be possible but the result would be messy. It turns out that we do not need to do so. We need to evaluate the probability P(X_n \le 1) for all n. We show that

    \displaystyle F_n(x)=P(X_n \le x)=\frac{x^n}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

for 0 \le x \le 1 and for n=1,2,3,\cdots. This is accomplished by an induction proof. This is true for n=1 since X_1=U_1 is a uniform distribution. Suppose (2) holds for the integer n-1. This means that \displaystyle F_{n-1}(x)=P(X_{n-1} \le x)=\frac{x^{n-1}}{(n-1)!} for 0 \le x \le 1. Note that X_n=X_{n-1}+U_n, which is an independent sum. Let’s write out the convolution formula for this independent sum:

    \displaystyle \begin{aligned} F_n(x)&=\int_{0}^x F_{n-1}(x-y) \cdot f_{U_n}(y) \ dy \\&=\int_{0}^x \frac{(x-y)^{n-1}}{(n-1)!} \cdot 1 \ dy \\&=\frac{1}{(n-1)!} \int_0^x (x-y)^{n-1} \ dy \\&=\frac{x^n}{n!}  \end{aligned}

The above derivation completes the proof of the claim. We now come back to the problem of evaluating the mean of the random variable N defined in (1). First, note that N>n if and only if X_n=U_1+U_2+\cdots+U_n \le 1. So we have the probability statement \displaystyle P(N>n)=F_n(1)=\frac{1}{n!}.

As a result, the following is the probability function of the random variable N.

    \displaystyle \begin{aligned} P(N=n)&=P(N>n-1)-P(N>n) \\&=F_{n-1}(1)-F_n(1) \\&=\frac{1}{(n-1)!}-\frac{1}{n!} \\&=\frac{n-1}{n!}  \end{aligned}

Now evaluate the mean.

    \displaystyle \begin{aligned} E(N)&=\sum \limits_{n=2}^\infty \frac{n(n-1)}{n!} \\&=\sum \limits_{n=2}^\infty \frac{1}{(n-2)!} \\&=\sum \limits_{m=0}^\infty \frac{1}{m!} \\&=e \end{aligned}

With the above derivation, the proof that e= 2.718281828… is the average number of random numbers to select in order to obtain a sum that exceeds 1 is completed.



  1. Shultz H. S., Leonard B., Unexpected Occurrences of the Number e,Mathematics Magazine, October 1989, Volume 62, Number 4, pp. 269–271.

\copyright \ \text{2015 by Dan Ma}

Conditional Distributions, Part 1

We illustrate the thought process of conditional distributions with a series of examples. These examples are presented in a series of blog posts. In this post, we look at some conditional distributions derived from discrete probability distributions.

Practice problems are found in the companion blog.


The Setting

Suppose we have a discrete random variable X with f(x)=P(X=x) as the probability mass function. Suppose some random experiment can be modeled by the discrete random variable X. The sample space S for this probability experiment is the set of sample points with positive probability masses, i.e. S is the set of all x for which f(x)=P(X=x)>0. In the examples below, S is either a subset of the real line \mathbb{R} or a subset of the plane \mathbb{R} \times \mathbb{R}. Conceivably the sample space could be subset of any Euclidean space \mathbb{R}^n in higher dimension.

Suppose that we are informed that some event A in the random experiment has occurred (A is a subset of the sample space S). Given this new information, all the sample points outside of the event A are irrelevant. Or perhaps, in this random experiment, we are only interested in those outcomes that are elements of some subset A of the sample space S. In either of these scenarios, we wish to make the event A as a new sample space.

The probability of the event A, denoted by P(A), is derived by summing the probabilities f(x)=P(X=x) over all the sample points x \in A. We have:

    \displaystyle P(A)=\sum_{x \in A} P(X=x)

The probability P(A) may not be 1.0. So the probability masses f(x)=P(X=x) for the sample points x \in A, if they are unadjusted, may not form a probability distribution. However, if we consider each such probability mass f(x)=P(X=x) as a proportion of the probability P(A), then the probability masses of the event A will form a probability distribution. For example, say the event A consists of two probability masses 0.2 and 0.3, which sum to 0.5. Then in the new sample space, the first probability mass is 0.4 (0.2 multiplied by \displaystyle \frac{1}{0.5} or divided by 0.5) and the second probability mass is 0.6.

We now summarize the above paragraph. Using the event A as a new sample space, the probability mass function is:

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

The above probability distribution is called the conditional distribution of X given the event A, denoted by X \lvert A. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).



Suppose that two students take a multiple choice test that has 5 questions. Let X be the number of correct answers of one student and Y be the number of correct answers of the other student (these can be considered as test scores for the purpose of the examples here). Assume that X and Y are independent. The following shows the probability functions.

      \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&P(X=x) &\text{ }&\text{ }&P(Y=y) \\\text{Correct Answers}&\text{ }&\text{ }&\text{ } &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.4&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 1&\text{ }&\text{ }&0.2&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 2&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 3&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 4&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ }  \\ 5&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2  \end{bmatrix}

Note that E(X)=1.6 and E(Y)=2.9. Without knowing any additional information, we can expect on average one student gets 1.6 correct answers and one student gets 2.9 correct answers. If having 3 or more correct answers is considered passing, then the student represented by X has a 30% chance of passing while the student represented by Y has a 60% chance of passing. The following examples show how the expectation can change as soon as new information is known.

The following examples are based on these two test scores X and Y.

Example 1

In this example, we only consider the student whose correct answers are modeled by the random variable X. In addition to knowing the probability function P(X=x), we also know that this student has at least one correct answer (i.e. the new information is X>0).

In light of the new information, the new sample space is A=\left\{1,2,3,4,5 \right\}. Note that P(A)=0.6. In this new sample space, each probability mass is the original one divided by 0.6. For example, for the sample point X=1, we have \displaystyle P(X=1 \lvert X>0)=\frac{0.2}{0.6}=\frac{2}{6}. The following is the conditional probability distribution of X given X>0.

      \displaystyle P(X=1 \lvert X>0)=\frac{2}{6}

      \displaystyle P(X=2 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=3 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=4 \lvert X>0)=\frac{1}{6}

      \displaystyle P(X=5 \lvert X>0)=\frac{1}{6}

The conditional mean is the mean of the conditional distribution. We have \displaystyle E(X \lvert X>0)=\frac{16}{6}=2.67. Given that this student is knowledgeable enough to answer some question correctly, the expectation is higher than before knowing the additional information. Also, given the new information, the student in question has a 50% chance of passing (vs. 30% before the new information is known).

Example 2

We now look at a joint distribution that has a 2-dimensional sample space. Consider the joint distribution of test scores X and Y. If the new information is that the total number of correct answers among them is 4, how would this change our expectation of their performance?

Since X and Y are independent, the sample space is a square as indicated the figure below.

\text{ }

Figure 1 – Sample Space of Test Scores

Because the two scores are independent, the joint probability at each of these 36 sample points is the product of the individual probabilities. We have P(X=x,Y=y)=P(X=x) \times P(Y=y). The following figure shows one such joint probability.

Figure 2 – Joint Probability Function

After taking the test, suppose that we have the additional information that the two students have a total of 4 correct answers. With this new information, we can focus our attention on the new sample space that is indicated in the following figure.

Figure 3 – New Sample Space

Now we wish to discuss the conditional probability distribution of X \lvert X+Y=4 and the conditional probability distribution of Y \lvert X+Y=4. In particular, given that there are 4 correct answers between the two students, what would be their expected numbers of correct answers and what would be their chances of passing?

There are 5 sample points in the new sample space (the 5 points circled above). The conditional probability distribution is obtained by making each probability mass as a fraction of the sum of the 5 probability masses. First we calculate the 5 joint probabilities.

      \displaystyle P(X=0,Y=4)=P(X=0) \times P(Y=4) =0.4 \times 0.2=0.08

      \displaystyle P(X=1,Y=3)=P(X=1) \times P(Y=3) =0.2 \times 0.2=0.04

      \displaystyle P(X=2,Y=2)=P(X=2) \times P(Y=2) =0.1 \times 0.2=0.02

      \displaystyle P(X=3,Y=1)=P(X=3) \times P(Y=1) =0.1 \times 0.1=0.01

      \displaystyle P(X=4,Y=0)=P(X=4) \times P(Y=0) =0.1 \times 0.1=0.01

The sum of these 5 joint probabilities is P(X+Y=4)=0.16. Making each of these joint probabilities as a fraction of 0.16, we have the following two conditional probability distributions.

      \displaystyle P(X=0 \lvert X+Y=4)=\frac{8}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=0 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=1 \lvert X+Y=4)=\frac{4}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=1 \lvert X+Y=4)=\frac{1}{16}

      \displaystyle P(X=2 \lvert X+Y=4)=\frac{2}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=2 \lvert X+Y=4)=\frac{2}{16}

      \displaystyle P(X=3 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=3 \lvert X+Y=4)=\frac{4}{16}

      \displaystyle P(X=4 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=4 \lvert X+Y=4)=\frac{8}{16}

Now the conditional means given that X+Y=4, comparing against the unconditional means.

      \displaystyle E(X \lvert X+Y=4)=\frac{0+4+4+3+4}{16}=\frac{15}{16}=0.9375 \ \ \ \ \ \ \ \ \text{vs} \ \ E(X)=1.6

      \displaystyle E(Y \lvert X+Y=4)=\frac{0+1+4+12+32}{16}=\frac{49}{16}=3.0625 \ \ \ \ \ \text{vs} \ \ E(Y)=2.9

Now compare the chances of passing.

      \displaystyle P(X \ge 3 \lvert X+Y=4)=\frac{4}{16}=0.25 \ \ \ \ \ \ \ \ \ \ \text{vs} \ \ P(X \ge 3)=0.3

      \displaystyle P(Y \ge 3 \lvert X+Y=4)=\frac{14}{16}=0.875 \ \ \ \ \ \ \ \ \text{vs} \ \ P(Y \ge 3)=0.6

Based on the new information of X+Y=4, we have a lower expectation for the student represented by X and a higher expectation for the student represented by Y. Observe that the conditional probability at X=0 increases to 0.5 from 0.4, while the conditional probability at X=4 increases to 0.5 from 0.2.

Example 3

Now suppose the new information is that the two students do well on the test. Particularly, their combined number of correct answers is greater than or equal to 5, i.e., X+Y \ge 5. How would this impact the conditional distributions?

First we discuss the conditional distributions for X \lvert X+Y \ge 5 and Y \lvert X+Y \ge 5. By considering the new information, the following is the new sample space.

Figure 4 – New Sample Space

To derive the conditional distribution of X \lvert X+Y \ge 5, sum the joint probabilities within the new sample space for each X=x. The calculation is shown below.

      \displaystyle P(X=0 \cap X+Y \ge 5)=0.4 \times 0.2=0.08

      \displaystyle P(X=1 \cap X+Y \ge 5)=0.2 \times (0.2+0.2)=0.08

      \displaystyle P(X=2 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2)=0.06

      \displaystyle P(X=3 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2+0.2)=0.08

      \displaystyle P(X=4 \cap X+Y \ge 5)=0.1 \times (1-0.1)=0.09

      \displaystyle P(X=5 \cap X+Y \ge 5)=0.1 \times (1)=0.10

The sum of these probabilities is 0.49, which is P(X+Y \ge 5). The conditional distribution of X \lvert X+Y \ge 5 is obtained by taking each of the above probabilities as a fraction of 0.49. We have:

      \displaystyle P(X=0 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=1 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(X=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(X=4 \lvert X+Y \ge 5)=\frac{9}{49}=0.184

      \displaystyle P(X=5 \lvert X+Y \ge 5)=\frac{10}{49}=0.204

We have the conditional mean \displaystyle E(X \lvert X+Y \ge 5)=\frac{0+8+12+24+36+50}{49}=\frac{130}{49}=2.653 (vs. E(X)=1.6). The conditional probability of passing is \displaystyle P(X \ge 3 \lvert X+Y \ge 5)=\frac{27}{49}=0.55 (vs. P(X \ge 3)=0.3).

Note that the above conditional distribution for X \lvert X+Y \ge 5 is not as skewed as the original one for X. With the information that both test takers do well, the expected score for the student represented by X is much higher.

With similar calculation we have the following results for the conditional distribution of Y \lvert X+Y \ge 5.

      \displaystyle P(Y=0 \lvert X+Y \ge 5)=\frac{1}{49}=0.02

      \displaystyle P(Y=1 \lvert X+Y \ge 5)=\frac{2}{49}=0.04

      \displaystyle P(Y=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122

      \displaystyle P(Y=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163

      \displaystyle P(Y=4 \lvert X+Y \ge 5)=\frac{12}{49}=0.245

      \displaystyle P(Y=5 \lvert X+Y \ge 5)=\frac{20}{49}=0.408

We have the conditional mean \displaystyle E(Y \lvert X+Y \ge 5)=\frac{0+2+12+24+48+100}{49}=\frac{186}{49}=3.8 (vs. E(Y)=2.9). The conditional probability of passing is \displaystyle P(Y \ge 3 \lvert X+Y \ge 5)=\frac{40}{49}=0.82 (vs. P(Y \ge 3)=0.6). Indeed, with the information that both test takers do well, we can expect much higher results from each individual test taker.

Example 4

In Examples 2 and 3, the new information involve both test takers (both random variables). If the new information involves just one test taker, it may be immaterial on the exam score of the other student. For example, suppose that Y \ge 4. Then what is the conditional distribution for X \lvert Y \ge 4? Since X and Y are independent, the high score Y \ge 4 has no impact on the score X. However, the high joint score X+Y \ge 5 does have an impact on each of the individual scores (Example 3).



We conclude with a summary of the thought process of conditional distributions.

Suppose X is a discrete random variable and f(x)=P(X=x) is its probability function. Further suppose that X is the probability model of some random experiment. The sample space of this random experiment is S.

Suppose we have some new information that in this random experiment, some event A has occurred. The event A is a subset of the sample space S.

To incorporate this new information, the event A is the new sample space. The random variable incorporated with the new information, denoted by X \lvert A, has a conditional probability distribution. The following is the probability function of the conditional distribution.

    \displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A

where P(A) = \displaystyle \sum_{x \in A} P(X=x).

The thought process is that in the conditional distribution is derived from taking each original probability mass as a fraction of the total probability P(A). The probability function derived in this manner reflects the new information that the event A has occurred.

Once the conditional probability function is derived, it can be used just like any other probability function, e.g. computationally for finding various distributional quantities.


Practice Problems

Practice problems are found in the companion blog.


\copyright \ \text{2013 by Dan Ma}

Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation W \sim \text{binom}(n,p) denotes the statement that W has a binomial distribution with parameters n and p. In other words, W is the number of successes in a sequence of n independent Bernoulli trials where p is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let X be the number of correct answers in the first quiz and Y be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both X and Y are binomial distributions (both \sim \text{binom}(5,\frac{1}{4})). The sum X+Y is also a binomial distribution (\sim \text{binom}(10,\frac{1}{4})). The question that is being asked is a conditional probability, i.e., P(X=3 \lvert X+Y=4). Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions X and Y with the same probability of success p (the number of trials does not have to be the same), the conditional distribution X \lvert X+Y=a is a hypergeometric distribution. Interestingly, the probability of success p has no bearing on this observation. For Example 1, we have the following calculation.

\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}

\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

Interestingly, the conditional mean E(X \lvert X+Y=4)=2, while the unconditional mean E(X)=5 \times \frac{1}{4}=1.25. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

Why This Works

Suppose X \sim \text{binom}(5,p) and Y \sim \text{binom}(5,p) and they are independent. The joint distribution of X and Y has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \  P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y}  \end{aligned}

where x=0,1,2,3,4,5 and y=0,1,2,3,4,5.

The conditional probability P(X=k \lvert X+Y=4) involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \  P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0)  \end{aligned}

We can plug (3) into (4) and work out the calculation. But (4) is actually equivalent to the following because X+Y \sim \text{binom}(10,p).

\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}

As stated earlier, the conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \  P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle  \binom{10}{4} p^4 \ (1-p)^{6}}   \end{aligned}

With the terms involving p and 1-p cancel out, we have:

\displaystyle (7) \ \ \ \  P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle  \binom{10}{4}}


Suppose X \sim \text{binom}(N,p) and Y \sim \text{binom}(M,p) and they are independent. Then X+Y is also a binomial distribution, i.e., \sim \text{binom}(N+M,p). Suppose that both binomial experiments \text{binom}(N,p) and \text{binom}(M,p) have been performed and it is known that there are a successes in total. Then X \lvert X+Y=a has a hypergeometric distribution.

\displaystyle (8) \ \ \ \  P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle  \binom{N+M}{a}}

where k=0,1,2,3,\cdots,\text{min}(N,a).

As discussed earlier, think of the N trials in \text{binom}(N,p) as red balls and think of the M trials in \text{binom}(M,p) as blue balls in a jar. Think of the a successes as the number of balls you are about to draw from the jar. So you reach into the jar and select a balls without replacement. The calculation in (8) gives the probability that you select k red balls and a-k blue balls.

The probability of success p in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

Additional Practice
Practice problems are found in the following blog post.

How to pick binomial trials