# The hypergeometric distribution, an example

We present an example of the hypergeometric distribution seen through an independent sum of two binomial distributions. Suppose a student takes two independent multiple choice quizzes (i.e. performance on one quiz has no bearing on the other quiz). Quiz 1 has 5 problems where each of the problem has 4 choices. Quiz 2 has 5 problems with 4 choices for each problem. Suppose a student answers each question in each of the two quizzes by pure guessing. If the students has a total of four correct answers for the two quizzes combined, what is the probablity that he passes quiz 1 (60% correct)?

Suppose that $X$ is the number of correct answers in quiz 1 and $Y$ is the number of correct answers in quiz 2. Then both $X$ and $Y$ have binomial distribution with $n=5$ and $p=0.25$. Then $Z=X+Y$ is the total number of correct answers and $Z$ has a binomial distribution with $n=10$ and $p=0.25$. The problem we need to solve is $P[X \ge 3 \lvert Z=4]$.

We propose that the conditional distribution of $X \lvert Z=z$ is a hypergeometric distribution. To see this intuitively, there are five green balls (a correct answer in quiz 1) and five yellow balls (a correct answer in quiz 2) in a bowl. Taking these two quizzes and getting a total of four correct answers would be like drawing 4 balls out of this bowl without replacement. Then what is the probability that three of the four balls are green? This is a probability obtained by the hypergeometric distribution (drawing 4 balls out of the bowl and resulting in 3 green balls and 1 yellow ball). Though not a proof, this is good intuitive description of the approach we can take. We first do the calculation and present the proof at the end.

We now evaluate the probability function $P[X=j \lvert Z=4]$. For example, to find $P[X=1 \lvert Z=4]$ is the probability of drawing 4 balls out of the bowl and resulting in 1 green ball and 3 yellow balls.

$\displaystyle P[X=0 \lvert Z=4]=\displaystyle \frac{\displaystyle \binom{5}{0} \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{1}{42}$

$\displaystyle P[X=1 \lvert Z=4]=\frac{\displaystyle \binom{5}{1} \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{10}{42}$

$\displaystyle P[X=2 \lvert Z=4]=\frac{\displaystyle \binom{5}{2} \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{20}{42}$

$\displaystyle P[X=3 \lvert Z=4]=\frac{\displaystyle \binom{5}{3} \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{10}{42}$

$\displaystyle P[X=4 \lvert Z=4]=\frac{\displaystyle \binom{5}{4} \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{1}{42}$

Thus, $\displaystyle P[X \ge 3 \lvert Z=4]=\frac{11}{42}=0.26$. Note that the unconditional probability $P[X \ge 3]=0.1035$ using the binomial distribution with $n=5$ and $p=0.25$. It is not surprising that the conditional probability is much greater. The conditional probability $P[X \ge 3 \lvert Z=4]$ is greater because the student is lucky enough to have four correct guesses.

We now discuss the general fact. Suppose $X \sim binomial(n,p)$ and $Y \sim binomial(m,p)$. With $Z=X+Y$ an independent sum, we show that $X \lvert Z$ has a hypergeometric distribution.

$\displaystyle P[X=x \lvert Z=z]=\frac{\displaystyle \binom{n}{x} p^x (1-p)^{n-x} \thinspace \binom{m}{z-x} p^{z-x} (1-p)^{m-(z-x)}}{\displaystyle \binom{n+m}{z} p^z (1-p)^{n+m-z}}$

After canceling out the terms for $p$ and $1-p$, the following is the probability function for the hypergeometric distribution:

$\displaystyle P[X=x \lvert Z=z]=\frac{\displaystyle \binom{n}{x} \binom{m}{z-x}}{\displaystyle \binom{n+m}{z}}$

The above probability distribution describes the situation where there are $n+m$ similar objects, with $n$ objects belong to one class (say green balls) and $m$ objects belong to another class (say yellow balls). We choose $z$ balls out of $n+m$ balls without replacement. The above probability is the probability of having a result of $x$ green balls and $z-x$ yellow balls. There are $\binom{n}{x}$ many ways of choosing $x$ green balls out of $n$ green balls. Likewise, there are $\binom{m}{z-x}$ many ways of choosing $z-x$ yellow balls out of $m$ yellow balls. The total number ways the joint operation can take place is $\binom{n}{x} \binom{m}{z-x}$. Of course, we assume that each of the $\binom{n+m}{z}$ ways of selecting $z$ balls out of $n+m$ balls is equally likely.

In the conditional probability in our example, the probability of success (0.25) in the individual Bernoulli trials that make up the two binomial distributions is not used. This is because the terms for $p$ and $1-p$ are canceled out. If each multiple choice quiz has a different probability of success, then the resulting conditional distribution $P[X=x \lvert X+Y=z]$ is no longer hypergeometric. In that case, the conditional probability must be obtained by first principle.