# How To Calculate Winning Odds in California Lottery

Ever wonder how to calculate winning odds of lottery games? The winning odds of the top prize of Fantasy 5 in California Lottery are 1 in 575,757. The winnings odds of the top prize of SuperLOTTO plus are 1 in 41,416,353. The winnings odds of the top prize of Mega Millions are 1 in 175,711,534. In this post, we show how to calculate the odds for these games in the California Lottery. The calculation is an excellent combinatorial exercise as well as in calculating hypergeometric probability.

All figures and data are obtained from the California Lottery.

Update, April 27, 2017. The calculation in this post assumes certain background knowledge on combination and the multiplication principle (not explained here). For any reader who would like to further understand how lottery odds are calculated, see this blog post on Powerball. It is a self contained step by step explanation at the basic level on how to calculate winning odds in the Powerball game.

____________________________________________________
Fantasy 5

The following figures show a playslip and a sample ticket for the game of Fantasy 5.

Figure 1

Figure 2

In the game of Fantasy 5, the player chooses 5 numbers from 1 to 39. If all 5 chosen numbers match the 5 winning numbers, the player wins the top prize which starts at $50,000 and can go up to$500,000 or more. The odds of winning the top prize are 1 in 575,757. There are lower tier prizes that are easier to win but with much lower winning amounts. The following figure shows the prize categories and the winning odds of Fantasy 5.

Figure 3

All 5 of 5
In matching the player’s chosen numbers with the winning numbers, the order of the numbers do not matter. Thus in the calculation of odds, we use combination rather than permutation. Thus we have:

$\displaystyle (1) \ \ \ \ \ \binom{39}{5}=\frac{39!}{5! \ (39-5)!}=575757$

Based on $(1)$, the odds of matching all 5 winning numbers is 1 in 575,757 (the odds of winning the top prize).

Any 4 of 5
To match 4 out of 5 winning numbers, 4 of the player’s chosen numbers are winning numbers and 1 of the player’s chosen numbers is from the non-winning numbers (34 of them). Thus the probability of matching 4 out of 5 winning numbers is:

$\displaystyle (2) \ \ \ \ \ \frac{\displaystyle \binom{5}{4} \ \binom{34}{1}}{\displaystyle \binom{39}{5}}=\frac{5 \times 34}{575757}=\frac{1}{3386.8} \ \ \text{(1 out of 3,387)}$

Any 3 of 5
To find the odds for matching 3 out of 5 winning numbers, we need to find the probability that 3 of the player’s chosen numbers are from the 5 winning numbers and 2 of the selected numbers are from the 34 non-winning numbers. Thus we have:

$\displaystyle (3) \ \ \ \ \ \frac{\displaystyle \binom{5}{3} \ \binom{34}{2}}{\displaystyle \binom{39}{5}}=\frac{10 \times 561}{575757}=\frac{1}{102.63} \ \ \text{(1 out of 103)}$

Any 2 of 5
Similarly, the following shows how to calculate the odds of matching 2 out of 5 winning numbers:

$\displaystyle (4) \ \ \ \ \ \frac{\displaystyle \binom{5}{2} \ \binom{34}{3}}{\displaystyle \binom{39}{5}}=\frac{10 \times 5984}{575757}=\frac{1}{9.6216} \ \ \text{(1 out of 10)}$

____________________________________________________
SuperLOTTO Plus

Here are the pictures of a playslip and a sample ticket of the game of SuperLOTTO Plus.

Figure 4

Figure 5

Based on the playslip (Figure 4), the player chooses 5 numbers from 1 to 47. The player also chooses an additional number called a Mega number from 1 to 27. To win the top prize, there must be a match between the player’s 5 selections and the 5 winning numbers as well as a match between the player’s Mega number and the winning Mega number (All 5 of 5 and Mega in Figure 6 below).

Figure 6

All 5 of 5 and Mega
To find the odds of the match of “All 5 of 5 and Mega”, the total number of possibilities is obtained by choosing 5 numbers from 47 numbers and choose 1 number from 27 numbers. We have:

$\displaystyle (5) \ \ \ \ \ \binom{47}{5} \times \binom{27}{1}=41,416,353$

Thus the odds of matching “All 5 of 5 and Mega” are 1 in 41,416,353.

Any 5 of 5
To find the odds of matching “All 5 of 5” (i.e. the player’s 5 selections match the 5 winning numbers but no match with the Mega winning number), we need to choose 5 numbers from the 5 winning numbers, choose 0 numbers from the 42 non-winning numbers, choose 0 numbers from the 1 Mega winning number and choose 1 number from the 26 non-Mega winning numbers. This may seem overly precise, but will make it easier to the subsequent derivations. We have:

\displaystyle \begin{aligned}(6) \ \ \ \ \ \frac{\displaystyle \binom{5}{5} \ \binom{42}{0} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{1 \times 1 \times 1 \times 26}{41416353} \\&=\frac{1}{1592936.654} \\&\text{ } \\&=\text{1 out of 1,592,937} \end{aligned}

Any 4 of 5 and Mega
To calculate the odds for matching “any 4 of 5 and Mega”, we need to choose 4 out of 5 winning numbers, choose 1 out of 42 non-winning numbers, choose 1 out of 1 Mega winning number, and choose 0 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(7) \ \ \ \ \ \frac{\displaystyle \binom{5}{4} \ \binom{42}{1} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 42 \times 1 \times 1}{41416353} \\&=\frac{1}{197220.7286} \\&\text{ } \\&=\text{1 out of 197,221} \end{aligned}

Any 4 of 5
To calculate the odds for matching “any 4 of 5” (no match for Mega number), we need to choose 4 out of 5 winning numbers, choose 1 out of 42 non-winning numbers, choose 0 out of 1 Mega winning number, and choose 1 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(8) \ \ \ \ \ \frac{\displaystyle \binom{5}{4} \ \binom{42}{1} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 42 \times 1 \times 26}{41416353} \\&=\frac{1}{7585.412637} \\&\text{ } \\&=\text{1 out of 7,585} \end{aligned}

Any 3 of 5 and Mega
To calculate the odds for matching “any 3 of 5 and Mega”, we need to choose 3 out of 5 winning numbers, choose 2 out of 42 non-winning numbers, choose 1 out of 1 Mega winning number, and choose 0 out of 26 non-winning Mega numbers. We have:

\displaystyle \begin{aligned}(9) \ \ \ \ \ \frac{\displaystyle \binom{5}{3} \ \binom{42}{2} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 861 \times 1 \times 1}{41416353} \\&=\frac{1}{4810.261672} \\&\text{ } \\&=\text{1 out of 4,810} \end{aligned}

The rest of the calculations for SuperLOTTO Plus should be routine. It is a matter to deciding how many to choose from the 5 winning numbers, how many to choose from the 42 non-winning numbers as well as how many to choose from the 1 winning Mega number and how many to choose from the 26 non-winning Mega numbers.

Any 3 of 5
\displaystyle \begin{aligned}(10) \ \ \ \ \ \frac{\displaystyle \binom{5}{3} \ \binom{42}{2} \ \binom{1}{0} \ \binom{26}{1}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 861 \times 1 \times 26}{41416353} \\&=\frac{1}{185.0100643} \\&\text{ } \\&=\text{1 out of 185} \end{aligned}

Any 2 of 5 and Mega
\displaystyle \begin{aligned}(11) \ \ \ \ \ \frac{\displaystyle \binom{5}{2} \ \binom{42}{3} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{10 \times 11480 \times 1 \times 1}{41416353} \\&=\frac{1}{360.7696254} \\&\text{ } \\&=\text{1 out of 361} \end{aligned}

Any 1 of 5 and Mega
\displaystyle \begin{aligned}(12) \ \ \ \ \ \frac{\displaystyle \binom{5}{1} \ \binom{42}{4} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{5 \times 111930 \times 1 \times 1}{41416353} \\&=\frac{1}{74.00402573} \\&\text{ } \\&=\text{1 out of 74} \end{aligned}

None of 5 only Mega
\displaystyle \begin{aligned}(13) \ \ \ \ \ \frac{\displaystyle \binom{5}{0} \ \binom{42}{5} \ \binom{1}{1} \ \binom{26}{0}}{\displaystyle \binom{47}{5} \times \binom{27}{1}}&=\frac{1 \times 850668 \times 1 \times 1}{41416353} \\&=\frac{1}{48.68685903} \\&\text{ } \\&=\text{1 out of 49} \end{aligned}

____________________________________________________
Mega Millions

The following are a playslip, a sample ticket and the winning odds of the game of Mega Millions.

Figure 7

Figure 8

Figure 9

Based on the playslip (Figure 7), the player chooses 5 numbers from 1 to 56. The player also chooses an additional number called a Mega number from 1 to 46. To win the top prize, there must be a match between the player’s 5 selections and the 5 winning numbers as well as a match between the player’s Mega number and the winning Mega number. The calculation of the odds indicated in Figure 9 are left as exercises.

# Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation $W \sim \text{binom}(n,p)$ denotes the statement that $W$ has a binomial distribution with parameters $n$ and $p$. In other words, $W$ is the number of successes in a sequence of $n$ independent Bernoulli trials where $p$ is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let $X$ be the number of correct answers in the first quiz and $Y$ be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both $X$ and $Y$ are binomial distributions (both $\sim \text{binom}(5,\frac{1}{4})$). The sum $X+Y$ is also a binomial distribution ($\sim \text{binom}(10,\frac{1}{4})$). The question that is being asked is a conditional probability, i.e., $P(X=3 \lvert X+Y=4)$. Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

$\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}$

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions $X$ and $Y$ with the same probability of success $p$ (the number of trials does not have to be the same), the conditional distribution $X \lvert X+Y=a$ is a hypergeometric distribution. Interestingly, the probability of success $p$ has no bearing on this observation. For Example 1, we have the following calculation.

$\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}$

$\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}$

$\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}$

$\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}$

$\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}$

Interestingly, the conditional mean $E(X \lvert X+Y=4)=2$, while the unconditional mean $E(X)=5 \times \frac{1}{4}=1.25$. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

__________________________________________________
Why This Works

Suppose $X \sim \text{binom}(5,p)$ and $Y \sim \text{binom}(5,p)$ and they are independent. The joint distribution of $X$ and $Y$ has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \ P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y} \end{aligned}

where $x=0,1,2,3,4,5$ and $y=0,1,2,3,4,5$.

The conditional probability $P(X=k \lvert X+Y=4)$ involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability $P(X=k \lvert X+Y=4)$ is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \ P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0) \end{aligned}

We can plug $(3)$ into $(4)$ and work out the calculation. But $(4)$ is actually equivalent to the following because $X+Y \sim \text{binom}(10,p)$.

$\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}$

As stated earlier, the conditional probability $P(X=k \lvert X+Y=4)$ is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \ P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle \binom{10}{4} p^4 \ (1-p)^{6}} \end{aligned}

With the terms involving $p$ and $1-p$ cancel out, we have:

$\displaystyle (7) \ \ \ \ P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle \binom{10}{4}}$

__________________________________________________
Summary

Suppose $X \sim \text{binom}(N,p)$ and $Y \sim \text{binom}(M,p)$ and they are independent. Then $X+Y$ is also a binomial distribution, i.e., $\sim \text{binom}(N+M,p)$. Suppose that both binomial experiments $\text{binom}(N,p)$ and $\text{binom}(M,p)$ have been performed and it is known that there are $a$ successes in total. Then $X \lvert X+Y=a$ has a hypergeometric distribution.

$\displaystyle (8) \ \ \ \ P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle \binom{N+M}{a}}$

where $k=0,1,2,3,\cdots,\text{min}(N,a)$.

As discussed earlier, think of the $N$ trials in $\text{binom}(N,p)$ as red balls and think of the $M$ trials in $\text{binom}(M,p)$ as blue balls in a jar. Think of the $a$ successes as the number of balls you are about to draw from the jar. So you reach into the jar and select $a$ balls without replacement. The calculation in $(8)$ gives the probability that you select $k$ red balls and $a-k$ blue balls.

The probability of success $p$ in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

__________________________________________________
Practice problems are found in the following blog post.

How to pick binomial trials

# The capture-recapture method

The capture-recapture method is one of the methods for estimating the size of wildlife populations and is based on the hypergeometric distribution. Recall that the hypergeometric distribution is a three-parameter family of discrete distributions and one of the parameters, denoted by $N$ in this post, is the size of the population. We show that the estimate for the parameter $N$ that is obtained from the capture-recapture method is the value of the parameter $N$ that makes the observed data “more likely” than any other possible values of $N$. Thus, the capture-recapture method produces the maximum likelihood estimate of the population size parameter $N$ of the hypergeometric distribution.

Let’s start with an example. In order to estimate the size of the population of bluegills (a species of fresh water fish) in a small lake in Missouri, a total of $w=250$ bluegills are captured and tagged and then released. After allowing sufficient time for the tagged fish to disperse, a sample of $n=150$ bluegills were caught. It was found that $y=16$ bluegills in the sample were tagged. Estimate the size of the bluegill population in this lake.

Let $N$ be the size of the bluegill population in this lake. The population proportion of the tagged bluegills is $\frac{w}{N}$. The sample proportion of the tagged bluegills is $\frac{y}{n}$. In the capture-recapture method, the population proportion and the sample proportion are set equaled. Then we solve for $N$.

$\displaystyle \frac{w}{N}=\frac{y}{n} \Rightarrow N=\frac{w n}{y}=\frac{250(150)}{16}=2343.75=2343$

Now, the connection to the hypergeometric distribution. After $w=250$ bluegills were captured, tagged and released, the population is separated into two distinct classes, tagged and non-tagged. When a sample of $n=150$ bluegills were selected without replacement, we let $Y$ be the number of bluegills in the sample that were tagged. The distribution of $Y$ is the hypergeometric distribution. The following is the probability function of $Y$.

$\displaystyle P[Y=y]=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}}$

In the hypergeometric distribution described here, the parameters $w$ and $n$ are known ($w=250$ and $n=150$). We now show that the estimate of $N=2343$ is the estimate that makes the observed value of $y=16$ “most likely” (i.e. the estimate of $N=2343$ is a maximum likelihood estimate of $N$). To show this, we consider the ratio of the hypergeometric probabilities for two successive values of $N$.

$\displaystyle \frac{P(N)}{P(N-1)}=\frac{(N-w)(N-n)}{N(N-w-n+y)}$

where $\displaystyle P(N)=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}}$ and $\displaystyle P(N-1)=\frac{\binom{w}{y} \thinspace \binom{N-1-w}{n-y}}{\binom{N-1}{n}}$

Note that $1<\frac{P(N)}{P(N-1)}$ or $P(N-1) if and only if the following holds:

$\displaystyle N(N-w-n+y)<(N-w)(N-n)$

$\displaystyle N<\frac{w n}{y}$

Note that $\frac{w n}{y}$ is the estimate from the capture-recapture method. It is also an upper bound for the population size $N$ such that the probability $P(N)$ is greater than $P(N-1)$. This implies that the maximum likelihood estimate of $N$ is achieved when the estimate is $\hat{N}=\frac{w n}{y}$.

As an illustration, we compute the probabilities $\displaystyle P(N)=\frac{\binom{250}{16} \thinspace \binom{N-250}{150-16}}{\binom{N}{150}}$ for several values of $N$ above and below $N=2343$. The following matrix illustrates that the maximum likelihood is achieved at $N=2343$.

$\displaystyle \begin{pmatrix} N&P(N) \\{2340}&0.1084918 \\{2341}&0.1084929 \\{2342}&0.1084935 \\{2343}&0.1084938 \\{2344}&0.1084937 \\{2345}&0.1084933 \\{2346}&0.1084924\end{pmatrix}$

# The hypergeometric distribution

Consider a large bowl with $n+m$ balls, $n$ of which are green and $m$ of which are yellow. We draw $z$ balls out of the bowl without replacement. Let $X$ be the number of green balls in the $z$ many draws (i.e. getting a green ball is a success). The distribution of $X$ is called the hypergeometric distribution. This distribution has three parameters $n,m,z$ and its probability function is:

$\displaystyle P[X=x]=\frac{\displaystyle \binom{n}{x} \thinspace \binom{m}{z-x}}{\displaystyle \binom{n+m}{z}}$ where $x=0,1,2, \cdots, min(n,z)$

In the denominator, we have the number of different ways of drawing $z$ balls out of $n+m$ balls. In the numerator, we have the number of ways of drawing $x$ balls out of $n$ green balls multiplied by the number of ways of drawing $z-x$ balls out of $m$ yellow balls. We presented an example of the hypergeometric distribution in this post (The hypergeometric distribution, an example). In this post, we continue the discussion and we also derive the mean and variance.

When both $n$ (the number of green balls) and $m$ (the number of yellow balls) approach infinity while at the same time the proportion $p$ of the green balls remains constant, then the hypergeometric distribution approaches the binomial distribution with parameters $z$ and $p$. This result makes intuitive sense. As the number of balls in the bowl becomes larger and larger and is much greater than the sample size ($z$), the difference between sampling without replacement and sampling with replacement is negligible. Thus when population size is much larger than the sample size, for all practical purposes, the hypergeometric probabilities can be estimated with the binomial probabilities.

If we draw green balls with replacement, we work with the binomial distribution with parameters $z$ and $p=\frac{n}{n+m}$. Then the mean number of green balls drawn is $zp=\frac{z n}{n+m}$. Interestingly, the mean of the hypergeometric distribution under discussion is also $E[X]=\frac{z n}{n+m}$. Sampling without replacement produces the same mean as sampling with replacement. However, the variance is different between sampling with and without replacement. The following is the derivation for the mean of the hypergeometric distribution.

$\displaystyle E[X]=\binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k} j \binom{n}{j} \binom{m}{z-j}$ where $k=min(n,z)$

$\displaystyle E[X]=\binom{n+m}{z}^{-1} \sum \limits_{j=1}^{k} n \binom{n-1}{j-1} \binom{m}{z-j}$

$\displaystyle E[X]=n \binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k-1} \binom{n-1}{j} \binom{m}{z-1-j}$

$\displaystyle E[X]=n \binom{n+m}{z}^{-1} \binom{n+m-1}{z-1}$

$\displaystyle E[X]=\frac{z n}{n+m}$

In the second to the last step above, we use the Euler’s formula:

$\displaystyle \sum \limits_{j=0}^{min(n,z)} \binom{n}{j} \binom{m}{z-j}=\binom{n+m}{z}$

We now derive the variance. We first derive $E[X(X-1)]$.

$\displaystyle E[X(X-1)]=\binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k} j(j-1) \binom{n}{j} \binom{m}{z-j}$ where $k=min(n,z)$

$\displaystyle E[X(X-1)]=\binom{n+m}{z}^{-1} \sum \limits_{j=2}^{k} n(n-1) \binom{n-2}{j-2} \binom{m}{z-j}$

$\displaystyle E[X(X-1)]=\frac{n(n-1)}{\displaystyle \binom{n+m}{z}} \sum \limits_{j=0}^{k-2} \binom{n-2}{j} \binom{m}{z-2-j}$

$\displaystyle E[X(X-1)]=\frac{n(n-1)}{\displaystyle \binom{n+m}{z}} \binom{n+m-2}{z-2}$

$\displaystyle E[X(X-1)]=\frac{n(n-1) z(z-1)}{(n+m) (n+m-1)}$

Again, we use the Euler’s formula in the second to the last step. To find the variance, we use the following:

$\displaystyle Var[X]=E[X(X-1)]+E[X]-E[X]^2$

$\displaystyle Var[X]=\frac{\displaystyle z \thinspace n \thinspace m (m+n-z)}{\displaystyle (n+m)^2 (n+m-1)}$

$\displaystyle Var[X]=z p (1-p) \frac{n+m-z}{n+m-1}$ where $\displaystyle p=\frac{n}{n+m}$

# The hypergeometric distribution, an example

We present an example of the hypergeometric distribution seen through an independent sum of two binomial distributions. Suppose a student takes two independent multiple choice quizzes (i.e. performance on one quiz has no bearing on the other quiz). Quiz 1 has 5 problems where each of the problem has 4 choices. Quiz 2 has 5 problems with 4 choices for each problem. Suppose a student answers each question in each of the two quizzes by pure guessing. If the students has a total of four correct answers for the two quizzes combined, what is the probablity that he passes quiz 1 (60% correct)?

Suppose that $X$ is the number of correct answers in quiz 1 and $Y$ is the number of correct answers in quiz 2. Then both $X$ and $Y$ have binomial distribution with $n=5$ and $p=0.25$. Then $Z=X+Y$ is the total number of correct answers and $Z$ has a binomial distribution with $n=10$ and $p=0.25$. The problem we need to solve is $P[X \ge 3 \lvert Z=4]$.

We propose that the conditional distribution of $X \lvert Z=z$ is a hypergeometric distribution. To see this intuitively, there are five green balls (a correct answer in quiz 1) and five yellow balls (a correct answer in quiz 2) in a bowl. Taking these two quizzes and getting a total of four correct answers would be like drawing 4 balls out of this bowl without replacement. Then what is the probability that three of the four balls are green? This is a probability obtained by the hypergeometric distribution (drawing 4 balls out of the bowl and resulting in 3 green balls and 1 yellow ball). Though not a proof, this is good intuitive description of the approach we can take. We first do the calculation and present the proof at the end.

We now evaluate the probability function $P[X=j \lvert Z=4]$. For example, to find $P[X=1 \lvert Z=4]$ is the probability of drawing 4 balls out of the bowl and resulting in 1 green ball and 3 yellow balls.

$\displaystyle P[X=0 \lvert Z=4]=\displaystyle \frac{\displaystyle \binom{5}{0} \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{1}{42}$

$\displaystyle P[X=1 \lvert Z=4]=\frac{\displaystyle \binom{5}{1} \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{10}{42}$

$\displaystyle P[X=2 \lvert Z=4]=\frac{\displaystyle \binom{5}{2} \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{20}{42}$

$\displaystyle P[X=3 \lvert Z=4]=\frac{\displaystyle \binom{5}{3} \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{10}{42}$

$\displaystyle P[X=4 \lvert Z=4]=\frac{\displaystyle \binom{5}{4} \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{1}{42}$

Thus, $\displaystyle P[X \ge 3 \lvert Z=4]=\frac{11}{42}=0.26$. Note that the unconditional probability $P[X \ge 3]=0.1035$ using the binomial distribution with $n=5$ and $p=0.25$. It is not surprising that the conditional probability is much greater. The conditional probability $P[X \ge 3 \lvert Z=4]$ is greater because the student is lucky enough to have four correct guesses.

We now discuss the general fact. Suppose $X \sim binomial(n,p)$ and $Y \sim binomial(m,p)$. With $Z=X+Y$ an independent sum, we show that $X \lvert Z$ has a hypergeometric distribution.

$\displaystyle P[X=x \lvert Z=z]=\frac{\displaystyle \binom{n}{x} p^x (1-p)^{n-x} \thinspace \binom{m}{z-x} p^{z-x} (1-p)^{m-(z-x)}}{\displaystyle \binom{n+m}{z} p^z (1-p)^{n+m-z}}$

After canceling out the terms for $p$ and $1-p$, the following is the probability function for the hypergeometric distribution:

$\displaystyle P[X=x \lvert Z=z]=\frac{\displaystyle \binom{n}{x} \binom{m}{z-x}}{\displaystyle \binom{n+m}{z}}$

The above probability distribution describes the situation where there are $n+m$ similar objects, with $n$ objects belong to one class (say green balls) and $m$ objects belong to another class (say yellow balls). We choose $z$ balls out of $n+m$ balls without replacement. The above probability is the probability of having a result of $x$ green balls and $z-x$ yellow balls. There are $\binom{n}{x}$ many ways of choosing $x$ green balls out of $n$ green balls. Likewise, there are $\binom{m}{z-x}$ many ways of choosing $z-x$ yellow balls out of $m$ yellow balls. The total number ways the joint operation can take place is $\binom{n}{x} \binom{m}{z-x}$. Of course, we assume that each of the $\binom{n+m}{z}$ ways of selecting $z$ balls out of $n+m$ balls is equally likely.

In the conditional probability in our example, the probability of success (0.25) in the individual Bernoulli trials that make up the two binomial distributions is not used. This is because the terms for $p$ and $1-p$ are canceled out. If each multiple choice quiz has a different probability of success, then the resulting conditional distribution $P[X=x \lvert X+Y=z]$ is no longer hypergeometric. In that case, the conditional probability must be obtained by first principle.