# The hypergeometric distribution

Consider a large bowl with $n+m$ balls, $n$ of which are green and $m$ of which are yellow. We draw $z$ balls out of the bowl without replacement. Let $X$ be the number of green balls in the $z$ many draws (i.e. getting a green ball is a success). The distribution of $X$ is called the hypergeometric distribution. This distribution has three parameters $n,m,z$ and its probability function is:

$\displaystyle P[X=x]=\frac{\displaystyle \binom{n}{x} \thinspace \binom{m}{z-x}}{\displaystyle \binom{n+m}{z}}$ where $x=0,1,2, \cdots, min(n,z)$

In the denominator, we have the number of different ways of drawing $z$ balls out of $n+m$ balls. In the numerator, we have the number of ways of drawing $x$ balls out of $n$ green balls multiplied by the number of ways of drawing $z-x$ balls out of $m$ yellow balls. We presented an example of the hypergeometric distribution in this post (The hypergeometric distribution, an example). In this post, we continue the discussion and we also derive the mean and variance.

When both $n$ (the number of green balls) and $m$ (the number of yellow balls) approach infinity while at the same time the proportion $p$ of the green balls remains constant, then the hypergeometric distribution approaches the binomial distribution with parameters $z$ and $p$. This result makes intuitive sense. As the number of balls in the bowl becomes larger and larger and is much greater than the sample size ($z$), the difference between sampling without replacement and sampling with replacement is negligible. Thus when population size is much larger than the sample size, for all practical purposes, the hypergeometric probabilities can be estimated with the binomial probabilities.

If we draw green balls with replacement, we work with the binomial distribution with parameters $z$ and $p=\frac{n}{n+m}$. Then the mean number of green balls drawn is $zp=\frac{z n}{n+m}$. Interestingly, the mean of the hypergeometric distribution under discussion is also $E[X]=\frac{z n}{n+m}$. Sampling without replacement produces the same mean as sampling with replacement. However, the variance is different between sampling with and without replacement. The following is the derivation for the mean of the hypergeometric distribution.

$\displaystyle E[X]=\binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k} j \binom{n}{j} \binom{m}{z-j}$ where $k=min(n,z)$

$\displaystyle E[X]=\binom{n+m}{z}^{-1} \sum \limits_{j=1}^{k} n \binom{n-1}{j-1} \binom{m}{z-j}$

$\displaystyle E[X]=n \binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k-1} \binom{n-1}{j} \binom{m}{z-1-j}$

$\displaystyle E[X]=n \binom{n+m}{z}^{-1} \binom{n+m-1}{z-1}$

$\displaystyle E[X]=\frac{z n}{n+m}$

In the second to the last step above, we use the Euler’s formula:

$\displaystyle \sum \limits_{j=0}^{min(n,z)} \binom{n}{j} \binom{m}{z-j}=\binom{n+m}{z}$

We now derive the variance. We first derive $E[X(X-1)]$.

$\displaystyle E[X(X-1)]=\binom{n+m}{z}^{-1} \sum \limits_{j=0}^{k} j(j-1) \binom{n}{j} \binom{m}{z-j}$ where $k=min(n,z)$

$\displaystyle E[X(X-1)]=\binom{n+m}{z}^{-1} \sum \limits_{j=2}^{k} n(n-1) \binom{n-2}{j-2} \binom{m}{z-j}$

$\displaystyle E[X(X-1)]=\frac{n(n-1)}{\displaystyle \binom{n+m}{z}} \sum \limits_{j=0}^{k-2} \binom{n-2}{j} \binom{m}{z-2-j}$

$\displaystyle E[X(X-1)]=\frac{n(n-1)}{\displaystyle \binom{n+m}{z}} \binom{n+m-2}{z-2}$

$\displaystyle E[X(X-1)]=\frac{n(n-1) z(z-1)}{(n+m) (n+m-1)}$

Again, we use the Euler’s formula in the second to the last step. To find the variance, we use the following:

$\displaystyle Var[X]=E[X(X-1)]+E[X]-E[X]^2$

$\displaystyle Var[X]=\frac{\displaystyle z \thinspace n \thinspace m (m+n-z)}{\displaystyle (n+m)^2 (n+m-1)}$

$\displaystyle Var[X]=z p (1-p) \frac{n+m-z}{n+m-1}$ where $\displaystyle p=\frac{n}{n+m}$

2. Hello. I have a question about applycation of the Hypergeometric distribution. I am wondering if one can calculate the sum $\sum_{k=0}^l\frac{{l \choose k}{2n-l \choose n-k}(2k-l)^q}{{2n\choose n}}$ using this distribution.