# The birthday problem

Consider this random experiment. You ask people (one at a time) of their birthdays (month and day only). The process continues until there is a repeat in the series of birthdays, in other words, until two people you’ve surveyed share the same birthday. How many people you have to ask before finding a repeat? What is the probability that you will have to ask $k$ people before finding a repeat? What is the median number of people you have to ask? In this post, we discuss this random variable and how this random variable relates to the birthday problem.

In the problem at hand, we ignore leap year and assume that each of the 365 days in the year is equally likely to be the birthday for a randomly chosen person. The birthday problem is typically the following question. How many people do we need to choose in order to have a 50% or better chance of having a common birthday among the selected individuals?

The random experiment stated at the beginning can be restated as follows. Suppose that balls are randomly thrown (one at a time) into $n$ cells (e.g. boxes). The random process continues until a ball is thrown into a cell that already has one ball (i.e. until a repeat occurs). Let $X_n$ be the number of balls that are required to obtain a repeat. Some of the problems we discuss include the mean (the average number of balls that are to be thrown to get a repeat) and the probability function. We will also show how this random variable is linked to the birthday problem when $n=365$.

___________________________________________________________________________

The Birthday Problem

First, we start with the birthday problem. The key is to derive the probability that in a group of $k$ randomly selected people, there are at least two who share the same birthday. It is easier to do the complement – the probability of having different birthdays among the group of $k$ people. We call this probability $p_k$.

\displaystyle \begin{aligned} p_k&=\frac{365}{365} \ \frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\biggl[1-\frac{1}{365} \biggr] \ \biggl[1-\frac{2}{365} \biggr] \cdots \biggl[1-\frac{k-1}{365} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) \end{aligned}

where $k=2,3,4,\cdots,365$. The reasoning for the first line is that there are 365 choices to be picked in the first selection. Each subsequent random selection has to avoid the previous birthday, thus 364 choices for the second person and only $365-(k-1)$ choices for the $k$th person.

To answer the birthday problem, just plug in values of $k$ to compute $p_k$ and $1-p_k$ until reaching the smallest $k$ such that $p_k<0.5$ and $1-p_k>0.5$. The calculation should be done using software (Excel for example). The smallest $k$ is 23 since

$p_{23}= 0.492702766$

$1-p_{23}= 0.507297234$

In a random group of 23 people, there is a less than 50% chance of having distinct birthdays, and thus a more than 50% chance of having at least one identical birthday. This may be a surprising result. Without the benefit of formula (1), some people may think that it will take a larger sample to obtain a repeat.

The benefit of (1) extends beyond the birthday problem. Let’s consider the case for $n$, i.e. randomly pick numbers from the set $\left\{1,2,3,\cdots,n-1,n \right\}$ with replacement until a number is chosen twice (until a repeat occurs). Similarly, let $p_{n,k}$ be the probability that in $k$ draws all chosen numbers are distinct. The probability $p_{n,k}$ is obtained by replacing 365 with $n$.

\displaystyle \begin{aligned} p_{n,k}&=\frac{n}{n} \ \frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \end{aligned}

Formula (2) will be useful in the next section.

___________________________________________________________________________

The Random Variable

We now look into the random variable discussed at the beginning, either the one for picking people at random until a repeated birthday or throwing balls into cells until one cell has two balls. To illustrate the idea, let’s look at an example.

Example 1
Roll a fair die until obtaining a repeated face value. Let $X_6$ be the number of rolls to obtain the repeated value. Find the probability function $P[X_6=k]$ where $k=2,3,4,5,6,7$.

Note that $P[X_6=k]$ is the probability that it will take $k$ rolls to get a repeated die value.

$\displaystyle P(X_6=2)=\frac{6}{6} \times \frac{1}{6}=\frac{1}{6}$

$\displaystyle P(X_6=3)=\frac{6}{6} \times \frac{5}{6} \times \frac{2}{6}=\frac{10}{6^2}$

$\displaystyle P(X_6=4)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6}=\frac{60}{6^3}$

$\displaystyle P(X_6=5)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{4}{6}=\frac{240}{6^4}$

$\displaystyle P(X_6=6)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{5}{6}=\frac{600}{6^5}$

$\displaystyle P(X_6=7)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{1}{6} \times \frac{6}{6}=\frac{720}{6^6}$

To get a repeat in 2 rolls, there are 6 choices for the first roll and the second has only one choice – the value of the first roll. To get a repeat in 3 rolls, there are 6 choices for the first roll, 5 choices for the second roll and the third roll must be out of the 2 previous two distinct values. The idea is that the first $k-1$ rolls are distinct and the last roll must be one of the previous values. $\square$

The reasoning process leads nicely to the general case. In the general case, let’s consider the occupancy interpretation. In throwing balls into $n$ cells, let $X_n$ be defined as above, i.e. the number of balls that are required to obtain a repeat. The following gives the probability $P[X_n=k]$.

\displaystyle \begin{aligned} P[X_n=k]&=\frac{n}{n} \times \frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{(n-1) \times (n-2) \times \cdots \times (n-(k-2)) \times (k-1)}{n^{k-1}} \\&=\biggl[1-\frac{1}{n} \biggr] \times \biggl[1-\frac{2}{n} \biggr] \times \cdots \times \biggl[1-\frac{k-2}{n}\biggr] \times \frac{k-1}{n} \ \ \ \ \ \ \ (3) \end{aligned}

where $k=2,3,\cdots,n+1$.

The reasoning is similar to Example 1. To get a repeat in throwing $k$ balls, the first $k-1$ balls must be go into different cells while the last ball would go into one of the $k-1$ occupied cells. For the first $k-1$ balls to go into different cells, there are $n (n-1) \cdots (n-(k-2))$ ways. There are $k-1$ cells for the last ball to land. Thus the product of these two quantities is in the numerator of (3).

Once the probability function (3) is obtained, the mean $E[X_n]$ can be derived accordingly. For the case of $n=365$, $E[X_{365}]=24.62$ (calculated by programming the probability function in Excel). On average it will be required to sample about 25 people to obtain a repeated birthday.

Another interesting quantity is $P[X_n>k]$. This is the probability that it will take throwing more than $k$ balls to get a repeat. Mathematically this can be obtained by first calculating $P[X_n \le k]$ by summing the individual probabilities via (3). This is a workable approach using software. There is another way that is more informative. For the event $X_n>k$ to happen, the first $k$ throws must be in different cells (no repeat). The event $X_n>k$ is identical to the event that there is no repeat in the first $k$ throws of balls. This is how the random variable $X_n$ is linked to the birthday problem since the probability $P[X_n>k]$ should agree with the probability $p_{n,k}$ in (2).

$\displaystyle P[X_n>k]=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

___________________________________________________________________________

Back to The Birthday Problem

Consider the case for $X_{365}$. What is the median of $X_{365}$? That would be the median number of people surveyed to obtain a pair of identical birthday. The median of $X_{365}$ would be the least $k$ such that $P[X_{365} \le k]$ is at least 0.5. Note that $P[X_{365}>k]$ is identical to $p_k$ in (1). The above calculation shows that $p_{23}=0.4927$ and $1-p_{23}=0.5073$. Thus the median of $X_{365}$ is 23. Thus when performing the random sampling of surveying birthday, about half of the times you can expect to survey 23 or fewer than 23 people.

The birthday problem is equivalently about finding the median of the random variable $X_{365}$. A little more broadly, the birthday problem is connected to the percentiles of the variable $X_{n}$. In contrast, the mean of $X_{365}$ is $E[X_{365}]=24.62$. The following lists several percentiles for the random variable $X_{365}$.

$\begin{array}{ccccccc} k & \text{ } & P[X_{365}>k] & \text{ } & P[X_{365} \le k] & \text{ } & \text{Percentile} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 14 & \text{ } & 0.77690 & & 0.22310 & \\ 15 & \text{ } & 0.74710 & & 0.25290 & \text{ } & \text{25th Percentile} \\ 16 & \text{ } & 0.71640 & & 0.28360 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 22 & \text{ } & 0.52430 & & 0.47570 & \\ 23 & \text{ } & 0.49270 & & 0.50730 & \text{ } & \text{50th Percentile} \\ 24 & \text{ } & 0.46166 & & 0.53834 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 31 & \text{ } & 0.26955 & & 0.73045 & \\ 32 & \text{ } & 0.24665 & & 0.75335 & \text{ } & \text{75th Percentile} \\ 33 & \text{ } & 0.22503 & & 0.77497 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 40 & \text{ } & 0.10877 & & 0.89123 & \\ 41 & \text{ } & 0.09685 & & 0.90315 & \text{ } & \text{90th Percentile} \\ 42 & \text{ } & 0.08597 & & 0.91403 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 46 & \text{ } & 0.05175 & & 0.94825 & \\ 47 & \text{ } & 0.04523 & & 0.95477 & \text{ } & \text{95th Percentile} \\ 48 & \text{ } & 0.03940 & & 0.96060 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 56 & \text{ } & 0.01167 & & 0.98833 & \\ 57 & \text{ } & 0.00988 & & 0.99012 & \text{ } & \text{99th Percentile} \\ 58 & \text{ } & 0.00834 & & 0.99166 & \\ \end{array}$

It is clear that in a group of 366 people, it is certain that there will be at least one repeated birthday (again ignoring leap year). This is due to the pigeon hole principle. As the percentiles in the above table shows, you do not need to survey anywhere close to 366 to get a repeat. The median is 23 as discussed. The 75th percentile of $X_{365}$ is 32.

The preceding calculation shows that you do not need a large group to have a repeated birthday. About 50% of the times, you will survey 23 or fewer people, about 75% of the time, 32 or fewer people, About 99% of the time, you will survey 57 or fewer people, much fewer than 365 or 366. So with around 50 in a random group, there is a near certainty of finding a shared birthday. In a random group of 100 people, there should be an almost absolute certainty that there is a shared birthday.

For a further demonstration, we simulated the random variable $X_{365}$ 10,000 times. The range of the simulated values is 2 to 78. Thus the odds for 100 people to survey is smaller than 1 in 10,000. To get a simulated value of 100, we will have to simulate more than 10,000 values of $X_{365}$. The median of the 10,000 simulated results is 23. The following table summarizes the results.

$\begin{array}{rrrrrr} \text{Interval} & \text{ } & \text{Count} & \text{ } & \text{Proportion} & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ \text{2 to 9} & \text{ } &953 & & 0.09530 & \\ \text{10 to 19} & \text{ } & 2870 & & 0.28700 & \\ \text{20 to 29} & \text{ } & 3055 & & 0.30550 & \\ \text{30 to 39} & \text{ } & 1922 & & 0.19220 & \\ \text{40 to 49} & \text{ } & 886 & & 0.08860 & \\ \text{50 to 59} & \text{ } & 253 & & 0.02530 & \\ \text{60 to 69} & \text{ } & 48 & & 0.00480 & \\ \text{70 to 78} & \text{ } & 13 & & 0.00130 & \\ \end{array}$

Not shown in the table is that 33 of the simulated results are the value of 2. Thus it is possible to ask two people and they both have the same birthday. But the odds of that happening is 33 out of 10,000 according to this particular set of simulations (probability 0.0033). The theoretical probability of 2 is 1/365 = 0.002739726. There are 2 instances of 78 in the 10,000 simulated values. Thus the odds are 2 out of 10,000 with probability 0.0002. The theoretical probability is 0.000037 using (3).

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$

# How long does it take to collect all coupons?

This post discusses the coupon collector problem, a classical problem in probability.

___________________________________________________________________________

The Coupon Collector Problem

The problem is usually stated as a coupon collector trying to collect the entire set of coupons. For example, each time the coupon collector buys a product (e.g. a box of breakfast cereal), he receives a coupon, which is a prize that can be a toy or a baseball card or other interesting item. Suppose that there are $n$ different types of coupons (prizes) and that the coupon collector wishes to collect the entire set. How many units of the product does the coupon collector have to buy in order to collect the entire set?

A simplified discussion of the coupon collector problem is found in another blog post. This post is more detailed discussion.

This blog post in another blog discusses the coupon collector problem from a simulation perspective.

As shown below, if there are 5 different coupons, it takes 12 purchases on average to get all coupons. If there are 10 different types of coupons, on average it would take 30 purchases. If there are 50 different types of coupons, it would take on average 225 purchases collect the entire set. The first few coupons are obtained fairly quickly. As more and more coupons are accumulated, it is harder to get the remaining coupons. For example, for the 50-coupon case, after the 49 coupons are obtained, it takes on average 50 purchases to get the last coupon.

Suppose that the coupon collector does not want to collect the entire set and only wishes to collect $r$ distinct coupons where $r \le n$. It turns out that this special case only requires a minor tweak to the case of collecting the entire set. Our strategy then is to focus on the main case. The special case will be discussed at the end of the post.

We first consider the main case that the coupon collector wishes to collect the entire set. The problem can be cast as a random sampling from the population $S=\left\{1,2,3,\cdots,n \right\}$. Selecting a number at random from $S$ with replacement is equivalent to the coupon collector receiving a coupon. Let $X_n$ be the minimum number of selections such that each number in $S$ is picked at least once. In this post we discuss the probability function of $X_n$ as well as its mean and variance.

Another interpretation of the problem is that it can be viewed as an occupancy problem, which involves throwing balls into $n$ cells at random. The random quantity of interest is the number of balls that are required to be thrown such that each cell has at least one ball. Clearly this formulation is identical to the coupon interpretation and the random sampling interpretation. The angle of occupancy problem is useful since we can leverage the formulas developed in this previous post. A description of the occupancy problem is given here.

Regardless of the interpretation, the goal is obtain information on the random variable $X_n$, the minimum number of random selections from $S=\left\{1,2,3,\cdots,n \right\}$ in order to have the complete set of $n$ distinct values represented in the sample.

___________________________________________________________________________

Mean and Variance

The mean and variance of $X_n$ are easier to obtain. So that is where we begin. The key is to break up $X_n$ into a sum as follows:

$X_n=C_{1}+C_{2}+C_{3}+\cdots+C_{i-1}+C_{i}+\cdots+C_{n} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

where $C_{i}$ is the additional selections from $S=\left\{1,2,3,\cdots,n \right\}$ to get a number that is distinct from the $i-1$ distinct numbers that have been chosen. For example, $C_3$ is the number of random selections to get a number that is distinct from the two distinct numbers obtained up to that point.

Note that each $C_i$ involves repeated sampling until some criterion is reached, thus resembling a geometric random variable. Indeed they are. As the sampling continues and as more distinct values are obtained, it is not as easy to obtain a new number. After $i-1$ distinct numbers have been obtained, the probability of drawing a new distinct number is $p=\frac{n-(i-1)}{n}$. As geometric random variables, each $C_i$ has the following mean and variance.

$\displaystyle E[C_i]=\frac{1}{p}=\frac{n}{n-(i-1)}$

$\displaystyle Var[C_i]=\frac{1-p}{p^2}=\frac{n(i-1)}{[n-(i-1)]^2}$

where $i=1,2,3,\cdots,n$. Note that the random variables $C_i$ are independent. The value of $C_i$ does not depend on how many trials it takes to draw the previous $i-1$ distinct numbers. The following gives the mean and variance of $X_n$.

$\displaystyle E[X_n]=\sum \limits_{i=1}^n E[C_i]=\sum \limits_{i=1}^n \frac{n}{n-(i-1)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle Var[X_n]=\sum \limits_{i=1}^n Var[C_i]=\sum \limits_{i=1}^n \frac{n(i-1)}{[n-(i-1)]^2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The expectation $E[X_n]$ can be rearranged as follows to give more information.

$\displaystyle E[X_n]=n \ \biggl[\frac{1}{n}+ \cdots + \frac{1}{3}+ \frac{1}{2} + 1\biggr]=n \ H_n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

The quantity $H_n$ is the partial sum of the harmonic series. Note that $H_n \rightarrow \infty$ as $n \rightarrow \infty$. Thus $E[X_n] \rightarrow \infty$ as $n \rightarrow \infty$. The quantity $H_n$ can be interpreted as the average number of units of product that are required to purchase per coupon. The following table lists out the expected values for selected values of $n$.

Table 1

$\begin{array}{cccccc} \text{Number of} & \text{ } & \text{Expected Number of Trials} & \text{ } & \text{Expected Total Number} & \\ \text{Coupons} & \text{ } & \text{per Coupon} & \text{ } & \text{of Trials (rounded up)} & \\ n & \text{ } & E[H_n] & \text{ } & E[X_n] & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 1 & \text{ } & 1.0000 & & 1 & \\ 2 & \text{ } & 1.5000 & & 3 & \\ 3 & \text{ } & 1.8333 & & 6 & \\ 4 & \text{ } & 2.0833 & & 9 & \\ 5 & \text{ } & 2.2833 & & 12 & \\ 6 & \text{ } & 2.4500 & & 15 & \\ 7 & \text{ } & 2.5929 & & 19 & \\ 8 & \text{ } & 2.7179 & & 22 & \\ 9 & \text{ } & 2.8290 & & 26 & \\ 10 & \text{ } & 2.9290 & & 30 & \\ 20 & \text{ } & 3.5977 & & 72 & \\ 30 & \text{ } & 3.9950 & & 120 & \\ 40 & \text{ } & 4.2785 & & 172 & \\ 50 & \text{ } & 4.4992 & & 225 & \\ 60 & \text{ } & 4.6799 & & 281 & \\ 70 & \text{ } & 4.8328 & & 339 & \\ 80 & \text{ } & 4.9655 & & 398 & \\ 90 & \text{ } & 5.0826 & & 458 & \\ 100 & \text{ } & 5.1874 & & 519 & \\ \end{array}$

Table 1 gives an estimate on how long to expect to collect the entire set of coupons for selected coupon sizes. The third column gives the expected total number of purchases to obtain the entire coupon set. The second column gives an estimate of how many purchases on average to obtain one coupon. For the 50-coupon case, it takes on average about 4.5 purchases to obtain one coupon. However, it does not tell the whole story. To get the 50th coupon, it takes on average 50 trials. Note that $E[C_{50}]=50$ in the 50-coupon case in formula (1). In a simulation of the 50-coupon problem, it took 54 trials to obtain the 50th coupon. To get the 49th coupon, it takes on average 50/2 = 25 trials.

___________________________________________________________________________

The Occupancy Problem

We now view the coupon collector problem as an occupancy problem in order to leverage a formula from a previous post. Suppose that we randomly throw $k$ balls into $n$ cells. Let $Y_{k,n}$ be the number of empty cells as a result of randomly assigning $k$ balls into $n$ cells. The following gives the probabilities $P(Y_{k,n}=j)$ where $j=0,1,2,\cdots,n-1$.

$\displaystyle P(Y_{k,n}=0)=\sum \limits_{i=0}^{n} (-1)^{i} \binom{n}{i} \biggl(1-\frac{i}{n}\biggr)^{k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

$\displaystyle P(Y_{k,n}=j)=\binom{n}{j} \sum \limits_{i=0}^{n-j} (-1)^{i} \binom{n-j}{i} \biggl(1-\frac{j+i}{n}\biggr)^{k} \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

where $j=1,2, \cdots, n-1$.

The notation $\binom{n}{j}$ is the binomial coefficient, which is the number of ways to choose $j$ objects out of $n$ objects where order does not matter. The calculation is defined by $\displaystyle \binom{n}{j}=\frac{n!}{j! (n-j)!}$.

The formula (5) gives the probability of having $j$ empty cells. In throwing $k$ balls into $n$ cells, the probability of having $w$ occupied cells is then $P[Y_{k,n}=n-w]$.

___________________________________________________________________________

The Probability Function

We now discuss the probability function of the random variable $X_n$, namely $P[X_n=k]$ for $k=n,n+1,n+2,\cdots$. The event $X_n=k$ means that all $n$ cells are occupied after throwing $k$ balls with the first $k-1$ balls landing in $n-1$ cells. Putting it in another way, there are zero empty cells after throwing $k$ balls and there is 1 empty cell after throwing the first $k-1$ balls. This can be stated using the notation in the preceding section on the occupancy problem as follows:

$P(X_n=k]=P[Y_{k,n}=0 \text{ and } Y_{k-1,n}=1] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)$

Consider the following derivation.

\displaystyle \begin{aligned} P(X_n=k]&=P[Y_{k,n}=0 \text{ and } Y_{k-1,n}=1] \\&=P[Y_{k,n}=0 \ \lvert \ Y_{k-1,n}=1] \times P[Y_{k-1,n}=1] \\&=\frac{1}{n} \times \binom{n}{1} \sum \limits_{i=0}^{n-1} (-1)^i \binom{n-1}{i} \biggl[ 1-\frac{1+i}{n} \biggr]^{k-1} \\&=\sum \limits_{i=0}^{n-1} (-1)^i \binom{n-1}{i} \biggl[ 1-\frac{1+i}{n} \biggr]^{k-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7) \end{aligned}

where $k=n,n+1,n+2,\cdots$.

Rather than memorizing the probability function in (7), a better way is to focus on the thought process that is inherent in (6).

One comment about the calculation for (7). The summation for $P[X_n=k]$ has $n$ terms. A given probability may involve multiple values of $k$, e.g.

$P[X_6>15]=1-P[6 \le X_6 \le 14]$.

Unless the number of values for $k$ is very small, the calculation should be done using software. Microsoft Excel is an excellent way to perform the calculation. The calculations for the examples below are programmed in Excel.

___________________________________________________________________________

Examples

Example 1
Suppose that a fair die is rolled until all 6 faces have appeared. Find the mean number of rolls and the variance of the number of rolls. What is the probability that it will take at least 12 rolls? What is the probability that it will take more than 15 rolls?

Using the notation developed above, the random variable of interest is $X_6$. Its mean and variance are:

$\displaystyle E[X_6]=6 \ \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+\frac{1}{4}+\frac{1}{5}+\frac{1}{6} \biggr]=6 \times 2.45 = 14.7 \approx 15$

$\displaystyle Var[X_6]=\sum \limits_{i=1}^6 \frac{6(i-1)}{[6-(i-1)]^2}=38.99$

The following is the probability function for $X_6$.

$\displaystyle P[X_6=k]=\sum \limits_{i=0}^{5} (-1)^i \binom{5}{i} \biggl[ 1-\frac{1+i}{6} \biggr]^{k-1}$

where $k=6,7,8,\cdots$

For each $k$, the quantity $P[X_6=k]$ requires 6 calculations. Performing the calculations in Excel, the desired probabilities are:

$P[X_6 \ge 12]=1-P[6 \le X_6 \le 11]=1-0.356206419=0.643793581$

$P[X_6 > 15]=1-P[6 \le X_6 \le 15]=1-0.644212739=0.355787261$

Even though the average number of trials is 15, there is still a significant probability that it will take more than 15 trials. This is because the variance is quite large. $\square$

Example 2
An Internet startup is rapidly hiring new employees. What is the expected number of new employees until all birth months are represented? Assume that the 12 birth months are equally likely. What is the probability that the company will have to hire more than 25 employees? If the company currently has more than 25 employees with less than 12 birth months, what is the probability that it will have to hire more than 35 employees to have all 12 birth months represented in the company?

The random variable of interest is $X_{12}$. The following shows the mean and probability function.

$\displaystyle E[X_{12}]=12 \ \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+\cdots +\frac{1}{12} \biggr]=37.23852814 \approx 38$

$\displaystyle P[X_{12}=k]=\sum \limits_{i=0}^{11} (-1)^i \binom{11}{i} \biggl[ 1-\frac{1+i}{12} \biggr]^{k-1}$

where $k=12,13,14,\cdots$

Performing the calculation in Excel, we obtain the following probabilities.

$P[X_{12} > 25]=1-P[12 \le X_{12} \le 25]=1-0.181898592=0.818101408$

$P[X_{12} > 35]=1-P[12 \le X_{12} \le 35]=1-0.531821149=0.468178851$

$\displaystyle P[X_{12} > 35 \ \lvert \ X_{12} > 25]=\frac{P[X_{12} > 35]}{P[X_{12} > 25]}=\frac{0.468178851}{0.818101408}=0.572274838$

___________________________________________________________________________

A Special Case

We now consider the special case that the coupon collector only wishes to collect $r$ distinct coupons where $r \le n$. Of course, $n$ is the total number of distinct coupon types. Let $X_{n,r}$ be the minimum number of purchases such that $r$ distinct coupons have been obtained. In the random sampling interpretation, $X_{n,r}$ would be the minimum sample size such that $r$ distinct elements have been chosen from the sample space $S=\left\{1,2,3,\cdots,n \right\}$. The mean and variance of $X_{n,r}$ follow from the same idea. Each $X_{n,r}$ is the independent sum of geometric random variables as in (0).

$X_{n,r}=C_{1}+C_{2}+C_{3}+\cdots+C_{i-1}+C_{i}+\cdots+C_{r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

where $r \le n$.

Thus $E[X_{n,r}]$ and $Var[X_{n,r}]$ would be like (1) and (2) except that the summation is through $r$ instead of $n$.

For the probability function of $X_{n,r}$, we only need to tweak the thought process expressed in (6) slightly. For the event $X_{n,r}=k$ to happen, exactly $r$ cells are occupied after throwing $k$ balls with the first $k-1$ balls landing in $r-1$ cells. In other words, there are exactly $n-r$ empty cells after throwing $k$ balls and there are exactly $n-r+1$ empty cells after throwing $k-1$ balls. The following expresses this condition in terms of the occupancy problem, i.e. similar to (6).

$P[X_{n,r}=k]=P[Y_{k,n}=n-r \text{ and } Y_{k-1,n}=n-r+1] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

Here’s the important components that need to go into $P[X_{n,r}=k]$ with the first one coming from the occupancy formula (5).

$\displaystyle P[Y_{k-1,n}=n-r+1]=\binom{n}{n-r+1} \sum \limits_{i=0}^{r-1} (-1)^i \binom{r-1}{i} \biggl[ 1-\frac{n-r+1+i}{n} \biggr]^{k-1}$

$\displaystyle P[Y_{k,n}=n-r \ \lvert \ Y_{k-1,n}=n-r+1]=\frac{n-r+1}{n}$

Multiply the above two probabilities together produces the desired probability $P[X_{n,r}=k]$ for $k=r,r+1,r+2,\cdots$.

$\displaystyle P[X_{n,r}=k]=\binom{n-1}{r-1} \sum \limits_{i=0}^{r-1} (-1)^i \binom{r-1}{i} \biggl[ 1-\frac{n-r+1+i}{n} \biggr]^{k-1} \ \ \ \ \ \ \ \ (10)$

Note that when $n=r$ (collecting the entire set of coupons), formula (10) would be identical to (7). The following example demonstrates the calculation.

Example 3
Consider the 6-coupon case discussed in Example 1. Suppose that the coupon collector is interested in collecting $r=4$ coupons. What is the expected number of purchases to get 4 coupons? What is the probability that it will take more than 6 purchases to get 4 coupons? What is the probability that it will take more than 8 purchases to get 4 coupons? Compare these results with Example 1.

The random variable of interest is $X_{6,4}$. The mean is:

$\displaystyle E[X_{6,4}]=6 \biggl[ \frac{1}{6}+\frac{1}{5}+\frac{1}{4}+\frac{1}{3} \biggr]=5.7 \approx 6$

Note that it is much faster to obtain 4 coupons than the entire set of six. The following gives the probability function for $X_{6,4}$.

$\displaystyle P[X_{6,4}=k]=\binom{5}{3} \sum \limits_{i=0}^{3} (-1)^i \binom{3}{i} \biggl[ 1-\frac{3+i}{6} \biggr]^{k-1}$

where $k=4,5,6,\cdots$

Performing the calculations in Excel gives the following probabilities.

$P[X_{6,4} > 6]=1-P[12 \le X_{12} \le 25]=1-0.74845679=0.25154321$

$P[X_{6,4} > 8]=1-P[12 \le X_{12} \le 35]=1-0.928712277=0.071287723$

The first probability shows there is still a good chance that it will take more then the mean number of trials to get 4 coupons. The wait time is much less than in Example 1 since the probability of wait time more than 8 is fairly small. $\square$

___________________________________________________________________________

Moment Generating Function

One distributional quantity that is easy to obtain is the moment generating function (MGF) for the random variable $X_n$ (the case of collecting the entire set of coupons) as well as for the random variable $X_{n,r}$ (the partial case). Since both of these random variables are the independent sum of geometric random variables, their MGFs would simply be the product of the individual geometric MGFs. The following gives the results.

$\displaystyle M_{X_n}(t)=\prod \limits_{i=1}^n \frac{[n-(i-1)] e^t}{n-(i-1) e^t} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11)$

$\displaystyle M_{X_{n,r}}(t)=\prod \limits_{i=1}^r \frac{[n-(i-1)] e^t}{n-(i-1) e^t} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (12)$

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$