The birthday problem

Consider this random experiment. You ask people (one at a time) of their birthdays (month and day only). The process continues until there is a repeat in the series of birthdays, in other words, until two people you’ve surveyed share the same birthday. How many people you have to ask before finding a repeat? What is the probability that you will have to ask k people before finding a repeat? What is the median number of people you have to ask? In this post, we discuss this random variable and how this random variable relates to the birthday problem.

In the problem at hand, we ignore leap year and assume that each of the 365 days in the year is equally likely to be the birthday for a randomly chosen person. The birthday problem is typically the following question. How many people do we need to choose in order to have a 50% or better chance of having a common birthday among the selected individuals?

The random experiment stated at the beginning can be restated as follows. Suppose that balls are randomly thrown (one at a time) into n cells (e.g. boxes). The random process continues until a ball is thrown into a cell that already has one ball (i.e. until a repeat occurs). Let X_n be the number of balls that are required to obtain a repeat. Some of the problems we discuss include the mean (the average number of balls that are to be thrown to get a repeat) and the probability function. We will also show how this random variable is linked to the birthday problem when n=365.

___________________________________________________________________________

The Birthday Problem

First, we start with the birthday problem. The key is to derive the probability that in a group of k randomly selected people, there are at least two who share the same birthday. It is easier to do the complement – the probability of having different birthdays among the group of k people. We call this probability p_k.

    \displaystyle \begin{aligned} p_k&=\frac{365}{365} \ \frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\biggl[1-\frac{1}{365} \biggr] \ \biggl[1-\frac{2}{365} \biggr] \cdots \biggl[1-\frac{k-1}{365} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) \end{aligned}

where k=2,3,4,\cdots,365. The reasoning for the first line is that there are 365 choices to be picked in the first selection. Each subsequent random selection has to avoid the previous birthday, thus 364 choices for the second person and only 365-(k-1) choices for the kth person.

To answer the birthday problem, just plug in values of k to compute p_k and 1-p_k until reaching the smallest k such that p_k<0.5 and 1-p_k>0.5. The calculation should be done using software (Excel for example). The smallest k is 23 since

    p_{23}= 0.492702766

    1-p_{23}= 0.507297234

In a random group of 23 people, there is a less than 50% chance of having distinct birthdays, and thus a more than 50% chance of having at least one identical birthday. This may be a surprising result. Without the benefit of formula (1), some people may think that it will take a larger sample to obtain a repeat.

The benefit of (1) extends beyond the birthday problem. Let’s consider the case for n, i.e. randomly pick numbers from the set \left\{1,2,3,\cdots,n-1,n \right\} with replacement until a number is chosen twice (until a repeat occurs). Similarly, let p_{n,k} be the probability that in k draws all chosen numbers are distinct. The probability p_{n,k} is obtained by replacing 365 with n.

    \displaystyle \begin{aligned} p_{n,k}&=\frac{n}{n} \ \frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \end{aligned}

Formula (2) will be useful in the next section.

___________________________________________________________________________

The Random Variable

We now look into the random variable discussed at the beginning, either the one for picking people at random until a repeated birthday or throwing balls into cells until one cell has two balls. To illustrate the idea, let’s look at an example.

Example 1
Roll a fair die until obtaining a repeated face value. Let X_6 be the number of rolls to obtain the repeated value. Find the probability function P[X_6=k] where k=2,3,4,5,6,7.

Note that P[X_6=k] is the probability that it will take k rolls to get a repeated die value.

    \displaystyle P(X_6=2)=\frac{6}{6} \times \frac{1}{6}=\frac{1}{6}

    \displaystyle P(X_6=3)=\frac{6}{6} \times \frac{5}{6} \times \frac{2}{6}=\frac{10}{6^2}

    \displaystyle P(X_6=4)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6}=\frac{60}{6^3}

    \displaystyle P(X_6=5)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{4}{6}=\frac{240}{6^4}

    \displaystyle P(X_6=6)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{5}{6}=\frac{600}{6^5}

    \displaystyle P(X_6=7)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{1}{6} \times \frac{6}{6}=\frac{720}{6^6}

To get a repeat in 2 rolls, there are 6 choices for the first roll and the second has only one choice – the value of the first roll. To get a repeat in 3 rolls, there are 6 choices for the first roll, 5 choices for the second roll and the third roll must be out of the 2 previous two distinct values. The idea is that the first k-1 rolls are distinct and the last roll must be one of the previous values. \square

The reasoning process leads nicely to the general case. In the general case, let’s consider the occupancy interpretation. In throwing balls into n cells, let X_n be defined as above, i.e. the number of balls that are required to obtain a repeat. The following gives the probability P[X_n=k].

    \displaystyle \begin{aligned} P[X_n=k]&=\frac{n}{n} \times \frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{(n-1) \times (n-2) \times \cdots \times (n-(k-2)) \times (k-1)}{n^{k-1}} \\&=\biggl[1-\frac{1}{n} \biggr] \times \biggl[1-\frac{2}{n} \biggr] \times \cdots \times \biggl[1-\frac{k-2}{n}\biggr] \times \frac{k-1}{n}  \ \ \ \ \ \ \ (3) \end{aligned}

where k=2,3,\cdots,n+1.

The reasoning is similar to Example 1. To get a repeat in throwing k balls, the first k-1 balls must be go into different cells while the last ball would go into one of the k-1 occupied cells. For the first k-1 balls to go into different cells, there are n (n-1) \cdots (n-(k-2)) ways. There are k-1 cells for the last ball to land. Thus the product of these two quantities is in the numerator of (3).

Once the probability function (3) is obtained, the mean E[X_n] can be derived accordingly. For the case of n=365, E[X_{365}]=24.62 (calculated by programming the probability function in Excel). On average it will be required to sample about 25 people to obtain a repeated birthday.

Another interesting quantity is P[X_n>k]. This is the probability that it will take throwing more than k balls to get a repeat. Mathematically this can be obtained by first calculating P[X_n \le k] by summing the individual probabilities via (3). This is a workable approach using software. There is another way that is more informative. For the event X_n>k to happen, the first k throws must be in different cells (no repeat). The event X_n>k is identical to the event that there is no repeat in the first k throws of balls. This is how the random variable X_n is linked to the birthday problem since the probability P[X_n>k] should agree with the probability p_{n,k} in (2).

    \displaystyle P[X_n>k]=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

___________________________________________________________________________

Back to The Birthday Problem

Consider the case for X_{365}. What is the median of X_{365}? That would be the median number of people surveyed to obtain a pair of identical birthday. The median of X_{365} would be the least k such that P[X_{365} \le k] is at least 0.5. Note that P[X_{365}>k] is identical to p_k in (1). The above calculation shows that p_{23}=0.4927 and 1-p_{23}=0.5073. Thus the median of X_{365} is 23. Thus when performing the random sampling of surveying birthday, about half of the times you can expect to survey 23 or fewer than 23 people.

The birthday problem is equivalently about finding the median of the random variable X_{365}. A little more broadly, the birthday problem is connected to the percentiles of the variable X_{n}. In contrast, the mean of X_{365} is E[X_{365}]=24.62. The following lists several percentiles for the random variable X_{365}.

    \begin{array}{ccccccc}    k &  \text{ } & P[X_{365}>k]  & \text{ } & P[X_{365} \le k] & \text{ } & \text{Percentile} \\    \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   14 &   \text{ } & 0.77690 &  & 0.22310 &  \\   15 &   \text{ } & 0.74710 &  & 0.25290 & \text{ } & \text{25th Percentile} \\   16 &   \text{ } & 0.71640 &  & 0.28360 &  \\  \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   22 &   \text{ } & 0.52430 &  & 0.47570 &  \\   23 &   \text{ } & 0.49270 &  & 0.50730 & \text{ } & \text{50th Percentile}  \\   24 &   \text{ } & 0.46166 &  & 0.53834 &  \\  \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   31 &   \text{ } & 0.26955 &  & 0.73045 &  \\   32 &   \text{ } & 0.24665 &  & 0.75335 & \text{ } & \text{75th Percentile} \\         33 &   \text{ } & 0.22503 &  & 0.77497 &  \\  \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   40 &   \text{ } & 0.10877 &  & 0.89123 &  \\   41 &   \text{ } & 0.09685 &  & 0.90315 & \text{ } & \text{90th Percentile} \\   42 &   \text{ } & 0.08597 &  & 0.91403 &  \\  \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   46 &   \text{ } & 0.05175 &  & 0.94825 &  \\   47 &   \text{ } & 0.04523 &  & 0.95477 & \text{ } & \text{95th Percentile} \\    48 &  \text{ } & 0.03940 &  & 0.96060 &  \\  \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\    56 &  \text{ } & 0.01167 &  & 0.98833 &  \\  57 &  \text{ } & 0.00988 &  & 0.99012 & \text{ } & \text{99th Percentile} \\  58 &  \text{ } & 0.00834 &  & 0.99166 &  \\      \end{array}

It is clear that in a group of 366 people, it is certain that there will be at least one repeated birthday (again ignoring leap year). This is due to the pigeon hole principle. As the percentiles in the above table shows, you do not need to survey anywhere close to 366 to get a repeat. The median is 23 as discussed. The 75th percentile of X_{365} is 32.

The preceding calculation shows that you do not need a large group to have a repeated birthday. About 50% of the times, you will survey 23 or fewer people, about 75% of the time, 32 or fewer people, About 99% of the time, you will survey 57 or fewer people, much fewer than 365 or 366. So with around 50 in a random group, there is a near certainty of finding a shared birthday. In a random group of 100 people, there should be an almost absolute certainty that there is a shared birthday.

For a further demonstration, we simulated the random variable X_{365} 10,000 times. The range of the simulated values is 2 to 78. Thus the odds for 100 people to survey is smaller than 1 in 10,000. To get a simulated value of 100, we will have to simulate more than 10,000 values of X_{365}. The median of the 10,000 simulated results is 23. The following table summarizes the results.

    \begin{array}{rrrrrr}    \text{Interval} &  \text{ } & \text{Count}  & \text{ } & \text{Proportion} & \\    \text{ } &   \text{ } & \text{ } & \text{ } & \text{ } &  \\   \text{2 to 9} &   \text{ } &953 &  & 0.09530 &  \\   \text{10 to 19} &   \text{ } & 2870 &  & 0.28700 &  \\   \text{20 to 29} &   \text{ } & 3055 &  & 0.30550 &  \\     \text{30 to 39} &   \text{ } & 1922 &  & 0.19220 &  \\   \text{40 to 49} &   \text{ } & 886 &  & 0.08860 &   \\   \text{50 to 59} &   \text{ } & 253 &  & 0.02530 &  \\     \text{60 to 69} &   \text{ } & 48 &  & 0.00480 &  \\   \text{70 to 78} &   \text{ } & 13 &  & 0.00130 &  \\              \end{array}

Not shown in the table is that 33 of the simulated results are the value of 2. Thus it is possible to ask two people and they both have the same birthday. But the odds of that happening is 33 out of 10,000 according to this particular set of simulations (probability 0.0033). The theoretical probability of 2 is 1/365 = 0.002739726. There are 2 instances of 78 in the 10,000 simulated values. Thus the odds are 2 out of 10,000 with probability 0.0002. The theoretical probability is 0.000037 using (3).

___________________________________________________________________________
\copyright \ 2017 \text{ by Dan Ma}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s