# The birthday problem

Consider this random experiment. You ask people (one at a time) of their birthdays (month and day only). The process continues until there is a repeat in the series of birthdays, in other words, until two people you’ve surveyed share the same birthday. How many people you have to ask before finding a repeat? What is the probability that you will have to ask $k$ people before finding a repeat? What is the median number of people you have to ask? In this post, we discuss this random variable and how this random variable relates to the birthday problem.

In the problem at hand, we ignore leap year and assume that each of the 365 days in the year is equally likely to be the birthday for a randomly chosen person. The birthday problem is typically the following question. How many people do we need to choose in order to have a 50% or better chance of having a common birthday among the selected individuals?

The random experiment stated at the beginning can be restated as follows. Suppose that balls are randomly thrown (one at a time) into $n$ cells (e.g. boxes). The random process continues until a ball is thrown into a cell that already has one ball (i.e. until a repeat occurs). Let $X_n$ be the number of balls that are required to obtain a repeat. Some of the problems we discuss include the mean (the average number of balls that are to be thrown to get a repeat) and the probability function. We will also show how this random variable is linked to the birthday problem when $n=365$.

___________________________________________________________________________

The Birthday Problem

First, we start with the birthday problem. The key is to derive the probability that in a group of $k$ randomly selected people, there are at least two who share the same birthday. It is easier to do the complement – the probability of having different birthdays among the group of $k$ people. We call this probability $p_k$.

\displaystyle \begin{aligned} p_k&=\frac{365}{365} \ \frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\frac{364}{365} \ \frac{363}{365} \cdots \frac{365-(k-1)}{365} \\&=\biggl[1-\frac{1}{365} \biggr] \ \biggl[1-\frac{2}{365} \biggr] \cdots \biggl[1-\frac{k-1}{365} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) \end{aligned}

where $k=2,3,4,\cdots,365$. The reasoning for the first line is that there are 365 choices to be picked in the first selection. Each subsequent random selection has to avoid the previous birthday, thus 364 choices for the second person and only $365-(k-1)$ choices for the $k$th person.

To answer the birthday problem, just plug in values of $k$ to compute $p_k$ and $1-p_k$ until reaching the smallest $k$ such that $p_k<0.5$ and $1-p_k>0.5$. The calculation should be done using software (Excel for example). The smallest $k$ is 23 since

$p_{23}= 0.492702766$

$1-p_{23}= 0.507297234$

In a random group of 23 people, there is a less than 50% chance of having distinct birthdays, and thus a more than 50% chance of having at least one identical birthday. This may be a surprising result. Without the benefit of formula (1), some people may think that it will take a larger sample to obtain a repeat.

The benefit of (1) extends beyond the birthday problem. Let’s consider the case for $n$, i.e. randomly pick numbers from the set $\left\{1,2,3,\cdots,n-1,n \right\}$ with replacement until a number is chosen twice (until a repeat occurs). Similarly, let $p_{n,k}$ be the probability that in $k$ draws all chosen numbers are distinct. The probability $p_{n,k}$ is obtained by replacing 365 with $n$.

\displaystyle \begin{aligned} p_{n,k}&=\frac{n}{n} \ \frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\frac{n-1}{n} \ \frac{n-2}{n} \cdots \frac{n-(k-1)}{n} \\&=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \end{aligned}

Formula (2) will be useful in the next section.

___________________________________________________________________________

The Random Variable

We now look into the random variable discussed at the beginning, either the one for picking people at random until a repeated birthday or throwing balls into cells until one cell has two balls. To illustrate the idea, let’s look at an example.

Example 1
Roll a fair die until obtaining a repeated face value. Let $X_6$ be the number of rolls to obtain the repeated value. Find the probability function $P[X_6=k]$ where $k=2,3,4,5,6,7$.

Note that $P[X_6=k]$ is the probability that it will take $k$ rolls to get a repeated die value.

$\displaystyle P(X_6=2)=\frac{6}{6} \times \frac{1}{6}=\frac{1}{6}$

$\displaystyle P(X_6=3)=\frac{6}{6} \times \frac{5}{6} \times \frac{2}{6}=\frac{10}{6^2}$

$\displaystyle P(X_6=4)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6}=\frac{60}{6^3}$

$\displaystyle P(X_6=5)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{4}{6}=\frac{240}{6^4}$

$\displaystyle P(X_6=6)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{5}{6}=\frac{600}{6^5}$

$\displaystyle P(X_6=7)=\frac{6}{6} \times \frac{5}{6} \times \frac{4}{6} \times \frac{3}{6} \times \frac{2}{6} \times \frac{1}{6} \times \frac{6}{6}=\frac{720}{6^6}$

To get a repeat in 2 rolls, there are 6 choices for the first roll and the second has only one choice – the value of the first roll. To get a repeat in 3 rolls, there are 6 choices for the first roll, 5 choices for the second roll and the third roll must be out of the 2 previous two distinct values. The idea is that the first $k-1$ rolls are distinct and the last roll must be one of the previous values. $\square$

The reasoning process leads nicely to the general case. In the general case, let’s consider the occupancy interpretation. In throwing balls into $n$ cells, let $X_n$ be defined as above, i.e. the number of balls that are required to obtain a repeat. The following gives the probability $P[X_n=k]$.

\displaystyle \begin{aligned} P[X_n=k]&=\frac{n}{n} \times \frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{n-1}{n} \times \cdots \times \frac{n-(k-2)}{n} \times \frac{k-1}{n} \\&=\frac{(n-1) \times (n-2) \times \cdots \times (n-(k-2)) \times (k-1)}{n^{k-1}} \\&=\biggl[1-\frac{1}{n} \biggr] \times \biggl[1-\frac{2}{n} \biggr] \times \cdots \times \biggl[1-\frac{k-2}{n}\biggr] \times \frac{k-1}{n} \ \ \ \ \ \ \ (3) \end{aligned}

where $k=2,3,\cdots,n+1$.

The reasoning is similar to Example 1. To get a repeat in throwing $k$ balls, the first $k-1$ balls must be go into different cells while the last ball would go into one of the $k-1$ occupied cells. For the first $k-1$ balls to go into different cells, there are $n (n-1) \cdots (n-(k-2))$ ways. There are $k-1$ cells for the last ball to land. Thus the product of these two quantities is in the numerator of (3).

Once the probability function (3) is obtained, the mean $E[X_n]$ can be derived accordingly. For the case of $n=365$, $E[X_{365}]=24.62$ (calculated by programming the probability function in Excel). On average it will be required to sample about 25 people to obtain a repeated birthday.

Another interesting quantity is $P[X_n>k]$. This is the probability that it will take throwing more than $k$ balls to get a repeat. Mathematically this can be obtained by first calculating $P[X_n \le k]$ by summing the individual probabilities via (3). This is a workable approach using software. There is another way that is more informative. For the event $X_n>k$ to happen, the first $k$ throws must be in different cells (no repeat). The event $X_n>k$ is identical to the event that there is no repeat in the first $k$ throws of balls. This is how the random variable $X_n$ is linked to the birthday problem since the probability $P[X_n>k]$ should agree with the probability $p_{n,k}$ in (2).

$\displaystyle P[X_n>k]=\biggl[1-\frac{1}{n} \biggr] \ \biggl[1-\frac{2}{n} \biggr] \cdots \biggl[1-\frac{k-1}{n} \biggr] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

___________________________________________________________________________

Back to The Birthday Problem

Consider the case for $X_{365}$. What is the median of $X_{365}$? That would be the median number of people surveyed to obtain a pair of identical birthday. The median of $X_{365}$ would be the least $k$ such that $P[X_{365} \le k]$ is at least 0.5. Note that $P[X_{365}>k]$ is identical to $p_k$ in (1). The above calculation shows that $p_{23}=0.4927$ and $1-p_{23}=0.5073$. Thus the median of $X_{365}$ is 23. Thus when performing the random sampling of surveying birthday, about half of the times you can expect to survey 23 or fewer than 23 people.

The birthday problem is equivalently about finding the median of the random variable $X_{365}$. A little more broadly, the birthday problem is connected to the percentiles of the variable $X_{n}$. In contrast, the mean of $X_{365}$ is $E[X_{365}]=24.62$. The following lists several percentiles for the random variable $X_{365}$.

$\begin{array}{ccccccc} k & \text{ } & P[X_{365}>k] & \text{ } & P[X_{365} \le k] & \text{ } & \text{Percentile} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 14 & \text{ } & 0.77690 & & 0.22310 & \\ 15 & \text{ } & 0.74710 & & 0.25290 & \text{ } & \text{25th Percentile} \\ 16 & \text{ } & 0.71640 & & 0.28360 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 22 & \text{ } & 0.52430 & & 0.47570 & \\ 23 & \text{ } & 0.49270 & & 0.50730 & \text{ } & \text{50th Percentile} \\ 24 & \text{ } & 0.46166 & & 0.53834 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 31 & \text{ } & 0.26955 & & 0.73045 & \\ 32 & \text{ } & 0.24665 & & 0.75335 & \text{ } & \text{75th Percentile} \\ 33 & \text{ } & 0.22503 & & 0.77497 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 40 & \text{ } & 0.10877 & & 0.89123 & \\ 41 & \text{ } & 0.09685 & & 0.90315 & \text{ } & \text{90th Percentile} \\ 42 & \text{ } & 0.08597 & & 0.91403 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 46 & \text{ } & 0.05175 & & 0.94825 & \\ 47 & \text{ } & 0.04523 & & 0.95477 & \text{ } & \text{95th Percentile} \\ 48 & \text{ } & 0.03940 & & 0.96060 & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 56 & \text{ } & 0.01167 & & 0.98833 & \\ 57 & \text{ } & 0.00988 & & 0.99012 & \text{ } & \text{99th Percentile} \\ 58 & \text{ } & 0.00834 & & 0.99166 & \\ \end{array}$

It is clear that in a group of 366 people, it is certain that there will be at least one repeated birthday (again ignoring leap year). This is due to the pigeon hole principle. As the percentiles in the above table shows, you do not need to survey anywhere close to 366 to get a repeat. The median is 23 as discussed. The 75th percentile of $X_{365}$ is 32.

The preceding calculation shows that you do not need a large group to have a repeated birthday. About 50% of the times, you will survey 23 or fewer people, about 75% of the time, 32 or fewer people, About 99% of the time, you will survey 57 or fewer people, much fewer than 365 or 366. So with around 50 in a random group, there is a near certainty of finding a shared birthday. In a random group of 100 people, there should be an almost absolute certainty that there is a shared birthday.

For a further demonstration, we simulated the random variable $X_{365}$ 10,000 times. The range of the simulated values is 2 to 78. Thus the odds for 100 people to survey is smaller than 1 in 10,000. To get a simulated value of 100, we will have to simulate more than 10,000 values of $X_{365}$. The median of the 10,000 simulated results is 23. The following table summarizes the results.

$\begin{array}{rrrrrr} \text{Interval} & \text{ } & \text{Count} & \text{ } & \text{Proportion} & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ \text{2 to 9} & \text{ } &953 & & 0.09530 & \\ \text{10 to 19} & \text{ } & 2870 & & 0.28700 & \\ \text{20 to 29} & \text{ } & 3055 & & 0.30550 & \\ \text{30 to 39} & \text{ } & 1922 & & 0.19220 & \\ \text{40 to 49} & \text{ } & 886 & & 0.08860 & \\ \text{50 to 59} & \text{ } & 253 & & 0.02530 & \\ \text{60 to 69} & \text{ } & 48 & & 0.00480 & \\ \text{70 to 78} & \text{ } & 13 & & 0.00130 & \\ \end{array}$

Not shown in the table is that 33 of the simulated results are the value of 2. Thus it is possible to ask two people and they both have the same birthday. But the odds of that happening is 33 out of 10,000 according to this particular set of simulations (probability 0.0033). The theoretical probability of 2 is 1/365 = 0.002739726. There are 2 instances of 78 in the 10,000 simulated values. Thus the odds are 2 out of 10,000 with probability 0.0002. The theoretical probability is 0.000037 using (3).

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$

# How long does it take to collect all coupons?

This post discusses the coupon collector problem, a classical problem in probability.

___________________________________________________________________________

The Coupon Collector Problem

The problem is usually stated as a coupon collector trying to collect the entire set of coupons. For example, each time the coupon collector buys a product (e.g. a box of breakfast cereal), he receives a coupon, which is a prize that can be a toy or a baseball card or other interesting item. Suppose that there are $n$ different types of coupons (prizes) and that the coupon collector wishes to collect the entire set. How many units of the product does the coupon collector have to buy in order to collect the entire set?

A simplified discussion of the coupon collector problem is found in another blog post. This post is more detailed discussion.

This blog post in another blog discusses the coupon collector problem from a simulation perspective.

As shown below, if there are 5 different coupons, it takes 12 purchases on average to get all coupons. If there are 10 different types of coupons, on average it would take 30 purchases. If there are 50 different types of coupons, it would take on average 225 purchases collect the entire set. The first few coupons are obtained fairly quickly. As more and more coupons are accumulated, it is harder to get the remaining coupons. For example, for the 50-coupon case, after the 49 coupons are obtained, it takes on average 50 purchases to get the last coupon.

Suppose that the coupon collector does not want to collect the entire set and only wishes to collect $r$ distinct coupons where $r \le n$. It turns out that this special case only requires a minor tweak to the case of collecting the entire set. Our strategy then is to focus on the main case. The special case will be discussed at the end of the post.

We first consider the main case that the coupon collector wishes to collect the entire set. The problem can be cast as a random sampling from the population $S=\left\{1,2,3,\cdots,n \right\}$. Selecting a number at random from $S$ with replacement is equivalent to the coupon collector receiving a coupon. Let $X_n$ be the minimum number of selections such that each number in $S$ is picked at least once. In this post we discuss the probability function of $X_n$ as well as its mean and variance.

Another interpretation of the problem is that it can be viewed as an occupancy problem, which involves throwing balls into $n$ cells at random. The random quantity of interest is the number of balls that are required to be thrown such that each cell has at least one ball. Clearly this formulation is identical to the coupon interpretation and the random sampling interpretation. The angle of occupancy problem is useful since we can leverage the formulas developed in this previous post. A description of the occupancy problem is given here.

Regardless of the interpretation, the goal is obtain information on the random variable $X_n$, the minimum number of random selections from $S=\left\{1,2,3,\cdots,n \right\}$ in order to have the complete set of $n$ distinct values represented in the sample.

___________________________________________________________________________

Mean and Variance

The mean and variance of $X_n$ are easier to obtain. So that is where we begin. The key is to break up $X_n$ into a sum as follows:

$X_n=C_{1}+C_{2}+C_{3}+\cdots+C_{i-1}+C_{i}+\cdots+C_{n} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

where $C_{i}$ is the additional selections from $S=\left\{1,2,3,\cdots,n \right\}$ to get a number that is distinct from the $i-1$ distinct numbers that have been chosen. For example, $C_3$ is the number of random selections to get a number that is distinct from the two distinct numbers obtained up to that point.

Note that each $C_i$ involves repeated sampling until some criterion is reached, thus resembling a geometric random variable. Indeed they are. As the sampling continues and as more distinct values are obtained, it is not as easy to obtain a new number. After $i-1$ distinct numbers have been obtained, the probability of drawing a new distinct number is $p=\frac{n-(i-1)}{n}$. As geometric random variables, each $C_i$ has the following mean and variance.

$\displaystyle E[C_i]=\frac{1}{p}=\frac{n}{n-(i-1)}$

$\displaystyle Var[C_i]=\frac{1-p}{p^2}=\frac{n(i-1)}{[n-(i-1)]^2}$

where $i=1,2,3,\cdots,n$. Note that the random variables $C_i$ are independent. The value of $C_i$ does not depend on how many trials it takes to draw the previous $i-1$ distinct numbers. The following gives the mean and variance of $X_n$.

$\displaystyle E[X_n]=\sum \limits_{i=1}^n E[C_i]=\sum \limits_{i=1}^n \frac{n}{n-(i-1)} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle Var[X_n]=\sum \limits_{i=1}^n Var[C_i]=\sum \limits_{i=1}^n \frac{n(i-1)}{[n-(i-1)]^2} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The expectation $E[X_n]$ can be rearranged as follows to give more information.

$\displaystyle E[X_n]=n \ \biggl[\frac{1}{n}+ \cdots + \frac{1}{3}+ \frac{1}{2} + 1\biggr]=n \ H_n \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

The quantity $H_n$ is the partial sum of the harmonic series. Note that $H_n \rightarrow \infty$ as $n \rightarrow \infty$. Thus $E[X_n] \rightarrow \infty$ as $n \rightarrow \infty$. The quantity $H_n$ can be interpreted as the average number of units of product that are required to purchase per coupon. The following table lists out the expected values for selected values of $n$.

Table 1

$\begin{array}{cccccc} \text{Number of} & \text{ } & \text{Expected Number of Trials} & \text{ } & \text{Expected Total Number} & \\ \text{Coupons} & \text{ } & \text{per Coupon} & \text{ } & \text{of Trials (rounded up)} & \\ n & \text{ } & E[H_n] & \text{ } & E[X_n] & \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } & \\ 1 & \text{ } & 1.0000 & & 1 & \\ 2 & \text{ } & 1.5000 & & 3 & \\ 3 & \text{ } & 1.8333 & & 6 & \\ 4 & \text{ } & 2.0833 & & 9 & \\ 5 & \text{ } & 2.2833 & & 12 & \\ 6 & \text{ } & 2.4500 & & 15 & \\ 7 & \text{ } & 2.5929 & & 19 & \\ 8 & \text{ } & 2.7179 & & 22 & \\ 9 & \text{ } & 2.8290 & & 26 & \\ 10 & \text{ } & 2.9290 & & 30 & \\ 20 & \text{ } & 3.5977 & & 72 & \\ 30 & \text{ } & 3.9950 & & 120 & \\ 40 & \text{ } & 4.2785 & & 172 & \\ 50 & \text{ } & 4.4992 & & 225 & \\ 60 & \text{ } & 4.6799 & & 281 & \\ 70 & \text{ } & 4.8328 & & 339 & \\ 80 & \text{ } & 4.9655 & & 398 & \\ 90 & \text{ } & 5.0826 & & 458 & \\ 100 & \text{ } & 5.1874 & & 519 & \\ \end{array}$

Table 1 gives an estimate on how long to expect to collect the entire set of coupons for selected coupon sizes. The third column gives the expected total number of purchases to obtain the entire coupon set. The second column gives an estimate of how many purchases on average to obtain one coupon. For the 50-coupon case, it takes on average about 4.5 purchases to obtain one coupon. However, it does not tell the whole story. To get the 50th coupon, it takes on average 50 trials. Note that $E[C_{50}]=50$ in the 50-coupon case in formula (1). In a simulation of the 50-coupon problem, it took 54 trials to obtain the 50th coupon. To get the 49th coupon, it takes on average 50/2 = 25 trials.

___________________________________________________________________________

The Occupancy Problem

We now view the coupon collector problem as an occupancy problem in order to leverage a formula from a previous post. Suppose that we randomly throw $k$ balls into $n$ cells. Let $Y_{k,n}$ be the number of empty cells as a result of randomly assigning $k$ balls into $n$ cells. The following gives the probabilities $P(Y_{k,n}=j)$ where $j=0,1,2,\cdots,n-1$.

$\displaystyle P(Y_{k,n}=0)=\sum \limits_{i=0}^{n} (-1)^{i} \binom{n}{i} \biggl(1-\frac{i}{n}\biggr)^{k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

$\displaystyle P(Y_{k,n}=j)=\binom{n}{j} \sum \limits_{i=0}^{n-j} (-1)^{i} \binom{n-j}{i} \biggl(1-\frac{j+i}{n}\biggr)^{k} \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

where $j=1,2, \cdots, n-1$.

The notation $\binom{n}{j}$ is the binomial coefficient, which is the number of ways to choose $j$ objects out of $n$ objects where order does not matter. The calculation is defined by $\displaystyle \binom{n}{j}=\frac{n!}{j! (n-j)!}$.

The formula (5) gives the probability of having $j$ empty cells. In throwing $k$ balls into $n$ cells, the probability of having $w$ occupied cells is then $P[Y_{k,n}=n-w]$.

___________________________________________________________________________

The Probability Function

We now discuss the probability function of the random variable $X_n$, namely $P[X_n=k]$ for $k=n,n+1,n+2,\cdots$. The event $X_n=k$ means that all $n$ cells are occupied after throwing $k$ balls with the first $k-1$ balls landing in $n-1$ cells. Putting it in another way, there are zero empty cells after throwing $k$ balls and there is 1 empty cell after throwing the first $k-1$ balls. This can be stated using the notation in the preceding section on the occupancy problem as follows:

$P(X_n=k]=P[Y_{k,n}=0 \text{ and } Y_{k-1,n}=1] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)$

Consider the following derivation.

\displaystyle \begin{aligned} P(X_n=k]&=P[Y_{k,n}=0 \text{ and } Y_{k-1,n}=1] \\&=P[Y_{k,n}=0 \ \lvert \ Y_{k-1,n}=1] \times P[Y_{k-1,n}=1] \\&=\frac{1}{n} \times \binom{n}{1} \sum \limits_{i=0}^{n-1} (-1)^i \binom{n-1}{i} \biggl[ 1-\frac{1+i}{n} \biggr]^{k-1} \\&=\sum \limits_{i=0}^{n-1} (-1)^i \binom{n-1}{i} \biggl[ 1-\frac{1+i}{n} \biggr]^{k-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7) \end{aligned}

where $k=n,n+1,n+2,\cdots$.

Rather than memorizing the probability function in (7), a better way is to focus on the thought process that is inherent in (6).

One comment about the calculation for (7). The summation for $P[X_n=k]$ has $n$ terms. A given probability may involve multiple values of $k$, e.g.

$P[X_6>15]=1-P[6 \le X_6 \le 14]$.

Unless the number of values for $k$ is very small, the calculation should be done using software. Microsoft Excel is an excellent way to perform the calculation. The calculations for the examples below are programmed in Excel.

___________________________________________________________________________

Examples

Example 1
Suppose that a fair die is rolled until all 6 faces have appeared. Find the mean number of rolls and the variance of the number of rolls. What is the probability that it will take at least 12 rolls? What is the probability that it will take more than 15 rolls?

Using the notation developed above, the random variable of interest is $X_6$. Its mean and variance are:

$\displaystyle E[X_6]=6 \ \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+\frac{1}{4}+\frac{1}{5}+\frac{1}{6} \biggr]=6 \times 2.45 = 14.7 \approx 15$

$\displaystyle Var[X_6]=\sum \limits_{i=1}^6 \frac{6(i-1)}{[6-(i-1)]^2}=38.99$

The following is the probability function for $X_6$.

$\displaystyle P[X_6=k]=\sum \limits_{i=0}^{5} (-1)^i \binom{5}{i} \biggl[ 1-\frac{1+i}{6} \biggr]^{k-1}$

where $k=6,7,8,\cdots$

For each $k$, the quantity $P[X_6=k]$ requires 6 calculations. Performing the calculations in Excel, the desired probabilities are:

$P[X_6 \ge 12]=1-P[6 \le X_6 \le 11]=1-0.356206419=0.643793581$

$P[X_6 > 15]=1-P[6 \le X_6 \le 15]=1-0.644212739=0.355787261$

Even though the average number of trials is 15, there is still a significant probability that it will take more than 15 trials. This is because the variance is quite large. $\square$

Example 2
An Internet startup is rapidly hiring new employees. What is the expected number of new employees until all birth months are represented? Assume that the 12 birth months are equally likely. What is the probability that the company will have to hire more than 25 employees? If the company currently has more than 25 employees with less than 12 birth months, what is the probability that it will have to hire more than 35 employees to have all 12 birth months represented in the company?

The random variable of interest is $X_{12}$. The following shows the mean and probability function.

$\displaystyle E[X_{12}]=12 \ \biggl[ 1 + \frac{1}{2}+\frac{1}{3}+\cdots +\frac{1}{12} \biggr]=37.23852814 \approx 38$

$\displaystyle P[X_{12}=k]=\sum \limits_{i=0}^{11} (-1)^i \binom{11}{i} \biggl[ 1-\frac{1+i}{12} \biggr]^{k-1}$

where $k=12,13,14,\cdots$

Performing the calculation in Excel, we obtain the following probabilities.

$P[X_{12} > 25]=1-P[12 \le X_{12} \le 25]=1-0.181898592=0.818101408$

$P[X_{12} > 35]=1-P[12 \le X_{12} \le 35]=1-0.531821149=0.468178851$

$\displaystyle P[X_{12} > 35 \ \lvert \ X_{12} > 25]=\frac{P[X_{12} > 35]}{P[X_{12} > 25]}=\frac{0.468178851}{0.818101408}=0.572274838$

___________________________________________________________________________

A Special Case

We now consider the special case that the coupon collector only wishes to collect $r$ distinct coupons where $r \le n$. Of course, $n$ is the total number of distinct coupon types. Let $X_{n,r}$ be the minimum number of purchases such that $r$ distinct coupons have been obtained. In the random sampling interpretation, $X_{n,r}$ would be the minimum sample size such that $r$ distinct elements have been chosen from the sample space $S=\left\{1,2,3,\cdots,n \right\}$. The mean and variance of $X_{n,r}$ follow from the same idea. Each $X_{n,r}$ is the independent sum of geometric random variables as in (0).

$X_{n,r}=C_{1}+C_{2}+C_{3}+\cdots+C_{i-1}+C_{i}+\cdots+C_{r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

where $r \le n$.

Thus $E[X_{n,r}]$ and $Var[X_{n,r}]$ would be like (1) and (2) except that the summation is through $r$ instead of $n$.

For the probability function of $X_{n,r}$, we only need to tweak the thought process expressed in (6) slightly. For the event $X_{n,r}=k$ to happen, exactly $r$ cells are occupied after throwing $k$ balls with the first $k-1$ balls landing in $r-1$ cells. In other words, there are exactly $n-r$ empty cells after throwing $k$ balls and there are exactly $n-r+1$ empty cells after throwing $k-1$ balls. The following expresses this condition in terms of the occupancy problem, i.e. similar to (6).

$P[X_{n,r}=k]=P[Y_{k,n}=n-r \text{ and } Y_{k-1,n}=n-r+1] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

Here’s the important components that need to go into $P[X_{n,r}=k]$ with the first one coming from the occupancy formula (5).

$\displaystyle P[Y_{k-1,n}=n-r+1]=\binom{n}{n-r+1} \sum \limits_{i=0}^{r-1} (-1)^i \binom{r-1}{i} \biggl[ 1-\frac{n-r+1+i}{n} \biggr]^{k-1}$

$\displaystyle P[Y_{k,n}=n-r \ \lvert \ Y_{k-1,n}=n-r+1]=\frac{n-r+1}{n}$

Multiply the above two probabilities together produces the desired probability $P[X_{n,r}=k]$ for $k=r,r+1,r+2,\cdots$.

$\displaystyle P[X_{n,r}=k]=\binom{n-1}{r-1} \sum \limits_{i=0}^{r-1} (-1)^i \binom{r-1}{i} \biggl[ 1-\frac{n-r+1+i}{n} \biggr]^{k-1} \ \ \ \ \ \ \ \ (10)$

Note that when $n=r$ (collecting the entire set of coupons), formula (10) would be identical to (7). The following example demonstrates the calculation.

Example 3
Consider the 6-coupon case discussed in Example 1. Suppose that the coupon collector is interested in collecting $r=4$ coupons. What is the expected number of purchases to get 4 coupons? What is the probability that it will take more than 6 purchases to get 4 coupons? What is the probability that it will take more than 8 purchases to get 4 coupons? Compare these results with Example 1.

The random variable of interest is $X_{6,4}$. The mean is:

$\displaystyle E[X_{6,4}]=6 \biggl[ \frac{1}{6}+\frac{1}{5}+\frac{1}{4}+\frac{1}{3} \biggr]=5.7 \approx 6$

Note that it is much faster to obtain 4 coupons than the entire set of six. The following gives the probability function for $X_{6,4}$.

$\displaystyle P[X_{6,4}=k]=\binom{5}{3} \sum \limits_{i=0}^{3} (-1)^i \binom{3}{i} \biggl[ 1-\frac{3+i}{6} \biggr]^{k-1}$

where $k=4,5,6,\cdots$

Performing the calculations in Excel gives the following probabilities.

$P[X_{6,4} > 6]=1-P[12 \le X_{12} \le 25]=1-0.74845679=0.25154321$

$P[X_{6,4} > 8]=1-P[12 \le X_{12} \le 35]=1-0.928712277=0.071287723$

The first probability shows there is still a good chance that it will take more then the mean number of trials to get 4 coupons. The wait time is much less than in Example 1 since the probability of wait time more than 8 is fairly small. $\square$

___________________________________________________________________________

Moment Generating Function

One distributional quantity that is easy to obtain is the moment generating function (MGF) for the random variable $X_n$ (the case of collecting the entire set of coupons) as well as for the random variable $X_{n,r}$ (the partial case). Since both of these random variables are the independent sum of geometric random variables, their MGFs would simply be the product of the individual geometric MGFs. The following gives the results.

$\displaystyle M_{X_n}(t)=\prod \limits_{i=1}^n \frac{[n-(i-1)] e^t}{n-(i-1) e^t} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11)$

$\displaystyle M_{X_{n,r}}(t)=\prod \limits_{i=1}^r \frac{[n-(i-1)] e^t}{n-(i-1) e^t} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (12)$

___________________________________________________________________________
$\copyright \ 2017 \text{ by Dan Ma}$

# Every character is known but password is hard to crack

In this post, we discusses an example in which you are given a password (every character of it) and yet it is still very hard (or even impossible) to crack. Anyone who understands this example has a solid understanding of the binomial distribution. Here’s the example:

Your friend John tells you that the password to his online bank account has 26 characters. The first character is the first letter in the English alphabet, the second character is the second letter in the English alphabet, the third character is the third letter in the English alphabet and so on.

Now that your friend John has given you the key to his account, does that mean you can log onto his account to find out how much money he has, or to make financial transaction on his behalf or to enrich yourself?

If this example sounds too good to be true, what is the catch?

Even though every character in the 26-character password is known, it is indeed a very strong password. How could this be? You may want to stop here and think about.

Indeed, if every character in John’s password is lower case or if every character is upper case, then his bank account is toast. But John’s password can be made very strong and very hard to crack if the password is case sensitive. The password given by John is not just one password, but is a large collection of passwords. In fact, there are over 67 millions possible passwords (67,108,864 to be exact). The following are two of the most obvious ones.

a b c d e f g h i j k l m n o p q r s t u v w x y z (all lower case)

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all upper case)

The following is another possible password. If this is the one John uses, it will be difficult to crack.

a b C d e f G h i j k l M N o p Q R s T U v w X Y z (10 upper case letters)

Here’s another possibility.

A B c D E f G H I J k l M N o P q R S t u v w X y z (14 upper case letters)

Each character in the password has two possibilities – lower case or upper case. Across all 26 characters, there are $2^{26}$ possibilities. This number is 67,108,864. So 2 raised to 26 is a little over 67 millions. So the password given by John is not just one password, but is a generic one with over 67 million possibilities. There is a one in 67 million chance in correctly guessing the correct password if John chooses the upper case letters randomly. This is much better odds than winning the Powerball lottery, one in 292,201,338, which one in 292 million. But it is still an undeniably strong password.

So John tells you the password, but has in fact not given up much secret. This is the case if he makes the case sensitivity truly random. Of course, once he sets his password, unless he has a great memory, he will need to write down the positions that are upper case. Otherwise, he will not be able to log back on. But that is a totally separate story.

___________________________________________________________________________

The binomial angle

A good way to represent the passwords is to view each one as a 26-character string of U and L (U stands for upper case and L stands for lower case). Then the above two passwords (the one with 10 upper case letters and the one with 14 upper case letters) are represented as follows:

L L U L L L U L L L L L U U L L U U L U U L L U U L (10 U’s)

U U L U U L U U U U L L U U L U L U U L L L L U L L (14 U’s)

As discussed, there are 67,108,864 many such strings. We can also think of each such string as the record of a random experiment consisting of tossing a coin 26 times. We record a U when we get a head and record an L when the coin toss results in a tail. In fact, this is the kind of records that John would need to keep when he sets the password. Such a string would tell him which characters in the password are in upper case and which characters are in lower case. On the other hand, hacking the password would be essentially trying to pinpoint one such string out of 67,108,864 many possibilities.

We know 67,108,864 is a large number. Let’s further demonstrate the challenge. In each string, the number of U’s can range from 0 to 26. How many of the strings have 0 U’s? Precisely one (all letters are L). This would be what most people would try first (all the letters are lower case). How many of the strings have exactly 1 U? Thinking about it carefully, we should arrive at the conclusion that there are 26 possibilities (the single U can be in any of the 26 positions). So if the would be hacker knows there there is only one upper case letter, then it would be easy to break the password. How many of the strings have exactly 2 U’s? If you try to write out all the possible cases, it may take some effort. There are 325 possibilities. So just having two upper case letters in the password seems to make it something that is approaching a strong password. But the problem is that the two U’s may be guessed easily. John may put the upper case letters on his initials (if his name is John Smith, he may make J and S upper case), or in other obvious letters.

How many of the strings have exactly 3 U’s? This will be really hard to write out by hand. There are 2,600 many possibilities. Why stop at having just 3 upper case letters? It is clear that the more upper case letters used in the password, the stronger it is and the harder it is to crack.

How do we know the precise number of possibilities for a given $k$, the number of U’s? The idea is that of choosing $k$ number of letters out of 26 letters.

The number of ways of choosing $k$ objects from a total of $n$ objects is denoted by the notation $\binom{n}{k}$. Sometimes the notations $C(n,k)$, $_nC_k$ and $C_{n,k}$ are used. Regardless of the notation, the calculation is

$\displaystyle \binom{n}{k}=\frac{n!}{k! (n-k)!}$

The notation $n!$ is the product of all the positive integers up to and including $n$ (this is called $n$ factorial). Thus $1!=1$, $2!=2$, $3!=6$, $4!=24$. To make the formula work correctly, we make $0!=1$.

The following gives the first several calculations in the 26-character password example.

$\displaystyle \binom{26}{2}=\frac{26!}{2! \ 24!}=\frac{26 \cdot 25}{2}=325$

$\displaystyle \binom{26}{3}=\frac{26!}{3! \ 23!}=\frac{26 \cdot 25 \cdot 24}{6}=2600$

$\displaystyle \binom{26}{4}=\frac{26!}{4! \ 22!}=\frac{26 \cdot 25 \cdot 24 \cdot 23}{24}=14950$

If the desire is to see the patterns, the remaining calculations can be done by using software (or at least a hand held calculator). The following table shows the results.

$\displaystyle \begin{array}{rrr} k &\text{ } & \displaystyle \binom{26}{k} \\ \text{ } & \text{ } & \text{ } \\ 0 &\text{ } & 1 \\ 1 &\text{ } & 26 \\ 2 &\text{ } & 325 \\ 3 &\text{ } & 2,600 \\ 4 &\text{ } & 14,950 \\ 5 &\text{ } & 65,780 \\ 6 &\text{ } & 230,230 \\ 7 &\text{ } & 657,800 \\ 8 &\text{ } & 1,562,275 \\ 9 &\text{ } & 3,124,550 \\ 10 &\text{ } & 5,311,735 \\ 11 &\text{ } & 7,726,160 \\ 12 &\text{ } & 9,657,700 \\ 13 &\text{ } & 10,400,600 \\ 14 &\text{ } & 9,657,700 \\ 15 &\text{ } & 7,726,160 \\ 16 &\text{ } & 5,311,735 \\ 17 &\text{ } & 3,124,550 \\ 18 &\text{ } & 1,562,275 \\ 19 &\text{ } & 657,800 \\ 20 &\text{ } & 230,230 \\ 21 &\text{ } & 65,780 \\ 22 &\text{ } & 14,950 \\ 23 &\text{ } & 2,600 \\ 24 &\text{ } & 325 \\ 25 &\text{ } & 26 \\ 26 &\text{ } & 1 \\ \end{array}$

The pattern is symmetrical. Having too few U’s or too many U’s produces weak passwords that may be easy to guess. Having 6 or 7 U’s seems to give strong passwords. Having half of the letters upper case (13 U’s) is the optimal, with the most possibilities (over 10 millions). Even if you are given partial information such as “half of the letters are in upper case”, you are still left with over 10 million possibilities to work with!

Dividing each of the above counts by 67,108,864 gives the relative weight (probability) of each case of having exactly $k$ U’s.

$\displaystyle \begin{array}{rrr} k &\text{ } & P[X=k] \\ \text{ } & \text{ } & \text{ } \\ 0 &\text{ } & 0.00000001 \\ 1 &\text{ } & 0.00000039 \\ 2 &\text{ } & 0.00000484 \\ 3 &\text{ } & 0.00003874 \\ 4 &\text{ } & 0.00022277 \\ 5 &\text{ } & 0.00098020 \\ 6 &\text{ } & 0.00343069 \\ 7 &\text{ } & 0.00980198 \\ 8 &\text{ } & 0.02327971 \\ 9 &\text{ } & 0.04655942 \\ 10 &\text{ } & 0.07915102 \\ 11 &\text{ } & 0.11512876 \\ 12 &\text{ } & 0.14391094 \\ 13 &\text{ } & 0.15498102 \\ 14 &\text{ } & 0.14391094 \\ 15 &\text{ } & 0.11512876 \\ 16 &\text{ } & 0.07915102 \\ 17 &\text{ } & 0.04655942 \\ 18 &\text{ } & 0.02327971 \\ 19 &\text{ } & 0.00980198 \\ 20 &\text{ } & 0.00343069 \\ 21 &\text{ } & 0.00098020 \\ 22 &\text{ } & 0.00022277 \\ 23 &\text{ } & 0.00003874 \\ 24 &\text{ } & 0.00000484 \\ 25 &\text{ } & 0.00000039 \\ 26 &\text{ } & 0.00000001 \\ \end{array}$

The cases of $X=k$ where $k=11,12,13,14,15$ add up to 67.3% of the 67,108,864 possibilities.

___________________________________________________________________________

The binomial distribution

Another way to look at it is that in setting the password, John is performing a sequence of 26 independent Bernoulli trials. Here, each trial has two outcomes, a or A, b or B, c or C and so on. For example, the lower case or upper case can be determined by a coin toss. Let $X$ be the number of upper case letters in the 26-character password. Then the random variable $X$ has a binomial distribution with $n=26$ (26 Bernoulli trials) and the probability of success $p=0.5$ in each trial, which is the probability that a character is upper case, assuming that he determines the upper/lower case by a coin toss. The following is the probability function:

$\displaystyle P(X=x)=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^x \biggl[\frac{1}{2}\biggr]^{26-x}=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^{26}$

where $x=0,1,2,\cdots,25,26$. The quantity $\displaystyle P(X=x)$ is the probability that the number of upper case letters is $x$. Here, $\binom{26}{x}$ is the number of ways to choose $x$ letters out of 26 letters and is computed by the formula indicated earlier.

Since the upper/lower case is determined randomly, another way to state the probability function of the random variable $X$ is:

$\displaystyle P(X=x)=\displaystyle \frac{\binom{26}{x}}{2^{26}}=\frac{\binom{26}{x}}{67108864} \ \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,24,25,26$

The expected value of this random variable $X$ is 13. This is the average number of upper case letters if the case is determined randomly. This obviously produces the most optimally strong password. If John determines the case not at random, the security may not be as strong or the would be hacker may be able to guess.

Stepping away from the 26-character password example, here’s the probability function of a binomial distribution in general.

$\displaystyle P(X=x)=\binom{n}{x} \ p^x \ (1-p)^{n-x} \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,n-1,n$

This model describes the random experiment of running $n$ independent trials, where each trial has two outcomes (the technical term is Bernoulli trial). In each trial, the probability of one outcome (called success) is $p$ and the probability of the other outcome (called failure) is $1-p$. The random variable $X$ counts the number of successes whenever such an experiment is performed. The probability $P(X=x)$ gives the likelihood of achieving $x$ successes.

As an example, if John has a bias toward lower case letter, then the probability of success (upper case) may be $p=0.4$ (assuming that the random variable $X$ still counts the number of upper case letters). Then the average number of upper case letters in a randomly determined password is 26 x 0.4 = 10.4.

___________________________________________________________________________

Interesting binomial distribution problems

The problem of points and the dice problem are two famous probability problems that are in the history book as a result of a gambler seeking help from Pascal and Fermat.

___________________________________________________________________________
$\copyright \ 2016 \text{ by Dan Ma}$

# The problem of points

There are two celebrated problems in probability that originated from the French professional gambler Chevalier de MÃ©rÃ© (1607-1684). The problems were solved jointly by Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) in a series of letters. The ideas discussed in these letters were often credited with having started probability theory. In a previous post, we discuss one of the problems posed by Chevalier de MÃ©rÃ© to Pascal (the dice problem). In this post, we discuss the second problem – the problem of points.

___________________________________________________________________________

Describing the Problem

Here’s a description of the famous problem of points. Two players play a game of chance with the agreement that each player puts up equal stakes and that the first player who wins a certain number of rounds (or points) will collect the entire stakes. Suppose that the game is interrupted before either player has won. How do the players divide the stakes fairly?

It is clear that the player who is closer to winning should get a larger share of the stakes. Since the player who had won more points is closer to winning, the player with more points should receive a larger share of the stakes. How do we quantify the differential?

To describe the problem in a little more details, suppose that two players, A and B, play a series of points in a game such that player A wins each point with probability $p$ and that player B wins each point with probability $1-p$. The first player to win $T$ points wins the game. Suppose that the game is stopped for some reason. At the time of stopping, player A has won $a$ points and player B has won $b$ points with $a and $b. How do they divide the stakes? Note that the pot is contributed equally by the two players.

In attacking the problem, Pascal’s idea is that the share of the stakes that is received by a player should be proportional to his/her probability of winning if the game were to continue at the point of stopping. Let’s explore this idea by looking at examples.

___________________________________________________________________________

Looking at the Problem thru Examples

The following examples are based on the following rule. Let’s say two players (A and B) play a series of points with equal probability of winning a point at each round. Each player puts up a stake of 32. The first player who wins four points take the entire stakes.

Example 1 (Fermat’s Approach)
Suppose that player A has won 2 points and player B has won one point right before the termination of the game. How can the stakes be divided fairly?

In the analysis, we assume that the game continues. Then player A needs 2 more points to win while player B needs 3 more points to win. We would like to calculate the probability that player A wins 2 points before player B winning 3 points. Consider the next 2 + 3 – 1 = 4 rounds (assuming one point per round). If player A wins at least 2 points in the next 4 rounds, player A win the game. The complement of this probability would be the probability that player B wins the game.

Let S (success) be the event that player A wins a point and let F (failure) be the event that player A loses a point (i.e. player B winning the point). Let’s write out all the outcomes of playing 4 points (this was the approach of Fermat). There are 16 such outcomes.

$\begin{array}{rrrrrr} 1 & SSSS & * & & \text{ } & \\ 2 & SSSF & * & & \text{ } & \\ 3 & SSFS & * & & \text{ } & \\ 4 & SSFF & * & & \text{ } & \\ 5 & SFSS & * & & \text{ } & \\ 6 & SFSF & * & & \text{ } & \\ 7 & SFFS & * & & \text{ } & \\ 8 & SFFF & \text{ } & & \text{ } & \\ 9 & FSSS & * & & \text{ } & \\ 10 & FSSF & * & & \text{ } & \\ 11 & FSFS & * & & \text{ } & \\ 12 & FSFF & \text{ } & & \text{ } & \\ 13 & FFSS & * & & \text{ } & \\ 14 & FFSF & \text{ } & & \text{ } & \\ 15 & FFFS & \text{ } & & \text{ } & \\ 16 & FFFF & \text{ } & & \text{ } & \\ \end{array}$

Player A wins In eleven of the outcomes (the ones with asterisk). Note that there are at least 2 S’s in the outcomes with asterisk. Thus the probability of player A winning is 11/16 = 0.6875. At the time of stopping the game, player A has a 68.75% chance of winning (if the game were to continue). The share for player A is 0.6875 x 64 = 44.

The example demonstrates the approach taken by Fermat. He essentially converted the original problem of points into an equivalent problem, i.e. finding the probability of player A winning the game if the game were to continue. Then he used combinatorial methods to count the number of cases that result in player A winning. In this example, the additional four points that are to be played are fictitious moves (the moves don’t have to be made) but are useful for finding the solution. The only draw back in Fermat’s approach is that he used counting. What if the number of points involved is large?

Example 2 (Pascal’s Approach)
The specifics of the example are the same as in Example 1. The listing out all possible cases in Example 1 makes the solution easy to see. But if the number of points is large, then the counting could become difficult to manage. What we need is an algorithm that is easy to use and is easy to implement on a computer.

Pascal essentially had the same thinking as Fermat, i.e. to base the solution on the probability of winning if the game were to continue. Pascal also understood that the original problem of points is equivalent to the problem of playing an additional series of points. In this example, playing additional 2 + 3 – 1 = 4 points. As in Example 1, we find the probability that player A wins 2 or more points in this series of 4 points. Pascal’s way to find this probability was based on what are now known as the Pascal’s triangle and the binomial distribution. We would use the following modern day notation:

$\displaystyle \sum \limits_{j=2}^4 \ \binom{4}{j} \ \frac{1}{2^4}=6 \times \frac{1}{2^4}+4 \times \frac{1}{2^4}+\frac{1}{2^4}=\frac{11}{16}=0.6875$

Note that the above probability of 0.6875 is the probability of having at least 2 successes in 4 trials (with 0.5 probability of success in each trial). Anyone with a good understanding of the binomial distribution can carry out the calculation (or use software). Of course, this mathematical construct came from Pascal! To the contemporaries of Pascal and Fermat, this concept was definitely not commonplace.

Example 3
Suppose that player A has won 1 point and player B has won no point at the time of termination of the game. How can the prize money be divided fairly?

Based on the discussion of Example 1 and Example 2, player A needs to win at least 3 points to win the game and player B needs to win 4 or more points to win the game. The extended series of points would have 3 + 4 – 1 = 6 points. Player A then needs to win at least 3 points out of 6 points (at least 3 successes out of 6 trials). The following gives the probability of player A winning the extended series of plays.

$\displaystyle \sum \limits_{j=3}^6 \ \binom{6}{j} \ \frac{1}{2^6}=\frac{42}{64}=0.65625$

With the total stakes being 64, the share for player A would be 64 x 42/64 = 42. The share for player B would be 22.

___________________________________________________________________________

General Discussion

We now discuss the ideas that are brought up in the examples. As indicated above, two players, A and B, contribute equally to the stakes and then play a series of points until one of them wins $T$ points. The probability that player A wins a round (one point in each round) is $p$ and the probability that player B wins a point is $1-p$. Suppose that the game is stopped for some reason before either player has won. At the time of stopping, player A has won $a$ points and player B has won $b$ points with $a and $b. Let $n=T-a$ and $m=T-b$. The key to solving the problem of points is to look at an extended play of $n+m-1$ points.

Here’s the great insight that came from Pascal and Fermat. They looked forward and not backward. They did not base the solution on the number of points that are already won. Instead, they focused on an extended series of points to determine the share of the winning. This additional play of points is “fictitious” but it helps clarify the process. In essence, they turned the original problem of points into a problem about this additional play of $n+m-1$ points.

The original problem is: what is the fair share for player A when the game is stopped prematurely with player A having won $a$ points and player B having won $b$ points? The equivalent problem is: what is the probability of player A winning at least $n$ points of the next $n+m-1$ points where $n=T-a$ and $m=T-b$ (assuming the game has not stopped). Let’s call this probability $P(n,m)$. Let’s examine the setting of this probability. Each point is like a Bernoulli trial – either a success (player A winning it) or a failure (player B winning it). There are $n+m-1$ trials. The probability of success is $p$ in each trial. We wish to find the probability that are at least $n$ successes. What is being described is a binomial distribution. The probability being sought is:

$\displaystyle P(n,m)=\sum \limits_{j=n}^{n+m-1} \ \binom{n+m-1}{j} \ p^j \ (1-p)^{n+m-1-j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

where $n=T-a$, $m=T-b$ and that $a$ is the number of points won by player A and $b$ is the number of points won by player B at the time the game is terminated.

The probability $P(n,m)$ is the probability of player A winning the game if the game were to continue at the point of termination. This probability $P(n,m)$ is the proportion of the stakes that would be awarded to player A. Of course, the proportion that should be awarded to player B would be $1-P(n,m)$. The quantity $P(n,m)$ can be obtained from the given parameters and by using any software that has a function for the binomial distribution.

The problem of points seems to have an easy solution since the answer $P(n,m)$ is so accessible. Any one who understands the binomial distribution can comprehend. It is also easy to compute probabilities for a binomial distribution using calculator or software. One thing to keep in mind is that the solution looks accessible now because of the tools and concepts that came from trails blazed by Pascal and Fermat (and because of the computing tools that we have). Tools and concepts such as Pascal’s triangle and binomial distribution were unknown to people at the time of Pascal and Fermat.

___________________________________________________________________________

Working Backward

For us, the calculation in $(1)$ is easily done using calculator or software. Pascal did not calculate $P(n,m)$ directly and instead perform the calculation backward by using the following formula.

$P(n,m)=p \times P(n-1,m)+(1-p) \times P(n,m-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The formula can be derived mathematically. But doing that is not necessary. The quantity $P(n,m)$ is the probability that player A will win $n$ points before player B winning $m$ points. We can derive the above formula by conditioning on the outcome of the first point. The quantity $P(2,3)$ is calculated in Example 1 and Example 2. It is the average of two similar probabilities with smaller parameters.

$P(2,3)=\frac{1}{2} \times P(1,3)+\frac{1}{2} \times P(2,2)$

Based on the recursive formula in $(1)$, Pascal built up the answer backward, similar to the way a computer program is written. In fact, this recursive approach allows us to solve not just for one scenario, but for all the scenarios in a game of $T$ points.

Example 4
We now revisit the problem in Examples 1 and 2. Recall that the game is to play for 4 points, i.e. the first player winning 4 points collects the entire stakes of 64 (32 is contributed by each player). Each player earns a point with probability 0.5. We now show how to divide the stakes when the game is stopped at every possible stopping point.

The following diagram (Figure 1) shows the table for the share awarded to player A. The table is empty except for the top row and rightmost column (the numbers in red). The number 64 shown in the last column would be the amount awarded to player A because player A has won 4 points. The number 0 in the top means that player B has won 4 games. So player A gets nothing. Note that the bottom row highlighted in orange shows the numbers of points that have been won player A. The bottom row highlighted in blue shows the remaining points that player A needs to win in order to win the entire stakes (these are the fictitious points). Similarly, the columns highlighted in orange and blue on the left show the same information for player B.

Figure 1 – The share of the stakes awarded to player A

Now, we can use the recursive formula in $(1)$ to fill the table. Basically each cell is the average of the number above it and the number to the right. The parameter $n$ in $P(n,m)$ is a number in the blue row. The parameter $m$ in $P(n,m)$ is a number in the blue column. The following shows the results.

Figure 2 – The share of the stakes awarded to player A

For example, when player A has won 2 points and player B has won 1 point, the share for player A is 44 (the average of 32 and 56), the same answer as in Example 1 and Example 2. When player A has won 2 points and player B as won 2 points, both players are in equally competitive positions. Then each player gets 32. When player A has won 2 points and player B as won 3 points, the share for player A is 16 (the average of 0 and 32).

Essentially the formula in $(2)$ is the idea of using smaller steps rather than the entire extended play of $n+m-1$ points. This idea of smaller steps was preferred by Pascal. At the point where player A needs $n$ more points to win and player B needs $m$ more points to win, the idea is to play one more point. Suppose that the players know the awards to the two players after one more round. Then they should split the difference between the future awards. The calculation should begin at the point where each player only needs one more point to win (the cell with 32 in Figure 2). In that cell, we know the awards after one additional round. Taking the average of 0 and 64 gives 32. From that cell, we move down and move left on the table. Keep repeating the process until the entire table is filled in.

There is a way to tweak the table approach to work for unequal winning probability of a point. Let’s say the probability of player A winning a point is 0.6. Then the probability of player B winning a point is 0.4. The value of a given cell in the table would be the weighted average of the cell on the right (0.6 weight) with the cell above it (0.4 weight). When we know the results from playing one more round, we assign 0.6 to the result of player A winning and 0.4 to the result of player B winning. The following table shows the results.

Figure 3 – The share of the stakes awarded to player A (with 0.6 weight)

The direct formula $(1)$ or the table approach using the recursive formula $(2)$ can be easily programmed in a computer. For Pascal, the table approach is a very attractive option since the calculation can be built up from lower parameter values. In the above configuration, simply fill in the right column (the entire stakes going to player A) and the top row (player A getting nothing). Then the remaining cells are obtained by weighted average as described in formula $(2)$.

We present one more example.

Example 5
Suppose that player B is a casino and player A is a visitor at the casino playing for 12 points. The house edge is 2% so that the probability of player A winning a round is 0.48. If player A desires to leave after player A has won 9 points and the house has won 6 points, what is the proportion of the stakes that should be awarded to player A?

Playing for 12 points, player A needs 3 more points to win and the house needs 6 more points to win. So we need to analyze an extended play of 3 + 6 – 1 = 8 points. For player A to win the extended play, he needs to win at least 3 points.

$\displaystyle P(3,6)=\sum \limits_{j=3}^8 \ \binom{8}{j} \ 0.48^j \ 0.52^{8-j}=0.827631917$

The answer can be obtained by computing each term in the sum (from $j=3$ to $j=8$). Another way is to use the BINOMDIST function in Excel as follows:

= 1-BINOMDIST(2, 8, 0.48, TRUE) = 1-0.172368083 = 0.827631917

Based on the fair division method discussed in this blog post, player A deserves 82.76% of the stakes.

___________________________________________________________________________

Remarks

In their correspondence, Pascal and Fermat came up with convincing and consistent solution to the problem of points. The earlier solutions to the problem of points were not satisfactory (to all concerned) and are sometimes inconsistent. Division of stakes only basing on the numbers points that have been won may produce extreme results. For example, if player A has won 1 point and player B has won no points, then player A would get the entire stakes. For the game described in Figure 2, for the same scenario, player A gets 42 out of 64 (42 / 64 = 0.65625), which is far from 100%.

For more detailed information on the history of the problem of points, see “A History of Mathematics” by Victor J. Katz.

___________________________________________________________________________
$\copyright \ 2016 \text{ by Dan Ma}$

# When a gambler asked a mathematician for help

When a gambler consistently loses large sum of money, what can he or she do? When one particular gambler, Chevalier de MÃ©rÃ© (1607-1684), was losing a big fortune, he called a “mathematical help line”. In fact, his correspondence with Blaise Pascal (1623-1662) earned him a place in the history book. The problems that were presented by de MÃ©rÃ©, jointly worked on by Pascal and Pierre de Fermat (1601-1665), are regarded as the beginning of the emerging academic field of probability. Chevalier de MÃ©rÃ© was in need of help for two problems – the problem of points and on the dice problem that now bears his name. In this post we discuss the dice problem. The problem of points is discussed in the next post.

___________________________________________________________________________

The Dice Problem

There were two dice problems from Chevalier de MÃ©rÃ©. The first game involves four rolls of a fair die. In this game, de MÃ©rÃ© made bet with even odds on rolling at least one six when a fair die is rolled four times. His reasoning was that since getting a six in one roll of a die is $\frac{1}{6}$ (correct), the chance of getting a six in four rolls of a die would be $4 \times \frac{1}{6}=\frac{2}{3}$ (incorrect). With the favorable odds of 67% of winning, he reasoned that betting with even odds would be a profitable proposition. Though his calculation was incorrect, he made considerable amount of money over many years playing this game.

The second game involves twenty four rolls of a pair of fair dice. The success in the first game emboldened de MÃ©rÃ© to make even bet on rolling one or more double sixes in twenty four rolls of a pair of dice. His reasoning was that the chance for getting a double six in one roll of a pair of dice is $\frac{1}{36}$ (correct). Then the chance of getting a double six in twenty four rolls of a pair of dice would be $24 \times \frac{1}{36}=\frac{2}{3}$ (incorrect). He again reasoned that betting with even odds would be profitable too.

But experience showed otherwise. As he lost a lot of money, he realized something was not quite right with the second game. In 1654, he challenged his renowned friend Blaise Pascal to find an explanation. The solution emerged in a series of letters between Pascal and Fermat. Out of this joint effort, a foundation was laid for the idea of probability as an academic subject. One particular idea that emerged was the Pascal triangle. Another one was the binomial distribution. In fact, anyone who understand the binomial distribution can very quickly see the faulty reasoning of de MÃ©rÃ©.

___________________________________________________________________________

The Simulation

Before we get to the calculation, let’s simulate the games played by de MÃ©rÃ©. Using random numbers generated from using the Rand() function in Excel, we simulated 100,000 iterations of each of the games. In our 100,000 simulations of the first game – rolling a die four times, there are 51,380 iterations with at least one six. This suggests that de MÃ©rÃ©’s position would win more than half of the time, though not the $\frac{2}{3}$ odds that he believed. But it was profitable for him nonetheless.

In our 100,000 simulations of the second game – rolling a pair of dice 24 times, there are only 49,211 iterations with at least one double six. This seems to support that de MÃ©rÃ©’s position is a losing proposition, that he would be losing his bets more than half the time.

Of course, de MÃ©rÃ© could have done similar simulation, though in a much smaller scale, by rolling the dice himself (say, 100 times). He could have seen the light much sooner.

___________________________________________________________________________

The Calculation

Letâ€™s see why the first game was profitable for de MÃ©rÃ© and why the second game was not.

The First Game
In a roll of a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6. If the die is fair, the probability of getting a six is $\frac{1}{6}$. Likewise, the probability of getting no six in one roll of a fair die is $\frac{5}{6}$.

The probability of getting no six in four rolls is:

$\displaystyle P(\text{no six in four rolls})=\frac{5}{6} \times \frac{5}{6} \times \frac{5}{6} \times \frac{5}{6}=\biggl(\frac{5}{6}\biggr)^4=0.482253$.

Thus in four rolls of a fair die, the probability of getting at least one six is:

\displaystyle \begin{aligned}P(\text{at least one six in four rolls})&=\displaystyle 1 - P(\text{no six in four rolls})\\&=1 - 0.482253\\&=0.517747\end{aligned}

Thus the probability of getting at least one six in four rolls of a fair die is 0.517747. Out of 100 games, de MÃ©rÃ© would on average win 52 games. Out of 1000 games, he would on average win 518 games. Suppose each bet is one French franc. Then de MÃ©rÃ© would gain 36 francs for each 1000 francs in wagered. Thus he had the house’s edge of about 3.6%.

The Second Game
In a roll of a pair of dice, there are a total of 36 possible outcomes (i.e. the six outcomes of the first die combined with the six outcomes of the second die). Out of these 36 outcomes, only one of them is a double six. So, the probability of getting a double six is $\frac{1}{36}$ in rolling a pair of dice. Likewise, the probability of not getting a double six is $\frac{35}{36}$.

The probability of getting no double six in 24 rolls of a pair of dice is:

\displaystyle \begin{aligned}P(\text{no double six in 24 rolls})= \biggl(\frac{35}{36}\biggr)^{24}=0.5086\end{aligned}

Thus the probability of getting at least one double six in 24 rolls is:

\displaystyle \begin{aligned}P(\text{at least one double six in 24 rolls})&=\displaystyle 1 - P(\text{no double six in 24 rolls})\\&=1 - 0.5086\\&=0.4914\end{aligned}

Thus the probability of getting at least one double six in 24 rolls of a pair of fair dice is 0.4914. On average, de MÃ©rÃ© would only win about 49 games out of 100 and his opposing side would win about 51 games out of 100 games. Out of 1000 games, he would on average win 491 games (the opposing side would win on average 509 games). With each bet as one franc, the opposing side of de MÃ©rÃ© would win 18 francs for each 1000 francs wagered (thus the opposing side having the house’s edge of about 1.8%).

The odds indicated by the simulations discussed above are in line with the calculated results. It would be interesting to known what action did de MÃ©rÃ© take after learning the answers. Maybe he stopped playing the second game and only played the first game. Maybe he modified the second game so that the odds of winning for him was at least even (or better).

___________________________________________________________________________
$\copyright \ 2016 \text{ by Dan Ma}$

# A randomized definition of the natural logarithm constant

The number $e$ is the base of the natural logarithm. It is an important constant in mathematics, which is approximately 2.718281828. This post discusses a charming little problem involving the the number $e$. This number can be represented in many ways. In a calculus course, the number $e$ may be defined as the upper limit of the following integral:

$\displaystyle \int_1^e \frac{1}{t} \ dt=1$

Another representation is that it is the sum $\displaystyle e=\sum_{n=1}^\infty \frac{1}{n!}$. Still another is that it is the limit $\displaystyle e=\lim_{n \rightarrow \infty} \biggl( 1+\frac{1}{n} \biggr)^n$. According to the authors of a brief article from the Mathematics Magazine [1], these textbook definitions do not give immediate insight about the number $e$ (article can be found here). As a result, students come away from the course without a solid understanding of the number $e$ and may have to resort to rote memorization on what the number $e$ is. Instead, the article gives six probability oriented ways in which the number $e$ can occur. These occurrences of $e$ are more interesting and, according to the authors of [1], can potentially increase students’ appreciation of the number $e$. In two of these six examples (Example 2 and Example 5), the number $e$ is defined by drawing random numbers from the interval $(0,1)$. This post discusses Example 2.

________________________________________________________________________

Random Experiment

Here’s the description of the random experiment that generates the number $e$. Randomly and successively select numbers from the interval $(0, 1)$. The experiment terminates when the sum of the random numbers exceeds 1. What is the average number of selections in this experiment? In other words, we are interested in the average length of the experiment. According to the article [1], the average length is the number $e$. The goal here is to give a proof of this result.

As illustration, the experiment is carried out 1 million times using random numbers that are generated in Excel using the Rand() function. The following table summarizes the results.

$\left[\begin{array}{rrrrrr} \text{Length of} & \text{ } & \text{ } & \text{ } & \text{Relative} \\ \text{Experiment} & \text{ } & \text{Frequency} & \text{ } & \text{Frequency} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ 2 & \text{ } & 500777 & \text{ } & 0.500777 \\ 3 & \text{ } & 332736 & \text{ } & 0.332736 \\ 4 & \text{ } & 124875 & \text{ } & 0.124875 \\ 5 & \text{ } & 33465 & \text{ } & 0.033465 \\ 6 & \text{ } & 6827 & \text{ } & 0.006827 \\ 7 & \text{ } & 1130 & \text{ } & 0.001130 \\ 8 & \text{ } & 172 & \text{ } & 0.000172 \\ 9 & \text{ } & 14 & \text{ } & 0.000014 \\ 10 & \text{ } & 4 & \text{ } & 0.000004 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 1000000 & \text{ } & \text{ } \end{array}\right]$

The average length of experiment in these 1 million experiments is 2.717001. Even though the rate of convergence to the number $e$ is fairly slow, the simulated data demonstrates that on average it takes approximately $e$ number of simulations of numbers in the unit interval to get a sum that exceeds 1.

________________________________________________________________________

A proof

Let $U_1,U_2,U_3,\cdots$ be a sequence of independent and identically distributed random variables such that the common distribution is a uniform distribution on the unit interval $(0,1)$. Let $N$ be defined as follows:

$\displaystyle N=\text{min}\left\{n: U_1+U_2+\cdots+U_n>1 \right\} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The objective is to determine $E(N)$. On the surface, it seems that we need to describe the distribution of the independent sum $X_n=U_1+U_2+\cdots+U_n$ for all possible $n$. Doing this may be possible but the result would be messy. It turns out that we do not need to do so. We need to evaluate the probability $P(X_n \le 1)$ for all $n$. We show that

$\displaystyle F_n(x)=P(X_n \le x)=\frac{x^n}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

for $0 \le x \le 1$ and for $n=1,2,3,\cdots$. This is accomplished by an induction proof. This is true for $n=1$ since $X_1=U_1$ is a uniform distribution. Suppose (2) holds for the integer $n-1$. This means that $\displaystyle F_{n-1}(x)=P(X_{n-1} \le x)=\frac{x^{n-1}}{(n-1)!}$ for $0 \le x \le 1$. Note that $X_n=X_{n-1}+U_n$, which is an independent sum. Let’s write out the convolution formula for this independent sum:

\displaystyle \begin{aligned} F_n(x)&=\int_{0}^x F_{n-1}(x-y) \cdot f_{U_n}(y) \ dy \\&=\int_{0}^x \frac{(x-y)^{n-1}}{(n-1)!} \cdot 1 \ dy \\&=\frac{1}{(n-1)!} \int_0^x (x-y)^{n-1} \ dy \\&=\frac{x^n}{n!} \end{aligned}

The above derivation completes the proof of the claim. We now come back to the problem of evaluating the mean of the random variable $N$ defined in (1). First, note that $N>n$ if and only if $X_n=U_1+U_2+\cdots+U_n \le 1$. So we have the probability statement $\displaystyle P(N>n)=F_n(1)=\frac{1}{n!}$.

As a result, the following is the probability function of the random variable $N$.

\displaystyle \begin{aligned} P(N=n)&=P(N>n-1)-P(N>n) \\&=F_{n-1}(1)-F_n(1) \\&=\frac{1}{(n-1)!}-\frac{1}{n!} \\&=\frac{n-1}{n!} \end{aligned}

Now evaluate the mean.

\displaystyle \begin{aligned} E(N)&=\sum \limits_{n=2}^\infty \frac{n(n-1)}{n!} \\&=\sum \limits_{n=2}^\infty \frac{1}{(n-2)!} \\&=\sum \limits_{m=0}^\infty \frac{1}{m!} \\&=e \end{aligned}

With the above derivation, the proof that $e=$ 2.718281828… is the average number of random numbers to select in order to obtain a sum that exceeds 1 is completed.

________________________________________________________________________

Reference

1. Shultz H. S., Leonard B., Unexpected Occurrences of the Number e,Mathematics Magazine, October 1989, Volume 62, Number 4, pp. 269â€“271.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Introducing the Lognormal Distribution

This post introduces the lognormal distribution and discusses some of its basic properties. The lognormal distribution is a transformation of the normal distribution through exponentiation. The basic properties of the lognormal distribution discussed here are derived from the normal distribution. The normal distribution is applicable in many situations but not in all situations. The normal density curve is a bell-shaped curve and is thus not appropriate in phenomena that are skewed to the right. In such situations, the lognormal distribution can be a good alternative to the normal distribution.

________________________________________________________________________

Defining the distribution

In this post, the notation $\log$ refers to the natural log function, i.e., logarithm to the base $e=2.718281828459 \cdots$.

A random variable $Y$ is said to follow a lognormal distribution if $\log(Y)$ follows a normal distribution. A lognormal distribution has two parameters $\mu$ and $\sigma$, which are the mean and standard deviation of the normal random variable $\log(Y)$. To be more precise, the definition is restated as follows:

A random variable $Y$ is said to follow a lognormal distribution with parameters $\mu$ and $\sigma$ if $\log(Y)$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$.

Many useful probability distributions are transformations of other known distributions. The above definition shows that a normal distribution is the transformation of a lognormal distribution under the natural logarithm. Start with a lognormal distribution, taking the natural log of it gives you a normal distribution. The other direction is actually more informative, i.e., a lognormal distribution is the transformation of a normal distribution by the exponential function. Start with a normal random variable $X$, the exponentiation of it is a lognormal distribution, i.e., $Y=e^{X}$ is a lognormal distribution. The following summarizes these two transformations.

$Y \text{ is lognormal} \longrightarrow X=\log(Y) \text{ is normal}$

$X \text{ is normal} \longrightarrow Y=e^{X} \text{ is lognormal}$

Since the exponential function gives positive values, the lognormal distribution always takes on positive real values. The following diagram shows the probability density functions of the standard normal distribution and the corresponding lognormal distribution. Recall that the standard normal distribution is the normal distribution with mean 0 and standard deviation 1.

Figure 1 – normal and lognormal density curves

In Figure 1, the standard normal density curve is symmetric bell shaped curve, with mean and median located at x = 0. The standard lognormal density (it is called standard since it is derived from the standard normal distribution) is located entirely over the positive x-axis. This is because the exponential function always gives positive values regardless of the sign of the argument. The lognormal density curve in Figure 1 is not symmetric and is a uni-modal curve skewed toward the right. All the standard normal values on the negative x-axis, when exponentiated, are in the interval (0, 1). Thus in Figure 1, the lower half of the lognormal probabilities lie in the interval x = 0 to x = 1 (i.e. the median of this lognormal distribution is x = 1). The other half of the lognormal probabilities lie in the interval $(1, \infty)$. Such lopsided assignment of probabilities shows that lognormal distribution is a positively skewed distribution (skewed to the right).

In the above paragraph, the lower half of the normal distribution on $(-\infty,0)$ is matched with the lognormal distribution on the interval $(0, 1)$. Such interval matching can tell us a great deal about the lognormal distribution. Another example: about 75% of the standard normal distributional values lie below x = 0.67. Thus in Figure 1, about 75% of the lognormal probabilities lie in the interval $(0, 1.95)$ where $e^{0.67} \approx 1.95$. Another example: what is the probability that the lognormal distribution in Figure 1 lie between 1 and 3.5? Then the normal matching interval is $(0, 1.25)$ where $\log(3.5) \approx 1.25$. The normal probability in this interval is 0.8944 – 0.5 = 0.3944. Thus randomly generated a value in the standard lognormal distribution, there is a 39.44% percent chance that it is between 1 and 3.5.

The interval matching idea is very useful for computing lognormal probabilities (e.g. cumulative distribution function) and for finding lognormal percentiles. This idea is discussed further below to make it work for any lognormal distribution, not just the standard lognormal distribution.

Though lognormal distribution is a skewed distribution, some are less skewed than others. The lognormal distributions with larger parameter value of $\sigma$ tend to be more skewed. The following is a diagram of three lognormal density curves that demonstrates this point. Note that the small $\sigma$ of 0.25 relatively resembles a bell curve.

Figure 2 – three lognormal density curves

________________________________________________________________________

How to compute lognormal probabilities and percentiles

Let $Y$ be a random variable that follows a lognormal distribution with parameters $\mu$ and $\sigma$. Then the related normal random variable is $X=\log(Y)$, which has mean $\mu$ and standard deviation $\sigma$. If we raise $e$ to $X$, we get back the lognormal $Y$.

Continuing the interval matching idea, the lognormal interval $Y \le y$ will match with the normal interval $X \le \log(y)$. Both intervals receive the same probability in their respective distributions. The following states this more clearly.

$\displaystyle P\biggl(Y \le y\biggr)=P\biggl(X=\log(Y) \le \log(y)\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

On the other hand, the normal interval $X \le x$ will match with the lognormal interval $Y \le e^x$. The same probability is assigned to both intervals in their respective distributions. This idea is stated as follows:

$\displaystyle P\biggl(X \le x\biggr)=P\biggl(Y=e^X \le e^x\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The idea of $(1)$ gives the cumulative distribution of the lognormal distribution (argument $y$), which is evaluated as the CDF of the corresponding normal distribution at $\log(y)$. One obvious application of $(2)$ is to have an easy way to find percentiles for the lognormal distribution. It is relatively easy to find the corresponding percentile of the normal distribution. Then the lognormal percentile is $e$ raised to the corresponding percentile of the normal distribution. For example, the median of the normal distribution is at the mean $\mu$. Then the median of the lognormal distribution is $e^{\mu}$.

The calculation in both $(1)$ and $(2)$ involve finding normal probabilities, which can be obtained using software or using a table of probability values of the standard normal distribution. To do the table approach, each normal CDF is converted to the standard normal CDF as follows:

$\displaystyle P\biggl(X \le x \biggr)=P\biggl(\frac{X-\mu}{\sigma} \le \frac{x-\mu}{\sigma} \biggr)=\Phi(z) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

where $z$ is the z-score which is the ratio $(x-\mu)/\sigma$ and $\Phi(z)$ is the cumulative distribution function of the standard normal distribution, which can be looked up from a table based on the z-score. In light of this, $(1)$ can be expressed as follows:

$\displaystyle P\biggl(Y \le y\biggr)=\Phi\biggl(\frac{\log(y)-\mu}{\sigma} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

A quick example to demonstrate how this works.

Example 1
If $Y$ is lognormally distributed with parameters $\mu=2.5$ and $\sigma=1.5$,

• what is the probability $P(1.9 \le Y \le 31.34)$?
• what is the 95th percentile of this lognormal distribution?

The first answer is $P(1.9 \le Y \le 31.34)=P(Y \le 31.34)-P(Y \le 1.9)$, which is calculated as follows:

$\displaystyle P(Y \le 31.34)=\Phi \biggl( \frac{\log(31.34)-2.5}{1.5}\biggr)=\Phi(0.63)=0.7357$

$\displaystyle P(Y \le 1.9)=\Phi \biggl( \frac{\log(1.9)-2.5}{1.5}\biggr)=\Phi(-1.24)=1-0.8925=0.1075$

\displaystyle \begin{aligned} P(1.9 \le Y \le 31.34)&=P(Y \le 31.34)-P(Y \le 1.9) \\&=0.7357-0.1075 \\&=0.6282 \end{aligned}

The z-score for the 95th percentile for the standard normal distribution is z = 1.645. Then the 95th percentile for the normal distribution with mean 2.5 and standard deviation 1.5 is x = 2.5 + 1.645 (1.5) = 4.9675. Then apply the exponential function to obtain $e^{4.9675} \approx 143.67$, which is the desired lognormal 95th percentile. $\square$

As $(2)$ and Example 1 suggest, to find a lognormal percentile, first find the percentile for the corresponding normal distribution. If $x_p$ is the $100p$th percentile of the normal distribution, then $e^{x_p}$ is the $100p$th percentile of the lognormal distribution. Usually, we can first find the $100p$th percentile for the standard normal distribution $z_p$. Then the normal percentile we need is $x=\mu + z_p \cdot \sigma$. The lognormal percentile is then:

$\displaystyle e^{\displaystyle \mu + z_p \cdot \sigma}=\text{lognormal } 100p \text{th percentile} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

The above discussion shows that the explicit form of the lognormal density curve is not needed in computing lognormal probabilities and percentiles. For the sake of completeness, the following shows the probability density functions of both the normal distribution and the lognormal distribution.

Normal PDF
$\displaystyle f_X(x)=\frac{1}{\sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(x-\mu)^2}{2 \sigma^2}} \ \ \ \ \ \ \ \ \ \ \ -\infty

Lognormal PDF
$\displaystyle f_Y(y)=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y)-\mu)^2}{2 \sigma^2}} \ \ \ \ \ 0

The cumulative distribution function for the lognormal distribution is then

$\displaystyle F_Y(y)=\int_0^y \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

Of course, we do not have to use $(8)$ since the lognormal CDF can be obtained based on the corresponding normal CDF.

One application of the lognormal PDF in $(7)$ is to use it to find the mode (by taking its derivative and finding the critical value). The mode of the lognormal distribution with parameters $\mu$ and $\sigma$ is $\displaystyle e^{\mu - \sigma^2}$.

________________________________________________________________________

How to find lognormal moments

To find the mean and higher moments of the lognormal distribution, we once again rely on basic information about normal distribution. For any random variable $T$ (normal or otherwise), its moment generating function is defined by $M_T(t)=E(e^{t \ T})$. The following is the moment generating function of the normal distribution with mean $\mu$ and standard deviation $\sigma$.

$\displaystyle M_X(t)=\text{\Large e}^{\displaystyle \biggl[ \mu t + (1/2) \sigma^2 t^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

As before, let $Y$ be a random variable that follows a lognormal distribution with parameters $\mu$ and $\sigma$. Then $Y=e^X$ where $X$ is normal with mean $\mu$ and standard deviation $\sigma$. Then $E(Y)=E(e^X)$ is simply the normal moment generating function evaluated at 1. In fact, the kth moment of $Y$, $E(Y^k)=E(e^{k X})$, is simply the normal mgf evaluated at $k$. Because the mgf of the normal distribution is defined at any real number, all moments for the lognormal distribution exist. The following gives the moments explicitly.

$E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}$

$E(Y^k)=\text{\Large e}^{\displaystyle \biggl[ k \mu+(k^2/2) \sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)$

In particular, the variance and standard deviation are:

\displaystyle \begin{aligned}Var(Y)&=\text{\Large e}^{\displaystyle \biggl[ 2 \mu+2 \sigma^2 \biggr]}-\text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11) \end{aligned}

$\displaystyle \sigma_Y=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ \ \text{\Large e}^{\displaystyle \biggl[ \mu+\frac{1}{2} \sigma^2 \biggr]}=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ E(Y) \ \ \ \ \ \ \ \ \ \ \ \ (12)$

The formulas $(11)$ and $(12)$ give the variance and standard deviation if the parameters $\mu$ and $\sigma$ are known. They do not need to be committed to memory, since they can always be generated from knowing the moments in $(10)$. As indicated before, the lognormal median is $e^{\mu}$, which is always less than the mean, which is $e$ raised to $\mu+(1/2) \sigma^2$. So the mean is greater than the median by a factor of $e$ raised to $(1/2) \sigma^2$. The mean being greater than the median is another sign that the lognormal distribution is skewed right.

Example 2
Suppose $Y$ follows a lognormal distribution with mean 12.18 and variance 255.02. Determine the probability that $Y$ is greater than its mean.

With the given information, we have:

$E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}=12.18$

$\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]}=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2=255.02$

\displaystyle \begin{aligned}Var(Y)&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2 \\&= \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) 12.18^2=255.02 \end{aligned}

From the last equation, we can solve for $\sigma$. The following shows the derivation.

\displaystyle \begin{aligned} &\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr)=\frac{255.02}{12.18^2} \\&\text{\Large e}^{\displaystyle \sigma^2}=2.719014994 \\&\sigma^2=\log(2.719014994)=1.00026968 \end{aligned}

Thus we can take $\sigma=1$. Then plug $\sigma=1$ into $E(Y)$ to get $\mu=2$. The desired probability is:

\displaystyle \begin{aligned} P(Y>12.18)&=1-\Phi \biggl(\frac{\log(12.18)-2}{1} \biggr) \\&=1-\Phi(0.499795262) \\&=1-\Phi(0.5)=1-0.6915=0.3085 \ \square \end{aligned}

________________________________________________________________________

Linear transformations

For any random variable $W$, a linear transformation of $W$ is the random variable $aW+b$ where $a$ and $b$ are real constants. It is well known that if $X$ follows a normal distribution, any linear transformation of $X$ also follows a normal distribution. Does this apply to lognormal distribution? A linear transformation of a lognormal distribution may not have distributional values over the entire positive x-axis. For example, if $Y$ is lognormal, then $Y+1$ is technically not lognormal since the values of $Y+1$ lie in $(1, \infty)$ and not $(0, \infty)$. Instead, we focus on the transformations $cY$ where $c>0$ is a constant. We have the following fact.

If $Y$ has a lognormal distribution with parameters $\mu$ and $\sigma$, then $cY$ has a lognormal distribution with parameters $\mu+\log(c)$ and $\sigma$.

The effect of the constant adjustment of the lognormal distribution is on the $\mu$ parameter, which is adjusted by adding the natural log of the constant $c$. Note that the adjustment on $\mu$ is addition and not multiplication. The $\sigma$ parameter is unchanged.

One application of the transformation $cY$ is that of inflation. For example, suppose $Y$ represents claim amounts in a given calendar year arising from a group of insurance policies. If the insurance company expects that the claims amounts in the next year will increase by 10%, then $1.1Y$ is the random variable that models next year’s claim amounts. If $Y$ is assumed to be lognormal, then the effect of the 10% inflation is on the $\mu$ parameter as indicated above.

To see why the inflation on $Y$ works as described, let’s look at the cumulative distribution function of $T=cY$.

\displaystyle \begin{aligned} P(T \le y)&=P(Y \le \frac{y}{c})=F_Y(y/c) \\&=\int_0^{y/c} \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \end{aligned}

Taking derivative of the last item above, we obtain the probability density function $f_T(y)$.

\displaystyle \begin{aligned} f_T(y)&=\frac{1}{y/c \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y/c)-\mu)^2}{2 \sigma^2}} (1/c) \\&=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{\biggl(\log(y)-(\mu+\log(c))\biggr)^2}{2 \sigma^2}} \end{aligned}

Comparing with the density function $(8)$, the last line is the lognormal density function with parameters $\mu + \log(c)$ and $\sigma$.

________________________________________________________________________

Distributional quantities involving higher moments

As the formula $(10)$ shows, all moments exist for the lognormal distribution. As a result, any distributional quantity that is defined using moments can be described explicitly in terms of the parameters $\mu$ and $\sigma$. We highlight three such distributional quantities: coefficient of variation, coefficient of skewness and kurtosis. The following shows their definitions. The calculation is done by plugging in the moments obtained from $(10)$.

$\displaystyle CV=\frac{\sigma_Y}{\mu_Y} \ \ \ \ \ \text{(Coefficient of variation)}$

\displaystyle \begin{aligned} \gamma_1&=\frac{E[ (Y-\mu_Y)^3 ]}{\sigma_Y^3} \\&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}} \ \ \ \ \ \text{(Coefficient of skewness)} \end{aligned}

\displaystyle \begin{aligned} \beta_2&=\frac{E[ (Y-\mu_Y)^4 ]}{\biggl(E[ (Y-\mu_Y)^2]\biggr)^2} \\&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}} \ \ \ \ \ \text{(Kurtosis)} \end{aligned}

The above definitions are made for any random variable $Y$. The notations $\mu_Y$ and $\sigma_Y$ are the mean and standard deviation of $Y$, respectively. Coefficient of variation is the ratio the standard deviation to the mean. It is a standardized measure of dispersion of a probability distribution or frequency distribution. The coefficient of skewness is defined as the ratio of the third central moment about the mean to the cube of the standard deviation. The second line in the above definition is an equivalent form that is in terms of the mean, variance and the third raw moment, which may be easier to calculate in some circumstances. Kurtosis is defined to be the ratio of the fourth central moment about the mean to the square of the second central moment about the mean. The second line in the definition gives an equivalent form that is in terms of the mean, variance and the third and fourth raw moments.

The above general definitions of CV, $\gamma_1$ and $\beta_2$ can be obtained for the lognormal distribution. The mean and variance and higher raw moments can be obtained by using $(10)$. Then it is a matter of plugging in the relevant items into the above definitions. The following example shows how this is done.

Example 3
Determine the CV, $\gamma_1$ and $\beta_2$ of the lognormal distribution in Example 2.

The calculation in Example 2 shows that the lognormal parameters are $\mu=2$ and $\sigma=1$. Now use formula $(10)$ to get the ingredients.

$\mu_Y=e^{2.5}$

$Var(Y)=(e-1) e^5 \ \ \ \ \ \ \ \ \sigma_Y=\sqrt{e-1} \ e^{2.5}$

$E(Y^2)=e^6$

$E(Y^3)=e^{10.5}$

$E(Y^4)=e^{16}$

Right away, CV = $\sqrt{e-1}=1.31$. The following shows the calculation for skewness and kurtosis.

\displaystyle \begin{aligned} \gamma_1&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}} \\&=\frac{e^{10.5}-3 \ e^{2.5} \ (e-1) \ e^5-(e^{2.5})^3}{[(e-1) \ e^5]^{\frac{3}{2}}}\\&=6.1849 \end{aligned}

\displaystyle \begin{aligned} \beta_2&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}} \\&=\frac{e^{16}-4 \ e^{2.5} \ e^{10.5} + 6 \ (e^{2.5})^2 \ e^{6} - 3 \ (e^{2.5})^4}{[(e-1) \ e^5]^{2}}\\&=113.9364 \end{aligned}

________________________________________________________________________

Is there a moment generating function for the lognormal distribution?

Because the normal distribution has a moment generating function, all moments exist for the lognormal distribution (see formula $(10)$ above). Does the moment generating function exist for the lognormal distribution? Whenever the mgf exists for a distribution, its moments can be derived from the mgf. What about the converse? That is, when all moments exist for a given distribution, does it mean that its moment generating function would always exist? The answer is no. It turns out that the lognormal distribution is a counterexample. We conclude this post by showing this fact.

Let $Y$ be the standard lognormal distribution, i.e., $X=\log(Y)$ has the standard normal distribution. We show that the expectation $E(e^{tY})$ converges to infinity when $t>0$.

\displaystyle \begin{aligned} E(e^{t \ Y})&=E(e^{t \ e^X}) \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x} \ e^{-0.5 x^2} \ dx \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{\biggl(t \ ( \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}) - 0.5x^2 \biggr)} \ dx=\infty \end{aligned}

The last integral in the above derivation converges to infinity. Note that the Taylor’s series expansion of $e^x$ is $\displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots$. In the last step, $e^x$ is replaced by $\displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}$. Then the exponent in the last integral is a third degree polynomial with a positive coefficient for the $x^3$ term. Thus this third degree polynomial converges to infinity as x goes to infinity. With the last integral goes to infinity, the mgf $E(e^{t \ Y})$ goes to infinity as well.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$