Every character is known but password is hard to crack

In this post, we discusses an example in which you are given a password (every character of it) and yet it is still very hard (or even impossible) to crack. Anyone who understands this example has a solid understanding of the binomial distribution. Here’s the example:

    Your friend John tells you that the password to his online bank account has 26 characters. The first character is the first letter in the English alphabet, the second character is the second letter in the English alphabet, the third character is the third letter in the English alphabet and so on.

    Now that your friend John has given you the key to his account, does that mean you can log onto his account to find out how much money he has, or to make financial transaction on his behalf or to enrich yourself?

    If this example sounds too good to be true, what is the catch?

Even though every character in the 26-character password is known, it is indeed a very strong password. How could this be? You may want to stop here and think about.

Indeed, if every character in John’s password is lower case or if every character is upper case, then his bank account is toast. But John’s password can be made very strong and very hard to crack if the password is case sensitive. The password given by John is not just one password, but is a large collection of passwords. In fact, there are over 67 millions possible passwords (67,108,864 to be exact). The following are two of the most obvious ones.

    a b c d e f g h i j k l m n o p q r s t u v w x y z (all lower case)

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all upper case)

The following is another possible password. If this is the one John uses, it will be difficult to crack.

    a b C d e f G h i j k l M N o p Q R s T U v w X Y z (10 upper case letters)

Here’s another possibility.

    A B c D E f G H I J k l M N o P q R S t u v w X y z (14 upper case letters)

Each character in the password has two possibilities – lower case or upper case. Across all 26 characters, there are 2^{26} possibilities. This number is 67,108,864. So 2 raised to 26 is a little over 67 millions. So the password given by John is not just one password, but is a generic one with over 67 million possibilities. There is a one in 67 million chance in correctly guessing the correct password if John chooses the upper case letters randomly. This is much better odds than winning the Powerball lottery, one in 292,201,338, which one in 292 million. But it is still an undeniably strong password.

So John tells you the password, but has in fact not given up much secret. This is the case if he makes the case sensitivity truly random. Of course, once he sets his password, unless he has a great memory, he will need to write down the positions that are upper case. Otherwise, he will not be able to log back on. But that is a totally separate story.

___________________________________________________________________________

The binomial angle

A good way to represent the passwords is to view each one as a 26-character string of U and L (U stands for upper case and L stands for lower case). Then the above two passwords (the one with 10 upper case letters and the one with 14 upper case letters) are represented as follows:

    L L U L L L U L L L L L U U L L U U L U U L L U U L (10 U’s)

    U U L U U L U U U U L L U U L U L U U L L L L U L L (14 U’s)

As discussed, there are 67,108,864 many such strings. We can also think of each such string as the record of a random experiment consisting of tossing a coin 26 times. We record a U when we get a head and record an L when the coin toss results in a tail. In fact, this is the kind of records that John would need to keep when he sets the password. Such a string would tell him which characters in the password are in upper case and which characters are in lower case. On the other hand, hacking the password would be essentially trying to pinpoint one such string out of 67,108,864 many possibilities.

We know 67,108,864 is a large number. Let’s further demonstrate the challenge. In each string, the number of U’s can range from 0 to 26. How many of the strings have 0 U’s? Precisely one (all letters are L). This would be what most people would try first (all the letters are lower case). How many of the strings have exactly 1 U? Thinking about it carefully, we should arrive at the conclusion that there are 26 possibilities (the single U can be in any of the 26 positions). So if the would be hacker knows there there is only one upper case letter, then it would be easy to break the password. How many of the strings have exactly 2 U’s? If you try to write out all the possible cases, it may take some effort. There are 325 possibilities. So just having two upper case letters in the password seems to make it something that is approaching a strong password. But the problem is that the two U’s may be guessed easily. John may put the upper case letters on his initials (if his name is John Smith, he may make J and S upper case), or in other obvious letters.

How many of the strings have exactly 3 U’s? This will be really hard to write out by hand. There are 2,600 many possibilities. Why stop at having just 3 upper case letters? It is clear that the more upper case letters used in the password, the stronger it is and the harder it is to crack.

How do we know the precise number of possibilities for a given k, the number of U’s? The idea is that of choosing k number of letters out of 26 letters.

The number of ways of choosing k objects from a total of n objects is denoted by the notation \binom{n}{k}. Sometimes the notations C(n,k), _nC_k and C_{n,k} are used. Regardless of the notation, the calculation is

    \displaystyle \binom{n}{k}=\frac{n!}{k! (n-k)!}

The notation n! is the product of all the positive integers up to and including n (this is called n factorial). Thus 1!=1, 2!=2, 3!=6, 4!=24. To make the formula work correctly, we make 0!=1.

The following gives the first several calculations in the 26-character password example.

    \displaystyle \binom{26}{2}=\frac{26!}{2! \ 24!}=\frac{26 \cdot 25}{2}=325

    \displaystyle \binom{26}{3}=\frac{26!}{3! \ 23!}=\frac{26 \cdot 25 \cdot 24}{6}=2600

    \displaystyle \binom{26}{4}=\frac{26!}{4! \ 22!}=\frac{26 \cdot 25 \cdot 24 \cdot 23}{24}=14950

If the desire is to see the patterns, the remaining calculations can be done by using software (or at least a hand held calculator). The following table shows the results.

    \displaystyle \begin{array}{rrr} k &\text{ } & \displaystyle \binom{26}{k}  \\  \text{ } & \text{ } & \text{ }  \\  0 &\text{ } & 1  \\     1 &\text{ } & 26   \\  2 &\text{ } & 325   \\  3 &\text{ } & 2,600   \\  4 &\text{ } & 14,950   \\  5 &\text{ } & 65,780   \\  6 &\text{ } & 230,230   \\  7 &\text{ } & 657,800   \\  8 &\text{ } & 1,562,275   \\  9 &\text{ } & 3,124,550   \\  10 &\text{ } & 5,311,735   \\  11 &\text{ } & 7,726,160   \\  12 &\text{ } & 9,657,700   \\  13 &\text{ } & 10,400,600   \\  14 &\text{ } & 9,657,700   \\  15 &\text{ } & 7,726,160   \\  16 &\text{ } & 5,311,735   \\  17 &\text{ } & 3,124,550   \\  18 &\text{ } & 1,562,275   \\  19 &\text{ } & 657,800   \\  20 &\text{ } & 230,230   \\  21 &\text{ } & 65,780   \\  22 &\text{ } & 14,950   \\  23 &\text{ } & 2,600   \\  24 &\text{ } & 325   \\  25 &\text{ } & 26   \\  26 &\text{ } & 1   \\  \end{array}

The pattern is symmetrical. Having too few U’s or too many U’s produces weak passwords that may be easy to guess. Having 6 or 7 U’s seems to give strong passwords. Having half of the letters upper case (13 U’s) is the optimal, with the most possibilities (over 10 millions). Even if you are given partial information such as “half of the letters are in upper case”, you are still left with over 10 million possibilities to work with!

Dividing each of the above counts by 67,108,864 gives the relative weight (probability) of each case of having exactly k U’s.

    \displaystyle \begin{array}{rrr} k &\text{ } & P[X=k]  \\  \text{ } & \text{ } & \text{ }  \\  0 &\text{ } & 0.00000001  \\     1 &\text{ } & 0.00000039   \\  2 &\text{ } & 0.00000484   \\  3 &\text{ } & 0.00003874   \\  4 &\text{ } & 0.00022277   \\  5 &\text{ } & 0.00098020   \\  6 &\text{ } & 0.00343069   \\  7 &\text{ } & 0.00980198   \\  8 &\text{ } & 0.02327971   \\  9 &\text{ } & 0.04655942   \\  10 &\text{ } & 0.07915102   \\  11 &\text{ } & 0.11512876   \\  12 &\text{ } & 0.14391094   \\  13 &\text{ } & 0.15498102   \\  14 &\text{ } & 0.14391094   \\  15 &\text{ } & 0.11512876   \\  16 &\text{ } & 0.07915102   \\  17 &\text{ } & 0.04655942   \\  18 &\text{ } & 0.02327971   \\  19 &\text{ } & 0.00980198   \\  20 &\text{ } & 0.00343069   \\  21 &\text{ } & 0.00098020   \\  22 &\text{ } & 0.00022277   \\  23 &\text{ } & 0.00003874   \\  24 &\text{ } & 0.00000484   \\  25 &\text{ } & 0.00000039   \\  26 &\text{ } & 0.00000001   \\  \end{array}

The cases of X=k where k=11,12,13,14,15 add up to 67.3% of the 67,108,864 possibilities.

___________________________________________________________________________

The binomial distribution

Another way to look at it is that in setting the password, John is performing a sequence of 26 independent Bernoulli trials. Here, each trial has two outcomes, a or A, b or B, c or C and so on. For example, the lower case or upper case can be determined by a coin toss. Let X be the number of upper case letters in the 26-character password. Then the random variable X has a binomial distribution with n=26 (26 Bernoulli trials) and the probability of success p=0.5 in each trial, which is the probability that a character is upper case, assuming that he determines the upper/lower case by a coin toss. The following is the probability function:

    \displaystyle P(X=x)=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^x \biggl[\frac{1}{2}\biggr]^{26-x}=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^{26}

where x=0,1,2,\cdots,25,26. The quantity \displaystyle P(X=x) is the probability that the number of upper case letters is x. Here, \binom{26}{x} is the number of ways to choose x letters out of 26 letters and is computed by the formula indicated earlier.

Since the upper/lower case is determined randomly, another way to state the probability function of the random variable X is:

    \displaystyle P(X=x)=\displaystyle \frac{\binom{26}{x}}{2^{26}}=\frac{\binom{26}{x}}{67108864} \ \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,24,25,26

The expected value of this random variable X is 13. This is the average number of upper case letters if the case is determined randomly. This obviously produces the most optimally strong password. If John determines the case not at random, the security may not be as strong or the would be hacker may be able to guess.

Stepping away from the 26-character password example, here’s the probability function of a binomial distribution in general.

    \displaystyle P(X=x)=\binom{n}{x} \ p^x \ (1-p)^{n-x} \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,n-1,n

This model describes the random experiment of running n independent trials, where each trial has two outcomes (the technical term is Bernoulli trial). In each trial, the probability of one outcome (called success) is p and the probability of the other outcome (called failure) is 1-p. The random variable X counts the number of successes whenever such an experiment is performed. The probability P(X=x) gives the likelihood of achieving x successes.

As an example, if John has a bias toward lower case letter, then the probability of success (upper case) may be p=0.4 (assuming that the random variable X still counts the number of upper case letters). Then the average number of upper case letters in a randomly determined password is 26 x 0.4 = 10.4.

___________________________________________________________________________

Interesting binomial distribution problems

The problem of points and the dice problem are two famous probability problems that are in the history book as a result of a gambler seeking help from Pascal and Fermat.

___________________________________________________________________________
\copyright \ 2016 \text{ by Dan Ma}

Advertisements

Defining the Poisson distribution

The Poisson distribution is a family of discrete distributions with positive probabilities on the non-negative numbers 0,1,2,\cdots. Each distribution in this family is indexed by a positive number \lambda>0. One way to define this distribution is to give its probability function given the parameter \lambda and then derive various distributional quantities such as mean and variance. Along with other mathematical facts, it can be shown that both the mean and the variance are \lambda. In this post, we take a different tack. We look at two view points that give rise to the Poisson distribution. Taking this approach will make it easier to appreciate some of the possible applications of the Poisson distribution. The first view point is that the Poisson distribution is the limiting case of the binomial distribution. The second view point is through the Poisson process, a stochastic process that, under some conditions, counts the number of events and the time points at which these events occur in a given time (or physical) interval.

________________________________________________________________________

Poisson as a limiting case of binomial

A binomial distribution where the number of trials n is large and the probability of success p is small such that np is moderate in size can be approximated using the Poisson distribution with mean \lambda=np. This fact follows from Theorem 1, which indicates that the Poisson distribution is the limiting case of the binomial distribution.

Theorem 1
Let \lambda be a fixed positive constant. Then for each integer x=0,1,2,\cdots, the following is true:

    \displaystyle \lim_{n \rightarrow \infty} \binom{n}{x} \ p^x \  (1-p)^{n-x}=\lim_{n \rightarrow \infty} \frac{n!}{x! \ (n-x)!} \ p^x \  (1-p)^{n-x}=\frac{e^{-\lambda} \ \lambda^x}{x!}

where \displaystyle p=\frac{\lambda}{n}.

Proof of Theorem 1
We start with a binomial distribution with n trials and with \displaystyle p=\frac{\lambda}{n} being the probability of success, where n>\lambda. Let X_n be the count of the number of successes in these n Bernoulli trials. The following is the probability that X_n=k.

    \displaystyle \begin{aligned} P(X_n=k)&=\binom{n}{k} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n!}{k! (n-k)!} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\lambda^k}{k!}\biggr) \biggr(1-\frac{\lambda}{n}\biggr)^{n} \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \ \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \end{aligned}

In the last step, the terms that contain n are inside the square brackets. Let’s see what they are when n approaches infinity.

    \displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{n}=e^{-\lambda}

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{-k}=1

The reason that the first result is true is that the numerator is a polynomial where the leading term is n^k. Upon dividing by n^k and taking the limit, we get 1. The second result is true since the following limit is one of the definitions of the exponential function e^x.

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1+\frac{x}{n}\biggr)^{n}=e^{x}

The third result is true since the exponent -k is a constant. Thus the following is the limit of the probability P(X_n=k) as n \rightarrow \infty.

    \displaystyle \begin{aligned} \lim \limits_{n \rightarrow \infty} P(X_n=k)&= \biggl(\frac{\lambda^k}{k!}\biggr) \ \lim \limits_{n \rightarrow \infty} \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \cdot 1 \cdot e^{-\lambda} \cdot 1 \\&=\frac{e^{-\lambda} \lambda^k}{k!} \end{aligned}

This above derivation completes the proof. \blacksquare

In a given binomial distribution, whenever the number of trials n is large and the probability p of success in each trial is small (i.e. each of the Bernoulli trial rarely results in a success), Theorem 1 tells us that we can use the Poisson distribution with parameter \lambda=np to estimate the binomial distribution.

Example 1
The probability of being dealt a full house in a hand of poker is approximately 0.001441. Out of 5000 hands of poker that are dealt at a certain casino, what is the probability that there will be at most 4 full houses?

Let X be the number of full houses in these 5000 poker hands. The exact distribution for X is the binomial distribution with n= 5000 and p= 0.001441. Thus example deals with a large number of trials where each trial is a rare event. So the Poisson estimation is applicable. Let \lambda= 5000(0.001441) = 7.205. Then P(X \le 4) can be approximated by the Poisson random variable Y with parameter \lambda. The following is the probability function of Y:

    \displaystyle P(Y=y)=e^{-7.205} \ \frac{7.205^y}{y!}

The following is the approximation of P(X \le 4):

    \displaystyle \begin{aligned} P(X \le 4)&\approx P(Y \le 4) \\&=P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3)+P(Y=4) \\&= e^{-7.205} \biggl[ 1+7.205+\frac{7.205^2}{2!}+\frac{7.205^3}{3!}+\frac{7.205^4}{4!}\biggr] \\&=0.155098087  \end{aligned}

The following is a side by side comparison between the binomial distribution and its Poisson approximation. For all practical purposes, they are indistingusihable from one another.

    \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&\text{Binomial } &\text{ }&\text{ }&\text{Poisson } \\\text{Full Houses}&\text{ }&\text{ }&P(X \le x) &\text{ }&\text{ }&P(Y \le x) \\\text{ }&\text{ }&\text{ }&n=5000 &\text{ }&\text{ }&\lambda=7.205 \\\text{ }&\text{ }&\text{ }&p=0.001441 &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.000739012&\text{ }&\text{ }&0.000742862 \\ 1&\text{ }&\text{ }&0.006071278&\text{ }&\text{ }&0.006095184 \\ 2&\text{ }&\text{ }&0.025304641&\text{ }&\text{ }&0.025376925 \\ 3&\text{ }&\text{ }&0.071544923&\text{ }&\text{ }&0.071685238 \\ 4&\text{ }&\text{ }&0.154905379&\text{ }&\text{ }&0.155098087  \\ 5&\text{ }&\text{ }&0.275104906&\text{ }&\text{ }&0.275296003 \\ 6&\text{ }&\text{ }&0.419508250&\text{ }&\text{ }&0.419633667 \\ 7&\text{ }&\text{ }&0.568176421 &\text{ }&\text{ }&0.568198363   \\ 8&\text{ }&\text{ }&0.702076190 &\text{ }&\text{ }&0.701999442 \\ 9&\text{ }&\text{ }&0.809253326&\text{ }&\text{ }&0.809114639 \\ 10&\text{ }&\text{ }&0.886446690&\text{ }&\text{ }&0.886291139 \\ 11&\text{ }&\text{ }&0.936980038&\text{ }&\text{ }&0.936841746 \\ 12&\text{ }&\text{ }&0.967298041&\text{ }&\text{ }&0.967193173 \\ 13&\text{ }&\text{ }&0.984085073&\text{ }&\text{ }&0.984014868 \\ 14&\text{ }&\text{ }&0.992714372&\text{ }&\text{ }&0.992672033 \\ 15&\text{ }&\text{ }&0.996853671&\text{ }&\text{ }&0.996830358 \end{bmatrix}

The above table is calculated using the functions BINOM.DIST and POISSON.DIST in Excel. The following shows how it is done. The parameter TRUE indicates that the result is a cumulative distribution. When it is set to FALSE, the formula gives the probability function.

    P(X \le x)=\text{BINOM.DIST(x, 5000, 0.001441, TRUE)}

    P(Y \le x)=\text{POISSON.DIST(x, 7.205, TRUE)}

________________________________________________________________________

The Poisson distribution

The limit in Theorem 1 is a probability function and the resulting distribution is called the Poisson distribution. We now gives the formal definition. A random variable X that takes on one of the numbers 0,1,2,\cdots is said to be a Poisson random variable with parameter \lambda>0 if

    \displaystyle P(X=x)=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

It can be shown that the above function is indeed a probability function, i.e., the probabilities sum to 1. Any random variable that has a probability function of the above form is said to follow (or to have) a Poisson distribution. Furthermore, it can be shown that E(X)=var(X)=\lambda, i.e., the Poisson parameter is both the mean and variance. Thus the Poisson distribution may be a good fit if the observed data indicate that the sample mean and the sample variance are nearly identical.

The following is the moment generating function of the Poisson distribution with parameter \lambda.

    \displaystyle M(t)=E(e^{tX})=e^{\lambda \ (e^t-1)}

One consequence of the Poisson moment generating function is that any independent sum of Poisson distributions is again a Poisson distribution.

________________________________________________________________________

The Poisson process

Another way, the more important way, to look at the Poisson distribution is the view point of the Poisson process. Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to record the time of the occurrence of each random event and for the purpose at hand, count the number of random events occurring in a fixed time interval. Starting at time 0, note the time of the occurrence of the first event. Then the time at which the second random event occurs and so on. Out of these measurements, we can derive the length of time between the occurrences of any two consecutive random events. Such measurements belong to a continuous random variable. In this post, we focus on the discrete random variable of the count of the random events in a fixed time interval.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, \alpha-particles were emitted from a polonium source and the number of \alpha-particles were counted during an interval of 7.5 seconds (2608 many such time intervals were observed). A Poisson process is a random process in which several criteria are satisfied. We will show that in a Poisson process, the number of these random occurrences in the fixed time interval will follow a Poisson distribution. First, we discuss the criteria to which a Poisson process must conform.

One of the criteria is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. For example, in the 1910 radioactivity study, \alpha-particles were emitted at the rate of \lambda= 3.87 per 7.5 seconds. So the probability of one \alpha-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one \alpha-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. To summarize, the following are the three criteria of a Poisson process:

    Suppose that on average \lambda random events occur in a time interval of length 1.

    1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
    2. For a very short subinterval of length \frac{1}{n} where n is a sufficiently large integer, the probability of a random event occurring in this subinterval is \frac{\lambda}{n}.
    3. The numbers of random events occurring in non-overlapping time intervals are independent.

Consider a Poisson process in which the average rate is \lambda random events per unit time interval. Let Y be the number of random events occurring in the unit time interval. In the 1910 radioactivity study, the unit time interval is 7.5 seconds and Y is the count of the number of \alpha-particles emitted in 7.5 seconds. It follows that Y has a Poisson distribution with parameter \lambda. To see this, subdivide the unit interval into n non-overlapping subintervals of equal length where n is a sufficiently large integer. Let X_{n,j} be the number of random events in the the jth subinterval (1 \le j \le n). Based on the three assumptions, X_{n,1},X_{n,2},\cdots,X_{n,n} are independent Bernoulli random variables, where the probability of success for each X_{n,j} is \frac{\lambda}{n}. Then X_n=X_{n,1}+X_{n,2}+\cdots+X_{n,n} has a binomial distribution with parameters n and p=\frac{\lambda}{n}. Theorem 1 tells us that the limiting case of the binomial distributions for X_n is the Poisson distribution with parameter \lambda. This Poisson distribution should agree with the distribution for Y. The Poisson is also discussed in quite a lot of details in the previous post called Poisson as a Limiting Case of Binomial Distribution.

We now examine the 1910 radioactivity study a little more closely.

Example 2
The basic idea of the 1910 radioactivity study conducted by Rutherford and Geiger is that a polonium source was placed a short distance from an observation point. The number of \alpha-particles emitted from the source were counted in 7.5-second intervals for 2608 times. The following is the tabulated results.

    \displaystyle \begin{bmatrix}   \text{Number of alpha particles}&\text{ }&\text{Observed}   \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}    \\ \text{ }&\text{ }&\text{ }   \\ 0&\text{ }&57    \\ 1&\text{ }&203    \\ 2&\text{ }&383    \\ 3&\text{ }&525    \\ 4&\text{ }&532    \\ 5&\text{ }&408   \\ 6&\text{ }&273   \\ 7&\text{ }&139   \\ 8&\text{ }&45   \\ 9&\text{ }&27   \\ 10&\text{ }&10  \\ 11+&\text{ }&6  \\ \text{ }&\text{ }&\text{ }  \\ \text{Total }&\text{ }&2608    \end{bmatrix}

What is the average number of particles observed per 7.5 seconds? The total number of \alpha-particles in these 2608 periods is

    0 \times 57+1 \times 203+2 \times 383+ 3 \times 525 + \cdots=10097.

The mean count per period is \lambda=\frac{10097}{2608}=3.87. Consider the Poisson distribution with parameter 3.87. The following is its probability function.

    \displaystyle P(X=x)=\frac{e^{-3.87} \ 3.87^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

Out of 2608 periods, the expected number of periods with x particles in emission is 2608P(X=x). The following is a side by side comparison in the observed counts and the expected counts.

    \displaystyle \begin{bmatrix}   \text{Number of alpha particles}&\text{ }&\text{Observed}&\text{ }&\text{Expected}    \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}&\text{ }&\text{counts}    \\ \text{ }&\text{ }&\text{ }&\text{ }&2608 \times P(X=x)  \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ }   \\ 0&\text{ }&57&\text{ }&54.40    \\ 1&\text{ }&203&\text{ }&210.52  \\ 2&\text{ }&383&\text{ }&407.36    \\ 3&\text{ }&525&\text{ }&525.50    \\ 4&\text{ }&532&\text{ }&508.42    \\ 5&\text{ }&408&\text{ }&393.52   \\ 6&\text{ }&273&\text{ }&253.82   \\ 7&\text{ }&139&\text{ }&140.32   \\ 8&\text{ }&45&\text{ }&67.88   \\ 9&\text{ }&27&\text{ }&29.19   \\ 10&\text{ }&10&\text{ }&11.30  \\ 11+&\text{ }&6&\text{ }&5.78  \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ }  \\ \text{Total }&\text{ }&2608&\text{ }&2608    \end{bmatrix}

The expected counts are quite close to the observed counts, showing that the Poisson distribution is a very good fit to the observed data from the 1910 study.

________________________________________________________________________

More comments about the Poisson process

We have described the Poisson process as the distribution of random events in a time interval. The same idea can be used to describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements (see Examples 5 and 6 below).

Another point to make is that sometimes it may be necessary to consider an interval other than the unit length. Instead of counting the random events occurring in an interval of length 1, we may want to count the random events in an interval of length t. As before, let \lambda be the rate of occurrences in a unit interval. Then the rate of occurrences of the random events is over the interval of length t is \lambda t. The same idea will derive that fact that the number of occurrences of the random events of interest in the interval of length t is a Poisson distribution with parameter \lambda t. The following is its probability function.

    \displaystyle P(X_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

where X_t is the count of the random events in an interval of length t.

For example, in the 1910 radioactive study, the unit length is 7.5 seconds. The number of \alpha-particles observed in a half unit interval (3.75 seconds) will follow a Poisson distribution with parameter 0.5 \lambda= 0.5(3.87) = 1.935 with the following probability function:

    \displaystyle P(X_{0.5}=x)=\frac{e^{-1.935} \ (1.935)^x}{x!}  \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

________________________________________________________________________

More examples

Example 3
A radioactive source is metered for 5 hours. During this period, 9638 \alpha-particles are counted. What is the probability that during the next minute, between 30 and 34 particles (both inclusive ) will be counted?

The average number of \alpha-particles counted per minute is \lambda=\frac{9638}{300}=32.12. Let X be the number of \alpha-particles counted per minute. Then X has a Poisson distribution with parameter \lambda=32.12. The following calculates P(30 \le X \le 34).

    \displaystyle \begin{aligned} P(30 \le X \le 34)&=e^{-32.12} \biggl[ \frac{32.12^{30}}{30!}+\frac{32.12^{31}}{31!}+\frac{32.12^{32}}{32!}+\frac{32.12^{33}}{33!}+\frac{32.12^{34}}{34!}   \biggr] \\&=0.341118569  \end{aligned}

Alternatively, the POISSON.DIST function in Excel can be used as follows:

    \displaystyle \begin{aligned} P(30 \le X \le 34)&=P(X \le 34)-P(X \le 29) \\&=\text{POISSON.DIST(34,32.12,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(29,32.12,TRUE)} \\&=0.671501917-0.330383348 \\&=0.341118569  \end{aligned}

Example 4
The side effect of dry mouth is known to be experienced, on the average, by 5 out of 10,000 individuals taking a certain medication. About 20,000 patients are expected to take this medication next year. What is the probability that between 12 and 16 (both inclusive) patients will experience the side effect of dry mouth? What is the exact probability model that can also be used to work this problem?

The exact model is a binomial distribution. The number of trials n= 20000 and the probability of success in each trial is p= 0.0005 (experiencing the side effect). Here, we use Poisson to estimate the binomial. The average number of patients experiencing side effect is \lambda=20000(0.0005)=10. Let X be the number of patients experiencing the side effect. The following calculates the Poisson probability for P(12 \le X \le 16) in two different ways.

    \displaystyle \begin{aligned} P(12 \le X \le 16)&=e^{-10} \biggl[ \frac{10^{12}}{12!}+\frac{10^{13}}{13!}+\frac{10^{14}}{14!}+\frac{10^{15}}{15!}+\frac{10^{16}}{16!}   \biggr] \\&=0.276182244  \end{aligned}
    \displaystyle \begin{aligned} P(12 \le X \le 16)&=P(X \le 11)-P(X \le 16) \\&=\text{POISSON.DIST(16,10,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(11,10,TRUE)} \\&=0.97295839-0.696776146 \\&=0.276182244  \end{aligned}

Example 5
In a 10-mile stretch of a highway, car troubles (e.g. tire punctures, dead batteries, and mechanical breakdown) occur at a rate of 1.5 per hour. A tow truck driver can respond to such car troubles and offer roadside assistance, which can include towing and minor repair. Assume that the number of such incidences per hour follows a Poisson distribution. At the beginning of the hour, three tow trucks (and their drivers) are available to respond to any car troubles in this stretch of highway. What is the probability that in the next hour all three tow trick drivers will be busy helping motorists with car troubles in this stretch of highway?

Let X be the number of car troubles that occur in this 10-mile stretch of highway in the one-hour period in question. If in this one hour there are 3 or more car troubles (X \ge 3), then all three tow truck drivers will be busy.

    \displaystyle \begin{aligned} P(X \ge 3)&=1-P(X \le 2) \\&=1-e^{-1.5} \biggl[ 1+1.5+\frac{1.5^{2}}{2!}   \biggr] \\&=1-0.808846831\\&=0.191153169  \end{aligned}

Example 6
Continuing Example 5. Considering that there is only 19% chance that all 3 tow truck drivers will be busy, there is a good chance that the resources are under utilized. What if one of the drivers is assigned to another stretch of highway?

With only two tow trucks available for this 10-mile stretch of highway, the following is the probability that all two tow truck drivers will be busy:

    \displaystyle \begin{aligned} P(X \ge 2)&=1-P(X \le 1) \\&=1-e^{-1.5} \biggl[ 1+1.5   \biggr] \\&=1-0.5578254\\&=0.4421746  \end{aligned}

Assigning one driver to another area seems to be a better way to make good use of the available resources. With only two tow truck drivers available, there is much reduced chance (56%) that one of the drivers will be idle, and there is a much increased chance (44%) that all available drivers will be busy.

________________________________________________________________________

Remarks

The Poisson distribution is one of the most important of all probability models and has shown to be an excellent model for a wide array of phenomena such as

  • the number of \alpha-particles emitted from radioactive source in a given amount of time,
  • the number of vehicles passing a particular location on a busy highway,
  • the number of traffic accidents in a stretch of highway in a given period of time,
  • the number of phone calls arriving at a particular point in a telephone network in a fixed time period,
  • the number of insurance losses/claims in a given period of time,
  • the number of customers arriving at a ticket window,
  • the number of earthquakes occurring in a fixed period of time,
  • the number of mutations on a strand of DNA.
  • the number of hurricanes in a year that originate in the Atlantic ocean.

What is the Poisson distribution so widely applicable in these and many other seemingly different and diverse phenomena? What is the commonality that ties all these different and diverse phenomena? The commonality is that all these phenomena are basically a series of independent Bernoulli trials. If a phenomenon is a Binomial model where n is large and p is small, then it has a strong connection to Poisson model mathematically through Theorem 1 above (i.e. it has a Poisson approximation). On the other hand, if the random phenomenon follows the criteria in a Poisson process, then the phenomenon is also approximately a Binomial model, which means that in the limiting case it is Poisson.

In both view points discussed in this post, the Poisson distribution can be regarded as a binomial distribution taken at a very granular level. This connection with the binomial distribution points to a vast arrays of problems that can be solved using the Poisson distribution.

________________________________________________________________________

Exercises

Practice problems for the Poisson concepts discussed above can be found in the companion blog (go there via the following link). Working on these exercises is strongly encouraged (you don’t know it until you can do it).

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

A natural look at the negative binomial survival function

The negative binomial distribution is a discrete distribution with two parameters r and p where r>0 and 0<p<1. It has positive probabilities at the non-negative integers 0,1,2,\cdots. So it can potentially be used as a model for the random count of a phenomenon of interest. In some cases, the negative binomial distribution has a natural interpretation. In fact, the natural interpretation should be how the negative binomial distribution is introduced. With the parameter r being a positive integer, the negative binomial distribution can be naturally interpreted as the discrete waiting time until the rth success. But this natural interpretation does not apply to the general case of r being only a positive real number but not necessarily an integer. However, once the natural case that r being a positive integer is understood, it is easy to make a leap to the general case that r is any positive real number. In this post, we focus on the “natural” version of the negative binomial distribution, where r is a positive integer. In this case, there is a natural way to look at the cumulative distribution function (cdf) and the survival function. The discussion in this post will complement the following posts on the negative binomial distribution.

________________________________________________________________________

The probability functions

A Bernoulli trial is a random experiment in which there are two distinct outcomes. For convenience, they are called success (S) and failure (F). Consider performing a sequence of independent Bernoulli trials such that each trial has the same probability of success p. Fix a positive integer r. In some cases, we would like to count the number of trials that produce the rth success. Let’s call this count X_r. In other cases, we may want instead to count the number of failures before getting the rth success. Let’s call this count Y_r. According to the discussion in the two posts listed above, the probability functions of X_r and Y_r are:

    \displaystyle P(X_r=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y_r=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

When r=1, the resulting distribution is called the geometric distribution.

    \displaystyle P(X_1=x)= p (1-p)^{x-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=1,2,3,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1a)

    \displaystyle P(Y_1=y)= p (1-p)^y \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2a)

Because the parameter r is a positive integer, all the above probability functions have an combinatorial explanation. For the probability function in (1), there are x many trials. The last Bernoulli trial is the rth success. Then all the preceding x-1 trails have at most r-1 successes. Hence we have the binomial coefficient \binom{x-1}{r-1} in (1). For the probability function in (2), there are y+r trials (r successes and y failures). Again, the last Bernoulli trial is the rth success. So in the preceding y+r-1 trials, there are exactly r-1 successes and exactly y failures. Hence we have the binomial coefficient \binom{y+r-1}{y} in (2). Let’s look at some examples.

Example 1
Four different prizes are randomly put into boxes of cereal. One of the prizes is a free ticket to the local zoo. Suppose that a family of four (called Family A) decides to buy this cereal until obtaining four free tickets to the zoo. What is the probability that the family will have to buy 10 boxes of cereal to obtain the four free tickets to the zoo? What is the probability that the family will have to buy 16 boxes of cereal to obtain the four free tickets to the zoo?

In this example, the success is a box of cereal with a free ticket to the zoo. So getting a ticket to the zoo is considered a success. Any one of the other three prizes is considered undesirable or a failure. Any one of the four prizes is equally likely. The negative binomial distribution in this example has r=4 and p=0.25. The count of the boxes of cereal to be purchased is the random variable X_4 as described in (1) above. The following gives the answers.

    \displaystyle \begin{aligned} P(X_4=10)&=\binom{10-1}{3} \ (0.25)^4 \ (0.75)^{10-4} \\&=\binom{9}{3} \ (0.25)^4 \ (0.75)^{6} \\&=84 \ (0.25)^4 \ (0.75)^{6} \\&=0.0583992  \end{aligned}

    \displaystyle \begin{aligned} P(X_4=16)&=\binom{16-1}{3} \ (0.25)^4 \ (0.75)^{16-4} \\&=\binom{15}{3} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766  \end{aligned}

Example 2 (Example 1 continued)
Suppose Family A agrees to give any one of the undesirable prizes away to another family (called Family B). What is the probability that Family A will give 5 undesirable prizes to Family B before obtaining the four desirable tickets to the zoo? What is the probability that Family A will give 12 undesirable prizes to Family B before obtaining the four tickets to the zoo?

The negative binomial distribution in this example has r=4 and p=0.25. The interest here is to count the number failures (undesirable prizes) before getting 4 successes. Thus the random variable of interest is Y_4 as described in (2) above. The following gives the answers.

    \displaystyle \begin{aligned} P(Y_4=5)&=\binom{5+4-1}{5} \ (0.25)^4 \ (0.75)^{5} \\&=\binom{8}{5} \ (0.25)^4 \ (0.75)^{5} \\&=56 \ (0.25)^4 \ (0.75)^{5} \\&=0.0519104  \end{aligned}

    \displaystyle \begin{aligned} P(Y_4=12)&=\binom{12+4-1}{10} \ (0.25)^4 \ (0.75)^{12} \\&=\binom{15}{12} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766  \end{aligned}

Here’s the mean and variance for both examples.

    \displaystyle E(X_4)=4 \ \frac{1}{0.25}=16

    \displaystyle Var(X_4)=Var(Y_4)=4 \ \frac{0.75}{0.25^2}=48

    \displaystyle E(Y_4)=4 \ \frac{0.75}{0.25}=12

Thus Family A is expected to buy 16 boxes of cereal to get the 4 tickets to the zoo and is expected to give 12 prizes to the other family. However, the variance is fairly large. As a result, the actual number of boxes purchased can vary from the mean by a large amount.

_______________________________________________________________________

The survival functions and the cumulative distribution functions

For any random variable T, the cumulative distribution function is P(T \le t) where T can range over all real numbers t or a relevant set of real numbers t. The survival function is P(T > t). The term survival comes from the interpretation that P(T > t) is the probability of a life surviving beyond time t if T is meant to model the lifetime of a biological life or some system. Even when T is not a lifetime random variable, we still use the term survival function for P(T > t).

Example 1 asks the probability of a certain number of trials in order to get rth success and the probability of a certain number of failures before getting the rth success. Sometimes it is more informative to know how many trials that are required to be performed in order to achieve one’s goal. For example, it may be useful to know the mean number of trials or the probability of achieving the goal in x trials or less. In some cases, it may take time and resource to perform the random Bernoulli trials. It will be helpful to know ahead of time the likelihood of achieving one’s goal given the resources that are available. In the above examples, it will be helpful for Family A to have a better and clearer sense of how many boxes of cereal are to be purchased. Therefore, it is worthwhile to look at the cdf and the survival function of the negative binomial distribution.

In terms of basic principle, here’s the survival functions for the distribution described in (1) and (2).

    \displaystyle \begin{aligned} P(X_r>x)&=\sum \limits_{k=x+1}^\infty P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=x+1}^\infty \binom{k-1}{r-1} p^r (1-p)^{k-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

    \displaystyle \begin{aligned} P(Y_r>y)&=\sum \limits_{j=y+1}^\infty P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y= 0,1,2,3,\cdots \\&=\sum \limits_{j=y+1}^\infty \binom{j+r-1}{j} p^r (1-p)^{j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4) \end{aligned}

The cumulative distribution functions can be defined by basic principle or by taking the complement as follows:

    \displaystyle \begin{aligned} P(X_r \le x)&=\sum \limits_{k=r}^x P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=1-P(X_r>x)  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)\end{aligned}

    \displaystyle \begin{aligned} P(Y_r \le y)&=\sum \limits_{j=0}^y P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,3,\cdots \\&=1-P(Y_r>y)  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)\end{aligned}

Example 3 (the above examples continued)
What is the probability that Family A will buy more than 16 boxes in order to obtain the 4 free tickets to the zoo? What is the probability that Family A will give away at most 10 undesirable prizes before obtaining the 4 free tickets to the zoo?

Working from the definition, here’s the answers:

    \displaystyle P(X_4>16)=1-P(X_4 \le 16)=0.40498711

    \displaystyle P(Y_4 \le 10)=0.47866004

The mean number of boxes of cereal purchased is 16 as indicated earlier. Since the variance is large, there is still a significance chance (about 40%) that Family A will have to buy more than 16 boxes of cereal before achieving their goal. On the other hand, there is good chance (about 48%) that Family will give away at most 10 undesirable prizes.

Note that the calculation for P(X_4>16) is based on P(X_4 \le 16), which requires the calculation of 17 probabilities. The calculation for P(Y_4 \le 10), which requires 11 probabilities. Such calculation can be done by software of course. There is a natural way of looking at the calculation for (3), (4), (5) and (6). This alternative approach will give much better insight on the negative binomial distribution.

_______________________________________________________________________

A more natural way of interpreting the survival function

We now discuss a better way to look at the survival functions in (3) and (4). Consider the event X_r>x and the event Y_r>y. We will see that the negative binomial survival function can be related to the cdf of a binomial distribution.

For the event X_r>x to occur, the rth success occurs after performing x trials. So it will take x+1 trials or more to get the rth success. This means that in the first x trials, there are at most r-1 successes. The following highlights the equivalent statements.

    \displaystyle \begin{aligned} X_r>x&\equiv \text{the } r \text{th success occurs after performing } x \text{ trials} \\&\equiv \text{it takes at least } x+1 \text{ trials to get the } r \text{th success}  \\&\equiv \text{in the first } x \text{ trials, there are at most } r-1 \text{ successes}  \end{aligned}

The last statement is a binomial distribution. Specifically, it is the binomial distribution with x trials and the probability of success p. Let’s denote the count of successes of this binomial distribution by B_{x,p}. Thus we can relate the survival function in (3) with the cdf of B_{x,p} as follows:

    \displaystyle \begin{aligned} P(X_r>x)&=P(B_{x,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=0}^{r-1} \ P(B_{x,p}=k)  \\&=\sum \limits_{k=0}^{r-1} \ \binom{x}{k} \ p^k (1-p)^{x-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7)   \end{aligned}

The advantage of (7) is that it gives us the insight of relating the negative binomial distribution with the binomial distribution. In terms of computation, (7) only requires r many binomial probabilities. Thus Example 3 only requires the computation of 4 probabilities (versus 17 previously).

Note that X_r=Y_r+r. Thus the event Y_r>y is the same as the event X_r>y+r. So we can just piggy back on the work done in (7). For the sake of the more clarity, here’s a translation for the event Y_r>y.

    \displaystyle \begin{aligned} Y_r>y&\equiv X_r>y+r \\&\equiv \text{the } r \text{th success occurs after performing } y+r \text{ trials} \\&\equiv \text{it takes at least } y+r+1 \text{ trials to get the } r \text{th success}  \\&\equiv \text{in the first } y+r \text{ trials, there are at most } r-1 \text{ successes}  \end{aligned}

As before, let B_{n,p} denote the number of successes in performing n Bernoulli trials with p as the probability of success. Based on the above translation, the following gives the survival function for the negative binomial random variable Y_r.

    \displaystyle \begin{aligned} P(Y_r>y)&=P(X_r>y+r)   \\&=P(B_{y+r,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \\&=\sum \limits_{j=0}^{r-1} \ P(B_{y+r,p}=j)  \\&=\sum \limits_{j=0}^{r-1} \ \binom{y+r}{j} \ p^j (1-p)^{y+r-j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)   \end{aligned}

For both negative binomial random variables X_r and Y_r, the survival functions can be computed using (7) and (8), respectively. Rote memorization of the formulas (7) and (8) is not recommended. Instead, focus on the thought process that translate the events X_r>x and Y_r>y into a binomial distribution.

Example 4 (the above examples continued)
We now rework Example 3 using the ideas presented in (7) and (8).

    \displaystyle \begin{aligned} P(X_4>16)&=\sum \limits_{k=0}^{3} \ \binom{16}{k} \ 0.25^k \ 0.75^{16-k}=0.40498711    \end{aligned}

    \displaystyle \begin{aligned} P(Y_4 \le 10)&=1-P(Y_4>10)   \\&=1-\sum \limits_{j=0}^{3} \ \binom{14}{j} \ 0.25^j \ 0.75^{14-j} \\&=1-0.52133996  \\&=0.47866004   \end{aligned}

Example 5 (the above examples continued)
What is the median number of boxes of cereal purchased by Family A in order to obtain 4 boxes with the prize of free ticket to the zoo? What is the median number of boxes of cereal with undesirable prizes that are purchased by Family A?

We have the following probabilities.

    \displaystyle P(X_4>14)=\sum \limits_{k=0}^{3} \ \binom{14}{k} \ 0.25^k \ 0.75^{16-k}=0.52133996

    \displaystyle P(X_4>15)=\sum \limits_{k=0}^{3} \ \binom{15}{k} \ 0.25^k \ 0.75^{16-k}=0.461286876

    \displaystyle P(X_4 \le 14)=1-0.52133996=0.47866004

    \displaystyle P(X_4 \le 15)=1-0.461286876=0.538713124

This means that the median number of boxes to be purchased is 15. One way to look at it is that x= 15 is the first number such that P(X_4 \le x) is greater than 0.5. Then the median number of boxes with undesirable prizes is 11 (15 less 4).

_______________________________________________________________________

Comparing with the Gamma distribution

The thought process discussed in (7) and (8) certainly gives a more efficient way to calculate the cumulative distribution function and the survival function of the negative binomial distribution. Even though the negative binomial cdf can be calculated easily by software, the ideas in (7) and (8) provides a formulation that gives more insight on the negative binomial distribution.

The though process in (7) and (8) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Consider the Gamma distribution where the shape parameter n is a positive integer and the rate parameter \beta can be any positive real number. Then the following is the density function of the Gamma distribution under consideration:

    \displaystyle f(x)=\frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta \ x} \ \ \ \ \ x>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)

The Gamma distribution described by the density function in (9) can be interpreted as a waiting time – the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The survival function in (7) is analogous to the following relation.

    \displaystyle P(X>t)=\int_t^\infty \frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\beta t} \ (\beta t)^j}{j!} \ \ \ \ \ \ \ \ \ \ \ (10)

The idea for (7) and (8) is the waiting time until the rth success where each success or failure is based on a Bernoulli process. The resulting probability distribution is a discrete waiting time process. The idea for (10) is the waiting time until the nth change where each change is based on a Poisson counting process. The resulting probability distribution is a continuous waiting time process. It is not necessary to memorize these formulas. It is easy to reproduce (7), (8) and (10) from the underlying thought processes.

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Deriving some facts of the negative binomial distribution

The previous post called The Negative Binomial Distribution gives a fairly comprehensive discussion of the negative binomial distribution. In this post, we fill in some of the details that are glossed over in that previous post. We derive the following points:

  • Discuss the several versions of the negative binomial distribution.
  • The negative binomial probabilities sum to one, i.e., the negative binomial probability function is a valid one.
  • Derive the moment generating function of the negative binomial distribution.
  • Derive the first and second moments and the variance of the negative binomial distribution.
  • An observation about independent sum of negative binomial distributions.

________________________________________________________________________

Three versions

The negative binomial distribution has two parameters r and p, where r is a positive real number and 0<p<1. The first two versions arise from the case that r is a positive integer, which can be interpreted as the random experiment of a sequence of independent Bernoulli trials until the rth success (the trials have the same probability of success p). In this interpretation, there are two ways of recording the random experiment:

    X = the number of Bernoulli trials required to get the rth success.
    Y = the number of Bernoulli trials that end in failure before getting the rth success.

The other parameter p is the probability of success in each Bernoulli trial. The notation \binom{m}{n} is the binomial coefficient where m and n are non-negative integers and m \ge n is defined as:

    \displaystyle \binom{m}{n}=\frac{m!}{n! \ (m-n)!}=\frac{m(m-1) \cdots (m-(n-1))}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)

With this in mind, the following are the probability functions of the random variables X and Y.

    \displaystyle P(X=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The thought process for (1) is that for the event X=x to happen, there can only be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial). The thought process for (2) is that for the event Y=y to happen, there are y+r trials (y failures and r successes). In the first y+r-1 trials, there can be only y failures (or equivalently r-1 successes). Note that X=Y+r. Thus knowing the mean of Y will derive the mean of X, a fact we will use below.

Instead of memorizing the probability functions (1) and (2), it is better to understand and remember the thought processes involved. Because of the natural interpretation of performing Bernoulli trials until the rth success, it is a good idea to introduce the negative binomial distribution via the distributions described by (1) and (2), i.e., the case where the parameter r is a positive integer. When r=1, the random experiment is a sequence of independent Bernoulli trials until the first success (this is called the geometric distribution).

Of course, (1) and (2) can also simply be used as counting distributions without any connection with a series of Bernoulli trials (e.g. used in an insurance context as the number of losses or claims arising from a group of insurance policies).

The binomial coefficient in (0) is defined when both numbers are non-negative integers and that the top one is greater than or equal to the bottom one. However, the rightmost term in (0) can be calculated even when the top number m is not a non-negative integer. Thus when m is any real number, the rightmost term (0) can be calculated provided that the bottom number n is a positive integer. For convenience we define \binom{m}{0}=1. With this in mind, the binomial coefficient \binom{m}{n} is defined for any real number m and any non-negative integer n.

The third version of the negative binomial distribution arises from the relaxation of the binomial coefficient \binom{m}{n} just discussed. With this in mind, the probability function in (2) can be defined for any positive real number r:

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where \displaystyle \binom{y+r-1}{y}=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!}.

Of course when r is a positive integer, versions (2) and (3) are identical. When r is a positive real number but is not an integer, the distribution cannot be interpreted as the number of failures until the occurrence of rth success. Instead, it is used as a counting distribution.

________________________________________________________________________

The probabilities sum to one

Do the probabilities in (1), (2) or (3) sum to one? For the interpretations of (1) and (2), is it possible to repeatedly perform Bernoulli trials and never get the rth success? For r=1, is it possible to never even get a success? In tossing a fair coin repeatedly, soon enough you will get a head and even if r is a large number, you will eventually get r number of heads. Here we wish to prove this fact mathematically.

To show that (1), (2) and (3) are indeed probability functions, we use a fact concerning Maclaurin’s series expansion of the function (1-x)^{-r}, a fact that is covered in a calculus course. In the following two results, r is a fixed positive real number and y is any non-negative integer:

    \displaystyle \binom{y+r-1}{y}=(-1)^y \ \binom{-r}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    \displaystyle \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y=(1-x)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The result (4) is to rearrange the binomial coefficient in probability function (3) to another binomial coefficient with a negative number. This is why there is the word “negative” in negative binomial distribution. The result (5) is the Maclaurin’s series expansion for the function (1-x)^{-r}. We first derive these two facts and then use them to show that the negative binomial probabilities in (3) sum to one. The following derives (4).

    \displaystyle \begin{aligned} \binom{y+r-1}{y}&=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!} \\&=(-1)^y \ \frac{(-r)(-r-1) \cdots (-r-(y-1))}{y!} \\&=(-1)^y \ \binom{-r}{y}  \end{aligned}

To derive (5), let f(x)=(1-x)^{-r}. Based on a theorem that can be found in most calculus text, the function f(x) has the following Maclaurin’s series expansion (Maclaurin’s series is simply Taylor’s series with center = 0).

    \displaystyle (1-x)^{-r}=f(0)+f^{'}(0)x+\frac{f^{(2)}(0)}{2!}x^2+\frac{f^{(3)}(0)}{3!}x^3+\cdots + \frac{f^{(n)}(0)}{n!}x^n+\cdots

where -1<x<1. Now, filling in the derivatives f^{(n)}(0), we have the following derivation.

    \displaystyle \begin{aligned} (1-x)^{-r}&=1+rx+\frac{(r+1)r}{2!}x^2+\frac{(r+2)(r+1)r}{3!}x^3 \\& \ \ \ \ \ \ \ \ +\cdots+\frac{(r+y-1)(r+y-2) \cdots (r+1)r}{y!}x^y +\cdots \\&=1+(-1)^1 (-r)x+(-1)^2\frac{(-r)(-r-1)}{2!}x^2 \\& \ \ \ \ \ \ +(-1)^3 \frac{(-r)(-r-1)(-r-2)}{3!}x^3 +\cdots \\& \ \ \ \ \ \ +(-1)^y \frac{(-r)(-r-1) \cdots (-r-y+2)(-r-y+1)}{y!}x^y +\cdots  \\&=(-1)^0 \binom{-r}{0}x^0 +(-1)^1 \binom{-r}{1}x^1+(-1)^2 \binom{-r}{2}x^2 \\& \ \ \ \ \ \ +(-1)^3 \binom{-r}{3}x^3+\cdots +(-1)^y \binom{-r}{y}x^y+\cdots    \\&=\sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y \end{aligned}

We can now show that the negative binomial probabilities in (3) sum to one. Let q=1-p.

    \displaystyle \begin{aligned} \sum \limits_{y=0}^\infty \binom{y+r-1}{y} \ p^r \ q^y &=p^r \ \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ q^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ (1-q)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5)\\&=p^r p^{-r} \\&=1 \end{aligned}

________________________________________________________________________

The moment generating function

We now derive the moment generating function of the negative binomial distribution according to (3). The moment generation function is M(t)=E(e^{tY}) over all real numbers t for which M(t) is defined. The following derivation does the job.

    \displaystyle \begin{aligned} M(t)&=E(e^{tY}) \\&=\sum \limits_{y=0}^\infty \ e^{t y} \ \binom{y+r-1}{y} \ p^r \ (1-p)^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ \binom{y+r-1}{y} \ [(1-p) e^t]^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ (-1)^y \binom{-r}{y} \ [(1-p) e^t]^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ [1-(1-p) \ e^t]^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5) \\&=\frac{p^r}{[1-(1-p) \ e^t]^{r}}\end{aligned}

The above moment generating function works for the negative binomial distribution with respect to (3) and thus to (2). For the distribution in (1), note that X=Y+r. Thus E(e^{tX})=E(e^{t(Y+r)})=e^{tr} \ E(e^{tY}). The moment generating function of (1) is simply the above moment generating function multiplied by the factor e^{tr}. To summarize, the moment generating functions for the three versions are:

    \displaystyle M_X(t)=E[e^{tX}]=\frac{p^r \ e^{tr}}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (1)

    \displaystyle M_Y(t)=E[e^{tY}]=\frac{p^r}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (2) \text{ and } (3)

The domain of the moment generating function is the set of all t that for which M_X(t) or M_Y(t) is defined and is positive. Based on the form that it takes, we focus on making sure that 1-(1-p) \ e^t>0. This leads to the domain t<-\text{ln}(1-p).

________________________________________________________________________

The mean and the variance

With the moment generating function derived in the above section, we can now focus on finding the moments of the negative binomial distribution. To find the moments, simply take the derivatives of the moment generating function and evaluate at t=0. For the distribution represented by the probability function in (3), we calculate the following:

    E(Y)=M_Y^{'}(0)

    E(Y^2)=M_Y^{(2)}(0)

    Var(Y)=E(Y^2)-E(Y)^2

After taking the first and second derivatives and evaluate at t=0, the first and the second moments are:

    \displaystyle E(Y)=r \ \frac{1-p}{p}

    \displaystyle E(Y^2)=\frac{r(1-p)[1+(1-p)]}{p^2}

The following derives the variance.

    \displaystyle \begin{aligned} Var(Y)&=E(Y^2)-E(Y)^2 \\&=\frac{r(1-p)[1+(1-p)]}{p^2}-\frac{(1-p)^2}{p^2} \\&=\frac{r(1-p)[1+r(1-p)-r(1-p)]}{p^2} \\&=\frac{r(1-p)}{p^2}  \end{aligned}

The above formula is the variance for the three versions (1), (2) and (3). Note that Var(Y)>E(Y). In contrast, the variance of the Poisson distribution is identical to its mean. Thus in the situation where the variance of observed data is greater than the sample mean, the negative binomial distribution should be a better fit than the Poisson distribution.

________________________________________________________________________

The independent sum

There is an easy consequence that follows from the moment generating function derived above. The sum of several independent negative binomial distributions is also a negative binomial distribution. For example, suppose T_1,T_2, \cdots,T_n are independent negative binomial random variables (version (3)). Suppose each T_j has parameters r_j and p (the second parameter is identical). The moment generating function of the independent sum is the product of the individual moment generating functions. Thus the following is the moment generating function of T=T_1+\cdots+T_n.

    \displaystyle M_T(t)=E[e^{tT}]=\frac{p^g}{[1-(1-p) \ e^t]^{g}}

where g=r_1+\cdots+r_n. The moment generating function uniquely identifies the distribution. The above M_T(t) is that of a negative binomial distribution with parameters g and p according to (3).

A special case is that the sum of n independent geometric distributions is a negative binomial distribution with the r parameter being r=n. The following is the moment generating function of the sum W of n independent geometric distributions.

    \displaystyle M_W(t)=E[e^{tW}]=\frac{p^n}{[1-(1-p) \ e^t]^{n}}

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation W \sim \text{binom}(n,p) denotes the statement that W has a binomial distribution with parameters n and p. In other words, W is the number of successes in a sequence of n independent Bernoulli trials where p is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let X be the number of correct answers in the first quiz and Y be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both X and Y are binomial distributions (both \sim \text{binom}(5,\frac{1}{4})). The sum X+Y is also a binomial distribution (\sim \text{binom}(10,\frac{1}{4})). The question that is being asked is a conditional probability, i.e., P(X=3 \lvert X+Y=4). Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions X and Y with the same probability of success p (the number of trials does not have to be the same), the conditional distribution X \lvert X+Y=a is a hypergeometric distribution. Interestingly, the probability of success p has no bearing on this observation. For Example 1, we have the following calculation.

\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}

\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

Interestingly, the conditional mean E(X \lvert X+Y=4)=2, while the unconditional mean E(X)=5 \times \frac{1}{4}=1.25. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

__________________________________________________
Why This Works

Suppose X \sim \text{binom}(5,p) and Y \sim \text{binom}(5,p) and they are independent. The joint distribution of X and Y has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \  P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y}  \end{aligned}

where x=0,1,2,3,4,5 and y=0,1,2,3,4,5.

The conditional probability P(X=k \lvert X+Y=4) involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \  P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0)  \end{aligned}

We can plug (3) into (4) and work out the calculation. But (4) is actually equivalent to the following because X+Y \sim \text{binom}(10,p).

\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}

As stated earlier, the conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \  P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle  \binom{10}{4} p^4 \ (1-p)^{6}}   \end{aligned}

With the terms involving p and 1-p cancel out, we have:

\displaystyle (7) \ \ \ \  P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle  \binom{10}{4}}

__________________________________________________
Summary

Suppose X \sim \text{binom}(N,p) and Y \sim \text{binom}(M,p) and they are independent. Then X+Y is also a binomial distribution, i.e., \sim \text{binom}(N+M,p). Suppose that both binomial experiments \text{binom}(N,p) and \text{binom}(M,p) have been performed and it is known that there are a successes in total. Then X \lvert X+Y=a has a hypergeometric distribution.

\displaystyle (8) \ \ \ \  P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle  \binom{N+M}{a}}

where k=0,1,2,3,\cdots,\text{min}(N,a).

As discussed earlier, think of the N trials in \text{binom}(N,p) as red balls and think of the M trials in \text{binom}(M,p) as blue balls in a jar. Think of the a successes as the number of balls you are about to draw from the jar. So you reach into the jar and select a balls without replacement. The calculation in (8) gives the probability that you select k red balls and a-k blue balls.

The probability of success p in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

__________________________________________________
Additional Practice
Practice problems are found in the following blog post.

How to pick binomial trials

Poisson as a Limiting Case of Binomial Distribution

In many binomial problems, the number of Bernoulli trials n is large, relatively speaking, and the probability of success p is small such that n p is of moderate magnitude. For example, consider problems that deal with rare events where the probability of occurrence is small (as a concrete example, counting the number of people with July 1 as birthday out of a random sample of 1000 people). It is often convenient to approximate such binomial problems using the Poisson distribution. The justification for using the Poisson approximation is that the Poisson distribution is a limiting case of the binomial distribution. Now that cheap computing power is widely available, it is quite easy to use computer or other computing devices to obtain exact binomial probabiities for experiments up to 1000 trials or more. Though the Poisson approximation may no longer be necessary for such problems, knowing how to get from binomial to Poisson is important for understanding the Poisson distribution itself.

Consider a counting process that describes the occurrences of a certain type of events of interest in a unit time interval subject to three simplifying assumptions (discussed below). We are interested in counting the number of occurrences of the event of interest in a unit time interval. As a concrete example, consider the number of cars arriving at an observation point in a certain highway in a period of time, say one hour. We wish to model the probability distribution of how many cars that will arrive at the observation point in this particular highway in one hour. Let X be the random variable described by this probability distribution. We wish to konw the probability that there are k cars arriving in one hour. We start with using a binomial distribution as an approximation to the probability P(X=k). We will see that upon letting n \rightarrow \infty, the P(X=k) is a Poisson probability.

Suppose that we know E(X)=\alpha, perhaps an average obtained after observing cars at the observation points for many hours. The simplifying assumptions alluded to earlier are the following:

  1. The numbers of cars arriving in nonoverlapping time intervals are independent.
  2. The probability of one car arriving in a very short time interval of length h is \alpha h.
  3. The probability of having more than one car arriving in a very short time interval is esstentially zero.

Assumption 1 means that a large number of cars arriving in one period does not imply fewer cars will arrival in the next period and vice versa. In other words, the number of cars that arrive in any one given moment does not affect the number of cars that will arrive subsequently. Knowing how many cars arriving in one minute will not help predict the number of cars arriving at the next minute.

Assumption 2 means that the rate of cars arriving is dependent only on the length of the time interval and not on when the time interval occurs (e.g. not on whether it is at the beginning of the hour or toward the end of the hour).

Assumptions 3 allows us to think of a very short period of time as a Bernoulli trial. Thinking of the arrival of a car as a success, each short time interval will result in only one success or one failure.

Assumption 2 tells us that non-overlapping short time intervals of the same length have identical probability of success. Overall, all three assumptions imply that non-overlapping short intervals of the same length are inpdendent Bernoulli trials with identical probability of success, giving us the basis for applying binomial distribution.

To start, we can break up the hour into 60 minutes (into 60 Bernoulli trials). We then consider the binomial distribution with n=60 and p=\frac{\alpha}{60}. So the following is an approximation to our desired probability distribution.

\displaystyle (1) \ \ \ \ \ P(X=k) \approx \binom{60}{k} \biggl(\frac{\alpha}{60}\biggr)^k \biggr(1-\frac{\alpha}{60}\biggr)^{60-k} \ \ \ \ \ k=0,1,2,\cdots, 60

Conceivably, there can be more than 1 car arriving in a minute and observing cars in a one-minute interval may not be a Bernoulli trial. For a one-minute interval to qualify as a Bernoulli trial, there is either no car arriving or 1 car arriving in that one minute. So we can break up an hour into 3,600 seconds (into 3,600 Bernoulli trials). We now consider the binomial distribution with n=3600 and p=\frac{\alpha}{3600}.

\displaystyle (2) \ \ \ \ \ P(X=k) \approx \binom{3600}{k} \biggl(\frac{\alpha}{3600}\biggr)^k \biggr(1-\frac{\alpha}{3600}\biggr)^{3600-k} \ \ \ \ \ k=0,1,2,\cdots, 3600

It is also conceivable that more than 1 car can arrive in one second and observing cars in one-second interval may still not qualify as a Bernoulli trial. So we need to get more granular. We can divide up the hour into n equal subintervals, each of length \frac{1}{n}. Assumption 3 ensures that each subinterval is a Bernoulli trial (either it is a success or a failure; one car arriving or no car arriving). Assumption 2 ensures that the non-overlapping subintervals all have the same probability of success. Assumption 1 tells us that all the n subintervals are independent. So breaking up the hour into n moments and counting the number of moments that are successes will result in a binomial distribution with parameters n and p=\frac{\alpha}{n}. So we are ready to proceed with the following approximation to our probability distribution P(X=k).

\displaystyle (3) \ \ \ \ \ P(X=k) \approx \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \ \ \ \ \ k=0,1,2,\cdots, n

As we get more granular, n \rightarrow \infty. We show that the limit of the binomial probability in (3) is the Poisson distribution with parameter \alpha. We show the following.

\displaystyle (4) \ \ \ \ \ P(X=k) = \lim \limits_{n \rightarrow \infty} \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k}=\frac{e^{-\alpha} \alpha^k}{k!} \ \ \ \ \ \ k=0,1,2,\cdots

In the derivation of (4), we need the following two mathematical tools. The statement (5) is one of the definitions of the mathematical constant e. In the statement (6), the integer n in the numerator is greater than the integer k in the denominator. It says that whenever we work with such a ratio of two factorials, the result is the product of n with the smaller integers down to (n-(k-1)). There are exactly k terms in the product.

\displaystyle (5) \ \ \ \ \ \lim \limits_{n \rightarrow \infty} \biggl(1+\frac{x}{n}\biggr)^n=e^x

\displaystyle (6) \ \ \ \ \ \frac{n!}{(n-k)!}=n(n-1)(n-2) \cdots (n-k+1) \ \ \ \ \ \ \ \  k<n

The following is the derivation of (4).

\displaystyle \begin{aligned}(7) \ \ \ \ \  P(X=k)&=\lim \limits_{n \rightarrow \infty} \binom{n}{k} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \\&=\lim \limits_{n \rightarrow \infty} \ \frac{n!}{k! (n-k)!} \biggl(\frac{\alpha}{n}\biggr)^k \biggr(1-\frac{\alpha}{n}\biggr)^{n-k} \\&=\lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\alpha^k}{k!}\biggr) \biggr(1-\frac{\alpha}{n}\biggr)^{n} \biggr(1-\frac{\alpha}{n}\biggr)^{-k} \\&=\biggl(\frac{\alpha^k}{k!}\biggr) \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \\&\times \ \ \ \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{n} \ \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{-k} \\&=\frac{e^{-\alpha} \alpha^k}{k!} \end{aligned}

In (7), we have \displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1. The reason being that the numerator is a polynomial where the leading term is n^k. Upon dividing by n^k and taking the limit, we get 1. Based on (5), we have \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{n}=e^{-\alpha}. For the last limit in the derivation we have \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\alpha}{n}\biggr)^{-k}=1.

We conclude with some comments. As the above derivation shows, the Poisson distribution is at heart a binomial distribution. When we divide the unit time interval into more and more subintervals (as the subintervals get more and more granular), the resulting binomial distribution behaves more and more like the Poisson distribution.

The three assumtions used in the derivation are called the Poisson postulates, which are the underlying assumptions that govern a Poisson process. Such a random process describes the occurrences of some type of events that are of interest (e.g. the arrivals of cars in our example) in a fixed period of time. The positive constant \alpha indicated in Assumption 2 is the parameter of the Poisson process, which can be interpreted as the rate of occurrences of the event of interest (or rate of changes, or rate of arrivals) in a unit time interval, meaning that the positive constant \alpha is the mean number of occurrences in the unit time interval. The derivation in (7) shows that whenever a certain type of events occurs according to a Poisson process with parameter \alpha, the counting variable of the number of occurrences in the unit time interval is distributed according to the Poisson distribution as indicated in (4).

If we observe the occurrences of events over intervals of length other than unit length, say, in an interval of length t, the counting process is governed by the same three postulates, with the modification to Assumption 2 that the rate of changes of the process is now \alpha t. The mean number of occurrences in the time interval of length t is now \alpha t. The Assumption 2 now states that for any very short time interval of length h (and that is also a subinterval of the interval of length t under observation), the probability of having one occurrence of event in this short interval is (\alpha t)h. Applyng the same derivation, it can be shown that the number of occurrences (X_t) in a time interval of length t has the Poisson distribution with the following probability mass function.

\displaystyle (8) \ \ \ \ \ P(X_t=k)=\frac{e^{-\alpha t} \ (\alpha t)^k}{k!} \ \ \ \ \ \ \ \  k=0,1,2,\cdots

Relating Binomial and Negative Binomial

The negative binomial distribution has a natural intepretation as a waiting time until the arrival of the rth success (when the parameter r is a positive integer). The waiting time refers to the number of independent Bernoulli trials needed to reach the rth success. This interpretation of the negative binomial distribution gives us a good way of relating it to the binomial distribution. For example, if the rth success takes place after k failed Bernoulli trials (for a total of k+r trials), then there can be at most r-1 successes in the first k+r trials. This tells us that the survival function of the negative binomial distribution is the cumulative distribution function (cdf) of a binomial distribution. In this post, we gives the details behind this observation. A previous post on the negative binomial distribution is found here.

A random experiment resulting in two distinct outcomes (success or failure) is called a Bernoulli trial (e.g. head or tail in a coin toss, whether or not the birthday of a customer is the first of January, whether an insurance claim is above or below a given threshold etc). Suppose a series of independent Bernoulli trials are performed until reaching the rth success where the probability of success in each trial is p. Let X_r be the number of failures before the occurrence of the rth success. The following is the probablity mass function of X_r.

\displaystyle (1) \ \ \ \ P(X_r=k)=\binom{k+r-1}{k} p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,3,\cdots

Be definition, the survival function and cdf of X_r are:

\displaystyle (2) \ \ \ \ P(X_r > k)=\sum \limits_{j=k+1}^\infty \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

\displaystyle (3) \ \ \ \ P(X_r \le k)=\sum \limits_{j=0}^k \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

For each positive integer k, let Y_{r+k} be the number of successes in performing a sequence of r+k independent Bernoulli trials where p is the probability of success. In other words, Y_{r+k} has a binomial distribution with parameters r+k and p.

If the random experiment requires more than k failures to reach the rth success, there are at most r-1 successes in the first k+r trails. Thus the survival function of X_r is the same as the cdf of a binomial distribution. Equivalently, the cdf of X_r is the same as the survival function of a binomial distribution. We have the following:

\displaystyle \begin{aligned}(4) \ \ \ \ P(X_r > k)&=P(Y_{k+r} \le r-1) \\&=\sum \limits_{j=0}^{r-1} \binom{k+r}{j} p^j (1-p)^{k+r-j} \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

\displaystyle \begin{aligned}(5) \ \ \ \ P(X_r \le k)&=P(Y_{k+r} > r-1) \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

Remark
The relation (4) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Recall that a Gamma distribution with shape parameter \alpha and scale parameter n, where n is a positive integer, can be interpreted as the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The relation (4) is analogous to the following relation.

\displaystyle (5) \ \ \ \ \int_t^\infty \frac{\alpha^n}{(n-1)!} \ x^{n-1} \ e^{-\alpha x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\alpha t} \ (\alpha t)^j}{j!}

A previous post on the negative binomial distribution is found here.