Every character is known but password is hard to crack

In this post, we discusses an example in which you are given a password (every character of it) and yet it is still very hard (or even impossible) to crack. Anyone who understands this example has a solid understanding of the binomial distribution. Here’s the example:

    Your friend John tells you that the password to his online bank account has 26 characters. The first character is the first letter in the English alphabet, the second character is the second letter in the English alphabet, the third character is the third letter in the English alphabet and so on.

    Now that your friend John has given you the key to his account, does that mean you can log onto his account to find out how much money he has, or to make financial transaction on his behalf or to enrich yourself?

    If this example sounds too good to be true, what is the catch?

Even though every character in the 26-character password is known, it is indeed a very strong password. How could this be? You may want to stop here and think about.

Indeed, if every character in John’s password is lower case or if every character is upper case, then his bank account is toast. But John’s password can be made very strong and very hard to crack if the password is case sensitive. The password given by John is not just one password, but is a large collection of passwords. In fact, there are over 67 millions possible passwords (67,108,864 to be exact). The following are two of the most obvious ones.

    a b c d e f g h i j k l m n o p q r s t u v w x y z (all lower case)

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all upper case)

The following is another possible password. If this is the one John uses, it will be difficult to crack.

    a b C d e f G h i j k l M N o p Q R s T U v w X Y z (10 upper case letters)

Here’s another possibility.

    A B c D E f G H I J k l M N o P q R S t u v w X y z (14 upper case letters)

Each character in the password has two possibilities – lower case or upper case. Across all 26 characters, there are 2^{26} possibilities. This number is 67,108,864. So 2 raised to 26 is a little over 67 millions. So the password given by John is not just one password, but is a generic one with over 67 million possibilities. There is a one in 67 million chance in correctly guessing the correct password if John chooses the upper case letters randomly. This is much better odds than winning the Powerball lottery, one in 292,201,338, which one in 292 million. But it is still an undeniably strong password.

So John tells you the password, but has in fact not given up much secret. This is the case if he makes the case sensitivity truly random. Of course, once he sets his password, unless he has a great memory, he will need to write down the positions that are upper case. Otherwise, he will not be able to log back on. But that is a totally separate story.

___________________________________________________________________________

The binomial angle

A good way to represent the passwords is to view each one as a 26-character string of U and L (U stands for upper case and L stands for lower case). Then the above two passwords (the one with 10 upper case letters and the one with 14 upper case letters) are represented as follows:

    L L U L L L U L L L L L U U L L U U L U U L L U U L (10 U’s)

    U U L U U L U U U U L L U U L U L U U L L L L U L L (14 U’s)

As discussed, there are 67,108,864 many such strings. We can also think of each such string as the record of a random experiment consisting of tossing a coin 26 times. We record a U when we get a head and record an L when the coin toss results in a tail. In fact, this is the kind of records that John would need to keep when he sets the password. Such a string would tell him which characters in the password are in upper case and which characters are in lower case. On the other hand, hacking the password would be essentially trying to pinpoint one such string out of 67,108,864 many possibilities.

We know 67,108,864 is a large number. Let’s further demonstrate the challenge. In each string, the number of U’s can range from 0 to 26. How many of the strings have 0 U’s? Precisely one (all letters are L). This would be what most people would try first (all the letters are lower case). How many of the strings have exactly 1 U? Thinking about it carefully, we should arrive at the conclusion that there are 26 possibilities (the single U can be in any of the 26 positions). So if the would be hacker knows there there is only one upper case letter, then it would be easy to break the password. How many of the strings have exactly 2 U’s? If you try to write out all the possible cases, it may take some effort. There are 325 possibilities. So just having two upper case letters in the password seems to make it something that is approaching a strong password. But the problem is that the two U’s may be guessed easily. John may put the upper case letters on his initials (if his name is John Smith, he may make J and S upper case), or in other obvious letters.

How many of the strings have exactly 3 U’s? This will be really hard to write out by hand. There are 2,600 many possibilities. Why stop at having just 3 upper case letters? It is clear that the more upper case letters used in the password, the stronger it is and the harder it is to crack.

How do we know the precise number of possibilities for a given k, the number of U’s? The idea is that of choosing k number of letters out of 26 letters.

The number of ways of choosing k objects from a total of n objects is denoted by the notation \binom{n}{k}. Sometimes the notations C(n,k), _nC_k and C_{n,k} are used. Regardless of the notation, the calculation is

    \displaystyle \binom{n}{k}=\frac{n!}{k! (n-k)!}

The notation n! is the product of all the positive integers up to and including n (this is called n factorial). Thus 1!=1, 2!=2, 3!=6, 4!=24. To make the formula work correctly, we make 0!=1.

The following gives the first several calculations in the 26-character password example.

    \displaystyle \binom{26}{2}=\frac{26!}{2! \ 24!}=\frac{26 \cdot 25}{2}=325

    \displaystyle \binom{26}{3}=\frac{26!}{3! \ 23!}=\frac{26 \cdot 25 \cdot 24}{6}=2600

    \displaystyle \binom{26}{4}=\frac{26!}{4! \ 22!}=\frac{26 \cdot 25 \cdot 24 \cdot 23}{24}=14950

If the desire is to see the patterns, the remaining calculations can be done by using software (or at least a hand held calculator). The following table shows the results.

    \displaystyle \begin{array}{rrr} k &\text{ } & \displaystyle \binom{26}{k}  \\  \text{ } & \text{ } & \text{ }  \\  0 &\text{ } & 1  \\     1 &\text{ } & 26   \\  2 &\text{ } & 325   \\  3 &\text{ } & 2,600   \\  4 &\text{ } & 14,950   \\  5 &\text{ } & 65,780   \\  6 &\text{ } & 230,230   \\  7 &\text{ } & 657,800   \\  8 &\text{ } & 1,562,275   \\  9 &\text{ } & 3,124,550   \\  10 &\text{ } & 5,311,735   \\  11 &\text{ } & 7,726,160   \\  12 &\text{ } & 9,657,700   \\  13 &\text{ } & 10,400,600   \\  14 &\text{ } & 9,657,700   \\  15 &\text{ } & 7,726,160   \\  16 &\text{ } & 5,311,735   \\  17 &\text{ } & 3,124,550   \\  18 &\text{ } & 1,562,275   \\  19 &\text{ } & 657,800   \\  20 &\text{ } & 230,230   \\  21 &\text{ } & 65,780   \\  22 &\text{ } & 14,950   \\  23 &\text{ } & 2,600   \\  24 &\text{ } & 325   \\  25 &\text{ } & 26   \\  26 &\text{ } & 1   \\  \end{array}

The pattern is symmetrical. Having too few U’s or too many U’s produces weak passwords that may be easy to guess. Having 6 or 7 U’s seems to give strong passwords. Having half of the letters upper case (13 U’s) is the optimal, with the most possibilities (over 10 millions). Even if you are given partial information such as “half of the letters are in upper case”, you are still left with over 10 million possibilities to work with!

Dividing each of the above counts by 67,108,864 gives the relative weight (probability) of each case of having exactly k U’s.

    \displaystyle \begin{array}{rrr} k &\text{ } & P[X=k]  \\  \text{ } & \text{ } & \text{ }  \\  0 &\text{ } & 0.00000001  \\     1 &\text{ } & 0.00000039   \\  2 &\text{ } & 0.00000484   \\  3 &\text{ } & 0.00003874   \\  4 &\text{ } & 0.00022277   \\  5 &\text{ } & 0.00098020   \\  6 &\text{ } & 0.00343069   \\  7 &\text{ } & 0.00980198   \\  8 &\text{ } & 0.02327971   \\  9 &\text{ } & 0.04655942   \\  10 &\text{ } & 0.07915102   \\  11 &\text{ } & 0.11512876   \\  12 &\text{ } & 0.14391094   \\  13 &\text{ } & 0.15498102   \\  14 &\text{ } & 0.14391094   \\  15 &\text{ } & 0.11512876   \\  16 &\text{ } & 0.07915102   \\  17 &\text{ } & 0.04655942   \\  18 &\text{ } & 0.02327971   \\  19 &\text{ } & 0.00980198   \\  20 &\text{ } & 0.00343069   \\  21 &\text{ } & 0.00098020   \\  22 &\text{ } & 0.00022277   \\  23 &\text{ } & 0.00003874   \\  24 &\text{ } & 0.00000484   \\  25 &\text{ } & 0.00000039   \\  26 &\text{ } & 0.00000001   \\  \end{array}

The cases of X=k where k=11,12,13,14,15 add up to 67.3% of the 67,108,864 possibilities.

___________________________________________________________________________

The binomial distribution

Another way to look at it is that in setting the password, John is performing a sequence of 26 independent Bernoulli trials. Here, each trial has two outcomes, a or A, b or B, c or C and so on. For example, the lower case or upper case can be determined by a coin toss. Let X be the number of upper case letters in the 26-character password. Then the random variable X has a binomial distribution with n=26 (26 Bernoulli trials) and the probability of success p=0.5 in each trial, which is the probability that a character is upper case, assuming that he determines the upper/lower case by a coin toss. The following is the probability function:

    \displaystyle P(X=x)=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^x \biggl[\frac{1}{2}\biggr]^{26-x}=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^{26}

where x=0,1,2,\cdots,25,26. The quantity \displaystyle P(X=x) is the probability that the number of upper case letters is x. Here, \binom{26}{x} is the number of ways to choose x letters out of 26 letters and is computed by the formula indicated earlier.

Since the upper/lower case is determined randomly, another way to state the probability function of the random variable X is:

    \displaystyle P(X=x)=\displaystyle \frac{\binom{26}{x}}{2^{26}}=\frac{\binom{26}{x}}{67108864} \ \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,24,25,26

The expected value of this random variable X is 13. This is the average number of upper case letters if the case is determined randomly. This obviously produces the most optimally strong password. If John determines the case not at random, the security may not be as strong or the would be hacker may be able to guess.

Stepping away from the 26-character password example, here’s the probability function of a binomial distribution in general.

    \displaystyle P(X=x)=\binom{n}{x} \ p^x \ (1-p)^{n-x} \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,n-1,n

This model describes the random experiment of running n independent trials, where each trial has two outcomes (the technical term is Bernoulli trial). In each trial, the probability of one outcome (called success) is p and the probability of the other outcome (called failure) is 1-p. The random variable X counts the number of successes whenever such an experiment is performed. The probability P(X=x) gives the likelihood of achieving x successes.

As an example, if John has a bias toward lower case letter, then the probability of success (upper case) may be p=0.4 (assuming that the random variable X still counts the number of upper case letters). Then the average number of upper case letters in a randomly determined password is 26 x 0.4 = 10.4.

___________________________________________________________________________

Interesting binomial distribution problems

The problem of points and the dice problem are two famous probability problems that are in the history book as a result of a gambler seeking help from Pascal and Fermat.

___________________________________________________________________________
\copyright \ 2016 \text{ by Dan Ma}

Advertisements

The problem of points

There are two celebrated problems in probability that originated from the French professional gambler Chevalier de Méré (1607-1684). The problems were solved jointly by Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) in a series of letters. The ideas discussed in these letters were often credited with having started probability theory. In a previous post, we discuss one of the problems posed by Chevalier de Méré to Pascal (the dice problem). In this post, we discuss the second problem – the problem of points.

___________________________________________________________________________

Describing the Problem

Here’s a description of the famous problem of points. Two players play a game of chance with the agreement that each player puts up equal stakes and that the first player who wins a certain number of rounds (or points) will collect the entire stakes. Suppose that the game is interrupted before either player has won. How do the players divide the stakes fairly?

It is clear that the player who is closer to winning should get a larger share of the stakes. Since the player who had won more points is closer to winning, the player with more points should receive a larger share of the stakes. How do we quantify the differential?

To describe the problem in a little more details, suppose that two players, A and B, play a series of points in a game such that player A wins each point with probability p and that player B wins each point with probability 1-p. The first player to win T points wins the game. Suppose that the game is stopped for some reason. At the time of stopping, player A has won a points and player B has won b points with a<T and b<T. How do they divide the stakes? Note that the pot is contributed equally by the two players.

In attacking the problem, Pascal’s idea is that the share of the stakes that is received by a player should be proportional to his/her probability of winning if the game were to continue at the point of stopping. Let’s explore this idea by looking at examples.

___________________________________________________________________________

Looking at the Problem thru Examples

The following examples are based on the following rule. Let’s say two players (A and B) play a series of points with equal probability of winning a point at each round. Each player puts up a stake of 32. The first player who wins four points take the entire stakes.

Example 1 (Fermat’s Approach)
Suppose that player A has won 2 points and player B has won one point right before the termination of the game. How can the stakes be divided fairly?

In the analysis, we assume that the game continues. Then player A needs 2 more points to win while player B needs 3 more points to win. We would like to calculate the probability that player A wins 2 points before player B winning 3 points. Consider the next 2 + 3 – 1 = 4 rounds (assuming one point per round). If player A wins at least 2 points in the next 4 rounds, player A win the game. The complement of this probability would be the probability that player B wins the game.

Let S (success) be the event that player A wins a point and let F (failure) be the event that player A loses a point (i.e. player B winning the point). Let’s write out all the outcomes of playing 4 points (this was the approach of Fermat). There are 16 such outcomes.

    \begin{array}{rrrrrr}          1 &  SSSS & * &  & \text{ } &  \\   2 &   SSSF & * &  & \text{ } &  \\   3 &   SSFS & * &  & \text{ } &  \\   4 &   SSFF & * &  & \text{ } &  \\   5 &   SFSS & * &  & \text{ } &  \\   6 &   SFSF & * &  & \text{ } &  \\   7 &   SFFS & * &  & \text{ } &  \\   8 &   SFFF & \text{ } &  & \text{ } &  \\         9 &   FSSS & * &  & \text{ } &  \\   10 &   FSSF & * &  & \text{ } &  \\   11 &   FSFS & * &  & \text{ } &  \\   12 &   FSFF & \text{ } &  & \text{ } &  \\   13 &   FFSS & * &  & \text{ } &  \\   14 &   FFSF & \text{ } &  & \text{ } &  \\    15 &  FFFS & \text{ } &  & \text{ } &  \\    16 &  FFFF & \text{ } &  & \text{ } &  \\    \end{array}

Player A wins In eleven of the outcomes (the ones with asterisk). Note that there are at least 2 S’s in the outcomes with asterisk. Thus the probability of player A winning is 11/16 = 0.6875. At the time of stopping the game, player A has a 68.75% chance of winning (if the game were to continue). The share for player A is 0.6875 x 64 = 44.

The example demonstrates the approach taken by Fermat. He essentially converted the original problem of points into an equivalent problem, i.e. finding the probability of player A winning the game if the game were to continue. Then he used combinatorial methods to count the number of cases that result in player A winning. In this example, the additional four points that are to be played are fictitious moves (the moves don’t have to be made) but are useful for finding the solution. The only draw back in Fermat’s approach is that he used counting. What if the number of points involved is large?

Example 2 (Pascal’s Approach)
The specifics of the example are the same as in Example 1. The listing out all possible cases in Example 1 makes the solution easy to see. But if the number of points is large, then the counting could become difficult to manage. What we need is an algorithm that is easy to use and is easy to implement on a computer.

Pascal essentially had the same thinking as Fermat, i.e. to base the solution on the probability of winning if the game were to continue. Pascal also understood that the original problem of points is equivalent to the problem of playing an additional series of points. In this example, playing additional 2 + 3 – 1 = 4 points. As in Example 1, we find the probability that player A wins 2 or more points in this series of 4 points. Pascal’s way to find this probability was based on what are now known as the Pascal’s triangle and the binomial distribution. We would use the following modern day notation:

    \displaystyle \sum \limits_{j=2}^4 \ \binom{4}{j} \ \frac{1}{2^4}=6 \times \frac{1}{2^4}+4 \times \frac{1}{2^4}+\frac{1}{2^4}=\frac{11}{16}=0.6875

Note that the above probability of 0.6875 is the probability of having at least 2 successes in 4 trials (with 0.5 probability of success in each trial). Anyone with a good understanding of the binomial distribution can carry out the calculation (or use software). Of course, this mathematical construct came from Pascal! To the contemporaries of Pascal and Fermat, this concept was definitely not commonplace.

Example 3
Suppose that player A has won 1 point and player B has won no point at the time of termination of the game. How can the prize money be divided fairly?

Based on the discussion of Example 1 and Example 2, player A needs to win at least 3 points to win the game and player B needs to win 4 or more points to win the game. The extended series of points would have 3 + 4 – 1 = 6 points. Player A then needs to win at least 3 points out of 6 points (at least 3 successes out of 6 trials). The following gives the probability of player A winning the extended series of plays.

    \displaystyle \sum \limits_{j=3}^6 \ \binom{6}{j} \ \frac{1}{2^6}=\frac{42}{64}=0.65625

With the total stakes being 64, the share for player A would be 64 x 42/64 = 42. The share for player B would be 22.

___________________________________________________________________________

General Discussion

We now discuss the ideas that are brought up in the examples. As indicated above, two players, A and B, contribute equally to the stakes and then play a series of points until one of them wins T points. The probability that player A wins a round (one point in each round) is p and the probability that player B wins a point is 1-p. Suppose that the game is stopped for some reason before either player has won. At the time of stopping, player A has won a points and player B has won b points with a<T and b<T. Let n=T-a and m=T-b. The key to solving the problem of points is to look at an extended play of n+m-1 points.

Here’s the great insight that came from Pascal and Fermat. They looked forward and not backward. They did not base the solution on the number of points that are already won. Instead, they focused on an extended series of points to determine the share of the winning. This additional play of points is “fictitious” but it helps clarify the process. In essence, they turned the original problem of points into a problem about this additional play of n+m-1 points.

The original problem is: what is the fair share for player A when the game is stopped prematurely with player A having won a points and player B having won b points? The equivalent problem is: what is the probability of player A winning at least n points of the next n+m-1 points where n=T-a and m=T-b (assuming the game has not stopped). Let’s call this probability P(n,m). Let’s examine the setting of this probability. Each point is like a Bernoulli trial – either a success (player A winning it) or a failure (player B winning it). There are n+m-1 trials. The probability of success is p in each trial. We wish to find the probability that are at least n successes. What is being described is a binomial distribution. The probability being sought is:

    \displaystyle P(n,m)=\sum \limits_{j=n}^{n+m-1} \ \binom{n+m-1}{j} \ p^j \ (1-p)^{n+m-1-j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

where n=T-a, m=T-b and that a is the number of points won by player A and b is the number of points won by player B at the time the game is terminated.

The probability P(n,m) is the probability of player A winning the game if the game were to continue at the point of termination. This probability P(n,m) is the proportion of the stakes that would be awarded to player A. Of course, the proportion that should be awarded to player B would be 1-P(n,m). The quantity P(n,m) can be obtained from the given parameters and by using any software that has a function for the binomial distribution.

The problem of points seems to have an easy solution since the answer P(n,m) is so accessible. Any one who understands the binomial distribution can comprehend. It is also easy to compute probabilities for a binomial distribution using calculator or software. One thing to keep in mind is that the solution looks accessible now because of the tools and concepts that came from trails blazed by Pascal and Fermat (and because of the computing tools that we have). Tools and concepts such as Pascal’s triangle and binomial distribution were unknown to people at the time of Pascal and Fermat.

___________________________________________________________________________

Working Backward

For us, the calculation in (1) is easily done using calculator or software. Pascal did not calculate P(n,m) directly and instead perform the calculation backward by using the following formula.

    P(n,m)=p \times P(n-1,m)+(1-p) \times P(n,m-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The formula can be derived mathematically. But doing that is not necessary. The quantity P(n,m) is the probability that player A will win n points before player B winning m points. We can derive the above formula by conditioning on the outcome of the first point. The quantity P(2,3) is calculated in Example 1 and Example 2. It is the average of two similar probabilities with smaller parameters.

    P(2,3)=\frac{1}{2} \times P(1,3)+\frac{1}{2} \times P(2,2)

Based on the recursive formula in (1), Pascal built up the answer backward, similar to the way a computer program is written. In fact, this recursive approach allows us to solve not just for one scenario, but for all the scenarios in a game of T points.

Example 4
We now revisit the problem in Examples 1 and 2. Recall that the game is to play for 4 points, i.e. the first player winning 4 points collects the entire stakes of 64 (32 is contributed by each player). Each player earns a point with probability 0.5. We now show how to divide the stakes when the game is stopped at every possible stopping point.

The following diagram (Figure 1) shows the table for the share awarded to player A. The table is empty except for the top row and rightmost column (the numbers in red). The number 64 shown in the last column would be the amount awarded to player A because player A has won 4 points. The number 0 in the top means that player B has won 4 games. So player A gets nothing. Note that the bottom row highlighted in orange shows the numbers of points that have been won player A. The bottom row highlighted in blue shows the remaining points that player A needs to win in order to win the entire stakes (these are the fictitious points). Similarly, the columns highlighted in orange and blue on the left show the same information for player B.

Figure 1 – The share of the stakes awarded to player A
problem-of-points-by-table-1a

Now, we can use the recursive formula in (1) to fill the table. Basically each cell is the average of the number above it and the number to the right. The parameter n in P(n,m) is a number in the blue row. The parameter m in P(n,m) is a number in the blue column. The following shows the results.

Figure 2 – The share of the stakes awarded to player A
problem-of-points-by-table-2a

For example, when player A has won 2 points and player B has won 1 point, the share for player A is 44 (the average of 32 and 56), the same answer as in Example 1 and Example 2. When player A has won 2 points and player B as won 2 points, both players are in equally competitive positions. Then each player gets 32. When player A has won 2 points and player B as won 3 points, the share for player A is 16 (the average of 0 and 32).

Essentially the formula in (2) is the idea of using smaller steps rather than the entire extended play of n+m-1 points. This idea of smaller steps was preferred by Pascal. At the point where player A needs n more points to win and player B needs m more points to win, the idea is to play one more point. Suppose that the players know the awards to the two players after one more round. Then they should split the difference between the future awards. The calculation should begin at the point where each player only needs one more point to win (the cell with 32 in Figure 2). In that cell, we know the awards after one additional round. Taking the average of 0 and 64 gives 32. From that cell, we move down and move left on the table. Keep repeating the process until the entire table is filled in.

There is a way to tweak the table approach to work for unequal winning probability of a point. Let’s say the probability of player A winning a point is 0.6. Then the probability of player B winning a point is 0.4. The value of a given cell in the table would be the weighted average of the cell on the right (0.6 weight) with the cell above it (0.4 weight). When we know the results from playing one more round, we assign 0.6 to the result of player A winning and 0.4 to the result of player B winning. The following table shows the results.

Figure 3 – The share of the stakes awarded to player A (with 0.6 weight)
problem-of-points-by-table-3a

The direct formula (1) or the table approach using the recursive formula (2) can be easily programmed in a computer. For Pascal, the table approach is a very attractive option since the calculation can be built up from lower parameter values. In the above configuration, simply fill in the right column (the entire stakes going to player A) and the top row (player A getting nothing). Then the remaining cells are obtained by weighted average as described in formula (2).

We present one more example.

Example 5
Suppose that player B is a casino and player A is a visitor at the casino playing for 12 points. The house edge is 2% so that the probability of player A winning a round is 0.48. If player A desires to leave after player A has won 9 points and the house has won 6 points, what is the proportion of the stakes that should be awarded to player A?

Playing for 12 points, player A needs 3 more points to win and the house needs 6 more points to win. So we need to analyze an extended play of 3 + 6 – 1 = 8 points. For player A to win the extended play, he needs to win at least 3 points.

    \displaystyle P(3,6)=\sum \limits_{j=3}^8 \ \binom{8}{j} \ 0.48^j \ 0.52^{8-j}=0.827631917

The answer can be obtained by computing each term in the sum (from j=3 to j=8). Another way is to use the BINOMDIST function in Excel as follows:

    = 1-BINOMDIST(2, 8, 0.48, TRUE) = 1-0.172368083 = 0.827631917

Based on the fair division method discussed in this blog post, player A deserves 82.76% of the stakes.

___________________________________________________________________________

Remarks

In their correspondence, Pascal and Fermat came up with convincing and consistent solution to the problem of points. The earlier solutions to the problem of points were not satisfactory (to all concerned) and are sometimes inconsistent. Division of stakes only basing on the numbers points that have been won may produce extreme results. For example, if player A has won 1 point and player B has won no points, then player A would get the entire stakes. For the game described in Figure 2, for the same scenario, player A gets 42 out of 64 (42 / 64 = 0.65625), which is far from 100%.

For more detailed information on the history of the problem of points, see “A History of Mathematics” by Victor J. Katz.

___________________________________________________________________________
\copyright \ 2016 \text{ by Dan Ma}

When a gambler asked a mathematician for help

When a gambler consistently loses large sum of money, what can he or she do? When one particular gambler, Chevalier de Méré (1607-1684), was losing a big fortune, he called a “mathematical help line”. In fact, his correspondence with Blaise Pascal (1623-1662) earned him a place in the history book. The problems that were presented by de Méré, jointly worked on by Pascal and Pierre de Fermat (1601-1665), are regarded as the beginning of the emerging academic field of probability. Chevalier de Méré was in need of help for two problems – the problem of points and on the dice problem that now bears his name. In this post we discuss the dice problem. The problem of points is discussed in the next post.

___________________________________________________________________________

The Dice Problem

There were two dice problems from Chevalier de Méré. The first game involves four rolls of a fair die. In this game, de Méré made bet with even odds on rolling at least one six when a fair die is rolled four times. His reasoning was that since getting a six in one roll of a die is \frac{1}{6} (correct), the chance of getting a six in four rolls of a die would be 4 \times \frac{1}{6}=\frac{2}{3} (incorrect). With the favorable odds of 67% of winning, he reasoned that betting with even odds would be a profitable proposition. Though his calculation was incorrect, he made considerable amount of money over many years playing this game.

The second game involves twenty four rolls of a pair of fair dice. The success in the first game emboldened de Méré to make even bet on rolling one or more double sixes in twenty four rolls of a pair of dice. His reasoning was that the chance for getting a double six in one roll of a pair of dice is \frac{1}{36} (correct). Then the chance of getting a double six in twenty four rolls of a pair of dice would be 24 \times \frac{1}{36}=\frac{2}{3} (incorrect). He again reasoned that betting with even odds would be profitable too.

But experience showed otherwise. As he lost a lot of money, he realized something was not quite right with the second game. In 1654, he challenged his renowned friend Blaise Pascal to find an explanation. The solution emerged in a series of letters between Pascal and Fermat. Out of this joint effort, a foundation was laid for the idea of probability as an academic subject. One particular idea that emerged was the Pascal triangle. Another one was the binomial distribution. In fact, anyone who understand the binomial distribution can very quickly see the faulty reasoning of de Méré.

___________________________________________________________________________

The Simulation

Before we get to the calculation, let’s simulate the games played by de Méré. Using random numbers generated from using the Rand() function in Excel, we simulated 100,000 iterations of each of the games. In our 100,000 simulations of the first game – rolling a die four times, there are 51,380 iterations with at least one six. This suggests that de Méré’s position would win more than half of the time, though not the \frac{2}{3} odds that he believed. But it was profitable for him nonetheless.

In our 100,000 simulations of the second game – rolling a pair of dice 24 times, there are only 49,211 iterations with at least one double six. This seems to support that de Méré’s position is a losing proposition, that he would be losing his bets more than half the time.

Of course, de Méré could have done similar simulation, though in a much smaller scale, by rolling the dice himself (say, 100 times). He could have seen the light much sooner.

___________________________________________________________________________

The Calculation

Let’s see why the first game was profitable for de Méré and why the second game was not.

The First Game
In a roll of a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6. If the die is fair, the probability of getting a six is \frac{1}{6}. Likewise, the probability of getting no six in one roll of a fair die is \frac{5}{6}.

The probability of getting no six in four rolls is:

    \displaystyle P(\text{no six in four rolls})=\frac{5}{6} \times \frac{5}{6} \times \frac{5}{6} \times \frac{5}{6}=\biggl(\frac{5}{6}\biggr)^4=0.482253.

Thus in four rolls of a fair die, the probability of getting at least one six is:

    \displaystyle \begin{aligned}P(\text{at least one six in four rolls})&=\displaystyle 1 - P(\text{no six in four rolls})\\&=1 - 0.482253\\&=0.517747\end{aligned}

Thus the probability of getting at least one six in four rolls of a fair die is 0.517747. Out of 100 games, de Méré would on average win 52 games. Out of 1000 games, he would on average win 518 games. Suppose each bet is one French franc. Then de Méré would gain 36 francs for each 1000 francs in wagered. Thus he had the house’s edge of about 3.6%.

The Second Game
In a roll of a pair of dice, there are a total of 36 possible outcomes (i.e. the six outcomes of the first die combined with the six outcomes of the second die). Out of these 36 outcomes, only one of them is a double six. So, the probability of getting a double six is \frac{1}{36} in rolling a pair of dice. Likewise, the probability of not getting a double six is \frac{35}{36}.

The probability of getting no double six in 24 rolls of a pair of dice is:

\displaystyle \begin{aligned}P(\text{no double six in 24 rolls})= \biggl(\frac{35}{36}\biggr)^{24}=0.5086\end{aligned}

Thus the probability of getting at least one double six in 24 rolls is:

\displaystyle \begin{aligned}P(\text{at least one double six in 24 rolls})&=\displaystyle 1 - P(\text{no double six in 24 rolls})\\&=1 - 0.5086\\&=0.4914\end{aligned}

Thus the probability of getting at least one double six in 24 rolls of a pair of fair dice is 0.4914. On average, de Méré would only win about 49 games out of 100 and his opposing side would win about 51 games out of 100 games. Out of 1000 games, he would on average win 491 games (the opposing side would win on average 509 games). With each bet as one franc, the opposing side of de Méré would win 18 francs for each 1000 francs wagered (thus the opposing side having the house’s edge of about 1.8%).

The odds indicated by the simulations discussed above are in line with the calculated results. It would be interesting to known what action did de Méré take after learning the answers. Maybe he stopped playing the second game and only played the first game. Maybe he modified the second game so that the odds of winning for him was at least even (or better).

___________________________________________________________________________
\copyright \ 2016 \text{ by Dan Ma}

Calculating order statistics using multinomial probabilities

Consider a random sample X_1,X_2,\cdots,X_n drawn from a continuous distribution. Rank the sample items in increasing order, resulting in a ranked sample Y_1<Y_2<\cdots <Y_n where Y_1 is the smallest sample item, Y_2 is the second smallest sample item and so on. The items in the ranked sample are called the order statistics. Recently the author of this blog was calculating a conditional probability such as P(Y_2>4 \ | \ Y_2>3). One way to do this is to calculate the distribution function P(Y_2 \le t). What about the probability P(Y_5>4 \ | \ Y_2>3)? Since this one involves two order statistics, the author of this blog initially thought that calculating P(Y_5>4 \ | \ Y_2>3) would require knowing the joint probability distribution of the order statistics Y_1,Y_2,\cdots ,Y_n. It turns out that a joint distribution may not be needed. Instead, we can calculate a conditional probability such as P(Y_5>4 \ | \ Y_2>3) using multinomial probabilities. In this post, we demonstrate how this is done using examples. Practice problems are found in here.

The calculation described here can be lengthy and tedious if the sample size is large. To make the calculation more manageable, the examples here have relatively small sample size. To keep the multinomial probabilities easier to calculate, the random samples are drawn from a uniform distribution. The calculation for larger sample sizes from other distributions is better done using a technology solution. In any case, the calculation described here is a great way to practice working with order statistics and multinomial probabilities.

________________________________________________________________________

The multinomial angle

In this post, the order statistics Y_1<Y_2<\cdots <Y_n are resulted from ranking the random sample X_1,X_2,\cdots,X_n, which is drawn from a continuous distribution with distribution function F(x)=P(X \le x). For the jth order statistic, the calculation often begins with its distribution function P(Y_j \le t).

Here’s the thought process for calculating P(Y_j \le t). In drawing the random sample X_1,X_2,\cdots,X_n, make a note of the items \le t and the items >t. For the event Y_j \le t to happen, there must be at least j many sample items X_i that are \le t. For the event Y_j > t to happen, there can be only at most j-1 many sample items X_i \le t. So to calculate P(Y_j \le t), simply find out the probability of having j or more sample items \le t. To calculate P(Y_j > t), find the probability of having at most j-1 sample items \le t.

    \displaystyle P(Y_j \le t)=\sum \limits_{k=j}^n \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

    \displaystyle P(Y_j > t)=\sum \limits_{k=0}^{j-1} \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

Both (1) and (2) involve binomial probabilities and are discussed in this previous post. The probability of success is F(t)=P(X \le t) since we are interested in how many sample items that are \le t. Both the calculations (1) and (2) are based on counting the number of sample items in the two intervals \le t and >t. It turns out that when the probability that is desired involves more than one Y_j, we can also count the number of sample items that fall into some appropriate intervals and apply some appropriate multinomial probabilities. Let’s use an example to illustrate the idea.

Example 1
Draw a random sample X_1,X_2,\cdots,X_{10} from the uniform distribution U(0,4). The resulting order statistics are Y_1<Y_2< \cdots <Y_{10}. Find the following probabilities:

  • P(Y_4<2<Y_5<Y_6<3<Y_7)
  • P(Y_4<2<Y_6<3<Y_7)

For both probabilities, the range of the distribution is broken up into 3 intervals, (0, 2), (2, 3) and (3, 4). Each sample item has probabilities \frac{2}{4}, \frac{1}{4}, \frac{1}{4} of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval.

The first probability involves the event that there are 4 sample items in the interval (0, 2), 2 sample items in the interval (2, 3) and 4 sample items in the interval (3, 4). Thus the first probability is the following multinomial probability:

    \displaystyle \begin{aligned} P(Y_4<2<Y_5<Y_6<3<Y_7)&=\frac{10!}{4! \ 2! \ 4!} \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2 \ \biggl[\frac{1}{4} \biggr]^4 \\&\text{ } \\&=\frac{50400}{1048567} \\&\text{ } \\&=0.0481  \end{aligned}

For the second probability, Y_5 does not have to be greater than 2. Thus there could be 5 sample items less than 2. So we need to add one more case to the above probability (5 sample items to the first interval, 1 sample item to the second interval and 4 sample items to the third interval).

    \displaystyle \begin{aligned} P(Y_4<2<Y_6<3<Y_7)&=\frac{10!}{4! \ 2! \ 4!} \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2 \ \biggl[\frac{1}{4} \biggr]^4 \\& \ \ \ \ + \frac{10!}{5! \ 1! \ 4!} \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{1}{4} \biggr]^4 \\&\text{ } \\&=\frac{50400+40320}{1048567} \\&\text{ } \\&=\frac{90720}{1048567} \\&\text{ } \\&=0.0865  \end{aligned}

Example 2
Draw a random sample X_1,X_2,\cdots,X_6 from the uniform distribution U(0,4). The resulting order statistics are Y_1<Y_2< \cdots <Y_6. Find the probability P(1<Y_2<Y_4<3).

In this problem the range of the distribution is broken up into 3 intervals (0, 1), (1, 3) and (3, 4). Each sample item has probabilities \frac{1}{4}, \frac{2}{4}, \frac{1}{4} of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval. The counting is a little bit more involved here than in the previous example.

The example is to find the probability that both the second order statistic Y_2 and the fourth order statistic Y_4 fall into the interval (1,3). To solve this, determine how many sample items that fall into the interval (0,1), (1,3) and (3,4). The following points detail the counting.

  • For the event 1<Y_2 to happen, there can be at most 1 sample item in the interval (0,1).
  • For the event Y_4<3 to happen, there must be at least 4 sample items in the interval (0,3). Thus if the interval (0,1) has 1 sample item, the interval (1,3) has at least 3 sample items. If the interval (0,1) has no sample item, the interval (1,3) has at least 4 sample items.

The following lists out all the cases that satisfy the above two bullet points. The notation [a, b, c] means that a sample items fall into (0,1), b sample items fall into the interval (1,3) and c sample items fall into the interval (3,4). So a+b+c=6. Since the sample items are drawn from U(0,4), the probabilities of a sample item falling into intervals (0,1), (1,3) and (3,4) are \frac{1}{4}, \frac{2}{4} and \frac{1}{4}, respectively.

    [0, 4, 2]
    [0, 5, 1]
    [0, 6, 0]
    [1, 3, 2]
    [1, 4, 1]
    [1, 5, 0]

    \displaystyle \begin{aligned} \frac{6!}{a! \ b! \ c!} \ \biggl[\frac{1}{4} \biggr]^a \ \biggl[\frac{2}{4} \biggr]^b \ \biggl[\frac{1}{4} \biggr]^c&=\frac{6!}{0! \ 4! \ 2!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2=\frac{240}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 5! \ 1!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1=\frac{192}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 6! \ 0!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^6 \ \biggl[\frac{1}{4} \biggr]^0=\frac{64}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 3! \ 2!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^3 \ \biggl[\frac{1}{4} \biggr]^2=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 4! \ 1!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^1=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 5! \ 0!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^0=\frac{192}{4096} \\&\text{ } \\&=\text{sum of probabilities }=\frac{1648}{4096}=0.4023\end{aligned}

So in randomly drawing 6 items from the uniform distribution U(0,4), there is a 40% chance that the second order statistic and the fourth order statistic are between 1 and 3.

________________________________________________________________________

More examples

The method described by the above examples is this. When looking at the event described by the probability problem, the entire range of distribution is broken up into several intervals. Imagine the sample items X_i are randomly being thrown into these interval (i.e. we are sampling from a uniform distribution). Then multinomial probabilities are calculated to account for all the different ways sample items can land into these intervals. The following examples further illustrate this idea.

Example 3
Draw a random sample X_1,X_2,\cdots,X_7 from the uniform distribution U(0,5). The resulting order statistics are Y_1<Y_2< \cdots <Y_7. Find the following probabilities:

  • P(1<Y_1<3<Y_4<4)
  • P(3<Y_4<4 \ | \ 1<Y_1<3)

The range is broken up into the intervals (0, 1), (1, 3), (3, 4) and (4, 5). The sample items fall into these intervals with probabilities \frac{1}{5}, \frac{2}{5}, \frac{1}{5} and \frac{1}{5}. The following details the counting for the event 1<Y_1<3<Y_4<4:

  • There are no sample items in (0, 1) since 1<Y_1.
  • Based on Y_1<3<Y_4, there are at least one sample item and at most 3 sample items in (0, 3). Thus in the interval (1, 3), there are at least one sample item and at most 3 sample items since there are none in (0, 1).
  • Based on Y_4<4, there are at least 4 sample items in the interval (0, 4). Thus the count in (3, 4) combines with the count in (1, 3) must be at least 4.
  • The interval (4, 5) simply receives the left over sample items not in the previous intervals.

The notation [a, b, c, d] lists out the counts in the 4 intervals. The following lists out all the cases described by the above 5 bullet points along with the corresponding multinomial probabilities, with two of the probabilities set up.

    \displaystyle [0, 1, 3, 3] \ \ \ \ \ \ \frac{280}{78125}=\frac{7!}{0! \ 1! \ 3! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^3

    \displaystyle [0, 1, 4, 2] \ \ \ \ \ \ \frac{210}{78125}

    \displaystyle [0, 1, 5, 1] \ \ \ \ \ \ \frac{84}{78125}

    \displaystyle [0, 1, 6, 0] \ \ \ \ \ \ \frac{14}{78125}

    \displaystyle [0, 2, 2, 3] \ \ \ \ \ \ \frac{840}{78125}

    \displaystyle [0, 2, 3, 2] \ \ \ \ \ \ \frac{840}{78125}

    \displaystyle [0, 2, 4, 1] \ \ \ \ \ \ \frac{420}{78125}

    \displaystyle [0, 2, 5, 0] \ \ \ \ \ \ \frac{84}{78125}

    \displaystyle [0, 3, 1, 3] \ \ \ \ \ \ \frac{1120}{78125}=\frac{7!}{0! \ 3! \ 1! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3

    \displaystyle [0, 3, 2, 2] \ \ \ \ \ \ \frac{1680}{78125}

    \displaystyle [0, 3, 3, 1] \ \ \ \ \ \ \frac{1120}{78125}

    \displaystyle [0, 3, 4, 0] \ \ \ \ \ \ \frac{280}{78125}

Summing all the probabilities, \displaystyle P(1<Y_1<3<Y_4<4)=\frac{6972}{78125}=0.08924. Out of the 78125 many different ways the 7 sample items can land into these 4 intervals, 6972 of them would satisfy the event 1<Y_1<3<Y_4<4.

++++++++++++++++++++++++++++++++++

We now calculate the second probability in Example 3.

    \displaystyle P(3<Y_4<4 \ | \ 1<Y_1<3)=\frac{P(1<Y_1<3<Y_4<4)}{P(1<Y_1<3)}

First calculate P(1<Y_1<3)=P(Y_1<3)-P(Y_1<1). The probability P(Y_1<t) is the probability of having at least 1 sample item less than t, which is the complement of the probability of all sample items greater than t.

    \displaystyle \begin{aligned} P(1<Y_1<3)&=P(Y_1<3)-P(Y_1<1) \\&=1-\biggl( \frac{2}{5} \biggr)^7 -\biggl[1-\biggl( \frac{4}{5} \biggr)^7 \biggr] \\&=\frac{77997-61741}{78125} \\&=\frac{16256}{78125} \end{aligned}

The event 1<Y_1<3 can occur in 16256 ways. Out of these many ways, 6972 of these satisfy the event 1<Y_1<3<Y_4<4. Thus we have:

    \displaystyle P(3<Y_4<4 \ | \ 1<Y_1<3)=\frac{6972}{16256}=0.4289

Example 4
Draw a random sample X_1,X_2,X_3,X_4,X_5 from the uniform distribution U(0,5). The resulting order statistics are Y_1<Y_2<Y_3<Y_4 <Y_5. Consider the conditional random variable Y_4 \ | \ Y_2 >3. For this conditional distribution, find the following:

  • P( Y_4 \le t \ | \ Y_2 >3)
  • f_{Y_4}(t \ | \ Y_2 >3)
  • E(Y_4 \ | \ Y_2 >3)

where 3<t<5. Note that f_{Y_4}(t | \ Y_2 >3) is the density function of Y_4 \ | \ Y_2 >3.

Note that

    \displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{P(3<Y_2<Y_4 \le t)}{P(Y_2 >3)}

In finding P(3<Y_2<Y_4 \le t), the range (0, 5) is broken up into 3 intervals (0, 3), (3, t) and (t, 5). The sample items fall into these intervals with probabilities \frac{3}{5}, \frac{t-3}{5} and \frac{5-t}{5}.

Since Y_2 >3, there is at most 1 sample item in the interval (0, 3). Since Y_4 \le t, there are at least 4 sample items in the interval (0, t). So the count in the interval (3, t) and the count in (0, 3) should add up to 4 or more items. The following shows all the cases for the event 3<Y_2<Y_4 \le t along with the corresponding multinomial probabilities.

    \displaystyle [0, 4, 1] \ \ \ \ \ \ \frac{5!}{0! \ 4! \ 1!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^1

    \displaystyle [0, 5, 0] \ \ \ \ \ \ \frac{5!}{0! \ 5! \ 0!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^5 \ \biggl[\frac{5-t}{5} \biggr]^0

    \displaystyle [1, 3, 1] \ \ \ \ \ \ \frac{5!}{1! \ 3! \ 1!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^3 \ \biggl[\frac{5-t}{5} \biggr]^1

    \displaystyle [1, 4, 0] \ \ \ \ \ \ \frac{5!}{1! \ 4! \ 0!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^0

After carrying the algebra and simplifying, we have the following:

    \displaystyle P(3<Y_2<Y_4 \le t)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{3125}

For the event Y_2 >3 to happen, there is at most 1 sample item less than 3. So we have:

    \displaystyle P(Y_2 >3)=\binom{5}{0} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^5 +\binom{5}{1} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{2}{5} \biggr]^4=\frac{272}{3125}

    \displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{272}

Then the conditional density is obtained by differentiating P( Y_4 \le t \ | \ Y_2 >3).

    \displaystyle f_{Y_4}(t \ | \ Y_2 >3)=\frac{-20t^4+100t^3+540t^2-3750t+5400}{272}

The following gives the conditional mean E(Y_4 \ | \ Y_2 >3).

    \displaystyle \begin{aligned} E(Y_4 \ | \ Y_2 >3)&=\frac{1}{272} \ \int_3^5 t(-20t^4+100t^3+540t^2-3750t+5400) \ dt \\&=\frac{215}{51}=4.216 \end{aligned}

To contrast, the following gives the information on the unconditional distribution of Y_4.

    \displaystyle f_{Y_4}(t)=\frac{5!}{3! \ 1! \ 1!} \ \biggl[\frac{t}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr] \ \biggl[ \frac{5-t}{5} \biggr]^1=\frac{20}{3125} \ (5t^3-t^4)

    \displaystyle E(Y_4)=\frac{20}{3125} \ \int_0^5 t(5t^3-t^4) \ dt=\frac{10}{3}=3.33

The unconditional mean of Y_4 is about 3.33. With the additional information that Y_2 >3, the average of Y_4 is now 4.2. So a higher value of Y_2 pulls up the mean of Y_4.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the problem blog, a companion blog to this blog.

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

    \displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54  \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46    \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61   \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74   \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20  \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15  \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87  \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03  \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68  \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36  \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ }    \end{bmatrix}

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Basic facts about order statistics

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample X_1,X_2,\cdots,X_n drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics Y_1<Y_2<\cdots < Y_n. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the Y_j is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics Y_j to estimate the population percentiles is that the expected percentage of the population below Y_j is about the same as the percentage of the sample items less than Y_j. According to the explanation in the second post listed above (link), the order statistic Y_j is expected to be above 100p percent of the population where p=\frac{j}{n+1}. In fact, the order statistics Y_1<Y_2<\cdots < Y_n are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between Y_{j-1} and Y_j is 100h where h=\frac{1}{n+1}.

The above explanation justifies the use of the order statistic Y_j as the sample 100pth percentile where p=\frac{j}{n+1}.

The sample size is n= 54 in the San Francisco rainfall data. Thus the order statistic Y_{11} is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of Y_{11} is 15.22.

With \frac{45}{54+1}=0.818, the order statistic Y_{45} is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of Y_{45} is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability P(Y_i < \tau_p < Y_j). This is the probability that the 100pth percentile, where 0<p<1, is in between Y_i and Y_j. Let's look at the median \tau_{0.5}. For Y_i<\tau_{0.5} to happen, there must be at least i many sample items less than the median \tau_{0.5}. For \tau_{0.5}<Y_j to happen, there can be at most j-1 many sample items less than the median \tau_{0.5}. Thus in the random draws of the sample items, in order for the event Y_i < \tau_{0.5} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{0.5}. In other words, in n Bernoulli trials, there at at least i and at most j-1 successes where the probability of success is P(X<\tau_{0.5})= 0.5. The following is the probability P(Y_i < \tau_{0.5} < Y_j):

    \displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha

Then interval Y_i < \tau_{0.5} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population median \tau_{0.5}. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the 100pth percentile where 0<p<1. In order for the event Y_i < \tau_{p} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{p}. This is equivalent to n Bernoulli trials resulting in at least i successes and at most j-1 successes where the probability of success is P(X<\tau_{p})=p.

    \displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha

Then interval Y_i < \tau_{p} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population percentile \tau_{p}. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is n= 54. The middle two data elements in the sample is y_{27}=20.01 and y_{28}=20.54. They are realizations of the order statistics Y_{27} and Y_{28}. So in this example, \frac{27}{54+1}=0.49 and \frac{28}{54+1}=0.509. Thus the order statistic Y_{27} is expected to be greater than about 49% of the population and Y_{28} is expected to be greater than about 51% of the population. So neither Y_{27} nor Y_{28} is an exact fit. So we take the average of the two as an estimate of the population median:

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

Looking for confidence intervals, we consider the intervals (Y_{21},Y_{34}), (Y_{20},Y_{35}), (Y_{19},Y_{36}) and (Y_{18},Y_{37}). The following shows the confidence levels.

    \displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271

    \displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436

    \displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673

    \displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

    Median

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

    (Y_{21},Y_{34}) = (18.11, 23.49) with approximately 92% confidence

    (Y_{20},Y_{35}) = (17.74, 23.87) with approximately 96% confidence

    (Y_{19},Y_{36}) = (17.65, 24.09) with approximately 98% confidence

    (Y_{18},Y_{37}) = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

    Lower quartile

    \displaystyle \hat{\tau}_{0.25}=15.985, average of Y_{13} and Y_{14}

    (Y_{7},Y_{20}) = (13.87, 17.74) with approximately 96% confidence

    (Y_{6},Y_{21}) = (13.86, 18.11) with approximately 98% confidence

    (Y_{5},Y_{22}) = (12.54, 18.26) with approximately 99% confidence

    Upper quartile

    \displaystyle \hat{\tau}_{0.75}=25.875, average of Y_{41} and Y_{42}

    (Y_{36},Y_{47}) = (24.09, 29.41) with approximately 91% confidence

    (Y_{35},Y_{48}) = (23.87, 31.87) with approximately 96% confidence

    (Y_{34},Y_{49}) = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for \tau_{0.25} and one for \tau_{0.75}.

    \displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918

    \displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Defining the Poisson distribution

The Poisson distribution is a family of discrete distributions with positive probabilities on the non-negative numbers 0,1,2,\cdots. Each distribution in this family is indexed by a positive number \lambda>0. One way to define this distribution is to give its probability function given the parameter \lambda and then derive various distributional quantities such as mean and variance. Along with other mathematical facts, it can be shown that both the mean and the variance are \lambda. In this post, we take a different tack. We look at two view points that give rise to the Poisson distribution. Taking this approach will make it easier to appreciate some of the possible applications of the Poisson distribution. The first view point is that the Poisson distribution is the limiting case of the binomial distribution. The second view point is through the Poisson process, a stochastic process that, under some conditions, counts the number of events and the time points at which these events occur in a given time (or physical) interval.

________________________________________________________________________

Poisson as a limiting case of binomial

A binomial distribution where the number of trials n is large and the probability of success p is small such that np is moderate in size can be approximated using the Poisson distribution with mean \lambda=np. This fact follows from Theorem 1, which indicates that the Poisson distribution is the limiting case of the binomial distribution.

Theorem 1
Let \lambda be a fixed positive constant. Then for each integer x=0,1,2,\cdots, the following is true:

    \displaystyle \lim_{n \rightarrow \infty} \binom{n}{x} \ p^x \  (1-p)^{n-x}=\lim_{n \rightarrow \infty} \frac{n!}{x! \ (n-x)!} \ p^x \  (1-p)^{n-x}=\frac{e^{-\lambda} \ \lambda^x}{x!}

where \displaystyle p=\frac{\lambda}{n}.

Proof of Theorem 1
We start with a binomial distribution with n trials and with \displaystyle p=\frac{\lambda}{n} being the probability of success, where n>\lambda. Let X_n be the count of the number of successes in these n Bernoulli trials. The following is the probability that X_n=k.

    \displaystyle \begin{aligned} P(X_n=k)&=\binom{n}{k} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n!}{k! (n-k)!} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\lambda^k}{k!}\biggr) \biggr(1-\frac{\lambda}{n}\biggr)^{n} \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \ \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \end{aligned}

In the last step, the terms that contain n are inside the square brackets. Let’s see what they are when n approaches infinity.

    \displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{n}=e^{-\lambda}

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{-k}=1

The reason that the first result is true is that the numerator is a polynomial where the leading term is n^k. Upon dividing by n^k and taking the limit, we get 1. The second result is true since the following limit is one of the definitions of the exponential function e^x.

    \displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1+\frac{x}{n}\biggr)^{n}=e^{x}

The third result is true since the exponent -k is a constant. Thus the following is the limit of the probability P(X_n=k) as n \rightarrow \infty.

    \displaystyle \begin{aligned} \lim \limits_{n \rightarrow \infty} P(X_n=k)&= \biggl(\frac{\lambda^k}{k!}\biggr) \ \lim \limits_{n \rightarrow \infty} \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \cdot 1 \cdot e^{-\lambda} \cdot 1 \\&=\frac{e^{-\lambda} \lambda^k}{k!} \end{aligned}

This above derivation completes the proof. \blacksquare

In a given binomial distribution, whenever the number of trials n is large and the probability p of success in each trial is small (i.e. each of the Bernoulli trial rarely results in a success), Theorem 1 tells us that we can use the Poisson distribution with parameter \lambda=np to estimate the binomial distribution.

Example 1
The probability of being dealt a full house in a hand of poker is approximately 0.001441. Out of 5000 hands of poker that are dealt at a certain casino, what is the probability that there will be at most 4 full houses?

Let X be the number of full houses in these 5000 poker hands. The exact distribution for X is the binomial distribution with n= 5000 and p= 0.001441. Thus example deals with a large number of trials where each trial is a rare event. So the Poisson estimation is applicable. Let \lambda= 5000(0.001441) = 7.205. Then P(X \le 4) can be approximated by the Poisson random variable Y with parameter \lambda. The following is the probability function of Y:

    \displaystyle P(Y=y)=e^{-7.205} \ \frac{7.205^y}{y!}

The following is the approximation of P(X \le 4):

    \displaystyle \begin{aligned} P(X \le 4)&\approx P(Y \le 4) \\&=P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3)+P(Y=4) \\&= e^{-7.205} \biggl[ 1+7.205+\frac{7.205^2}{2!}+\frac{7.205^3}{3!}+\frac{7.205^4}{4!}\biggr] \\&=0.155098087  \end{aligned}

The following is a side by side comparison between the binomial distribution and its Poisson approximation. For all practical purposes, they are indistingusihable from one another.

    \displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&\text{Binomial } &\text{ }&\text{ }&\text{Poisson } \\\text{Full Houses}&\text{ }&\text{ }&P(X \le x) &\text{ }&\text{ }&P(Y \le x) \\\text{ }&\text{ }&\text{ }&n=5000 &\text{ }&\text{ }&\lambda=7.205 \\\text{ }&\text{ }&\text{ }&p=0.001441 &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.000739012&\text{ }&\text{ }&0.000742862 \\ 1&\text{ }&\text{ }&0.006071278&\text{ }&\text{ }&0.006095184 \\ 2&\text{ }&\text{ }&0.025304641&\text{ }&\text{ }&0.025376925 \\ 3&\text{ }&\text{ }&0.071544923&\text{ }&\text{ }&0.071685238 \\ 4&\text{ }&\text{ }&0.154905379&\text{ }&\text{ }&0.155098087  \\ 5&\text{ }&\text{ }&0.275104906&\text{ }&\text{ }&0.275296003 \\ 6&\text{ }&\text{ }&0.419508250&\text{ }&\text{ }&0.419633667 \\ 7&\text{ }&\text{ }&0.568176421 &\text{ }&\text{ }&0.568198363   \\ 8&\text{ }&\text{ }&0.702076190 &\text{ }&\text{ }&0.701999442 \\ 9&\text{ }&\text{ }&0.809253326&\text{ }&\text{ }&0.809114639 \\ 10&\text{ }&\text{ }&0.886446690&\text{ }&\text{ }&0.886291139 \\ 11&\text{ }&\text{ }&0.936980038&\text{ }&\text{ }&0.936841746 \\ 12&\text{ }&\text{ }&0.967298041&\text{ }&\text{ }&0.967193173 \\ 13&\text{ }&\text{ }&0.984085073&\text{ }&\text{ }&0.984014868 \\ 14&\text{ }&\text{ }&0.992714372&\text{ }&\text{ }&0.992672033 \\ 15&\text{ }&\text{ }&0.996853671&\text{ }&\text{ }&0.996830358 \end{bmatrix}

The above table is calculated using the functions BINOM.DIST and POISSON.DIST in Excel. The following shows how it is done. The parameter TRUE indicates that the result is a cumulative distribution. When it is set to FALSE, the formula gives the probability function.

    P(X \le x)=\text{BINOM.DIST(x, 5000, 0.001441, TRUE)}

    P(Y \le x)=\text{POISSON.DIST(x, 7.205, TRUE)}

________________________________________________________________________

The Poisson distribution

The limit in Theorem 1 is a probability function and the resulting distribution is called the Poisson distribution. We now gives the formal definition. A random variable X that takes on one of the numbers 0,1,2,\cdots is said to be a Poisson random variable with parameter \lambda>0 if

    \displaystyle P(X=x)=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

It can be shown that the above function is indeed a probability function, i.e., the probabilities sum to 1. Any random variable that has a probability function of the above form is said to follow (or to have) a Poisson distribution. Furthermore, it can be shown that E(X)=var(X)=\lambda, i.e., the Poisson parameter is both the mean and variance. Thus the Poisson distribution may be a good fit if the observed data indicate that the sample mean and the sample variance are nearly identical.

The following is the moment generating function of the Poisson distribution with parameter \lambda.

    \displaystyle M(t)=E(e^{tX})=e^{\lambda \ (e^t-1)}

One consequence of the Poisson moment generating function is that any independent sum of Poisson distributions is again a Poisson distribution.

________________________________________________________________________

The Poisson process

Another way, the more important way, to look at the Poisson distribution is the view point of the Poisson process. Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to record the time of the occurrence of each random event and for the purpose at hand, count the number of random events occurring in a fixed time interval. Starting at time 0, note the time of the occurrence of the first event. Then the time at which the second random event occurs and so on. Out of these measurements, we can derive the length of time between the occurrences of any two consecutive random events. Such measurements belong to a continuous random variable. In this post, we focus on the discrete random variable of the count of the random events in a fixed time interval.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, \alpha-particles were emitted from a polonium source and the number of \alpha-particles were counted during an interval of 7.5 seconds (2608 many such time intervals were observed). A Poisson process is a random process in which several criteria are satisfied. We will show that in a Poisson process, the number of these random occurrences in the fixed time interval will follow a Poisson distribution. First, we discuss the criteria to which a Poisson process must conform.

One of the criteria is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. For example, in the 1910 radioactivity study, \alpha-particles were emitted at the rate of \lambda= 3.87 per 7.5 seconds. So the probability of one \alpha-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one \alpha-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. To summarize, the following are the three criteria of a Poisson process:

    Suppose that on average \lambda random events occur in a time interval of length 1.

    1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
    2. For a very short subinterval of length \frac{1}{n} where n is a sufficiently large integer, the probability of a random event occurring in this subinterval is \frac{\lambda}{n}.
    3. The numbers of random events occurring in non-overlapping time intervals are independent.

Consider a Poisson process in which the average rate is \lambda random events per unit time interval. Let Y be the number of random events occurring in the unit time interval. In the 1910 radioactivity study, the unit time interval is 7.5 seconds and Y is the count of the number of \alpha-particles emitted in 7.5 seconds. It follows that Y has a Poisson distribution with parameter \lambda. To see this, subdivide the unit interval into n non-overlapping subintervals of equal length where n is a sufficiently large integer. Let X_{n,j} be the number of random events in the the jth subinterval (1 \le j \le n). Based on the three assumptions, X_{n,1},X_{n,2},\cdots,X_{n,n} are independent Bernoulli random variables, where the probability of success for each X_{n,j} is \frac{\lambda}{n}. Then X_n=X_{n,1}+X_{n,2}+\cdots+X_{n,n} has a binomial distribution with parameters n and p=\frac{\lambda}{n}. Theorem 1 tells us that the limiting case of the binomial distributions for X_n is the Poisson distribution with parameter \lambda. This Poisson distribution should agree with the distribution for Y. The Poisson is also discussed in quite a lot of details in the previous post called Poisson as a Limiting Case of Binomial Distribution.

We now examine the 1910 radioactivity study a little more closely.

Example 2
The basic idea of the 1910 radioactivity study conducted by Rutherford and Geiger is that a polonium source was placed a short distance from an observation point. The number of \alpha-particles emitted from the source were counted in 7.5-second intervals for 2608 times. The following is the tabulated results.

    \displaystyle \begin{bmatrix}   \text{Number of alpha particles}&\text{ }&\text{Observed}   \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}    \\ \text{ }&\text{ }&\text{ }   \\ 0&\text{ }&57    \\ 1&\text{ }&203    \\ 2&\text{ }&383    \\ 3&\text{ }&525    \\ 4&\text{ }&532    \\ 5&\text{ }&408   \\ 6&\text{ }&273   \\ 7&\text{ }&139   \\ 8&\text{ }&45   \\ 9&\text{ }&27   \\ 10&\text{ }&10  \\ 11+&\text{ }&6  \\ \text{ }&\text{ }&\text{ }  \\ \text{Total }&\text{ }&2608    \end{bmatrix}

What is the average number of particles observed per 7.5 seconds? The total number of \alpha-particles in these 2608 periods is

    0 \times 57+1 \times 203+2 \times 383+ 3 \times 525 + \cdots=10097.

The mean count per period is \lambda=\frac{10097}{2608}=3.87. Consider the Poisson distribution with parameter 3.87. The following is its probability function.

    \displaystyle P(X=x)=\frac{e^{-3.87} \ 3.87^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

Out of 2608 periods, the expected number of periods with x particles in emission is 2608P(X=x). The following is a side by side comparison in the observed counts and the expected counts.

    \displaystyle \begin{bmatrix}   \text{Number of alpha particles}&\text{ }&\text{Observed}&\text{ }&\text{Expected}    \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}&\text{ }&\text{counts}    \\ \text{ }&\text{ }&\text{ }&\text{ }&2608 \times P(X=x)  \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ }   \\ 0&\text{ }&57&\text{ }&54.40    \\ 1&\text{ }&203&\text{ }&210.52  \\ 2&\text{ }&383&\text{ }&407.36    \\ 3&\text{ }&525&\text{ }&525.50    \\ 4&\text{ }&532&\text{ }&508.42    \\ 5&\text{ }&408&\text{ }&393.52   \\ 6&\text{ }&273&\text{ }&253.82   \\ 7&\text{ }&139&\text{ }&140.32   \\ 8&\text{ }&45&\text{ }&67.88   \\ 9&\text{ }&27&\text{ }&29.19   \\ 10&\text{ }&10&\text{ }&11.30  \\ 11+&\text{ }&6&\text{ }&5.78  \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ }  \\ \text{Total }&\text{ }&2608&\text{ }&2608    \end{bmatrix}

The expected counts are quite close to the observed counts, showing that the Poisson distribution is a very good fit to the observed data from the 1910 study.

________________________________________________________________________

More comments about the Poisson process

We have described the Poisson process as the distribution of random events in a time interval. The same idea can be used to describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements (see Examples 5 and 6 below).

Another point to make is that sometimes it may be necessary to consider an interval other than the unit length. Instead of counting the random events occurring in an interval of length 1, we may want to count the random events in an interval of length t. As before, let \lambda be the rate of occurrences in a unit interval. Then the rate of occurrences of the random events is over the interval of length t is \lambda t. The same idea will derive that fact that the number of occurrences of the random events of interest in the interval of length t is a Poisson distribution with parameter \lambda t. The following is its probability function.

    \displaystyle P(X_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

where X_t is the count of the random events in an interval of length t.

For example, in the 1910 radioactive study, the unit length is 7.5 seconds. The number of \alpha-particles observed in a half unit interval (3.75 seconds) will follow a Poisson distribution with parameter 0.5 \lambda= 0.5(3.87) = 1.935 with the following probability function:

    \displaystyle P(X_{0.5}=x)=\frac{e^{-1.935} \ (1.935)^x}{x!}  \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots

________________________________________________________________________

More examples

Example 3
A radioactive source is metered for 5 hours. During this period, 9638 \alpha-particles are counted. What is the probability that during the next minute, between 30 and 34 particles (both inclusive ) will be counted?

The average number of \alpha-particles counted per minute is \lambda=\frac{9638}{300}=32.12. Let X be the number of \alpha-particles counted per minute. Then X has a Poisson distribution with parameter \lambda=32.12. The following calculates P(30 \le X \le 34).

    \displaystyle \begin{aligned} P(30 \le X \le 34)&=e^{-32.12} \biggl[ \frac{32.12^{30}}{30!}+\frac{32.12^{31}}{31!}+\frac{32.12^{32}}{32!}+\frac{32.12^{33}}{33!}+\frac{32.12^{34}}{34!}   \biggr] \\&=0.341118569  \end{aligned}

Alternatively, the POISSON.DIST function in Excel can be used as follows:

    \displaystyle \begin{aligned} P(30 \le X \le 34)&=P(X \le 34)-P(X \le 29) \\&=\text{POISSON.DIST(34,32.12,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(29,32.12,TRUE)} \\&=0.671501917-0.330383348 \\&=0.341118569  \end{aligned}

Example 4
The side effect of dry mouth is known to be experienced, on the average, by 5 out of 10,000 individuals taking a certain medication. About 20,000 patients are expected to take this medication next year. What is the probability that between 12 and 16 (both inclusive) patients will experience the side effect of dry mouth? What is the exact probability model that can also be used to work this problem?

The exact model is a binomial distribution. The number of trials n= 20000 and the probability of success in each trial is p= 0.0005 (experiencing the side effect). Here, we use Poisson to estimate the binomial. The average number of patients experiencing side effect is \lambda=20000(0.0005)=10. Let X be the number of patients experiencing the side effect. The following calculates the Poisson probability for P(12 \le X \le 16) in two different ways.

    \displaystyle \begin{aligned} P(12 \le X \le 16)&=e^{-10} \biggl[ \frac{10^{12}}{12!}+\frac{10^{13}}{13!}+\frac{10^{14}}{14!}+\frac{10^{15}}{15!}+\frac{10^{16}}{16!}   \biggr] \\&=0.276182244  \end{aligned}
    \displaystyle \begin{aligned} P(12 \le X \le 16)&=P(X \le 11)-P(X \le 16) \\&=\text{POISSON.DIST(16,10,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(11,10,TRUE)} \\&=0.97295839-0.696776146 \\&=0.276182244  \end{aligned}

Example 5
In a 10-mile stretch of a highway, car troubles (e.g. tire punctures, dead batteries, and mechanical breakdown) occur at a rate of 1.5 per hour. A tow truck driver can respond to such car troubles and offer roadside assistance, which can include towing and minor repair. Assume that the number of such incidences per hour follows a Poisson distribution. At the beginning of the hour, three tow trucks (and their drivers) are available to respond to any car troubles in this stretch of highway. What is the probability that in the next hour all three tow trick drivers will be busy helping motorists with car troubles in this stretch of highway?

Let X be the number of car troubles that occur in this 10-mile stretch of highway in the one-hour period in question. If in this one hour there are 3 or more car troubles (X \ge 3), then all three tow truck drivers will be busy.

    \displaystyle \begin{aligned} P(X \ge 3)&=1-P(X \le 2) \\&=1-e^{-1.5} \biggl[ 1+1.5+\frac{1.5^{2}}{2!}   \biggr] \\&=1-0.808846831\\&=0.191153169  \end{aligned}

Example 6
Continuing Example 5. Considering that there is only 19% chance that all 3 tow truck drivers will be busy, there is a good chance that the resources are under utilized. What if one of the drivers is assigned to another stretch of highway?

With only two tow trucks available for this 10-mile stretch of highway, the following is the probability that all two tow truck drivers will be busy:

    \displaystyle \begin{aligned} P(X \ge 2)&=1-P(X \le 1) \\&=1-e^{-1.5} \biggl[ 1+1.5   \biggr] \\&=1-0.5578254\\&=0.4421746  \end{aligned}

Assigning one driver to another area seems to be a better way to make good use of the available resources. With only two tow truck drivers available, there is much reduced chance (56%) that one of the drivers will be idle, and there is a much increased chance (44%) that all available drivers will be busy.

________________________________________________________________________

Remarks

The Poisson distribution is one of the most important of all probability models and has shown to be an excellent model for a wide array of phenomena such as

  • the number of \alpha-particles emitted from radioactive source in a given amount of time,
  • the number of vehicles passing a particular location on a busy highway,
  • the number of traffic accidents in a stretch of highway in a given period of time,
  • the number of phone calls arriving at a particular point in a telephone network in a fixed time period,
  • the number of insurance losses/claims in a given period of time,
  • the number of customers arriving at a ticket window,
  • the number of earthquakes occurring in a fixed period of time,
  • the number of mutations on a strand of DNA.
  • the number of hurricanes in a year that originate in the Atlantic ocean.

What is the Poisson distribution so widely applicable in these and many other seemingly different and diverse phenomena? What is the commonality that ties all these different and diverse phenomena? The commonality is that all these phenomena are basically a series of independent Bernoulli trials. If a phenomenon is a Binomial model where n is large and p is small, then it has a strong connection to Poisson model mathematically through Theorem 1 above (i.e. it has a Poisson approximation). On the other hand, if the random phenomenon follows the criteria in a Poisson process, then the phenomenon is also approximately a Binomial model, which means that in the limiting case it is Poisson.

In both view points discussed in this post, the Poisson distribution can be regarded as a binomial distribution taken at a very granular level. This connection with the binomial distribution points to a vast arrays of problems that can be solved using the Poisson distribution.

________________________________________________________________________

Exercises

Practice problems for the Poisson concepts discussed above can be found in the companion blog (go there via the following link). Working on these exercises is strongly encouraged (you don’t know it until you can do it).

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Picking Two Types of Binomial Trials

We motivate the discussion with the following example. The notation W \sim \text{binom}(n,p) denotes the statement that W has a binomial distribution with parameters n and p. In other words, W is the number of successes in a sequence of n independent Bernoulli trials where p is the probability of success in each trial.

Example 1
Suppose that a student took two multiple choice quizzes in a course for probability and statistics. Each quiz has 5 questions. Each question has 4 choices and only one of the choices is correct. Suppose that the student answered all the questions by pure guessing. Furthermore, the two quizzes are independent (i.e. results of one quiz will not affect the results of the other quiz). Let X be the number of correct answers in the first quiz and Y be the number of correct answers in the second quiz. Suppose the student was told by the instructor that she had a total of 4 correct answers in these two quizzes. What is the probability that she had 3 correct answers in the first quiz?

On the face of it, the example is all about binomial distribution. Both X and Y are binomial distributions (both \sim \text{binom}(5,\frac{1}{4})). The sum X+Y is also a binomial distribution (\sim \text{binom}(10,\frac{1}{4})). The question that is being asked is a conditional probability, i.e., P(X=3 \lvert X+Y=4). Surprisingly, this conditional probability can be computed using the hypergeometric distribution. One can always work this problem from first principle using binomial distributions. As discussed below, for a problem such as Example 1, it is always possible to replace the binomial distributions using a thought process involving the hypergeometric distribution.

Here’s how to think about the problem. This student took the two quizzes and was given the news by the instructor that she had 4 correct answers in total. She now wonders what the probability of having 3 correct answers in the first quiz is. The thought process is this. She is to pick 4 questions from 10 questions (5 of them are from Quiz 1 and 5 of them are from Quiz 2). So she is picking 4 objects from a group of two distinct types of objects. This is akin to reaching into a jar that has 5 red balls and 5 blue balls and pick 4 balls without replacement. What is the probability of picking 3 red balls and 1 blue ball? The probability just described is from a hypergeometric distribution. The following shows the calculation.

\displaystyle (1) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

We will show below why this works. Before we do that, let’s describe the above thought process. Whenever you have two independent binomial distributions X and Y with the same probability of success p (the number of trials does not have to be the same), the conditional distribution X \lvert X+Y=a is a hypergeometric distribution. Interestingly, the probability of success p has no bearing on this observation. For Example 1, we have the following calculation.

\displaystyle (2a) \ \ \ \ P(X=0 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{0} \ \binom{5}{4}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

\displaystyle (2b) \ \ \ \ P(X=1 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{1} \ \binom{5}{3}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2c) \ \ \ \ P(X=2 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{2} \ \binom{5}{2}}{\displaystyle \binom{10}{4}}=\frac{100}{210}

\displaystyle (2d) \ \ \ \ P(X=3 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{3} \ \binom{5}{1}}{\displaystyle \binom{10}{4}}=\frac{50}{210}

\displaystyle (2e) \ \ \ \ P(X=4 \lvert X+Y=4)=\frac{\displaystyle \binom{5}{4} \ \binom{5}{0}}{\displaystyle \binom{10}{4}}=\frac{5}{210}

Interestingly, the conditional mean E(X \lvert X+Y=4)=2, while the unconditional mean E(X)=5 \times \frac{1}{4}=1.25. The fact that the conditional mean is higher is not surprising. The student was lucky enough to have obtained 4 correct answers by guessing. Given this, she had a greater chance of doing better on the first quiz.

__________________________________________________
Why This Works

Suppose X \sim \text{binom}(5,p) and Y \sim \text{binom}(5,p) and they are independent. The joint distribution of X and Y has 36 points in the sample space. See the following diagram.

Figure 1

The probability attached to each point is

\displaystyle \begin{aligned}(3) \ \ \ \  P(X=x,Y=y)&=P(X=x) \times P(Y=y) \\&=\binom{5}{x} p^x (1-p)^{5-x} \times \binom{5}{y} p^y (1-p)^{5-y}  \end{aligned}

where x=0,1,2,3,4,5 and y=0,1,2,3,4,5.

The conditional probability P(X=k \lvert X+Y=4) involves 5 points as indicated in the following diagram.

Figure 2

The conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in the above diagram. The following is the sum total of the probabilities of the 5 points indicated in Figure 2.

\displaystyle \begin{aligned}(4) \ \ \ \  P(X+Y=4)&=P(X=0) \times P(Y=4)+P(X=1) \times P(Y=3)\\&\ \ \ \ +P(X=2) \times P(Y=3)+P(X=3) \times P(Y=2)\\&\ \ \ \ +P(X=4) \times P(Y=0)  \end{aligned}

We can plug (3) into (4) and work out the calculation. But (4) is actually equivalent to the following because X+Y \sim \text{binom}(10,p).

\displaystyle (5) \ \ \ \ P(X+Y=4)=\ \binom{10}{4} p^4 \ (1-p)^{6}

As stated earlier, the conditional probability P(X=k \lvert X+Y=4) is simply the probability of one of the 5 sample points as a fraction of the sum total of the 5 sample points encircled in Figure 2. Thus we have:

\displaystyle \begin{aligned}(6) \ \ \ \  P(X=k \lvert X+Y=4)&=\frac{P(X=k) \times P(Y=4-k)}{P(X+Y=4)} \\&=\frac{\displaystyle \binom{5}{k} p^k (1-p)^{5-k} \times \binom{5}{4-k} p^{4-k} (1-p)^{5-(4-k)}}{\displaystyle  \binom{10}{4} p^4 \ (1-p)^{6}}   \end{aligned}

With the terms involving p and 1-p cancel out, we have:

\displaystyle (7) \ \ \ \  P(X=k \lvert X+Y=4)=\frac{\displaystyle \binom{5}{k} \times \binom{5}{4-k}}{\displaystyle  \binom{10}{4}}

__________________________________________________
Summary

Suppose X \sim \text{binom}(N,p) and Y \sim \text{binom}(M,p) and they are independent. Then X+Y is also a binomial distribution, i.e., \sim \text{binom}(N+M,p). Suppose that both binomial experiments \text{binom}(N,p) and \text{binom}(M,p) have been performed and it is known that there are a successes in total. Then X \lvert X+Y=a has a hypergeometric distribution.

\displaystyle (8) \ \ \ \  P(X=k \lvert X+Y=a)=\frac{\displaystyle \binom{N}{k} \times \binom{M}{a-k}}{\displaystyle  \binom{N+M}{a}}

where k=0,1,2,3,\cdots,\text{min}(N,a).

As discussed earlier, think of the N trials in \text{binom}(N,p) as red balls and think of the M trials in \text{binom}(M,p) as blue balls in a jar. Think of the a successes as the number of balls you are about to draw from the jar. So you reach into the jar and select a balls without replacement. The calculation in (8) gives the probability that you select k red balls and a-k blue balls.

The probability of success p in the two binomial distributions have no bearing on the result since it gets canceled out in the derivation. One can always work a problem like Example 1 using first principle. Once the thought process using hypergeometric distribution is understood, it is a great way to solve this problem, that is, you can by pass the binomial distributions and go straight to the hypergeometric distribution.

__________________________________________________
Additional Practice
Practice problems are found in the following blog post.

How to pick binomial trials