# Every character is known but password is hard to crack

In this post, we discusses an example in which you are given a password (every character of it) and yet it is still very hard (or even impossible) to crack. Anyone who understands this example has a solid understanding of the binomial distribution. Here’s the example:

Your friend John tells you that the password to his online bank account has 26 characters. The first character is the first letter in the English alphabet, the second character is the second letter in the English alphabet, the third character is the third letter in the English alphabet and so on.

Now that your friend John has given you the key to his account, does that mean you can log onto his account to find out how much money he has, or to make financial transaction on his behalf or to enrich yourself?

If this example sounds too good to be true, what is the catch?

Even though every character in the 26-character password is known, it is indeed a very strong password. How could this be? You may want to stop here and think about.

Indeed, if every character in John’s password is lower case or if every character is upper case, then his bank account is toast. But John’s password can be made very strong and very hard to crack if the password is case sensitive. The password given by John is not just one password, but is a large collection of passwords. In fact, there are over 67 millions possible passwords (67,108,864 to be exact). The following are two of the most obvious ones.

a b c d e f g h i j k l m n o p q r s t u v w x y z (all lower case)

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z (all upper case)

The following is another possible password. If this is the one John uses, it will be difficult to crack.

a b C d e f G h i j k l M N o p Q R s T U v w X Y z (10 upper case letters)

Here’s another possibility.

A B c D E f G H I J k l M N o P q R S t u v w X y z (14 upper case letters)

Each character in the password has two possibilities – lower case or upper case. Across all 26 characters, there are $2^{26}$ possibilities. This number is 67,108,864. So 2 raised to 26 is a little over 67 millions. So the password given by John is not just one password, but is a generic one with over 67 million possibilities. There is a one in 67 million chance in correctly guessing the correct password if John chooses the upper case letters randomly. This is much better odds than winning the Powerball lottery, one in 292,201,338, which one in 292 million. But it is still an undeniably strong password.

So John tells you the password, but has in fact not given up much secret. This is the case if he makes the case sensitivity truly random. Of course, once he sets his password, unless he has a great memory, he will need to write down the positions that are upper case. Otherwise, he will not be able to log back on. But that is a totally separate story.

___________________________________________________________________________

The binomial angle

A good way to represent the passwords is to view each one as a 26-character string of U and L (U stands for upper case and L stands for lower case). Then the above two passwords (the one with 10 upper case letters and the one with 14 upper case letters) are represented as follows:

L L U L L L U L L L L L U U L L U U L U U L L U U L (10 U’s)

U U L U U L U U U U L L U U L U L U U L L L L U L L (14 U’s)

As discussed, there are 67,108,864 many such strings. We can also think of each such string as the record of a random experiment consisting of tossing a coin 26 times. We record a U when we get a head and record an L when the coin toss results in a tail. In fact, this is the kind of records that John would need to keep when he sets the password. Such a string would tell him which characters in the password are in upper case and which characters are in lower case. On the other hand, hacking the password would be essentially trying to pinpoint one such string out of 67,108,864 many possibilities.

We know 67,108,864 is a large number. Let’s further demonstrate the challenge. In each string, the number of U’s can range from 0 to 26. How many of the strings have 0 U’s? Precisely one (all letters are L). This would be what most people would try first (all the letters are lower case). How many of the strings have exactly 1 U? Thinking about it carefully, we should arrive at the conclusion that there are 26 possibilities (the single U can be in any of the 26 positions). So if the would be hacker knows there there is only one upper case letter, then it would be easy to break the password. How many of the strings have exactly 2 U’s? If you try to write out all the possible cases, it may take some effort. There are 325 possibilities. So just having two upper case letters in the password seems to make it something that is approaching a strong password. But the problem is that the two U’s may be guessed easily. John may put the upper case letters on his initials (if his name is John Smith, he may make J and S upper case), or in other obvious letters.

How many of the strings have exactly 3 U’s? This will be really hard to write out by hand. There are 2,600 many possibilities. Why stop at having just 3 upper case letters? It is clear that the more upper case letters used in the password, the stronger it is and the harder it is to crack.

How do we know the precise number of possibilities for a given $k$, the number of U’s? The idea is that of choosing $k$ number of letters out of 26 letters.

The number of ways of choosing $k$ objects from a total of $n$ objects is denoted by the notation $\binom{n}{k}$. Sometimes the notations $C(n,k)$, $_nC_k$ and $C_{n,k}$ are used. Regardless of the notation, the calculation is

$\displaystyle \binom{n}{k}=\frac{n!}{k! (n-k)!}$

The notation $n!$ is the product of all the positive integers up to and including $n$ (this is called $n$ factorial). Thus $1!=1$, $2!=2$, $3!=6$, $4!=24$. To make the formula work correctly, we make $0!=1$.

The following gives the first several calculations in the 26-character password example.

$\displaystyle \binom{26}{2}=\frac{26!}{2! \ 24!}=\frac{26 \cdot 25}{2}=325$

$\displaystyle \binom{26}{3}=\frac{26!}{3! \ 23!}=\frac{26 \cdot 25 \cdot 24}{6}=2600$

$\displaystyle \binom{26}{4}=\frac{26!}{4! \ 22!}=\frac{26 \cdot 25 \cdot 24 \cdot 23}{24}=14950$

If the desire is to see the patterns, the remaining calculations can be done by using software (or at least a hand held calculator). The following table shows the results.

$\displaystyle \begin{array}{rrr} k &\text{ } & \displaystyle \binom{26}{k} \\ \text{ } & \text{ } & \text{ } \\ 0 &\text{ } & 1 \\ 1 &\text{ } & 26 \\ 2 &\text{ } & 325 \\ 3 &\text{ } & 2,600 \\ 4 &\text{ } & 14,950 \\ 5 &\text{ } & 65,780 \\ 6 &\text{ } & 230,230 \\ 7 &\text{ } & 657,800 \\ 8 &\text{ } & 1,562,275 \\ 9 &\text{ } & 3,124,550 \\ 10 &\text{ } & 5,311,735 \\ 11 &\text{ } & 7,726,160 \\ 12 &\text{ } & 9,657,700 \\ 13 &\text{ } & 10,400,600 \\ 14 &\text{ } & 9,657,700 \\ 15 &\text{ } & 7,726,160 \\ 16 &\text{ } & 5,311,735 \\ 17 &\text{ } & 3,124,550 \\ 18 &\text{ } & 1,562,275 \\ 19 &\text{ } & 657,800 \\ 20 &\text{ } & 230,230 \\ 21 &\text{ } & 65,780 \\ 22 &\text{ } & 14,950 \\ 23 &\text{ } & 2,600 \\ 24 &\text{ } & 325 \\ 25 &\text{ } & 26 \\ 26 &\text{ } & 1 \\ \end{array}$

The pattern is symmetrical. Having too few U’s or too many U’s produces weak passwords that may be easy to guess. Having 6 or 7 U’s seems to give strong passwords. Having half of the letters upper case (13 U’s) is the optimal, with the most possibilities (over 10 millions). Even if you are given partial information such as “half of the letters are in upper case”, you are still left with over 10 million possibilities to work with!

Dividing each of the above counts by 67,108,864 gives the relative weight (probability) of each case of having exactly $k$ U’s.

$\displaystyle \begin{array}{rrr} k &\text{ } & P[X=k] \\ \text{ } & \text{ } & \text{ } \\ 0 &\text{ } & 0.00000001 \\ 1 &\text{ } & 0.00000039 \\ 2 &\text{ } & 0.00000484 \\ 3 &\text{ } & 0.00003874 \\ 4 &\text{ } & 0.00022277 \\ 5 &\text{ } & 0.00098020 \\ 6 &\text{ } & 0.00343069 \\ 7 &\text{ } & 0.00980198 \\ 8 &\text{ } & 0.02327971 \\ 9 &\text{ } & 0.04655942 \\ 10 &\text{ } & 0.07915102 \\ 11 &\text{ } & 0.11512876 \\ 12 &\text{ } & 0.14391094 \\ 13 &\text{ } & 0.15498102 \\ 14 &\text{ } & 0.14391094 \\ 15 &\text{ } & 0.11512876 \\ 16 &\text{ } & 0.07915102 \\ 17 &\text{ } & 0.04655942 \\ 18 &\text{ } & 0.02327971 \\ 19 &\text{ } & 0.00980198 \\ 20 &\text{ } & 0.00343069 \\ 21 &\text{ } & 0.00098020 \\ 22 &\text{ } & 0.00022277 \\ 23 &\text{ } & 0.00003874 \\ 24 &\text{ } & 0.00000484 \\ 25 &\text{ } & 0.00000039 \\ 26 &\text{ } & 0.00000001 \\ \end{array}$

The cases of $X=k$ where $k=11,12,13,14,15$ add up to 67.3% of the 67,108,864 possibilities.

___________________________________________________________________________

The binomial distribution

Another way to look at it is that in setting the password, John is performing a sequence of 26 independent Bernoulli trials. Here, each trial has two outcomes, a or A, b or B, c or C and so on. For example, the lower case or upper case can be determined by a coin toss. Let $X$ be the number of upper case letters in the 26-character password. Then the random variable $X$ has a binomial distribution with $n=26$ (26 Bernoulli trials) and the probability of success $p=0.5$ in each trial, which is the probability that a character is upper case, assuming that he determines the upper/lower case by a coin toss. The following is the probability function:

$\displaystyle P(X=x)=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^x \biggl[\frac{1}{2}\biggr]^{26-x}=\binom{26}{x} \biggl[\frac{1}{2}\biggr]^{26}$

where $x=0,1,2,\cdots,25,26$. The quantity $\displaystyle P(X=x)$ is the probability that the number of upper case letters is $x$. Here, $\binom{26}{x}$ is the number of ways to choose $x$ letters out of 26 letters and is computed by the formula indicated earlier.

Since the upper/lower case is determined randomly, another way to state the probability function of the random variable $X$ is:

$\displaystyle P(X=x)=\displaystyle \frac{\binom{26}{x}}{2^{26}}=\frac{\binom{26}{x}}{67108864} \ \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,24,25,26$

The expected value of this random variable $X$ is 13. This is the average number of upper case letters if the case is determined randomly. This obviously produces the most optimally strong password. If John determines the case not at random, the security may not be as strong or the would be hacker may be able to guess.

Stepping away from the 26-character password example, here’s the probability function of a binomial distribution in general.

$\displaystyle P(X=x)=\binom{n}{x} \ p^x \ (1-p)^{n-x} \ \ \ \ \ \ \ \ x=0,1,2,3,\cdots,n-1,n$

This model describes the random experiment of running $n$ independent trials, where each trial has two outcomes (the technical term is Bernoulli trial). In each trial, the probability of one outcome (called success) is $p$ and the probability of the other outcome (called failure) is $1-p$. The random variable $X$ counts the number of successes whenever such an experiment is performed. The probability $P(X=x)$ gives the likelihood of achieving $x$ successes.

As an example, if John has a bias toward lower case letter, then the probability of success (upper case) may be $p=0.4$ (assuming that the random variable $X$ still counts the number of upper case letters). Then the average number of upper case letters in a randomly determined password is 26 x 0.4 = 10.4.

___________________________________________________________________________

Interesting binomial distribution problems

The problem of points and the dice problem are two famous probability problems that are in the history book as a result of a gambler seeking help from Pascal and Fermat.

___________________________________________________________________________
$\copyright \ 2016 \text{ by Dan Ma}$

# A randomized definition of the natural logarithm constant

The number $e$ is the base of the natural logarithm. It is an important constant in mathematics, which is approximately 2.718281828. This post discusses a charming little problem involving the the number $e$. This number can be represented in many ways. In a calculus course, the number $e$ may be defined as the upper limit of the following integral:

$\displaystyle \int_1^e \frac{1}{t} \ dt=1$

Another representation is that it is the sum $\displaystyle e=\sum_{n=1}^\infty \frac{1}{n!}$. Still another is that it is the limit $\displaystyle e=\lim_{n \rightarrow \infty} \biggl( 1+\frac{1}{n} \biggr)^n$. According to the authors of a brief article from the Mathematics Magazine [1], these textbook definitions do not give immediate insight about the number $e$ (article can be found here). As a result, students come away from the course without a solid understanding of the number $e$ and may have to resort to rote memorization on what the number $e$ is. Instead, the article gives six probability oriented ways in which the number $e$ can occur. These occurrences of $e$ are more interesting and, according to the authors of [1], can potentially increase students’ appreciation of the number $e$. In two of these six examples (Example 2 and Example 5), the number $e$ is defined by drawing random numbers from the interval $(0,1)$. This post discusses Example 2.

________________________________________________________________________

Random Experiment

Here’s the description of the random experiment that generates the number $e$. Randomly and successively select numbers from the interval $(0, 1)$. The experiment terminates when the sum of the random numbers exceeds 1. What is the average number of selections in this experiment? In other words, we are interested in the average length of the experiment. According to the article [1], the average length is the number $e$. The goal here is to give a proof of this result.

As illustration, the experiment is carried out 1 million times using random numbers that are generated in Excel using the Rand() function. The following table summarizes the results.

$\left[\begin{array}{rrrrrr} \text{Length of} & \text{ } & \text{ } & \text{ } & \text{Relative} \\ \text{Experiment} & \text{ } & \text{Frequency} & \text{ } & \text{Frequency} \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ 2 & \text{ } & 500777 & \text{ } & 0.500777 \\ 3 & \text{ } & 332736 & \text{ } & 0.332736 \\ 4 & \text{ } & 124875 & \text{ } & 0.124875 \\ 5 & \text{ } & 33465 & \text{ } & 0.033465 \\ 6 & \text{ } & 6827 & \text{ } & 0.006827 \\ 7 & \text{ } & 1130 & \text{ } & 0.001130 \\ 8 & \text{ } & 172 & \text{ } & 0.000172 \\ 9 & \text{ } & 14 & \text{ } & 0.000014 \\ 10 & \text{ } & 4 & \text{ } & 0.000004 \\ \text{ } & \text{ } & \text{ } & \text{ } & \text{ } \\ \text{Total} & \text{ } & 1000000 & \text{ } & \text{ } \end{array}\right]$

The average length of experiment in these 1 million experiments is 2.717001. Even though the rate of convergence to the number $e$ is fairly slow, the simulated data demonstrates that on average it takes approximately $e$ number of simulations of numbers in the unit interval to get a sum that exceeds 1.

________________________________________________________________________

A proof

Let $U_1,U_2,U_3,\cdots$ be a sequence of independent and identically distributed random variables such that the common distribution is a uniform distribution on the unit interval $(0,1)$. Let $N$ be defined as follows:

$\displaystyle N=\text{min}\left\{n: U_1+U_2+\cdots+U_n>1 \right\} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The objective is to determine $E(N)$. On the surface, it seems that we need to describe the distribution of the independent sum $X_n=U_1+U_2+\cdots+U_n$ for all possible $n$. Doing this may be possible but the result would be messy. It turns out that we do not need to do so. We need to evaluate the probability $P(X_n \le 1)$ for all $n$. We show that

$\displaystyle F_n(x)=P(X_n \le x)=\frac{x^n}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

for $0 \le x \le 1$ and for $n=1,2,3,\cdots$. This is accomplished by an induction proof. This is true for $n=1$ since $X_1=U_1$ is a uniform distribution. Suppose (2) holds for the integer $n-1$. This means that $\displaystyle F_{n-1}(x)=P(X_{n-1} \le x)=\frac{x^{n-1}}{(n-1)!}$ for $0 \le x \le 1$. Note that $X_n=X_{n-1}+U_n$, which is an independent sum. Let’s write out the convolution formula for this independent sum:

\displaystyle \begin{aligned} F_n(x)&=\int_{0}^x F_{n-1}(x-y) \cdot f_{U_n}(y) \ dy \\&=\int_{0}^x \frac{(x-y)^{n-1}}{(n-1)!} \cdot 1 \ dy \\&=\frac{1}{(n-1)!} \int_0^x (x-y)^{n-1} \ dy \\&=\frac{x^n}{n!} \end{aligned}

The above derivation completes the proof of the claim. We now come back to the problem of evaluating the mean of the random variable $N$ defined in (1). First, note that $N>n$ if and only if $X_n=U_1+U_2+\cdots+U_n \le 1$. So we have the probability statement $\displaystyle P(N>n)=F_n(1)=\frac{1}{n!}$.

As a result, the following is the probability function of the random variable $N$.

\displaystyle \begin{aligned} P(N=n)&=P(N>n-1)-P(N>n) \\&=F_{n-1}(1)-F_n(1) \\&=\frac{1}{(n-1)!}-\frac{1}{n!} \\&=\frac{n-1}{n!} \end{aligned}

Now evaluate the mean.

\displaystyle \begin{aligned} E(N)&=\sum \limits_{n=2}^\infty \frac{n(n-1)}{n!} \\&=\sum \limits_{n=2}^\infty \frac{1}{(n-2)!} \\&=\sum \limits_{m=0}^\infty \frac{1}{m!} \\&=e \end{aligned}

With the above derivation, the proof that $e=$ 2.718281828… is the average number of random numbers to select in order to obtain a sum that exceeds 1 is completed.

________________________________________________________________________

Reference

1. Shultz H. S., Leonard B., Unexpected Occurrences of the Number e,Mathematics Magazine, October 1989, Volume 62, Number 4, pp. 269–271.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Introducing the Lognormal Distribution

This post introduces the lognormal distribution and discusses some of its basic properties. The lognormal distribution is a transformation of the normal distribution through exponentiation. The basic properties of the lognormal distribution discussed here are derived from the normal distribution. The normal distribution is applicable in many situations but not in all situations. The normal density curve is a bell-shaped curve and is thus not appropriate in phenomena that are skewed to the right. In such situations, the lognormal distribution can be a good alternative to the normal distribution.

________________________________________________________________________

Defining the distribution

In this post, the notation $\log$ refers to the natural log function, i.e., logarithm to the base $e=2.718281828459 \cdots$.

A random variable $Y$ is said to follow a lognormal distribution if $\log(Y)$ follows a normal distribution. A lognormal distribution has two parameters $\mu$ and $\sigma$, which are the mean and standard deviation of the normal random variable $\log(Y)$. To be more precise, the definition is restated as follows:

A random variable $Y$ is said to follow a lognormal distribution with parameters $\mu$ and $\sigma$ if $\log(Y)$ follows a normal distribution with mean $\mu$ and standard deviation $\sigma$.

Many useful probability distributions are transformations of other known distributions. The above definition shows that a normal distribution is the transformation of a lognormal distribution under the natural logarithm. Start with a lognormal distribution, taking the natural log of it gives you a normal distribution. The other direction is actually more informative, i.e., a lognormal distribution is the transformation of a normal distribution by the exponential function. Start with a normal random variable $X$, the exponentiation of it is a lognormal distribution, i.e., $Y=e^{X}$ is a lognormal distribution. The following summarizes these two transformations.

$Y \text{ is lognormal} \longrightarrow X=\log(Y) \text{ is normal}$

$X \text{ is normal} \longrightarrow Y=e^{X} \text{ is lognormal}$

Since the exponential function gives positive values, the lognormal distribution always takes on positive real values. The following diagram shows the probability density functions of the standard normal distribution and the corresponding lognormal distribution. Recall that the standard normal distribution is the normal distribution with mean 0 and standard deviation 1.

Figure 1 – normal and lognormal density curves

In Figure 1, the standard normal density curve is symmetric bell shaped curve, with mean and median located at x = 0. The standard lognormal density (it is called standard since it is derived from the standard normal distribution) is located entirely over the positive x-axis. This is because the exponential function always gives positive values regardless of the sign of the argument. The lognormal density curve in Figure 1 is not symmetric and is a uni-modal curve skewed toward the right. All the standard normal values on the negative x-axis, when exponentiated, are in the interval (0, 1). Thus in Figure 1, the lower half of the lognormal probabilities lie in the interval x = 0 to x = 1 (i.e. the median of this lognormal distribution is x = 1). The other half of the lognormal probabilities lie in the interval $(1, \infty)$. Such lopsided assignment of probabilities shows that lognormal distribution is a positively skewed distribution (skewed to the right).

In the above paragraph, the lower half of the normal distribution on $(-\infty,0)$ is matched with the lognormal distribution on the interval $(0, 1)$. Such interval matching can tell us a great deal about the lognormal distribution. Another example: about 75% of the standard normal distributional values lie below x = 0.67. Thus in Figure 1, about 75% of the lognormal probabilities lie in the interval $(0, 1.95)$ where $e^{0.67} \approx 1.95$. Another example: what is the probability that the lognormal distribution in Figure 1 lie between 1 and 3.5? Then the normal matching interval is $(0, 1.25)$ where $\log(3.5) \approx 1.25$. The normal probability in this interval is 0.8944 – 0.5 = 0.3944. Thus randomly generated a value in the standard lognormal distribution, there is a 39.44% percent chance that it is between 1 and 3.5.

The interval matching idea is very useful for computing lognormal probabilities (e.g. cumulative distribution function) and for finding lognormal percentiles. This idea is discussed further below to make it work for any lognormal distribution, not just the standard lognormal distribution.

Though lognormal distribution is a skewed distribution, some are less skewed than others. The lognormal distributions with larger parameter value of $\sigma$ tend to be more skewed. The following is a diagram of three lognormal density curves that demonstrates this point. Note that the small $\sigma$ of 0.25 relatively resembles a bell curve.

Figure 2 – three lognormal density curves

________________________________________________________________________

How to compute lognormal probabilities and percentiles

Let $Y$ be a random variable that follows a lognormal distribution with parameters $\mu$ and $\sigma$. Then the related normal random variable is $X=\log(Y)$, which has mean $\mu$ and standard deviation $\sigma$. If we raise $e$ to $X$, we get back the lognormal $Y$.

Continuing the interval matching idea, the lognormal interval $Y \le y$ will match with the normal interval $X \le \log(y)$. Both intervals receive the same probability in their respective distributions. The following states this more clearly.

$\displaystyle P\biggl(Y \le y\biggr)=P\biggl(X=\log(Y) \le \log(y)\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

On the other hand, the normal interval $X \le x$ will match with the lognormal interval $Y \le e^x$. The same probability is assigned to both intervals in their respective distributions. This idea is stated as follows:

$\displaystyle P\biggl(X \le x\biggr)=P\biggl(Y=e^X \le e^x\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The idea of $(1)$ gives the cumulative distribution of the lognormal distribution (argument $y$), which is evaluated as the CDF of the corresponding normal distribution at $\log(y)$. One obvious application of $(2)$ is to have an easy way to find percentiles for the lognormal distribution. It is relatively easy to find the corresponding percentile of the normal distribution. Then the lognormal percentile is $e$ raised to the corresponding percentile of the normal distribution. For example, the median of the normal distribution is at the mean $\mu$. Then the median of the lognormal distribution is $e^{\mu}$.

The calculation in both $(1)$ and $(2)$ involve finding normal probabilities, which can be obtained using software or using a table of probability values of the standard normal distribution. To do the table approach, each normal CDF is converted to the standard normal CDF as follows:

$\displaystyle P\biggl(X \le x \biggr)=P\biggl(\frac{X-\mu}{\sigma} \le \frac{x-\mu}{\sigma} \biggr)=\Phi(z) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

where $z$ is the z-score which is the ratio $(x-\mu)/\sigma$ and $\Phi(z)$ is the cumulative distribution function of the standard normal distribution, which can be looked up from a table based on the z-score. In light of this, $(1)$ can be expressed as follows:

$\displaystyle P\biggl(Y \le y\biggr)=\Phi\biggl(\frac{\log(y)-\mu}{\sigma} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

A quick example to demonstrate how this works.

Example 1
If $Y$ is lognormally distributed with parameters $\mu=2.5$ and $\sigma=1.5$,

• what is the probability $P(1.9 \le Y \le 31.34)$?
• what is the 95th percentile of this lognormal distribution?

The first answer is $P(1.9 \le Y \le 31.34)=P(Y \le 31.34)-P(Y \le 1.9)$, which is calculated as follows:

$\displaystyle P(Y \le 31.34)=\Phi \biggl( \frac{\log(31.34)-2.5}{1.5}\biggr)=\Phi(0.63)=0.7357$

$\displaystyle P(Y \le 1.9)=\Phi \biggl( \frac{\log(1.9)-2.5}{1.5}\biggr)=\Phi(-1.24)=1-0.8925=0.1075$

\displaystyle \begin{aligned} P(1.9 \le Y \le 31.34)&=P(Y \le 31.34)-P(Y \le 1.9) \\&=0.7357-0.1075 \\&=0.6282 \end{aligned}

The z-score for the 95th percentile for the standard normal distribution is z = 1.645. Then the 95th percentile for the normal distribution with mean 2.5 and standard deviation 1.5 is x = 2.5 + 1.645 (1.5) = 4.9675. Then apply the exponential function to obtain $e^{4.9675} \approx 143.67$, which is the desired lognormal 95th percentile. $\square$

As $(2)$ and Example 1 suggest, to find a lognormal percentile, first find the percentile for the corresponding normal distribution. If $x_p$ is the $100p$th percentile of the normal distribution, then $e^{x_p}$ is the $100p$th percentile of the lognormal distribution. Usually, we can first find the $100p$th percentile for the standard normal distribution $z_p$. Then the normal percentile we need is $x=\mu + z_p \cdot \sigma$. The lognormal percentile is then:

$\displaystyle e^{\displaystyle \mu + z_p \cdot \sigma}=\text{lognormal } 100p \text{th percentile} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

The above discussion shows that the explicit form of the lognormal density curve is not needed in computing lognormal probabilities and percentiles. For the sake of completeness, the following shows the probability density functions of both the normal distribution and the lognormal distribution.

Normal PDF
$\displaystyle f_X(x)=\frac{1}{\sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(x-\mu)^2}{2 \sigma^2}} \ \ \ \ \ \ \ \ \ \ \ -\infty

Lognormal PDF
$\displaystyle f_Y(y)=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y)-\mu)^2}{2 \sigma^2}} \ \ \ \ \ 0

The cumulative distribution function for the lognormal distribution is then

$\displaystyle F_Y(y)=\int_0^y \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)$

Of course, we do not have to use $(8)$ since the lognormal CDF can be obtained based on the corresponding normal CDF.

One application of the lognormal PDF in $(7)$ is to use it to find the mode (by taking its derivative and finding the critical value). The mode of the lognormal distribution with parameters $\mu$ and $\sigma$ is $\displaystyle e^{\mu - \sigma^2}$.

________________________________________________________________________

How to find lognormal moments

To find the mean and higher moments of the lognormal distribution, we once again rely on basic information about normal distribution. For any random variable $T$ (normal or otherwise), its moment generating function is defined by $M_T(t)=E(e^{t \ T})$. The following is the moment generating function of the normal distribution with mean $\mu$ and standard deviation $\sigma$.

$\displaystyle M_X(t)=\text{\Large e}^{\displaystyle \biggl[ \mu t + (1/2) \sigma^2 t^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

As before, let $Y$ be a random variable that follows a lognormal distribution with parameters $\mu$ and $\sigma$. Then $Y=e^X$ where $X$ is normal with mean $\mu$ and standard deviation $\sigma$. Then $E(Y)=E(e^X)$ is simply the normal moment generating function evaluated at 1. In fact, the kth moment of $Y$, $E(Y^k)=E(e^{k X})$, is simply the normal mgf evaluated at $k$. Because the mgf of the normal distribution is defined at any real number, all moments for the lognormal distribution exist. The following gives the moments explicitly.

$E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}$

$E(Y^k)=\text{\Large e}^{\displaystyle \biggl[ k \mu+(k^2/2) \sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)$

In particular, the variance and standard deviation are:

\displaystyle \begin{aligned}Var(Y)&=\text{\Large e}^{\displaystyle \biggl[ 2 \mu+2 \sigma^2 \biggr]}-\text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11) \end{aligned}

$\displaystyle \sigma_Y=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ \ \text{\Large e}^{\displaystyle \biggl[ \mu+\frac{1}{2} \sigma^2 \biggr]}=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ E(Y) \ \ \ \ \ \ \ \ \ \ \ \ (12)$

The formulas $(11)$ and $(12)$ give the variance and standard deviation if the parameters $\mu$ and $\sigma$ are known. They do not need to be committed to memory, since they can always be generated from knowing the moments in $(10)$. As indicated before, the lognormal median is $e^{\mu}$, which is always less than the mean, which is $e$ raised to $\mu+(1/2) \sigma^2$. So the mean is greater than the median by a factor of $e$ raised to $(1/2) \sigma^2$. The mean being greater than the median is another sign that the lognormal distribution is skewed right.

Example 2
Suppose $Y$ follows a lognormal distribution with mean 12.18 and variance 255.02. Determine the probability that $Y$ is greater than its mean.

With the given information, we have:

$E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}=12.18$

$\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]}=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2=255.02$

\displaystyle \begin{aligned}Var(Y)&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2 \\&= \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) 12.18^2=255.02 \end{aligned}

From the last equation, we can solve for $\sigma$. The following shows the derivation.

\displaystyle \begin{aligned} &\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr)=\frac{255.02}{12.18^2} \\&\text{\Large e}^{\displaystyle \sigma^2}=2.719014994 \\&\sigma^2=\log(2.719014994)=1.00026968 \end{aligned}

Thus we can take $\sigma=1$. Then plug $\sigma=1$ into $E(Y)$ to get $\mu=2$. The desired probability is:

\displaystyle \begin{aligned} P(Y>12.18)&=1-\Phi \biggl(\frac{\log(12.18)-2}{1} \biggr) \\&=1-\Phi(0.499795262) \\&=1-\Phi(0.5)=1-0.6915=0.3085 \ \square \end{aligned}

________________________________________________________________________

Linear transformations

For any random variable $W$, a linear transformation of $W$ is the random variable $aW+b$ where $a$ and $b$ are real constants. It is well known that if $X$ follows a normal distribution, any linear transformation of $X$ also follows a normal distribution. Does this apply to lognormal distribution? A linear transformation of a lognormal distribution may not have distributional values over the entire positive x-axis. For example, if $Y$ is lognormal, then $Y+1$ is technically not lognormal since the values of $Y+1$ lie in $(1, \infty)$ and not $(0, \infty)$. Instead, we focus on the transformations $cY$ where $c>0$ is a constant. We have the following fact.

If $Y$ has a lognormal distribution with parameters $\mu$ and $\sigma$, then $cY$ has a lognormal distribution with parameters $\mu+\log(c)$ and $\sigma$.

The effect of the constant adjustment of the lognormal distribution is on the $\mu$ parameter, which is adjusted by adding the natural log of the constant $c$. Note that the adjustment on $\mu$ is addition and not multiplication. The $\sigma$ parameter is unchanged.

One application of the transformation $cY$ is that of inflation. For example, suppose $Y$ represents claim amounts in a given calendar year arising from a group of insurance policies. If the insurance company expects that the claims amounts in the next year will increase by 10%, then $1.1Y$ is the random variable that models next year’s claim amounts. If $Y$ is assumed to be lognormal, then the effect of the 10% inflation is on the $\mu$ parameter as indicated above.

To see why the inflation on $Y$ works as described, let’s look at the cumulative distribution function of $T=cY$.

\displaystyle \begin{aligned} P(T \le y)&=P(Y \le \frac{y}{c})=F_Y(y/c) \\&=\int_0^{y/c} \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \end{aligned}

Taking derivative of the last item above, we obtain the probability density function $f_T(y)$.

\displaystyle \begin{aligned} f_T(y)&=\frac{1}{y/c \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y/c)-\mu)^2}{2 \sigma^2}} (1/c) \\&=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{\biggl(\log(y)-(\mu+\log(c))\biggr)^2}{2 \sigma^2}} \end{aligned}

Comparing with the density function $(8)$, the last line is the lognormal density function with parameters $\mu + \log(c)$ and $\sigma$.

________________________________________________________________________

Distributional quantities involving higher moments

As the formula $(10)$ shows, all moments exist for the lognormal distribution. As a result, any distributional quantity that is defined using moments can be described explicitly in terms of the parameters $\mu$ and $\sigma$. We highlight three such distributional quantities: coefficient of variation, coefficient of skewness and kurtosis. The following shows their definitions. The calculation is done by plugging in the moments obtained from $(10)$.

$\displaystyle CV=\frac{\sigma_Y}{\mu_Y} \ \ \ \ \ \text{(Coefficient of variation)}$

\displaystyle \begin{aligned} \gamma_1&=\frac{E[ (Y-\mu_Y)^3 ]}{\sigma_Y^3} \\&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}} \ \ \ \ \ \text{(Coefficient of skewness)} \end{aligned}

\displaystyle \begin{aligned} \beta_2&=\frac{E[ (Y-\mu_Y)^4 ]}{\biggl(E[ (Y-\mu_Y)^2]\biggr)^2} \\&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}} \ \ \ \ \ \text{(Kurtosis)} \end{aligned}

The above definitions are made for any random variable $Y$. The notations $\mu_Y$ and $\sigma_Y$ are the mean and standard deviation of $Y$, respectively. Coefficient of variation is the ratio the standard deviation to the mean. It is a standardized measure of dispersion of a probability distribution or frequency distribution. The coefficient of skewness is defined as the ratio of the third central moment about the mean to the cube of the standard deviation. The second line in the above definition is an equivalent form that is in terms of the mean, variance and the third raw moment, which may be easier to calculate in some circumstances. Kurtosis is defined to be the ratio of the fourth central moment about the mean to the square of the second central moment about the mean. The second line in the definition gives an equivalent form that is in terms of the mean, variance and the third and fourth raw moments.

The above general definitions of CV, $\gamma_1$ and $\beta_2$ can be obtained for the lognormal distribution. The mean and variance and higher raw moments can be obtained by using $(10)$. Then it is a matter of plugging in the relevant items into the above definitions. The following example shows how this is done.

Example 3
Determine the CV, $\gamma_1$ and $\beta_2$ of the lognormal distribution in Example 2.

The calculation in Example 2 shows that the lognormal parameters are $\mu=2$ and $\sigma=1$. Now use formula $(10)$ to get the ingredients.

$\mu_Y=e^{2.5}$

$Var(Y)=(e-1) e^5 \ \ \ \ \ \ \ \ \sigma_Y=\sqrt{e-1} \ e^{2.5}$

$E(Y^2)=e^6$

$E(Y^3)=e^{10.5}$

$E(Y^4)=e^{16}$

Right away, CV = $\sqrt{e-1}=1.31$. The following shows the calculation for skewness and kurtosis.

\displaystyle \begin{aligned} \gamma_1&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}} \\&=\frac{e^{10.5}-3 \ e^{2.5} \ (e-1) \ e^5-(e^{2.5})^3}{[(e-1) \ e^5]^{\frac{3}{2}}}\\&=6.1849 \end{aligned}

\displaystyle \begin{aligned} \beta_2&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}} \\&=\frac{e^{16}-4 \ e^{2.5} \ e^{10.5} + 6 \ (e^{2.5})^2 \ e^{6} - 3 \ (e^{2.5})^4}{[(e-1) \ e^5]^{2}}\\&=113.9364 \end{aligned}

________________________________________________________________________

Is there a moment generating function for the lognormal distribution?

Because the normal distribution has a moment generating function, all moments exist for the lognormal distribution (see formula $(10)$ above). Does the moment generating function exist for the lognormal distribution? Whenever the mgf exists for a distribution, its moments can be derived from the mgf. What about the converse? That is, when all moments exist for a given distribution, does it mean that its moment generating function would always exist? The answer is no. It turns out that the lognormal distribution is a counterexample. We conclude this post by showing this fact.

Let $Y$ be the standard lognormal distribution, i.e., $X=\log(Y)$ has the standard normal distribution. We show that the expectation $E(e^{tY})$ converges to infinity when $t>0$.

\displaystyle \begin{aligned} E(e^{t \ Y})&=E(e^{t \ e^X}) \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x} \ e^{-0.5 x^2} \ dx \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{\biggl(t \ ( \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}) - 0.5x^2 \biggr)} \ dx=\infty \end{aligned}

The last integral in the above derivation converges to infinity. Note that the Taylor’s series expansion of $e^x$ is $\displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots$. In the last step, $e^x$ is replaced by $\displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}$. Then the exponent in the last integral is a third degree polynomial with a positive coefficient for the $x^3$ term. Thus this third degree polynomial converges to infinity as x goes to infinity. With the last integral goes to infinity, the mgf $E(e^{t \ Y})$ goes to infinity as well.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# The skewness of a probability distribution

In this post, we discuss how to calculate the moment coefficient of skewness and also discuss some issues surrounding the notion of skewness.

________________________________________________________________________

Looking at graphs

One informal but useful way of checking the skewness of a distribution is to look at the density curve (or a histogram). Consider the following density functions.

Figure 1

Figure 2

The density curve in Figure 1 has a longer tail to the right than to the left. The example in Figure 1 is a distribution that is skewed to the right. It is also said to be positively skewed since its coefficient of skewness is positive. The density curve in Figure 2 has a longer tail to the left than to the right. The example in Figure 2 is a distribution that is skewed to the left. It is also said to be negatively skewed since the skewness coefficient is negative. If a density curve looks the same to the left and to the right (such as the bell curve for the normal distribution), then it is a symmetric distribution and the skewness coefficient is zero.

The distribution in Figure 1 is a right skewed distribution (the longer tail is on the right). It is a gamma distribution with mean 2 and median approximately 1.678347. The mode (the highest peak) is at x = 1. The distribution in Figure 2 is a left skewed distribution (the longer tail is on the left) with mean and median approximately 0.909 and 0.9213562, respectively. The mode is at 0.95.

________________________________________________________________________

In the distribution for Figure 1, we can say that “mode < median < mean". In the distribution for Figure 2, we can say that "mean < median < mode". A common conception is that these simple rules characterize all skewed distribution, i.e., the mean is to the right of the median, which in turn is to the right of the mode in a right skewed distribution and that the mean is to the left of the median, which in turn is to the left of the mode in a left skewed distribution. Such rules are certainly easy to remember and are stated in some statistics textbooks. In the above two figures, this rule of thumb is certainly true. It turns out that this rule of thumb does not hold in many instances. The above two graphs are "textbook" demonstrations of skewness. They are gamma distributions and beta distributions and they behave well according to the usual notion of how skewed distributions should look like. In a later section of this post, we will discuss this issue in greater details. First we define the coefficient of skewness.

________________________________________________________________________

Pearson moment coefficient of skewness

The measure of skewness defined here is called the Pearson moment coefficient of skewness. This measure provides information about the amount and direction of the departure from symmetry. Its value can be positive or negative, or even undefined. The higher the absolute value of the skewness measure, the more asymmetric the distribution. The skewness measure of symmetric distributions is, or near, zero.

To help put the definition of skewness in context, we first define raw moments and central moments of a random variable $X$. The $k$th raw moment of $X$ is $E(X^k)$, the expected value of the $k$th power of the random variable $X$. The first raw moment is the mean of the random variable and is usually denoted by $\mu$.

The $k$th central moment of a random variable $X$ is $E[(X-\mu)^k]$, the expected value of the $k$th power of the deviation of the variable from its mean. The moment $E[(X-\mu)^k]$ is usually denoted by $\mu_k$. The second central moment is usually called the variance and is denoted by $\sigma^2$. The square root of $\sigma^2$, $\sigma$, is the standard deviation.

The ratio of the standard deviation to the mean, $\displaystyle \frac{\sigma}{\mu}$, is called the coefficient of variation.

The ratio of the third central moment to the cube of the standard deviation is called Pearson’s moment coefficient of skewness (or the coefficient of skewness) and is denoted by $\gamma_1$.

$\displaystyle \gamma_1=\frac{E[ (X-\mu)^3 ]}{\sigma^3}=\frac{\mu_3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The skewness in (1) can be expanded to derive a version that can be calculated more easily:

\displaystyle \begin{aligned} \gamma_1&=\frac{E[ (X-\mu)^3 ]}{\sigma^3} \\&=\frac{E(X^3)-3 \mu E(X^2)+3 \mu^2 E(X)-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu [E(X^2)-\mu E(X)]-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{\frac{3}{2}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

The last version (3) is in terms of the first raw moment $\mu$, the second central moment $\sigma^2$ and the third raw moment $E(X^3)$. Essentially, the coefficient $\gamma_1$ can be obtained via (3) by first computing the first three raw moments.

Even though kurtosis is not the focus of this post, we would like to state it to complete the the brief discussion on moments. The ratio of the fourth central moment to the fourth power of the standard deviation, $\displaystyle \gamma_2=\frac{\mu_4}{\sigma^4}$, is called the kurtosis.

________________________________________________________________________

Examples

In this section, we discuss the skewness in two familiar families of continuous distributions – gamma and beta. We also demonstrate how exponentiation can affect skewness.

Example 1Gamma Distribution
The following is the probability density function of the gamma distribution.

$\displaystyle f(x)=\frac{\beta^\alpha}{\Gamma(\alpha)} \ x^{\alpha-1} \ e^{-\beta x} \ \ \ \ \ \ \ \ \ x>0$

where $\Gamma(\cdot)$ is the gamma function, and $\alpha$ and $\beta$ are parameters such that $\alpha>0$ and $\beta>0$. The number $\alpha$ is the shape parameter and the number $\beta$ here is the rate parameter. Figure 1 shows the gamma distribution with $\alpha=2$ and $\beta=1$. When $\alpha=1$, we obtain the exponential distribution. When $\beta=\frac{1}{2}$ and $\alpha=\frac{k}{2}$ where $k$ is a positive integer, we obtain the chi square distribution with $k$ degrees of freedom.

Let $X$ be a random variable with the above gamma density function. The raw moments $E(X^k)$, where $k=1,2,3,\cdots$, are:

$\displaystyle E(X^k)=\frac{(\alpha+k-1)(\alpha+k-2) \cdots \alpha}{\beta^k}$

Using the first two raw moments to calculate the variance as well as the third moment, the following calculates the moment coefficient of skewness, based on the form in (3):

$\displaystyle \gamma_1=\frac{\displaystyle \frac{(\alpha+2)(\alpha+1)\alpha}{\beta^3}-3 \frac{\alpha}{\beta} \frac{\alpha}{\beta^3}-\frac{\alpha^3}{\beta^3}}{\biggl( \displaystyle \frac{\alpha}{\beta^2} \biggr)^{\frac{3}{2}}}=\frac{2}{\sqrt{\alpha}}$

The above calculation shows that the rate parameter $\beta$ has no effect on skewness. The example in Figure 1 has $\alpha=2$, giving a coefficient of skewness of $\sqrt{2}$ = 1.414213562. In general, the gamma distribution is skewed positively. However, the gamma distribution becomes more and more symmetric as the shape parameter $\alpha \rightarrow \infty$. The following graph the gamma densities for $\alpha=1, 2, 3, 5, 6$ and $\beta=1$.

Figure 3

In Figure 3, the light blue density with $\alpha=1$ is an exponential distribution. The red one with $\alpha=2$ is the density in Figure 1. With $\alpha=6$, the gamma density already looks very symmetric (the dark blue).

On the other hand, as the shape parameter $\alpha \rightarrow 0$, the gamma distribution becomes increasingly positively skewed. When $\alpha=\frac{1}{n}$, $\gamma_1=2 \sqrt{n}$. When $n \rightarrow \infty$, $\gamma_1 \rightarrow \infty$.

Example 2Beta Distribution
The following is the PDF of a beta distribution:

$\displaystyle f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \ x^{\alpha-1} \ (1-x)^{\beta-1} \ \ \ \ \ \ \ \ \ \ \ \ 0

where $\Gamma(\cdot)$ is the gamma function, and $\alpha$ and $\beta$ are parameters such that $\alpha>0$ and $\beta>0$. Both $\alpha$ and $\beta$ are shape parameters.

In the beta family of distributions, the skewness can range from positive to negative. If the $\alpha$ parameter dominates (i.e. $x$ is to a higher power and $1-x$ is to a small power in the density function), then the beta distribution has a negative skew (skewed to the left). This is because the function $x^n$ has left skew and the function $(1-x)^n$ has right skew. Then the skewness of the beta distribution follows the one that dominates. If the $\beta$ parameter dominates, the beta distribution is skewed to the right. If both parameters are roughly equal, the beta distribution is close to symmetric. For example when $\alpha=20$ and $\beta=2$, the beta distribution is left skewed (its density curve is in Figure 2). As in the gamma case, the skewness of the beta distribution has a close form. The following formula confirms the intuition about the skewness of the beta distribution (found here).

$\displaystyle \gamma_1=\frac{2(\beta-\alpha) \ \sqrt{\alpha+\beta+1}}{(\alpha+\beta+2) \ \sqrt{\alpha \ \beta}}$

Thus the beta distribution with $\alpha=20$ and $\beta=2$ has skewness coefficient -1.137431317. The following figure further demonstrates the role of the shape parameters play in changing the skewness of the beta distribution.

Figure 4

In Figure 4, as the $\alpha$ parameter goes from 2 to 20, the skewness goes from 1.137431317 to 0.659393193 to 0 to -0.659393193 to -1.137431317.

Example 3Exponentiation
Symmetric distributions have zero coefficient of skewness. Raising a symmetric distribution to a positive power can produce a skewed distribution. For example, let $X$ be the standard normal random variable (mean 0 and variance 1). Let $Y=X^2$. Then $Y$ has a chi-square distribution with 1 degree of freedom, which means that it is a gamma distribution with $\alpha=\frac{1}{2}$ and $\beta=\frac{1}{2}$. According to Example 1 above, the skewness coefficient is $\frac{2}{\sqrt{0.5}}=2 \sqrt{2}=2.828$. Thus squaring a standard normal distribution produces a very strongly positively skewed distribution.

Example 4Exponentiation
When raising a positively skewed distribution to positive power can produce a more strongly positively skewed distribution. For example, let $X$ be an exponential random variable. Example 1 shows that exponential distributions have skewness coefficient 2. We show that the coefficient of skewness for $Y=X^2$ is approximately 6.619.

The density function for the exponential random variable $X$ is $f(x)=\beta e^{-\beta x}$ where $\beta>0$ is the rate parameter. It can be shown that the raw moments of $X$ is:

$\displaystyle E(X^k)=\frac{k!}{\beta^k} \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

Then the first three moments of $Y$ are:

$\displaystyle E(Y)=E(X^2)=\frac{2}{\beta^2}$

$\displaystyle E(Y^2)=E(X^4)=\frac{24}{\beta^4}$

$\displaystyle E(Y^3)=E(X^6)=\frac{720}{\beta^6}$

With the first two raw moments, calculate the variance of $Y$. Then compute $\gamma_1$ via formula (3).

$\displaystyle \gamma_1=\frac{\displaystyle \frac{720}{\beta^6}-3 \ \frac{2}{\beta^2} \ \frac{20}{\beta^4}-\frac{8}{\beta^6}}{\biggl( \displaystyle \frac{20}{\beta^4} \biggr)^{\frac{3}{2}}}=\frac{74}{5 \sqrt{5}}=6.61876$

Example 5Exponentiation
Raising a left skewed distribution to a positive power can produce a distribution that is less left skewed. The use of increasing exponents eventually produces a positively skewed distribution. Let $X$ be the beta random variable with $\alpha=5$ and $\beta=1$. The density function for $X$ is $f(x)=5x^4$ where $0. Using the formula shown in Example 2 above, the coefficient of skewness is

$\displaystyle \gamma_1=\frac{2(1-5) \sqrt{5+1+1}}{(5+1+2) \sqrt{5}}=-1.183215957$

We wish to calculate the coefficient of skewness for $X^2$. To do that, it will be helpful to have a formula for the raw moments of $X$. It is easy to verify that:

$\displaystyle E(X^k)=\frac{5}{5+k} \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The first three moments of $Y=X^2$ are:

$\displaystyle E(Y)=E(X^2)=\frac{5}{7}$

$\displaystyle E(Y^2)=E(X^4)=\frac{5}{9}$

$\displaystyle E(Y^3)=E(X^6)=\frac{5}{11}$

Via formula (3), the following is the coefficient of skewness for $Y=X^2$.

$\displaystyle \gamma_1=\frac{\displaystyle \frac{5}{11}-3 \ \frac{5}{7} \ \frac{20}{441}-\frac{125}{7^3}}{\biggl( \displaystyle \frac{20}{441} \biggr)^{\frac{3}{2}}}=\frac{-18}{11 \sqrt{5}}=-0.731804065$

In this example, squaring the beta distribution with skewness -1.1832 produces a distribution a negatively skewed distribution but with a smaller skew. Let’s raise $X$ to higher powers. The following shows the results:

$\displaystyle X^3 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{11}}{7 \sqrt{5}}=-0.423782771$

$\displaystyle X^4 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{13}}{17 \sqrt{5}}=-0.189700182$

$\displaystyle X^5 \ \ \ \ \ \ \gamma_1=6 \sqrt{2}=8.485281374$

Raising the beta distribution with $\alpha=5$ and $\beta=1$ to higher powers eventually produces a positively skewed distribution. This is an interesting example, though this observation probably should not be taken as a rule.

________________________________________________________________________

Counterexamples

All the examples discussed previously are good “textbook” examples in that they help build intuition on how skewness behaves in familiar distributions. However, it is also easy to take the wrong lessons from these examples. The above examples can serve as good introduction to the topic of skewness. It is also important to attempt to provide a caveat that some of the commonly drawn lessons are not appropriate in all circumstances.

As indicated earlier, one wrong lesson from Figure 1 and Figure 2 is that a density curve such as Figure 1 may suggest that “mode < median < mean" for a right skewed distribution and that Figure 2 may suggest that "mean < median < mode" for a left skewed distribution. In both Figure 1 and Figure 2, the mean is further out in the long tail than the median. In certain textbooks, these two observations are even stated as characterizations of right skew and left skew. Such a rule of thumb is easy to state and easy to apply. For some students, such rule provides a lot of clarity about how skewness should work. For such students, checking for skewness is simply a matter of finding the relative position of the mean and median (e.g. in such thinking, if mean is greater than the median, then it is a right skew).

Any discussion of skewness should point out that the simple rule described in the above paragraph, though useful in many situations, is imperfect and may not apply outside of certain familiar distributions. For a good discussion on this issue, see this article.

We highlight one example found in the article mentioned above. This example demonstrates a clear violation of the common misconception indicated above. The following is the density function of the example.

$\displaystyle f_p(x)=\left\{\begin{matrix} \displaystyle (1-p)\biggl(1+\frac{1-p}{2p} x \biggr)&\ \ \ \ \ \ \frac{-2p}{1-p} \le x \le 0 \\{\text{ }}& \\ (1-p) e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

where $0. To facilitate the discussion, let $X_p$ be the random variable whose PDF is $f_p(x)$ defined above. The above density function is a juxtaposition of a triangular density and an exponential density. This triangular-exponential distribution has positive coefficient of skewness when $0. Yet within this range for $p$, the mean can be made to be on either side of the median. We consider three cases where $p=0.7$, $p=0.6$ and $p=0.9$.

Example 6
First the case $p=0.7$.

$\displaystyle f_{0.7}(x)=\left\{\begin{matrix} \displaystyle 0.3\biggl(1+\frac{3}{14} x \biggr)&\ \ \ \ \ \ \frac{-14}{3} \le x \le 0 \\{\text{ }}& \\ 0.3 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The following is the graph of the density curve $f_{0.7}(x)$. The right tail is long since the exponential distribution is on the right side. However, the left side is heavier (with 70% of the weight on the triangle on the left side).

Figure 5

The following shows the results for the density function $f_{0.7}(x)$.

$\displaystyle E(X_{0.7})=\frac{-21.3}{27}=-0.7889$

$\displaystyle \text{median of } X_{0.7} = -0.722613478$

$\displaystyle E(X_{0.7}^2)=\frac{254.4}{81}=3.140740741$

$\displaystyle Var(X_{0.7})=\frac{1835.91}{3^6}$

$\displaystyle E(X_{0.7}^3)=\frac{-2152.2}{405}$

$\displaystyle \gamma_1=\frac{111906.63}{5 1835.91^{1.5}}=0.284517335$

The calculation confirms the positive skew (0.2845), which is a moderately strong positive skewness. Note that the mean is to the left of the median. Both the mean and median are to the left of the mode (at x = 0). In Figure 5, the right side is infinitely long, thus a positively skewed distribution (and is confirmed by the calculation of $\gamma_1$). According to the common notion of how right skew should work, the mean should be further out on the right tail. But this is not the case. The mean is further out on the left side than the median. The violation of the common conception of skewness can occur when one tail is long but the other side is heavier.

Example 7
Now the case $p=0.6$.

$\displaystyle f_{0.6}(x)=\left\{\begin{matrix} \displaystyle 0.4\biggl(1+\frac{1}{3} x \biggr)&\ \ \ \ \ \ -3 \le x \le 0 \\{\text{ }}& \\ 0.4 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The following is the graph of the density curve $f_{0.6}(x)$. The right tail is long since the exponential distribution is on the right side. The left side is still heavy but a little less heavier than in the previous example (with 60% of the weight on the triangle on the left side).

Figure 6

The following shows the results for the density function $f_{0.6}(x)$.

$\displaystyle E(X_{0.6})=-0.2$

$\displaystyle \text{median of } X_{0.6} = -0.261387212$

$\displaystyle E(X_{0.6}^2)=1.7$

$\displaystyle Var(X_{0.6})=1.66$

$\displaystyle E(X_{0.6}^3)=0.78$

$\displaystyle \gamma_1=0.834128035$

The density curve $f_{0.6}(x)$ has a stronger positive skew than the previous example as there is a little more weight on the exponential side (the right side). Even though the mean in this case is to the right of the median, both the mean and median are not on the right tail but on the left triangular side (the heavier side). In any case, the mean is definitely not further out on the longer tail (the right tail) as the common rule of thumb would suggest.

Both Example 6 and Example 7 are right skewed distributions that do not conform to the common expectation about right skewed distributions. The following example will dispel the notion about the direction of the skew.

Example 8
Here we use $p=0.9$ so that there is a still a long right tail but 90% of the weight is on the other side.

$\displaystyle f_{0.9}(x)=\left\{\begin{matrix} \displaystyle 0.1\biggl(1+\frac{1}{18} x \biggr)&\ \ \ \ \ \ -18 \le x \le 0 \\{\text{ }}& \\ 0.1 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The overall shape of the $f_{0.9}(x)$ is similar to Figure 5 and Figure 6. The following shows the results for the density function $f_{0.9}(x)$.

$\displaystyle E(X_{0.9})=-5.3$

$\displaystyle E(X_{0.9}^2)=48.8$

$\displaystyle Var(X_{0.9})=20.71$

$\displaystyle E(X_{0.9}^3)=-524.28$

$\displaystyle \gamma_1=-0.489285839$

Because there is so little weight on the right tail, the skewness is actually negative (-0.48928). Here we have a right skewed looking distribution that is actually skewed to the left!

________________________________________________________________________

Remarks

Examples 5 through 7 demonstrate that when one tail is long but the other side is heavy, the common conception of right skew and left skew do not apply. The common conception, as discussed earlier, is that the both the mean and the median are located in the longer tail and that the mean is further out in the long tail than the median. The article mentioned earlier is easy to read and gives a fuller discussion of the issues when dealing with the notion of skewness. The common conception of skewness can be easily violated in discrete distributions, especially when the weights on both sides of the median are not equal. All the above examples are unimodal distributions. According the quoted article, bimodal or multimodal distributions can be problematic too.

Of course, the caveat presented here is not meant to discourage anyone from discussing the common conception about skewness. The common conception about the locations of mode, mean and median conveys useful intuition and we should continue to focus on it. But the common rule of thumb should definitely be not be presented as gospel truth as some textbooks had done. Instead, it should be pointed out that the common rule of thumb is imperfect and it would be helpful to have a discussion why the rule is imperfect.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Calculating order statistics using multinomial probabilities

Consider a random sample $X_1,X_2,\cdots,X_n$ drawn from a continuous distribution. Rank the sample items in increasing order, resulting in a ranked sample $Y_1 where $Y_1$ is the smallest sample item, $Y_2$ is the second smallest sample item and so on. The items in the ranked sample are called the order statistics. Recently the author of this blog was calculating a conditional probability such as $P(Y_2>4 \ | \ Y_2>3)$. One way to do this is to calculate the distribution function $P(Y_2 \le t)$. What about the probability $P(Y_5>4 \ | \ Y_2>3)$? Since this one involves two order statistics, the author of this blog initially thought that calculating $P(Y_5>4 \ | \ Y_2>3)$ would require knowing the joint probability distribution of the order statistics $Y_1,Y_2,\cdots ,Y_n$. It turns out that a joint distribution may not be needed. Instead, we can calculate a conditional probability such as $P(Y_5>4 \ | \ Y_2>3)$ using multinomial probabilities. In this post, we demonstrate how this is done using examples. Practice problems are found in here.

The calculation described here can be lengthy and tedious if the sample size is large. To make the calculation more manageable, the examples here have relatively small sample size. To keep the multinomial probabilities easier to calculate, the random samples are drawn from a uniform distribution. The calculation for larger sample sizes from other distributions is better done using a technology solution. In any case, the calculation described here is a great way to practice working with order statistics and multinomial probabilities.

________________________________________________________________________

The multinomial angle

In this post, the order statistics $Y_1 are resulted from ranking the random sample $X_1,X_2,\cdots,X_n$, which is drawn from a continuous distribution with distribution function $F(x)=P(X \le x)$. For the $j$th order statistic, the calculation often begins with its distribution function $P(Y_j \le t)$.

Here’s the thought process for calculating $P(Y_j \le t)$. In drawing the random sample $X_1,X_2,\cdots,X_n$, make a note of the items $\le t$ and the items $>t$. For the event $Y_j \le t$ to happen, there must be at least $j$ many sample items $X_i$ that are $\le t$. For the event $Y_j > t$ to happen, there can be only at most $j-1$ many sample items $X_i$ $\le t$. So to calculate $P(Y_j \le t)$, simply find out the probability of having $j$ or more sample items $\le t$. To calculate $P(Y_j > t)$, find the probability of having at most $j-1$ sample items $\le t$.

$\displaystyle P(Y_j \le t)=\sum \limits_{k=j}^n \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle P(Y_j > t)=\sum \limits_{k=0}^{j-1} \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

Both (1) and (2) involve binomial probabilities and are discussed in this previous post. The probability of success is $F(t)=P(X \le t)$ since we are interested in how many sample items that are $\le t$. Both the calculations (1) and (2) are based on counting the number of sample items in the two intervals $\le t$ and $>t$. It turns out that when the probability that is desired involves more than one $Y_j$, we can also count the number of sample items that fall into some appropriate intervals and apply some appropriate multinomial probabilities. Let’s use an example to illustrate the idea.

Example 1
Draw a random sample $X_1,X_2,\cdots,X_{10}$ from the uniform distribution $U(0,4)$. The resulting order statistics are $Y_1. Find the following probabilities:

• $P(Y_4<2
• $P(Y_4<2

For both probabilities, the range of the distribution is broken up into 3 intervals, (0, 2), (2, 3) and (3, 4). Each sample item has probabilities $\frac{2}{4}$, $\frac{1}{4}$, $\frac{1}{4}$ of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval.

The first probability involves the event that there are 4 sample items in the interval (0, 2), 2 sample items in the interval (2, 3) and 4 sample items in the interval (3, 4). Thus the first probability is the following multinomial probability:

\displaystyle \begin{aligned} P(Y_4<2

For the second probability, $Y_5$ does not have to be greater than 2. Thus there could be 5 sample items less than 2. So we need to add one more case to the above probability (5 sample items to the first interval, 1 sample item to the second interval and 4 sample items to the third interval).

\displaystyle \begin{aligned} P(Y_4<2

Example 2
Draw a random sample $X_1,X_2,\cdots,X_6$ from the uniform distribution $U(0,4)$. The resulting order statistics are $Y_1. Find the probability $P(1.

In this problem the range of the distribution is broken up into 3 intervals (0, 1), (1, 3) and (3, 4). Each sample item has probabilities $\frac{1}{4}$, $\frac{2}{4}$, $\frac{1}{4}$ of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval. The counting is a little bit more involved here than in the previous example.

The example is to find the probability that both the second order statistic $Y_2$ and the fourth order statistic $Y_4$ fall into the interval $(1,3)$. To solve this, determine how many sample items that fall into the interval $(0,1)$, $(1,3)$ and $(3,4)$. The following points detail the counting.

• For the event $1 to happen, there can be at most 1 sample item in the interval $(0,1)$.
• For the event $Y_4<3$ to happen, there must be at least 4 sample items in the interval $(0,3)$. Thus if the interval $(0,1)$ has 1 sample item, the interval $(1,3)$ has at least 3 sample items. If the interval $(0,1)$ has no sample item, the interval $(1,3)$ has at least 4 sample items.

The following lists out all the cases that satisfy the above two bullet points. The notation $[a, b, c]$ means that $a$ sample items fall into $(0,1)$, $b$ sample items fall into the interval $(1,3)$ and $c$ sample items fall into the interval $(3,4)$. So $a+b+c=6$. Since the sample items are drawn from $U(0,4)$, the probabilities of a sample item falling into intervals $(0,1)$, $(1,3)$ and $(3,4)$ are $\frac{1}{4}$, $\frac{2}{4}$ and $\frac{1}{4}$, respectively.

[0, 4, 2]
[0, 5, 1]
[0, 6, 0]
[1, 3, 2]
[1, 4, 1]
[1, 5, 0]

\displaystyle \begin{aligned} \frac{6!}{a! \ b! \ c!} \ \biggl[\frac{1}{4} \biggr]^a \ \biggl[\frac{2}{4} \biggr]^b \ \biggl[\frac{1}{4} \biggr]^c&=\frac{6!}{0! \ 4! \ 2!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2=\frac{240}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 5! \ 1!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1=\frac{192}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 6! \ 0!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^6 \ \biggl[\frac{1}{4} \biggr]^0=\frac{64}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 3! \ 2!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^3 \ \biggl[\frac{1}{4} \biggr]^2=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 4! \ 1!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^1=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 5! \ 0!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^0=\frac{192}{4096} \\&\text{ } \\&=\text{sum of probabilities }=\frac{1648}{4096}=0.4023\end{aligned}

So in randomly drawing 6 items from the uniform distribution $U(0,4)$, there is a 40% chance that the second order statistic and the fourth order statistic are between 1 and 3.

________________________________________________________________________

More examples

The method described by the above examples is this. When looking at the event described by the probability problem, the entire range of distribution is broken up into several intervals. Imagine the sample items $X_i$ are randomly being thrown into these interval (i.e. we are sampling from a uniform distribution). Then multinomial probabilities are calculated to account for all the different ways sample items can land into these intervals. The following examples further illustrate this idea.

Example 3
Draw a random sample $X_1,X_2,\cdots,X_7$ from the uniform distribution $U(0,5)$. The resulting order statistics are $Y_1. Find the following probabilities:

• $P(1
• $P(3

The range is broken up into the intervals (0, 1), (1, 3), (3, 4) and (4, 5). The sample items fall into these intervals with probabilities $\frac{1}{5}$, $\frac{2}{5}$, $\frac{1}{5}$ and $\frac{1}{5}$. The following details the counting for the event $1:

• There are no sample items in (0, 1) since $1.
• Based on $Y_1<3, there are at least one sample item and at most 3 sample items in (0, 3). Thus in the interval (1, 3), there are at least one sample item and at most 3 sample items since there are none in (0, 1).
• Based on $Y_4<4$, there are at least 4 sample items in the interval (0, 4). Thus the count in (3, 4) combines with the count in (1, 3) must be at least 4.
• The interval (4, 5) simply receives the left over sample items not in the previous intervals.

The notation $[a, b, c, d]$ lists out the counts in the 4 intervals. The following lists out all the cases described by the above 5 bullet points along with the corresponding multinomial probabilities, with two of the probabilities set up.

$\displaystyle [0, 1, 3, 3] \ \ \ \ \ \ \frac{280}{78125}=\frac{7!}{0! \ 1! \ 3! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^3$

$\displaystyle [0, 1, 4, 2] \ \ \ \ \ \ \frac{210}{78125}$

$\displaystyle [0, 1, 5, 1] \ \ \ \ \ \ \frac{84}{78125}$

$\displaystyle [0, 1, 6, 0] \ \ \ \ \ \ \frac{14}{78125}$

$\displaystyle [0, 2, 2, 3] \ \ \ \ \ \ \frac{840}{78125}$

$\displaystyle [0, 2, 3, 2] \ \ \ \ \ \ \frac{840}{78125}$

$\displaystyle [0, 2, 4, 1] \ \ \ \ \ \ \frac{420}{78125}$

$\displaystyle [0, 2, 5, 0] \ \ \ \ \ \ \frac{84}{78125}$

$\displaystyle [0, 3, 1, 3] \ \ \ \ \ \ \frac{1120}{78125}=\frac{7!}{0! \ 3! \ 1! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3$

$\displaystyle [0, 3, 2, 2] \ \ \ \ \ \ \frac{1680}{78125}$

$\displaystyle [0, 3, 3, 1] \ \ \ \ \ \ \frac{1120}{78125}$

$\displaystyle [0, 3, 4, 0] \ \ \ \ \ \ \frac{280}{78125}$

Summing all the probabilities, $\displaystyle P(1. Out of the 78125 many different ways the 7 sample items can land into these 4 intervals, 6972 of them would satisfy the event $1.

++++++++++++++++++++++++++++++++++

We now calculate the second probability in Example 3.

$\displaystyle P(3

First calculate $P(1. The probability $P(Y_1 is the probability of having at least 1 sample item less than $t$, which is the complement of the probability of all sample items greater than $t$.

\displaystyle \begin{aligned} P(1

The event $1 can occur in 16256 ways. Out of these many ways, 6972 of these satisfy the event $1. Thus we have:

$\displaystyle P(3

Example 4
Draw a random sample $X_1,X_2,X_3,X_4,X_5$ from the uniform distribution $U(0,5)$. The resulting order statistics are $Y_1. Consider the conditional random variable $Y_4 \ | \ Y_2 >3$. For this conditional distribution, find the following:

• $P( Y_4 \le t \ | \ Y_2 >3)$
• $f_{Y_4}(t \ | \ Y_2 >3)$
• $E(Y_4 \ | \ Y_2 >3)$

where $3. Note that $f_{Y_4}(t | \ Y_2 >3)$ is the density function of $Y_4 \ | \ Y_2 >3$.

Note that

$\displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{P(33)}$

In finding $P(3, the range (0, 5) is broken up into 3 intervals (0, 3), (3, t) and (t, 5). The sample items fall into these intervals with probabilities $\frac{3}{5}$, $\frac{t-3}{5}$ and $\frac{5-t}{5}$.

Since $Y_2 >3$, there is at most 1 sample item in the interval (0, 3). Since $Y_4 \le t$, there are at least 4 sample items in the interval (0, t). So the count in the interval (3, t) and the count in (0, 3) should add up to 4 or more items. The following shows all the cases for the event $3 along with the corresponding multinomial probabilities.

$\displaystyle [0, 4, 1] \ \ \ \ \ \ \frac{5!}{0! \ 4! \ 1!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^1$

$\displaystyle [0, 5, 0] \ \ \ \ \ \ \frac{5!}{0! \ 5! \ 0!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^5 \ \biggl[\frac{5-t}{5} \biggr]^0$

$\displaystyle [1, 3, 1] \ \ \ \ \ \ \frac{5!}{1! \ 3! \ 1!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^3 \ \biggl[\frac{5-t}{5} \biggr]^1$

$\displaystyle [1, 4, 0] \ \ \ \ \ \ \frac{5!}{1! \ 4! \ 0!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^0$

After carrying the algebra and simplifying, we have the following:

$\displaystyle P(3

For the event $Y_2 >3$ to happen, there is at most 1 sample item less than 3. So we have:

$\displaystyle P(Y_2 >3)=\binom{5}{0} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^5 +\binom{5}{1} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{2}{5} \biggr]^4=\frac{272}{3125}$

$\displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{272}$

Then the conditional density is obtained by differentiating $P( Y_4 \le t \ | \ Y_2 >3)$.

$\displaystyle f_{Y_4}(t \ | \ Y_2 >3)=\frac{-20t^4+100t^3+540t^2-3750t+5400}{272}$

The following gives the conditional mean $E(Y_4 \ | \ Y_2 >3)$.

\displaystyle \begin{aligned} E(Y_4 \ | \ Y_2 >3)&=\frac{1}{272} \ \int_3^5 t(-20t^4+100t^3+540t^2-3750t+5400) \ dt \\&=\frac{215}{51}=4.216 \end{aligned}

To contrast, the following gives the information on the unconditional distribution of $Y_4$.

$\displaystyle f_{Y_4}(t)=\frac{5!}{3! \ 1! \ 1!} \ \biggl[\frac{t}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr] \ \biggl[ \frac{5-t}{5} \biggr]^1=\frac{20}{3125} \ (5t^3-t^4)$

$\displaystyle E(Y_4)=\frac{20}{3125} \ \int_0^5 t(5t^3-t^4) \ dt=\frac{10}{3}=3.33$

The unconditional mean of $Y_4$ is about 3.33. With the additional information that $Y_2 >3$, the average of $Y_4$ is now 4.2. So a higher value of $Y_2$ pulls up the mean of $Y_4$.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the problem blog, a companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

$\displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54 \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46 \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61 \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74 \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20 \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15 \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87 \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03 \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68 \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36 \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ } \end{bmatrix}$

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample $X_1,X_2,\cdots,X_n$ drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics $Y_1. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the $Y_j$ is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics $Y_j$ to estimate the population percentiles is that the expected percentage of the population below $Y_j$ is about the same as the percentage of the sample items less than $Y_j$. According to the explanation in the second post listed above (link), the order statistic $Y_j$ is expected to be above $100p$ percent of the population where $p=\frac{j}{n+1}$. In fact, the order statistics $Y_1 are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between $Y_{j-1}$ and $Y_j$ is $100h$ where $h=\frac{1}{n+1}$.

The above explanation justifies the use of the order statistic $Y_j$ as the sample $100p$th percentile where $p=\frac{j}{n+1}$.

The sample size is $n=$ 54 in the San Francisco rainfall data. Thus the order statistic $Y_{11}$ is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of $Y_{11}$ is 15.22.

With $\frac{45}{54+1}=0.818$, the order statistic $Y_{45}$ is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of $Y_{45}$ is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability $P(Y_i < \tau_p < Y_j)$. This is the probability that the $100p$th percentile, where $0, is in between $Y_i$ and $Y_j$. Let's look at the median $\tau_{0.5}$. For $Y_i<\tau_{0.5}$ to happen, there must be at least $i$ many sample items less than the median $\tau_{0.5}$. For $\tau_{0.5} to happen, there can be at most $j-1$ many sample items less than the median $\tau_{0.5}$. Thus in the random draws of the sample items, in order for the event $Y_i < \tau_{0.5} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{0.5}$. In other words, in $n$ Bernoulli trials, there at at least $i$ and at most $j-1$ successes where the probability of success is $P(X<\tau_{0.5})=$ 0.5. The following is the probability $P(Y_i < \tau_{0.5} < Y_j)$:

$\displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{0.5} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population median $\tau_{0.5}$. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the $100p$th percentile where $0. In order for the event $Y_i < \tau_{p} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{p}$. This is equivalent to $n$ Bernoulli trials resulting in at least $i$ successes and at most $j-1$ successes where the probability of success is $P(X<\tau_{p})=p$.

$\displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{p} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population percentile $\tau_{p}$. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is $n=$ 54. The middle two data elements in the sample is $y_{27}=20.01$ and $y_{28}=20.54$. They are realizations of the order statistics $Y_{27}$ and $Y_{28}$. So in this example, $\frac{27}{54+1}=0.49$ and $\frac{28}{54+1}=0.509$. Thus the order statistic $Y_{27}$ is expected to be greater than about 49% of the population and $Y_{28}$ is expected to be greater than about 51% of the population. So neither $Y_{27}$ nor $Y_{28}$ is an exact fit. So we take the average of the two as an estimate of the population median:

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

Looking for confidence intervals, we consider the intervals $(Y_{21},Y_{34})$, $(Y_{20},Y_{35})$, $(Y_{19},Y_{36})$ and $(Y_{18},Y_{37})$. The following shows the confidence levels.

$\displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271$

$\displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436$

$\displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673$

$\displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666$

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

Median

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

$(Y_{21},Y_{34})$ = (18.11, 23.49) with approximately 92% confidence

$(Y_{20},Y_{35})$ = (17.74, 23.87) with approximately 96% confidence

$(Y_{19},Y_{36})$ = (17.65, 24.09) with approximately 98% confidence

$(Y_{18},Y_{37})$ = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

Lower quartile

$\displaystyle \hat{\tau}_{0.25}=15.985$, average of $Y_{13}$ and $Y_{14}$

$(Y_{7},Y_{20})$ = (13.87, 17.74) with approximately 96% confidence

$(Y_{6},Y_{21})$ = (13.86, 18.11) with approximately 98% confidence

$(Y_{5},Y_{22})$ = (12.54, 18.26) with approximately 99% confidence

Upper quartile

$\displaystyle \hat{\tau}_{0.75}=25.875$, average of $Y_{41}$ and $Y_{42}$

$(Y_{36},Y_{47})$ = (24.09, 29.41) with approximately 91% confidence

$(Y_{35},Y_{48})$ = (23.87, 31.87) with approximately 96% confidence

$(Y_{34},Y_{49})$ = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for $\tau_{0.25}$ and one for $\tau_{0.75}$.

$\displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918$

$\displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918$

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Defining the Poisson distribution

The Poisson distribution is a family of discrete distributions with positive probabilities on the non-negative numbers $0,1,2,\cdots$. Each distribution in this family is indexed by a positive number $\lambda>0$. One way to define this distribution is to give its probability function given the parameter $\lambda$ and then derive various distributional quantities such as mean and variance. Along with other mathematical facts, it can be shown that both the mean and the variance are $\lambda$. In this post, we take a different tack. We look at two view points that give rise to the Poisson distribution. Taking this approach will make it easier to appreciate some of the possible applications of the Poisson distribution. The first view point is that the Poisson distribution is the limiting case of the binomial distribution. The second view point is through the Poisson process, a stochastic process that, under some conditions, counts the number of events and the time points at which these events occur in a given time (or physical) interval.

________________________________________________________________________

Poisson as a limiting case of binomial

A binomial distribution where the number of trials $n$ is large and the probability of success $p$ is small such that $np$ is moderate in size can be approximated using the Poisson distribution with mean $\lambda=np$. This fact follows from Theorem 1, which indicates that the Poisson distribution is the limiting case of the binomial distribution.

Theorem 1
Let $\lambda$ be a fixed positive constant. Then for each integer $x=0,1,2,\cdots$, the following is true:

$\displaystyle \lim_{n \rightarrow \infty} \binom{n}{x} \ p^x \ (1-p)^{n-x}=\lim_{n \rightarrow \infty} \frac{n!}{x! \ (n-x)!} \ p^x \ (1-p)^{n-x}=\frac{e^{-\lambda} \ \lambda^x}{x!}$

where $\displaystyle p=\frac{\lambda}{n}$.

Proof of Theorem 1
We start with a binomial distribution with $n$ trials and with $\displaystyle p=\frac{\lambda}{n}$ being the probability of success, where $n>\lambda$. Let $X_n$ be the count of the number of successes in these $n$ Bernoulli trials. The following is the probability that $X_n=k$.

\displaystyle \begin{aligned} P(X_n=k)&=\binom{n}{k} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n!}{k! (n-k)!} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\lambda^k}{k!}\biggr) \biggr(1-\frac{\lambda}{n}\biggr)^{n} \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \ \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \end{aligned}

In the last step, the terms that contain $n$ are inside the square brackets. Let’s see what they are when $n$ approaches infinity.

$\displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1$

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{n}=e^{-\lambda}$

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{-k}=1$

The reason that the first result is true is that the numerator is a polynomial where the leading term is $n^k$. Upon dividing by $n^k$ and taking the limit, we get 1. The second result is true since the following limit is one of the definitions of the exponential function $e^x$.

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1+\frac{x}{n}\biggr)^{n}=e^{x}$

The third result is true since the exponent $-k$ is a constant. Thus the following is the limit of the probability $P(X_n=k)$ as $n \rightarrow \infty$.

\displaystyle \begin{aligned} \lim \limits_{n \rightarrow \infty} P(X_n=k)&= \biggl(\frac{\lambda^k}{k!}\biggr) \ \lim \limits_{n \rightarrow \infty} \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \cdot 1 \cdot e^{-\lambda} \cdot 1 \\&=\frac{e^{-\lambda} \lambda^k}{k!} \end{aligned}

This above derivation completes the proof. $\blacksquare$

In a given binomial distribution, whenever the number of trials $n$ is large and the probability $p$ of success in each trial is small (i.e. each of the Bernoulli trial rarely results in a success), Theorem 1 tells us that we can use the Poisson distribution with parameter $\lambda=np$ to estimate the binomial distribution.

Example 1
The probability of being dealt a full house in a hand of poker is approximately 0.001441. Out of 5000 hands of poker that are dealt at a certain casino, what is the probability that there will be at most 4 full houses?

Let $X$ be the number of full houses in these 5000 poker hands. The exact distribution for $X$ is the binomial distribution with $n=$ 5000 and $p=$ 0.001441. Thus example deals with a large number of trials where each trial is a rare event. So the Poisson estimation is applicable. Let $\lambda=$ 5000(0.001441) = 7.205. Then $P(X \le 4)$ can be approximated by the Poisson random variable $Y$ with parameter $\lambda$. The following is the probability function of $Y$:

$\displaystyle P(Y=y)=e^{-7.205} \ \frac{7.205^y}{y!}$

The following is the approximation of $P(X \le 4)$:

\displaystyle \begin{aligned} P(X \le 4)&\approx P(Y \le 4) \\&=P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3)+P(Y=4) \\&= e^{-7.205} \biggl[ 1+7.205+\frac{7.205^2}{2!}+\frac{7.205^3}{3!}+\frac{7.205^4}{4!}\biggr] \\&=0.155098087 \end{aligned}

The following is a side by side comparison between the binomial distribution and its Poisson approximation. For all practical purposes, they are indistingusihable from one another.

$\displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&\text{Binomial } &\text{ }&\text{ }&\text{Poisson } \\\text{Full Houses}&\text{ }&\text{ }&P(X \le x) &\text{ }&\text{ }&P(Y \le x) \\\text{ }&\text{ }&\text{ }&n=5000 &\text{ }&\text{ }&\lambda=7.205 \\\text{ }&\text{ }&\text{ }&p=0.001441 &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.000739012&\text{ }&\text{ }&0.000742862 \\ 1&\text{ }&\text{ }&0.006071278&\text{ }&\text{ }&0.006095184 \\ 2&\text{ }&\text{ }&0.025304641&\text{ }&\text{ }&0.025376925 \\ 3&\text{ }&\text{ }&0.071544923&\text{ }&\text{ }&0.071685238 \\ 4&\text{ }&\text{ }&0.154905379&\text{ }&\text{ }&0.155098087 \\ 5&\text{ }&\text{ }&0.275104906&\text{ }&\text{ }&0.275296003 \\ 6&\text{ }&\text{ }&0.419508250&\text{ }&\text{ }&0.419633667 \\ 7&\text{ }&\text{ }&0.568176421 &\text{ }&\text{ }&0.568198363 \\ 8&\text{ }&\text{ }&0.702076190 &\text{ }&\text{ }&0.701999442 \\ 9&\text{ }&\text{ }&0.809253326&\text{ }&\text{ }&0.809114639 \\ 10&\text{ }&\text{ }&0.886446690&\text{ }&\text{ }&0.886291139 \\ 11&\text{ }&\text{ }&0.936980038&\text{ }&\text{ }&0.936841746 \\ 12&\text{ }&\text{ }&0.967298041&\text{ }&\text{ }&0.967193173 \\ 13&\text{ }&\text{ }&0.984085073&\text{ }&\text{ }&0.984014868 \\ 14&\text{ }&\text{ }&0.992714372&\text{ }&\text{ }&0.992672033 \\ 15&\text{ }&\text{ }&0.996853671&\text{ }&\text{ }&0.996830358 \end{bmatrix}$

The above table is calculated using the functions BINOM.DIST and POISSON.DIST in Excel. The following shows how it is done. The parameter TRUE indicates that the result is a cumulative distribution. When it is set to FALSE, the formula gives the probability function.

$P(X \le x)=\text{BINOM.DIST(x, 5000, 0.001441, TRUE)}$

$P(Y \le x)=\text{POISSON.DIST(x, 7.205, TRUE)}$

________________________________________________________________________

The Poisson distribution

The limit in Theorem 1 is a probability function and the resulting distribution is called the Poisson distribution. We now gives the formal definition. A random variable $X$ that takes on one of the numbers $0,1,2,\cdots$ is said to be a Poisson random variable with parameter $\lambda>0$ if

$\displaystyle P(X=x)=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

It can be shown that the above function is indeed a probability function, i.e., the probabilities sum to 1. Any random variable that has a probability function of the above form is said to follow (or to have) a Poisson distribution. Furthermore, it can be shown that $E(X)=var(X)=\lambda$, i.e., the Poisson parameter is both the mean and variance. Thus the Poisson distribution may be a good fit if the observed data indicate that the sample mean and the sample variance are nearly identical.

The following is the moment generating function of the Poisson distribution with parameter $\lambda$.

$\displaystyle M(t)=E(e^{tX})=e^{\lambda \ (e^t-1)}$

One consequence of the Poisson moment generating function is that any independent sum of Poisson distributions is again a Poisson distribution.

________________________________________________________________________

The Poisson process

Another way, the more important way, to look at the Poisson distribution is the view point of the Poisson process. Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to record the time of the occurrence of each random event and for the purpose at hand, count the number of random events occurring in a fixed time interval. Starting at time 0, note the time of the occurrence of the first event. Then the time at which the second random event occurs and so on. Out of these measurements, we can derive the length of time between the occurrences of any two consecutive random events. Such measurements belong to a continuous random variable. In this post, we focus on the discrete random variable of the count of the random events in a fixed time interval.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, $\alpha$-particles were emitted from a polonium source and the number of $\alpha$-particles were counted during an interval of 7.5 seconds (2608 many such time intervals were observed). A Poisson process is a random process in which several criteria are satisfied. We will show that in a Poisson process, the number of these random occurrences in the fixed time interval will follow a Poisson distribution. First, we discuss the criteria to which a Poisson process must conform.

One of the criteria is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. For example, in the 1910 radioactivity study, $\alpha$-particles were emitted at the rate of $\lambda=$ 3.87 per 7.5 seconds. So the probability of one $\alpha$-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one $\alpha$-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. To summarize, the following are the three criteria of a Poisson process:

Suppose that on average $\lambda$ random events occur in a time interval of length 1.

1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
2. For a very short subinterval of length $\frac{1}{n}$ where $n$ is a sufficiently large integer, the probability of a random event occurring in this subinterval is $\frac{\lambda}{n}$.
3. The numbers of random events occurring in non-overlapping time intervals are independent.

Consider a Poisson process in which the average rate is $\lambda$ random events per unit time interval. Let $Y$ be the number of random events occurring in the unit time interval. In the 1910 radioactivity study, the unit time interval is 7.5 seconds and $Y$ is the count of the number of $\alpha$-particles emitted in 7.5 seconds. It follows that $Y$ has a Poisson distribution with parameter $\lambda$. To see this, subdivide the unit interval into $n$ non-overlapping subintervals of equal length where $n$ is a sufficiently large integer. Let $X_{n,j}$ be the number of random events in the the $j$th subinterval ($1 \le j \le n$). Based on the three assumptions, $X_{n,1},X_{n,2},\cdots,X_{n,n}$ are independent Bernoulli random variables, where the probability of success for each $X_{n,j}$ is $\frac{\lambda}{n}$. Then $X_n=X_{n,1}+X_{n,2}+\cdots+X_{n,n}$ has a binomial distribution with parameters $n$ and $p=\frac{\lambda}{n}$. Theorem 1 tells us that the limiting case of the binomial distributions for $X_n$ is the Poisson distribution with parameter $\lambda$. This Poisson distribution should agree with the distribution for $Y$. The Poisson is also discussed in quite a lot of details in the previous post called Poisson as a Limiting Case of Binomial Distribution.

We now examine the 1910 radioactivity study a little more closely.

Example 2
The basic idea of the 1910 radioactivity study conducted by Rutherford and Geiger is that a polonium source was placed a short distance from an observation point. The number of $\alpha$-particles emitted from the source were counted in 7.5-second intervals for 2608 times. The following is the tabulated results.

$\displaystyle \begin{bmatrix} \text{Number of alpha particles}&\text{ }&\text{Observed} \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts} \\ \text{ }&\text{ }&\text{ } \\ 0&\text{ }&57 \\ 1&\text{ }&203 \\ 2&\text{ }&383 \\ 3&\text{ }&525 \\ 4&\text{ }&532 \\ 5&\text{ }&408 \\ 6&\text{ }&273 \\ 7&\text{ }&139 \\ 8&\text{ }&45 \\ 9&\text{ }&27 \\ 10&\text{ }&10 \\ 11+&\text{ }&6 \\ \text{ }&\text{ }&\text{ } \\ \text{Total }&\text{ }&2608 \end{bmatrix}$

What is the average number of particles observed per 7.5 seconds? The total number of $\alpha$-particles in these 2608 periods is

$0 \times 57+1 \times 203+2 \times 383+ 3 \times 525 + \cdots=10097$.

The mean count per period is $\lambda=\frac{10097}{2608}=3.87$. Consider the Poisson distribution with parameter 3.87. The following is its probability function.

$\displaystyle P(X=x)=\frac{e^{-3.87} \ 3.87^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

Out of 2608 periods, the expected number of periods with $x$ particles in emission is $2608P(X=x)$. The following is a side by side comparison in the observed counts and the expected counts.

$\displaystyle \begin{bmatrix} \text{Number of alpha particles}&\text{ }&\text{Observed}&\text{ }&\text{Expected} \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}&\text{ }&\text{counts} \\ \text{ }&\text{ }&\text{ }&\text{ }&2608 \times P(X=x) \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ } \\ 0&\text{ }&57&\text{ }&54.40 \\ 1&\text{ }&203&\text{ }&210.52 \\ 2&\text{ }&383&\text{ }&407.36 \\ 3&\text{ }&525&\text{ }&525.50 \\ 4&\text{ }&532&\text{ }&508.42 \\ 5&\text{ }&408&\text{ }&393.52 \\ 6&\text{ }&273&\text{ }&253.82 \\ 7&\text{ }&139&\text{ }&140.32 \\ 8&\text{ }&45&\text{ }&67.88 \\ 9&\text{ }&27&\text{ }&29.19 \\ 10&\text{ }&10&\text{ }&11.30 \\ 11+&\text{ }&6&\text{ }&5.78 \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ } \\ \text{Total }&\text{ }&2608&\text{ }&2608 \end{bmatrix}$

The expected counts are quite close to the observed counts, showing that the Poisson distribution is a very good fit to the observed data from the 1910 study.

________________________________________________________________________

We have described the Poisson process as the distribution of random events in a time interval. The same idea can be used to describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements (see Examples 5 and 6 below).

Another point to make is that sometimes it may be necessary to consider an interval other than the unit length. Instead of counting the random events occurring in an interval of length 1, we may want to count the random events in an interval of length $t$. As before, let $\lambda$ be the rate of occurrences in a unit interval. Then the rate of occurrences of the random events is over the interval of length $t$ is $\lambda t$. The same idea will derive that fact that the number of occurrences of the random events of interest in the interval of length $t$ is a Poisson distribution with parameter $\lambda t$. The following is its probability function.

$\displaystyle P(X_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

where $X_t$ is the count of the random events in an interval of length $t$.

For example, in the 1910 radioactive study, the unit length is 7.5 seconds. The number of $\alpha$-particles observed in a half unit interval (3.75 seconds) will follow a Poisson distribution with parameter $0.5 \lambda=$ 0.5(3.87) = 1.935 with the following probability function:

$\displaystyle P(X_{0.5}=x)=\frac{e^{-1.935} \ (1.935)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

________________________________________________________________________

More examples

Example 3
A radioactive source is metered for 5 hours. During this period, 9638 $\alpha$-particles are counted. What is the probability that during the next minute, between 30 and 34 particles (both inclusive ) will be counted?

The average number of $\alpha$-particles counted per minute is $\lambda=\frac{9638}{300}=32.12$. Let $X$ be the number of $\alpha$-particles counted per minute. Then $X$ has a Poisson distribution with parameter $\lambda=32.12$. The following calculates $P(30 \le X \le 34)$.

\displaystyle \begin{aligned} P(30 \le X \le 34)&=e^{-32.12} \biggl[ \frac{32.12^{30}}{30!}+\frac{32.12^{31}}{31!}+\frac{32.12^{32}}{32!}+\frac{32.12^{33}}{33!}+\frac{32.12^{34}}{34!} \biggr] \\&=0.341118569 \end{aligned}

Alternatively, the POISSON.DIST function in Excel can be used as follows:

\displaystyle \begin{aligned} P(30 \le X \le 34)&=P(X \le 34)-P(X \le 29) \\&=\text{POISSON.DIST(34,32.12,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(29,32.12,TRUE)} \\&=0.671501917-0.330383348 \\&=0.341118569 \end{aligned}

Example 4
The side effect of dry mouth is known to be experienced, on the average, by 5 out of 10,000 individuals taking a certain medication. About 20,000 patients are expected to take this medication next year. What is the probability that between 12 and 16 (both inclusive) patients will experience the side effect of dry mouth? What is the exact probability model that can also be used to work this problem?

The exact model is a binomial distribution. The number of trials $n=$ 20000 and the probability of success in each trial is $p=$ 0.0005 (experiencing the side effect). Here, we use Poisson to estimate the binomial. The average number of patients experiencing side effect is $\lambda=20000(0.0005)=10$. Let $X$ be the number of patients experiencing the side effect. The following calculates the Poisson probability for $P(12 \le X \le 16)$ in two different ways.

\displaystyle \begin{aligned} P(12 \le X \le 16)&=e^{-10} \biggl[ \frac{10^{12}}{12!}+\frac{10^{13}}{13!}+\frac{10^{14}}{14!}+\frac{10^{15}}{15!}+\frac{10^{16}}{16!} \biggr] \\&=0.276182244 \end{aligned}
\displaystyle \begin{aligned} P(12 \le X \le 16)&=P(X \le 11)-P(X \le 16) \\&=\text{POISSON.DIST(16,10,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(11,10,TRUE)} \\&=0.97295839-0.696776146 \\&=0.276182244 \end{aligned}

Example 5
In a 10-mile stretch of a highway, car troubles (e.g. tire punctures, dead batteries, and mechanical breakdown) occur at a rate of 1.5 per hour. A tow truck driver can respond to such car troubles and offer roadside assistance, which can include towing and minor repair. Assume that the number of such incidences per hour follows a Poisson distribution. At the beginning of the hour, three tow trucks (and their drivers) are available to respond to any car troubles in this stretch of highway. What is the probability that in the next hour all three tow trick drivers will be busy helping motorists with car troubles in this stretch of highway?

Let $X$ be the number of car troubles that occur in this 10-mile stretch of highway in the one-hour period in question. If in this one hour there are 3 or more car troubles ($X \ge 3$), then all three tow truck drivers will be busy.

\displaystyle \begin{aligned} P(X \ge 3)&=1-P(X \le 2) \\&=1-e^{-1.5} \biggl[ 1+1.5+\frac{1.5^{2}}{2!} \biggr] \\&=1-0.808846831\\&=0.191153169 \end{aligned}

Example 6
Continuing Example 5. Considering that there is only 19% chance that all 3 tow truck drivers will be busy, there is a good chance that the resources are under utilized. What if one of the drivers is assigned to another stretch of highway?

With only two tow trucks available for this 10-mile stretch of highway, the following is the probability that all two tow truck drivers will be busy:

\displaystyle \begin{aligned} P(X \ge 2)&=1-P(X \le 1) \\&=1-e^{-1.5} \biggl[ 1+1.5 \biggr] \\&=1-0.5578254\\&=0.4421746 \end{aligned}

Assigning one driver to another area seems to be a better way to make good use of the available resources. With only two tow truck drivers available, there is much reduced chance (56%) that one of the drivers will be idle, and there is a much increased chance (44%) that all available drivers will be busy.

________________________________________________________________________

Remarks

The Poisson distribution is one of the most important of all probability models and has shown to be an excellent model for a wide array of phenomena such as

• the number of $\alpha$-particles emitted from radioactive source in a given amount of time,
• the number of vehicles passing a particular location on a busy highway,
• the number of traffic accidents in a stretch of highway in a given period of time,
• the number of phone calls arriving at a particular point in a telephone network in a fixed time period,
• the number of insurance losses/claims in a given period of time,
• the number of customers arriving at a ticket window,
• the number of earthquakes occurring in a fixed period of time,
• the number of mutations on a strand of DNA.
• the number of hurricanes in a year that originate in the Atlantic ocean.

What is the Poisson distribution so widely applicable in these and many other seemingly different and diverse phenomena? What is the commonality that ties all these different and diverse phenomena? The commonality is that all these phenomena are basically a series of independent Bernoulli trials. If a phenomenon is a Binomial model where $n$ is large and $p$ is small, then it has a strong connection to Poisson model mathematically through Theorem 1 above (i.e. it has a Poisson approximation). On the other hand, if the random phenomenon follows the criteria in a Poisson process, then the phenomenon is also approximately a Binomial model, which means that in the limiting case it is Poisson.

In both view points discussed in this post, the Poisson distribution can be regarded as a binomial distribution taken at a very granular level. This connection with the binomial distribution points to a vast arrays of problems that can be solved using the Poisson distribution.

________________________________________________________________________

Exercises

Practice problems for the Poisson concepts discussed above can be found in the companion blog (go there via the following link). Working on these exercises is strongly encouraged (you don’t know it until you can do it).

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$