Introducing the Lognormal Distribution

This post introduces the lognormal distribution and discusses some of its basic properties. The lognormal distribution is a transformation of the normal distribution through exponentiation. The basic properties of the lognormal distribution discussed here are derived from the normal distribution. The normal distribution is applicable in many situations but not in all situations. The normal density curve is a bell-shaped curve and is thus not appropriate in phenomena that are skewed to the right. In such situations, the lognormal distribution can be a good alternative to the normal distribution.


Defining the distribution

In this post, the notation \log refers to the natural log function, i.e., logarithm to the base e=2.718281828459 \cdots.

A random variable Y is said to follow a lognormal distribution if \log(Y) follows a normal distribution. A lognormal distribution has two parameters \mu and \sigma, which are the mean and standard deviation of the normal random variable \log(Y). To be more precise, the definition is restated as follows:

    A random variable Y is said to follow a lognormal distribution with parameters \mu and \sigma if \log(Y) follows a normal distribution with mean \mu and standard deviation \sigma.

Many useful probability distributions are transformations of other known distributions. The above definition shows that a normal distribution is the transformation of a lognormal distribution under the natural logarithm. Start with a lognormal distribution, taking the natural log of it gives you a normal distribution. The other direction is actually more informative, i.e., a lognormal distribution is the transformation of a normal distribution by the exponential function. Start with a normal random variable X, the exponentiation of it is a lognormal distribution, i.e., Y=e^{X} is a lognormal distribution. The following summarizes these two transformations.

    Y \text{ is lognormal}  \longrightarrow X=\log(Y) \text{ is normal}

    X \text{ is normal} \longrightarrow  Y=e^{X} \text{ is lognormal}

Since the exponential function gives positive values, the lognormal distribution always takes on positive real values. The following diagram shows the probability density functions of the standard normal distribution and the corresponding lognormal distribution. Recall that the standard normal distribution is the normal distribution with mean 0 and standard deviation 1.

    Figure 1 – normal and lognormal density curves
    standard normal - lognormal PDFs

In Figure 1, the standard normal density curve is symmetric bell shaped curve, with mean and median located at x = 0. The standard lognormal density (it is called standard since it is derived from the standard normal distribution) is located entirely over the positive x-axis. This is because the exponential function always gives positive values regardless of the sign of the argument. The lognormal density curve in Figure 1 is not symmetric and is a uni-modal curve skewed toward the right. All the standard normal values on the negative x-axis, when exponentiated, are in the interval (0, 1). Thus in Figure 1, the lower half of the lognormal probabilities lie in the interval x = 0 to x = 1 (i.e. the median of this lognormal distribution is x = 1). The other half of the lognormal probabilities lie in the interval (1, \infty). Such lopsided assignment of probabilities shows that lognormal distribution is a positively skewed distribution (skewed to the right).

In the above paragraph, the lower half of the normal distribution on (-\infty,0) is matched with the lognormal distribution on the interval (0, 1). Such interval matching can tell us a great deal about the lognormal distribution. Another example: about 75% of the standard normal distributional values lie below x = 0.67. Thus in Figure 1, about 75% of the lognormal probabilities lie in the interval (0, 1.95) where e^{0.67} \approx 1.95. Another example: what is the probability that the lognormal distribution in Figure 1 lie between 1 and 3.5? Then the normal matching interval is (0, 1.25) where \log(3.5) \approx 1.25. The normal probability in this interval is 0.8944 – 0.5 = 0.3944. Thus randomly generated a value in the standard lognormal distribution, there is a 39.44% percent chance that it is between 1 and 3.5.

The interval matching idea is very useful for computing lognormal probabilities (e.g. cumulative distribution function) and for finding lognormal percentiles. This idea is discussed further below to make it work for any lognormal distribution, not just the standard lognormal distribution.

Though lognormal distribution is a skewed distribution, some are less skewed than others. The lognormal distributions with larger parameter value of \sigma tend to be more skewed. The following is a diagram of three lognormal density curves that demonstrates this point. Note that the small \sigma of 0.25 relatively resembles a bell curve.

    Figure 2 – three lognormal density curves
    Three lognormal density curves


How to compute lognormal probabilities and percentiles

Let Y be a random variable that follows a lognormal distribution with parameters \mu and \sigma. Then the related normal random variable is X=\log(Y), which has mean \mu and standard deviation \sigma. If we raise e to X, we get back the lognormal Y.

Continuing the interval matching idea, the lognormal interval Y \le y will match with the normal interval X \le \log(y). Both intervals receive the same probability in their respective distributions. The following states this more clearly.

    \displaystyle P\biggl(Y \le y\biggr)=P\biggl(X=\log(Y) \le \log(y)\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

On the other hand, the normal interval X \le x will match with the lognormal interval Y \le e^x. The same probability is assigned to both intervals in their respective distributions. This idea is stated as follows:

    \displaystyle P\biggl(X \le x\biggr)=P\biggl(Y=e^X \le e^x\biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The idea of (1) gives the cumulative distribution of the lognormal distribution (argument y), which is evaluated as the CDF of the corresponding normal distribution at \log(y). One obvious application of (2) is to have an easy way to find percentiles for the lognormal distribution. It is relatively easy to find the corresponding percentile of the normal distribution. Then the lognormal percentile is e raised to the corresponding percentile of the normal distribution. For example, the median of the normal distribution is at the mean \mu. Then the median of the lognormal distribution is e^{\mu}.

The calculation in both (1) and (2) involve finding normal probabilities, which can be obtained using software or using a table of probability values of the standard normal distribution. To do the table approach, each normal CDF is converted to the standard normal CDF as follows:

    \displaystyle P\biggl(X \le x \biggr)=P\biggl(\frac{X-\mu}{\sigma} \le \frac{x-\mu}{\sigma} \biggr)=\Phi(z) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where z is the z-score which is the ratio (x-\mu)/\sigma and \Phi(z) is the cumulative distribution function of the standard normal distribution, which can be looked up from a table based on the z-score. In light of this, (1) can be expressed as follows:

    \displaystyle P\biggl(Y \le y\biggr)=\Phi\biggl(\frac{\log(y)-\mu}{\sigma} \biggr) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

A quick example to demonstrate how this works.

Example 1
If Y is lognormally distributed with parameters \mu=2.5 and \sigma=1.5,

  • what is the probability P(1.9 \le Y \le 31.34)?
  • what is the 95th percentile of this lognormal distribution?

The first answer is P(1.9 \le Y \le 31.34)=P(Y \le 31.34)-P(Y \le 1.9), which is calculated as follows:

    \displaystyle P(Y \le 31.34)=\Phi \biggl( \frac{\log(31.34)-2.5}{1.5}\biggr)=\Phi(0.63)=0.7357

    \displaystyle P(Y \le 1.9)=\Phi \biggl( \frac{\log(1.9)-2.5}{1.5}\biggr)=\Phi(-1.24)=1-0.8925=0.1075

    \displaystyle \begin{aligned} P(1.9 \le Y \le 31.34)&=P(Y \le 31.34)-P(Y \le 1.9) \\&=0.7357-0.1075 \\&=0.6282  \end{aligned}

The z-score for the 95th percentile for the standard normal distribution is z = 1.645. Then the 95th percentile for the normal distribution with mean 2.5 and standard deviation 1.5 is x = 2.5 + 1.645 (1.5) = 4.9675. Then apply the exponential function to obtain e^{4.9675} \approx 143.67, which is the desired lognormal 95th percentile. \square

As (2) and Example 1 suggest, to find a lognormal percentile, first find the percentile for the corresponding normal distribution. If x_p is the 100pth percentile of the normal distribution, then e^{x_p} is the 100pth percentile of the lognormal distribution. Usually, we can first find the 100pth percentile for the standard normal distribution z_p. Then the normal percentile we need is x=\mu + z_p \cdot \sigma. The lognormal percentile is then:

    \displaystyle e^{\displaystyle \mu + z_p \cdot \sigma}=\text{lognormal } 100p \text{th percentile} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The above discussion shows that the explicit form of the lognormal density curve is not needed in computing lognormal probabilities and percentiles. For the sake of completeness, the following shows the probability density functions of both the normal distribution and the lognormal distribution.

    Normal PDF
    \displaystyle f_X(x)=\frac{1}{\sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(x-\mu)^2}{2 \sigma^2}} \ \ \ \ \ \ \ \ \ \ \ -\infty<x<\infty \ \ \ \ \ \ \ \ \ \ (6)

    Lognormal PDF
    \displaystyle f_Y(y)=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y)-\mu)^2}{2 \sigma^2}} \ \ \ \ \ 0<y<\infty \ \ \ \ \ \ \ \ \ \ \ \ \ (7)

The cumulative distribution function for the lognormal distribution is then

    \displaystyle F_Y(y)=\int_0^y \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)

Of course, we do not have to use (8) since the lognormal CDF can be obtained based on the corresponding normal CDF.

One application of the lognormal PDF in (7) is to use it to find the mode (by taking its derivative and finding the critical value). The mode of the lognormal distribution with parameters \mu and \sigma is \displaystyle e^{\mu - \sigma^2}.


How to find lognormal moments

To find the mean and higher moments of the lognormal distribution, we once again rely on basic information about normal distribution. For any random variable T (normal or otherwise), its moment generating function is defined by M_T(t)=E(e^{t \ T}). The following is the moment generating function of the normal distribution with mean \mu and standard deviation \sigma.

    \displaystyle M_X(t)=\text{\Large e}^{\displaystyle \biggl[ \mu t + (1/2) \sigma^2 t^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)

As before, let Y be a random variable that follows a lognormal distribution with parameters \mu and \sigma. Then Y=e^X where X is normal with mean \mu and standard deviation \sigma. Then E(Y)=E(e^X) is simply the normal moment generating function evaluated at 1. In fact, the kth moment of Y, E(Y^k)=E(e^{k X}), is simply the normal mgf evaluated at k. Because the mgf of the normal distribution is defined at any real number, all moments for the lognormal distribution exist. The following gives the moments explicitly.

    E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}

    E(Y^k)=\text{\Large e}^{\displaystyle \biggl[ k \mu+(k^2/2) \sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)

In particular, the variance and standard deviation are:

    \displaystyle \begin{aligned}Var(Y)&=\text{\Large e}^{\displaystyle \biggl[ 2 \mu+2 \sigma^2 \biggr]}-\text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (11) \end{aligned}

    \displaystyle \sigma_Y=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ \ \text{\Large e}^{\displaystyle \biggl[ \mu+\frac{1}{2} \sigma^2 \biggr]}=\sqrt{\text{\Large e}^{\displaystyle \sigma^2}-1} \ E(Y) \ \ \ \ \ \ \ \ \ \ \ \ (12)

The formulas (11) and (12) give the variance and standard deviation if the parameters \mu and \sigma are known. They do not need to be committed to memory, since they can always be generated from knowing the moments in (10). As indicated before, the lognormal median is e^{\mu}, which is always less than the mean, which is e raised to \mu+(1/2) \sigma^2. So the mean is greater than the median by a factor of e raised to (1/2) \sigma^2. The mean being greater than the median is another sign that the lognormal distribution is skewed right.

Example 2
Suppose Y follows a lognormal distribution with mean 12.18 and variance 255.02. Determine the probability that Y is greater than its mean.

With the given information, we have:

    E(Y)=\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}=12.18

    \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]}=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2=255.02

    \displaystyle \begin{aligned}Var(Y)&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \text{\Large e}^{\displaystyle \biggl[ 2 \mu+\sigma^2 \biggr]} \\&=\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) \biggl(\text{\Large e}^{\displaystyle \biggl[ \mu+(1/2) \sigma^2 \biggr]}\biggr)^2  \\&=  \biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr) 12.18^2=255.02    \end{aligned}

From the last equation, we can solve for \sigma. The following shows the derivation.

    \displaystyle \begin{aligned} &\biggl(\text{\Large e}^{\displaystyle \sigma^2}-1\biggr)=\frac{255.02}{12.18^2}   \\&\text{\Large e}^{\displaystyle \sigma^2}=2.719014994  \\&\sigma^2=\log(2.719014994)=1.00026968 \end{aligned}

Thus we can take \sigma=1. Then plug \sigma=1 into E(Y) to get \mu=2. The desired probability is:

    \displaystyle \begin{aligned} P(Y>12.18)&=1-\Phi \biggl(\frac{\log(12.18)-2}{1}  \biggr) \\&=1-\Phi(0.499795262)  \\&=1-\Phi(0.5)=1-0.6915=0.3085  \ \square    \end{aligned}


Linear transformations

For any random variable W, a linear transformation of W is the random variable aW+b where a and b are real constants. It is well known that if X follows a normal distribution, any linear transformation of X also follows a normal distribution. Does this apply to lognormal distribution? A linear transformation of a lognormal distribution may not have distributional values over the entire positive x-axis. For example, if Y is lognormal, then Y+1 is technically not lognormal since the values of Y+1 lie in (1, \infty) and not (0, \infty). Instead, we focus on the transformations cY where c>0 is a constant. We have the following fact.

    If Y has a lognormal distribution with parameters \mu and \sigma, then cY has a lognormal distribution with parameters \mu+\log(c) and \sigma.

The effect of the constant adjustment of the lognormal distribution is on the \mu parameter, which is adjusted by adding the natural log of the constant c. Note that the adjustment on \mu is addition and not multiplication. The \sigma parameter is unchanged.

One application of the transformation cY is that of inflation. For example, suppose Y represents claim amounts in a given calendar year arising from a group of insurance policies. If the insurance company expects that the claims amounts in the next year will increase by 10%, then 1.1Y is the random variable that models next year’s claim amounts. If Y is assumed to be lognormal, then the effect of the 10% inflation is on the \mu parameter as indicated above.

To see why the inflation on Y works as described, let’s look at the cumulative distribution function of T=cY.

    \displaystyle \begin{aligned} P(T \le y)&=P(Y \le \frac{y}{c})=F_Y(y/c) \\&=\int_0^{y/c} \frac{1}{t \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(t)-\mu)^2}{2 \sigma^2}} \ dt  \end{aligned}

Taking derivative of the last item above, we obtain the probability density function f_T(y).

    \displaystyle \begin{aligned} f_T(y)&=\frac{1}{y/c \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{(\log(y/c)-\mu)^2}{2 \sigma^2}} (1/c) \\&=\frac{1}{y \ \sigma \sqrt{2 \pi}} \ \ \text{\Large e}^{\displaystyle -\frac{\biggl(\log(y)-(\mu+\log(c))\biggr)^2}{2 \sigma^2}}  \end{aligned}

Comparing with the density function (8), the last line is the lognormal density function with parameters \mu + \log(c) and \sigma.


Distributional quantities involving higher moments

As the formula (10) shows, all moments exist for the lognormal distribution. As a result, any distributional quantity that is defined using moments can be described explicitly in terms of the parameters \mu and \sigma. We highlight three such distributional quantities: coefficient of variation, coefficient of skewness and kurtosis. The following shows their definitions. The calculation is done by plugging in the moments obtained from (10).

    \displaystyle CV=\frac{\sigma_Y}{\mu_Y} \ \ \ \ \ \text{(Coefficient of variation)}

    \displaystyle \begin{aligned} \gamma_1&=\frac{E[ (Y-\mu_Y)^3 ]}{\sigma_Y^3}  \\&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}}  \ \ \ \ \ \text{(Coefficient of skewness)}    \end{aligned}

    \displaystyle \begin{aligned} \beta_2&=\frac{E[ (Y-\mu_Y)^4 ]}{\biggl(E[ (Y-\mu_Y)^2]\biggr)^2}  \\&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}}  \ \ \ \ \ \text{(Kurtosis)}    \end{aligned}

The above definitions are made for any random variable Y. The notations \mu_Y and \sigma_Y are the mean and standard deviation of Y, respectively. Coefficient of variation is the ratio the standard deviation to the mean. It is a standardized measure of dispersion of a probability distribution or frequency distribution. The coefficient of skewness is defined as the ratio of the third central moment about the mean to the cube of the standard deviation. The second line in the above definition is an equivalent form that is in terms of the mean, variance and the third raw moment, which may be easier to calculate in some circumstances. Kurtosis is defined to be the ratio of the fourth central moment about the mean to the square of the second central moment about the mean. The second line in the definition gives an equivalent form that is in terms of the mean, variance and the third and fourth raw moments.

The above general definitions of CV, \gamma_1 and \beta_2 can be obtained for the lognormal distribution. The mean and variance and higher raw moments can be obtained by using (10). Then it is a matter of plugging in the relevant items into the above definitions. The following example shows how this is done.

Example 3
Determine the CV, \gamma_1 and \beta_2 of the lognormal distribution in Example 2.

The calculation in Example 2 shows that the lognormal parameters are \mu=2 and \sigma=1. Now use formula (10) to get the ingredients.


    Var(Y)=(e-1) e^5 \ \ \ \ \ \ \ \ \sigma_Y=\sqrt{e-1} \ e^{2.5}




Right away, CV = \sqrt{e-1}=1.31. The following shows the calculation for skewness and kurtosis.

    \displaystyle \begin{aligned} \gamma_1&=\frac{E(Y^3)-3 \mu_Y \sigma_Y^2-\mu_Y^3}{(\sigma_Y^2)^{\frac{3}{2}}}  \\&=\frac{e^{10.5}-3 \ e^{2.5} \ (e-1) \ e^5-(e^{2.5})^3}{[(e-1) \ e^5]^{\frac{3}{2}}}\\&=6.1849     \end{aligned}

    \displaystyle \begin{aligned} \beta_2&=\frac{E(Y^4)-4 \ \mu_Y \ E(Y^3) + 6 \ \mu_Y^2 \ E(Y^2) - 3 \ \mu_Y^4}{(\sigma_Y^2)^{2}}  \\&=\frac{e^{16}-4 \ e^{2.5} \ e^{10.5} + 6 \ (e^{2.5})^2 \ e^{6} - 3 \ (e^{2.5})^4}{[(e-1) \ e^5]^{2}}\\&=113.9364    \end{aligned}


Is there a moment generating function for the lognormal distribution?

Because the normal distribution has a moment generating function, all moments exist for the lognormal distribution (see formula (10) above). Does the moment generating function exist for the lognormal distribution? Whenever the mgf exists for a distribution, its moments can be derived from the mgf. What about the converse? That is, when all moments exist for a given distribution, does it mean that its moment generating function would always exist? The answer is no. It turns out that the lognormal distribution is a counterexample. We conclude this post by showing this fact.

Let Y be the standard lognormal distribution, i.e., X=\log(Y) has the standard normal distribution. We show that the expectation E(e^{tY}) converges to infinity when t>0.

    \displaystyle \begin{aligned} E(e^{t \ Y})&=E(e^{t \ e^X}) \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x} \ e^{-0.5 x^2} \ dx \\&=\frac{1}{\sqrt{2 \pi}} \ \int_{-\infty}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{t \ e^x - 0.5x^2} \ dx \\&> \frac{1}{\sqrt{2 \pi}} \ \int_{0}^\infty e^{\biggl(t \ ( \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}) - 0.5x^2 \biggr)} \ dx=\infty \end{aligned}

The last integral in the above derivation converges to infinity. Note that the Taylor’s series expansion of e^x is \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}+\frac{x^4}{4!}+\cdots. In the last step, e^x is replaced by \displaystyle 1+x+\frac{x^2}{2!}+\frac{x^3}{3!}. Then the exponent in the last integral is a third degree polynomial with a positive coefficient for the x^3 term. Thus this third degree polynomial converges to infinity as x goes to infinity. With the last integral goes to infinity, the mgf E(e^{t \ Y}) goes to infinity as well.


Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

\copyright \ \text{2015 by Dan Ma}

The Poisson Distribution

Let \alpha be a positive constant. Consider the following probability distribution:

\displaystyle (1) \ \ \ \ \ P(X=j)=\frac{e^{-\alpha} \alpha^j}{j!} \ \ \ \ \ j=0,1,2,\cdots

The above distribution is said to be a Poisson distribution with parameter \alpha. The Poisson distribution is usually used to model the random number of events occurring in a fixed time interval. As will be shown below, E(X)=\alpha. Thus the parameter \alpha is the rate of occurrence of the random events; it indicates on average how many events occur per unit of time. Examples of random events that may be modeled by the Poisson distribution include the number of alpha particles emitted by a radioactive substance counted in a prescribed area during a fixed period of time, the number of auto accidents in a fixed period of time or the number of losses arising from a group of insureds during a policy period.

Each of the above examples can be thought of as a process that generates a number of arrivals or changes in a fixed period of time. If such a counting process leads to a Poisson distribution, then the process is said to be a Poisson process.

We now discuss some basic properties of the Poisson distribution. Using the Taylor series expansion of e^{\alpha}, the following shows that (1) is indeed a probability distribution.

\displaystyle . \ \ \ \ \ \ \ \sum \limits_{j=0}^\infty \frac{e^{-\alpha} \alpha^j}{j!}=e^{-\alpha} \sum \limits_{j=0}^\infty \frac{\alpha^j}{j!}=e^{-\alpha} e^{\alpha}=1

The generating function of the Poisson distribution is g(z)=e^{\alpha (z-1)} (see The generating function). The mean and variance can be calculated using the generating function.

\displaystyle \begin{aligned}(2) \ \ \ \ \ &E(X)=g'(1)=\alpha \\&\text{ } \\&E[X(X-1)]=g^{(2)}(1)=\alpha^2 \\&\text{ } \\&Var(X)=E[X(X-1)]+E(X)-E(X)^2=\alpha^2+\alpha-\alpha^2=\alpha \end{aligned}

The Poisson distribution can also be interpreted as an approximation to the binomial distribution. It is well known that the Poisson distribution is the limiting case of binomial distributions (see [1] or this post).

\displaystyle (3) \ \ \ \ \ \lim \limits_{n \rightarrow \infty} \binom{n}{j} \biggl(\frac{\alpha}{n}\biggr)^j \biggl(1-\frac{\alpha}{n}\biggr)^{n-j}=\frac{e^{-\alpha} \alpha^j}{j!}

One application of (3) is that we can use Poisson probabilities to approximate Binomial probabilities. The approximation is reasonably good when the number of trials n in a binomial distribution is large and the probability of success p is small. The binomial mean is n p and the variance is n p (1-p). When p is small, 1-p is close to 1 and the binomial variance is approximately np \approx n p (1-p). Whenever the mean of a discrete distribution is approximately equaled to the mean, the Poisson approximation is quite good. As a rule of thumb, we can use Poisson to approximate binomial if n \le 100 and p \le 0.01.

As an example, we use the Poisson distribution to estimate the probability that at most 1 person out of 1000 will have a birthday on the New Year Day. Let n=1000 and p=365^{-1}. So we use the Poisson distribution with \alpha=1000 365^{-1}. The following is an estimate using the Poisson distribution.

\displaystyle . \ \ \ \ \ \ \ P(X \le 1)=e^{-\alpha}+\alpha e^{-\alpha}=(1+\alpha) e^{-\alpha}=0.2415

Another useful property is that the independent sum of Poisson distributions also has a Poisson distribution. Specifically, if each X_i has a Poisson distribution with parameter \alpha_i, then the independent sum X=X_1+\cdots+X_n has a Poisson distribution with parameter \alpha=\alpha_1+\cdots+\alpha_n. One way to see this is that the product of Poisson generating functions has the same general form as g(z)=e^{\alpha (z-1)} (see The generating function). One interpretation of this property is that when merging several arrival processes, each of which follow a Poisson distribution, the result is still a Poisson distribution.

For example, suppose that in an airline ticket counter, the arrival of first class customers follows a Poisson process with a mean arrival rate of 8 per 15 minutes and the arrival of customers flying coach follows a Poisson distribution with a mean rate of 12 per 15 minutes. Then the arrival of customers of either types has a Poisson distribution with a mean rate of 20 per 15 minutes or 80 per hour.

A Poisson distribution with a large mean can be thought of as an independent sum of Poisson distributions. For example, a Poisson distribution with a mean of 50 is the independent sum of 50 Poisson distributions each with mean 1. Because of the central limit theorem, when the mean is large, we can approximate the Poisson using the normal distribution.

In addition to merging several Poisson distributions into one combined Poisson distribution, we can also split a Poisson into several Poisson distributions. For example, suppose that a stream of customers arrives according to a Poisson distribution with parameter \alpha and each customer can be classified into one of two types (e.g. no purchase vs. purchase) with probabilities p_1 and p_2, respectively. Then the number of “no purchase” customers and the number of “purchase” customers are independent Poisson random variables with parameters \alpha p_1 and \alpha p_2, respectively. For more details on the splitting of Poisson, see Splitting a Poisson Distribution.


  1. Feller W. An Introduction to Probability Theory and Its Applications, Third Edition, John Wiley & Sons, New York, 1968

The sign test

What kind of significance tests do we use for doing inference on the mean of an obviously non-normal population? If the sample is large, we can still use the t-test since the sampling distribution of the sample mean \overline{X} is close to normal and the t-procedure is robust. If the sample size is small and the underlying distribution is clearly not normal (e.g. is extremely skewed), what significance test do we use? Let’s take the example of a matched pairs data problem. The matched pairs t-test is to test the hypothesis that there is “no difference” between two continuous random variables X and Y that are paired. If the underlying distributions are normal or if the sample size is large, the matched pairs t-test are an excellent test. However, absent normality or large samples, the sign test is an alternative to the matched pairs t-test. In this post, we discuss how the sign test works and present some examples. Examples 1 and 2 are shown in this post. Examples 3, 4 and 5 are shown in the next post
The sign test, more examples.

The sign test and the confidence intervals for percentiles (discussed in the previous post Confidence intervals for percentiles) are examples of distribution-free inference procedures. They are called distribution-free because no assumptions are made about the underlying distributions of the data measurements. For more information about distribution-free inferences, see [Hollander & Wolfe].

We discuss two types of problems for which the sign test is applicable – one-sample location problems and matched pairs data problems. In the one-sample problems, the sign test is to test whether the location (median) of the data has shifted. In the matched pairs problems, the sign test is to test whether the location (median) of one variable has shifted in relation to the matched variable. Thus, the test hypotheses must be restated in terms of the median if the sign test is to be used as an alternative to the t-test. With the sign test, the question is “has the median changed?” whereas the question is “has the mean changed?” for the t-test.

The sign test is one of the simplest distribution-free procedures. It is an excellent choice for a significance test when the sample size is small and the data are highly skewed or have outliers. In such cases, the sign test is preferred over the t-test. However, the sign test is generally less powerful than the t-test. For the matched pairs problems, the sign test only looks at the signs of the differences of the data pairs. The magnitude of the differences is not taken into account. Because the sign test does not use all the available information contained in the data, it is less powerful than the t-test when the population is close to normal.

How the sign test works
Suppose that (X,Y) is a pair of continuous random variables. Suppose that a random sample of paired data (X_1,Y_1),(X_2,Y_2), \cdots, (X_n,Y_n) is obtained. We omit the observations (X_i,Y_i) with X_i=Y_i. Let m be the number of pairs for which X_i \ne Y_i. For each of these m pairs, we make a note of the sign of the difference X_i-Y_i (+ if X_i>Y_i and - if X_i<Y_i). Let W be the number of + signs out of these m pairs. The sign test gets its name from the fact that the statistic W is the test statistic of the sign test. Thus we are only considering the signs of the differences in the paired data and not the magnitude of the differences. The sign test is also called the binomial test since the statistic W has a binomial distribution.

Let p=P[X>Y]. Note that this is the probability that a data pair (X,Y) has a + sign. If p=\frac{1}{2}, then any random pair (X,Y) has an equal chance of being a + or a - sign. The null hypothesis H_0:p=\frac{1}{2} is the hypothesis of “no difference”. Under this hypothesis, there is no difference between the two measurements X and Y. The sign test is test the null hypothesis H_0:p=\frac{1}{2} against any one of the following alternative hypotheses:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \ \ \text{(Left-tailed)}
\displaystyle H_1:p>\frac{1}{2} \ \ \ \ \ \text{(Right-tailed)}
\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \ \ \text{(Two-tailed)}

The statistic W can be considered a series of m independent trials, each of which has probability of success p=P[X>Y]. Thus W \sim binomial(m,p). When H_0 is true, W \sim binomial(m,\frac{1}{2}). Thus the binomial distribution is used for calculating significance. The left-tailed P-value is of the form P[W \le w] and the right-tailed P-value is P[W \ge w]. Then the two-tailed P-value is twice the one-sided P-value.

The sign test can also be viewed as testing the hypothesis that the median of the differences is zero. Let m_d be the median of the differences X-Y. The null hypothesis H_0:p=\frac{1}{2} is equivalent to the hypothesis H_0:m_d=0. For the alternative hypotheses, we have the following equivalences:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d<0

\displaystyle H_1:p>\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d>0

\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d \ne 0

Example 1
A running club conducts a 6-week training program in preparing 20 middle aged amateur runners for a 5K running race. The following matrix shows the running times (in minutes) before and after the training program. Note that five kilometers = 3.1 miles.

\displaystyle \begin{pmatrix} \text{Runner}&\text{Pre-training}&\text{Post-training}&\text{Diff} \\{1}&57.5&54.9&2.6 \\{2}&52.4&53.5&-1.1 \\{3}&59.2&49.0&10.2 \\{4}&27.0&24.5&2.5 \\{5}&55.8&50.7&5.1 \\{6}&60.8&57.5&3.3 \\{7}&40.6&37.2&3.4 \\{8}&47.3&42.3&5.0 \\{9}&43.9&47.3&-3.4 \\{10}&43.7&34.8&8.9 \\{11}&60.8&53.3&7.5 \\{12}&43.9&33.8&10.1 \\{13}&45.6&41.7&3.9 \\{14}&40.6&41.5&-0.9 \\{15}&54.1&52.5&1.6 \\{16}&50.7&52.4&-1.7 \\{17}&25.4&25.9&-0.5 \\{18}&57.5&54.7&2.8 \\{19}&43.9&38.7&5.2 \\{20}&43.9&39.9&4.0 \end{pmatrix}

The difference is taken to be pre-training time minus post-training time. Use the sign test to test whether the training program improves run time.

For a given runner, let X be a random pre-training running time and Y be a random post-training running time. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ \ H_1:p>\frac{1}{2} \ \ \ \text{where} \ p=P[X>Y]

Under the null hypothesis H_0, there is no difference between the pre-training run time and post-training run time. The difference is equally likely to be a plus sign or a minus sign. Let W be the number of runners in the sample for which X_i-Y_i>0. Then W \sim \text{Binomial}(20,0.5). The observed value of the statistic W is w=15. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 15]=\sum \limits_{k=15}^{20} \binom{20}{k} \biggl(\frac{1}{2}\biggr)^{20}=0.02069

Because of the small P-value, the result of 15 out of 20 runners having improved run time cannot be due to random chance alone. So we reject H_0 and we have good reason to believe that the training program reduces run time.

Example 2
A car owner is curoius about the effect of oil changes on gas mileage. For each of 17 oil changes, he recorded data for miles per gallon (MPG) prior to the oil change and after the oil change. The following matrix shows the data:

\displaystyle \begin{pmatrix} \text{Oil Change}&\text{MPG (Pre)}&\text{MPG (Post)}&\text{Diff} \\{1}&24.24&27.45&3.21 \\{2}&24.33&24.60&0.27 \\{3}&24.45&28.27&3.82 \\{4}&23.37&22.49&-0.88 \\{5}&26.73&28.67&1.94 \\{6}&30.40&27.51&-2.89 \\{7}&29.57&29.28&-0.29 \\{8}&22.27&23.18&0.91 \\{9}&27.00&27.64&0.64 \\{10}&24.95&26.01&1.06 \\{11}&27.12&27.39&0.27 \\{12}&28.53&28.67&0.14 \\{13}&27.55&30.27&2.72 \\{14}&30.17&27.83&-2.34 \\{15}&26.00&27.78&1.78 \\{16}&27.52&29.18&1.66 \\{17}&34.61&33.04&-1.57\end{pmatrix}

Regular oil changes are obviously crucial to maintaining the overall health of the car. It seems to make sense that oil changes would improve gas mileage. Is there evidence that this is the case? Do the analysis using the sign test.

In this example we set the hypotheses in terms of the median. For a given oil change, let X be the post oil change MPG and Y be the pre oil change MPG. Consider the differences X-Y. Let m_d be the median of the differences X-Y. We test the null hypothesis that there is no change in MPG before and after oil change against the alternative hypothesis that the median of the post oil change MPG has shifted to the right in relation to the pre oil change MPG. We have the following hypotheses:

\displaystyle H_0:m_d=0 \ \ \ \ \ H_1:m_d>0

Let W be the number of oil changes with positive differences in MPG (post minus pre). Then W \sim \text{Binomial}(17,0.5). The observed value of the statistic W is w=12. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 12]=\sum \limits_{k=12}^{17} \binom{17}{k} \biggl(\frac{1}{2}\biggr)^{17}=0.07173

At the significance level of \alpha=0.10, we reject the null hypothesis. However, we would like to add a caveat. The value of this example is that it is an excellent demonstration of the sign test. The 17 oil changes are not controlled. For example, the data are just records of mileage and gas usage for 17 oil changes (both pre and post). No effort was made to make sure that the driving conditions are similar for the pre oil change MPG and post oil change MPG (freeway vs. local streets, weather conditions, etc). With more care in producing the data, we can conceivably derive a more definite answer.

Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

A note about the Student’s t distribution

The Stuent’s t distribution is an important distribution in statistics. It is the basis for the Student’s t statistic and arises in the problem of etimating the mean of a normal population. The Student’s t distribution (t distribution for short) is usually defined as the following ratio

\displaystyle T=\frac{Z}{\sqrt{\frac{U}{n}}}=\frac{Z \sqrt{n}}{\sqrt{U}}

where Z \sim N(0,1) and U has a chi-square distribution with n degrees of freedom. For this derivation, see [1].

In this post we discuss another way of deriving the t distribution. The alternative view is through the notion of mixture (compounding in some texts). Suppose that X has a normal distribution with mean 0 and variance \Theta^{-1}. That is, X \sim N(0,\Theta^{-1}). There is uncertainty in the variance \Theta^{-1}. Further suppose that \Theta follows a gamma distribution with parameters \alpha and \beta where \alpha=\beta and \alpha is a positive integer. Then the unconditional distribution of X has a Student’s t distribution with n=2 \alpha degrees of freedom. In the language of mixture in probability, we say that the Student’s t distribution is a mixture of normal distributions with gamma mixing weights.

The conditional distribution of X \lvert \Theta is f_{X \lvert \Theta}(x \lvert \theta) \thinspace h_{\Theta}(\theta) where

\displaystyle f_{X \lvert \Theta}(x \lvert \theta)=\frac{\sqrt{\theta}}{\sqrt{2 \pi}} \thinspace e^{-\frac{\Theta x^2}{2}} and

\displaystyle h_{\Theta}(\theta)=\frac{\alpha^{\alpha}}{\Gamma(\alpha)} \thinspace \theta^{\alpha-1} \thinspace e^{-\alpha \theta} \thinspace d \theta

The marginal (the unconditional) density function of X is obtained by integrating out the parameter \theta. The resulted density is that of the Student’s t distribution. The following is the derivation.

\displaystyle f_{X}(x)=\int_{0}^{\infty} f_{X \lvert \Theta}(x \lvert \theta) \thinspace h_{\Theta}(\theta) \thinspace d \theta

\displaystyle =\int_{0}^{\infty} \frac{\sqrt{\theta}}{\sqrt{2 \pi}} \thinspace e^{-\frac{\Theta x^2}{2}} \thinspace \frac{\alpha^{\alpha}}{\Gamma(\alpha)} \thinspace \theta^{\alpha-1} \thinspace e^{-\alpha \theta} \thinspace d \theta

\displaystyle =\frac{\alpha^{\alpha}}{\sqrt{2 \pi} \Gamma(\alpha)} \int_{0}^{\infty} \theta^{\alpha+\frac{1}{2}-1} \thinspace e^{-(\alpha+\frac{x^2}{2}) \theta} \thinspace d \theta

\displaystyle =\frac{\alpha^{\alpha}}{\sqrt{2 \pi} \Gamma(\alpha)} \frac{\Gamma(\alpha+\frac{1}{2})}{(\alpha+\frac{x^2}{2})^{\alpha+\frac{1}{2}}}\int_{0}^{\infty} \frac{(\alpha+\frac{x^2}{2})^{\alpha+\frac{1}{2}}}{\Gamma(\alpha+\frac{1}{2})} \theta^{\alpha+\frac{1}{2}-1} \thinspace e^{-(\alpha+\frac{x^2}{2}) \theta} \thinspace d \theta

\displaystyle =\frac{\alpha^{\alpha}}{\sqrt{2 \pi} \Gamma(\alpha)} \frac{\Gamma(\alpha+\frac{1}{2})}{(\alpha+\frac{x^2}{2})^{\alpha+\frac{1}{2}}}

Now, let n=2 \alpha. Then the above density function becomes:

\displaystyle f_X(x)=\frac{\Gamma(\frac{n+1}{2})}{\sqrt{\pi n} \thinspace \Gamma(\frac{n}{2})} \biggl(\frac{n}{n+x^2}\biggr)^{\frac{n+1}{2}}

The above density function is that of a Student’s t distribution with n degrees of freedom. It is interesting to note that because of the uncertainty in the parameter \Theta, the Student’s t distribution has a longer tail than the conditional normal distribution used in the beginning of the derivation.


  1. Feller W., An Introduction to Probability Theory and Its Applications, Vol II, Second Edition, John Wiley & Sons (1971)