# The skewness of a probability distribution

In this post, we discuss how to calculate the moment coefficient of skewness and also discuss some issues surrounding the notion of skewness.

________________________________________________________________________

Looking at graphs

One informal but useful way of checking the skewness of a distribution is to look at the density curve (or a histogram). Consider the following density functions.

Figure 1

Figure 2

The density curve in Figure 1 has a longer tail to the right than to the left. The example in Figure 1 is a distribution that is skewed to the right. It is also said to be positively skewed since its coefficient of skewness is positive. The density curve in Figure 2 has a longer tail to the left than to the right. The example in Figure 2 is a distribution that is skewed to the left. It is also said to be negatively skewed since the skewness coefficient is negative. If a density curve looks the same to the left and to the right (such as the bell curve for the normal distribution), then it is a symmetric distribution and the skewness coefficient is zero.

The distribution in Figure 1 is a right skewed distribution (the longer tail is on the right). It is a gamma distribution with mean 2 and median approximately 1.678347. The mode (the highest peak) is at x = 1. The distribution in Figure 2 is a left skewed distribution (the longer tail is on the left) with mean and median approximately 0.909 and 0.9213562, respectively. The mode is at 0.95.

________________________________________________________________________

In the distribution for Figure 1, we can say that “mode < median < mean". In the distribution for Figure 2, we can say that "mean < median < mode". A common conception is that these simple rules characterize all skewed distribution, i.e., the mean is to the right of the median, which in turn is to the right of the mode in a right skewed distribution and that the mean is to the left of the median, which in turn is to the left of the mode in a left skewed distribution. Such rules are certainly easy to remember and are stated in some statistics textbooks. In the above two figures, this rule of thumb is certainly true. It turns out that this rule of thumb does not hold in many instances. The above two graphs are "textbook" demonstrations of skewness. They are gamma distributions and beta distributions and they behave well according to the usual notion of how skewed distributions should look like. In a later section of this post, we will discuss this issue in greater details. First we define the coefficient of skewness.

________________________________________________________________________

Pearson moment coefficient of skewness

The measure of skewness defined here is called the Pearson moment coefficient of skewness. This measure provides information about the amount and direction of the departure from symmetry. Its value can be positive or negative, or even undefined. The higher the absolute value of the skewness measure, the more asymmetric the distribution. The skewness measure of symmetric distributions is, or near, zero.

To help put the definition of skewness in context, we first define raw moments and central moments of a random variable $X$. The $k$th raw moment of $X$ is $E(X^k)$, the expected value of the $k$th power of the random variable $X$. The first raw moment is the mean of the random variable and is usually denoted by $\mu$.

The $k$th central moment of a random variable $X$ is $E[(X-\mu)^k]$, the expected value of the $k$th power of the deviation of the variable from its mean. The moment $E[(X-\mu)^k]$ is usually denoted by $\mu_k$. The second central moment is usually called the variance and is denoted by $\sigma^2$. The square root of $\sigma^2$, $\sigma$, is the standard deviation.

The ratio of the standard deviation to the mean, $\displaystyle \frac{\sigma}{\mu}$, is called the coefficient of variation.

The ratio of the third central moment to the cube of the standard deviation is called Pearson’s moment coefficient of skewness (or the coefficient of skewness) and is denoted by $\gamma_1$.

$\displaystyle \gamma_1=\frac{E[ (X-\mu)^3 ]}{\sigma^3}=\frac{\mu_3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

The skewness in (1) can be expanded to derive a version that can be calculated more easily:

\displaystyle \begin{aligned} \gamma_1&=\frac{E[ (X-\mu)^3 ]}{\sigma^3} \\&=\frac{E(X^3)-3 \mu E(X^2)+3 \mu^2 E(X)-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu [E(X^2)+\mu E(X)]-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{\frac{3}{2}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

The last version (3) is in terms of the first raw moment $\mu$, the second central moment $\sigma^2$ and the third raw moment $E(X^3)$. Essentially, the coefficient $\gamma_1$ can be obtained via (3) by first computing the first three raw moments.

Even though kurtosis is not the focus of this post, we would like to state it to complete the the brief discussion on moments. The ratio of the fourth central moment to the fourth power of the standard deviation, $\displaystyle \gamma_2=\frac{\mu_4}{\sigma^4}$, is called the kurtosis.

________________________________________________________________________

Examples

In this section, we discuss the skewness in two familiar families of continuous distributions – gamma and beta. We also demonstrate how exponentiation can affect skewness.

Example 1Gamma Distribution
The following is the probability density function of the gamma distribution.

$\displaystyle f(x)=\frac{\beta^\alpha}{\Gamma(\alpha)} \ x^{\alpha-1} \ e^{-\beta x} \ \ \ \ \ \ \ \ \ x>0$

where $\Gamma(\cdot)$ is the gamma function, and $\alpha$ and $\beta$ are parameters such that $\alpha>0$ and $\beta>0$. The number $\alpha$ is the shape parameter and the number $\beta$ here is the rate parameter. Figure 1 shows the gamma distribution with $\alpha=2$ and $\beta=1$. When $\alpha=1$, we obtain the exponential distribution. When $\beta=\frac{1}{2}$ and $\alpha=\frac{k}{2}$ where $k$ is a positive integer, we obtain the chi square distribution with $k$ degrees of freedom.

Let $X$ be a random variable with the above gamma density function. The raw moments $E(X^k)$, where $k=1,2,3,\cdots$, are:

$\displaystyle E(X^k)=\frac{(\alpha+k-1)(\alpha+k-2) \cdots \alpha}{\beta^k}$

Using the first two raw moments to calculate the variance as well as the third moment, the following calculates the moment coefficient of skewness, based on the form in (3):

$\displaystyle \gamma_1=\frac{\displaystyle \frac{(\alpha+2)(\alpha+1)\alpha}{\beta^3}-3 \frac{\alpha}{\beta} \frac{\alpha}{\beta^3}-\frac{\alpha^3}{\beta^3}}{\biggl( \displaystyle \frac{\alpha}{\beta^2} \biggr)^{\frac{3}{2}}}=\frac{2}{\sqrt{\alpha}}$

The above calculation shows that the rate parameter $\beta$ has no effect on skewness. The example in Figure 1 has $\alpha=2$, giving a coefficient of skewness of $\sqrt{2}$ = 1.414213562. In general, the gamma distribution is skewed positively. However, the gamma distribution becomes more and more symmetric as the shape parameter $\alpha \rightarrow \infty$. The following graph the gamma densities for $\alpha=1, 2, 3, 5, 6$ and $\beta=1$.

Figure 3

In Figure 3, the light blue density with $\alpha=1$ is an exponential distribution. The red one with $\alpha=2$ is the density in Figure 1. With $\alpha=6$, the gamma density already looks very symmetric (the dark blue).

On the other hand, as the shape parameter $\alpha \rightarrow 0$, the gamma distribution becomes increasingly positively skewed. When $\alpha=\frac{1}{n}$, $\gamma_1=2 \sqrt{n}$. When $n \rightarrow \infty$, $\gamma_1 \rightarrow \infty$.

Example 2Beta Distribution
The following is the PDF of a beta distribution:

$\displaystyle f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \ x^{\alpha-1} \ (1-x)^{\beta-1} \ \ \ \ \ \ \ \ \ \ \ \ 0

where $\Gamma(\cdot)$ is the gamma function, and $\alpha$ and $\beta$ are parameters such that $\alpha>0$ and $\beta>0$. Both $\alpha$ and $\beta$ are shape parameters.

In the beta family of distributions, the skewness can range from positive to negative. If the $\alpha$ parameter dominates (i.e. $x$ is to a higher power and $1-x$ is to a small power in the density function), then the beta distribution has a negative skew (skewed to the left). This is because the function $x^n$ has left skew and the function $(1-x)^n$ has right skew. Then the skewness of the beta distribution follows the one that dominates. If the $\beta$ parameter dominates, the beta distribution is skewed to the right. If both parameters are roughly equal, the beta distribution is close to symmetric. For example when $\alpha=20$ and $\beta=2$, the beta distribution is left skewed (its density curve is in Figure 2). As in the gamma case, the skewness of the beta distribution has a close form. The following formula confirms the intuition about the skewness of the beta distribution (found here).

$\displaystyle \gamma_1=\frac{2(\beta-\alpha) \ \sqrt{\alpha+\beta+1}}{(\alpha+\beta+2) \ \sqrt{\alpha \ \beta}}$

Thus the beta distribution with $\alpha=20$ and $\beta=2$ has skewness coefficient -1.137431317. The following figure further demonstrates the role of the shape parameters play in changing the skewness of the beta distribution.

Figure 4

In Figure 4, as the $\alpha$ parameter goes from 2 to 20, the skewness goes from 1.137431317 to 0.659393193 to 0 to -0.659393193 to -1.137431317.

Example 3Exponentiation
Symmetric distributions have zero coefficient of skewness. Raising a symmetric distribution to a positive power can produce a skewed distribution. For example, let $X$ be the standard normal random variable (mean 0 and variance 1). Let $Y=X^2$. Then $Y$ has a chi-square distribution with 1 degree of freedom, which means that it is a gamma distribution with $\alpha=\frac{1}{2}$ and $\beta=\frac{1}{2}$. According to Example 1 above, the skewness coefficient is $\frac{2}{\sqrt{0.5}}=2 \sqrt{2}=2.828$. Thus squaring a standard normal distribution produces a very strongly positively skewed distribution.

Example 4Exponentiation
When raising a positively skewed distribution to positive power can produce a more strongly positively skewed distribution. For example, let $X$ be an exponential random variable. Example 1 shows that exponential distributions have skewness coefficient 2. We show that the coefficient of skewness for $Y=X^2$ is approximately 6.619.

The density function for the exponential random variable $X$ is $f(x)=\beta e^{-\beta x}$ where $\beta>0$ is the rate parameter. It can be shown that the raw moments of $X$ is:

$\displaystyle E(X^k)=\frac{k!}{\beta^k} \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

Then the first three moments of $Y$ are:

$\displaystyle E(Y)=E(X^2)=\frac{2}{\beta^2}$

$\displaystyle E(Y^2)=E(X^4)=\frac{24}{\beta^4}$

$\displaystyle E(Y^3)=E(X^6)=\frac{720}{\beta^6}$

With the first two raw moments, calculate the variance of $Y$. Then compute $\gamma_1$ via formula (3).

$\displaystyle \gamma_1=\frac{\displaystyle \frac{720}{\beta^6}-3 \ \frac{2}{\beta^2} \ \frac{20}{\beta^4}-\frac{8}{\beta^6}}{\biggl( \displaystyle \frac{20}{\beta^4} \biggr)^{\frac{3}{2}}}=\frac{74}{5 \sqrt{5}}=6.61876$

Example 5Exponentiation
Raising a left skewed distribution to a positive power can produce a distribution that is less left skewed. The use of increasing exponents eventually produces a positively skewed distribution. Let $X$ be the beta random variable with $\alpha=5$ and $\beta=1$. The density function for $X$ is $f(x)=5x^4$ where $0. Using the formula shown in Example 2 above, the coefficient of skewness is

$\displaystyle \gamma_1=\frac{2(1-5) \sqrt{5+1+1}}{(5+1+2) \sqrt{5}}=-1.183215957$

We wish to calculate the coefficient of skewness for $X^2$. To do that, it will be helpful to have a formula for the raw moments of $X$. It is easy to verify that:

$\displaystyle E(X^k)=\frac{5}{5+k} \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots$

The first three moments of $Y=X^2$ are:

$\displaystyle E(Y)=E(X^2)=\frac{5}{7}$

$\displaystyle E(Y^2)=E(X^4)=\frac{5}{9}$

$\displaystyle E(Y^3)=E(X^6)=\frac{5}{11}$

Via formula (3), the following is the coefficient of skewness for $Y=X^2$.

$\displaystyle \gamma_1=\frac{\displaystyle \frac{5}{11}-3 \ \frac{5}{7} \ \frac{20}{441}-\frac{125}{7^3}}{\biggl( \displaystyle \frac{20}{441} \biggr)^{\frac{3}{2}}}=\frac{-18}{11 \sqrt{5}}=-0.731804065$

In this example, squaring the beta distribution with skewness -1.1832 produces a distribution a negatively skewed distribution but with a smaller skew. Let’s raise $X$ to higher powers. The following shows the results:

$\displaystyle X^3 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{11}}{7 \sqrt{5}}=-0.423782771$

$\displaystyle X^4 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{13}}{17 \sqrt{5}}=-0.189700182$

$\displaystyle X^5 \ \ \ \ \ \ \gamma_1=6 \sqrt{2}=8.485281374$

Raising the beta distribution with $\alpha=5$ and $\beta=1$ to higher powers eventually produces a positively skewed distribution. This is an interesting example, though this observation probably should not be taken as a rule.

________________________________________________________________________

Counterexamples

All the examples discussed previously are good “textbook” examples in that they help build intuition on how skewness behaves in familiar distributions. However, it is also easy to take the wrong lessons from these examples. The above examples can serve as good introduction to the topic of skewness. It is also important to attempt to provide a caveat that some of the commonly drawn lessons are not appropriate in all circumstances.

As indicated earlier, one wrong lesson from Figure 1 and Figure 2 is that a density curve such as Figure 1 may suggest that “mode < median < mean" for a right skewed distribution and that Figure 2 may suggest that "mean < median < mode" for a left skewed distribution. In both Figure 1 and Figure 2, the mean is further out in the long tail than the median. In certain textbooks, these two observations are even stated as characterizations of right skew and left skew. Such a rule of thumb is easy to state and easy to apply. For some students, such rule provides a lot of clarity about how skewness should work. For such students, checking for skewness is simply a matter of finding the relative position of the mean and median (e.g. in such thinking, if mean is greater than the median, then it is a right skew).

Any discussion of skewness should point out that the simple rule described in the above paragraph, though useful in many situations, is imperfect and may not apply outside of certain familiar distributions. For a good discussion on this issue, see this article.

We highlight one example found in the article mentioned above. This example demonstrates a clear violation of the common misconception indicated above. The following is the density function of the example.

$\displaystyle f_p(x)=\left\{\begin{matrix} \displaystyle (1-p)\biggl(1+\frac{1-p}{2p} x \biggr)&\ \ \ \ \ \ \frac{-2p}{1-p} \le x \le 0 \\{\text{ }}& \\ (1-p) e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

where $0. To facilitate the discussion, let $X_p$ be the random variable whose PDF is $f_p(x)$ defined above. The above density function is a juxtaposition of a triangular density and an exponential density. This triangular-exponential distribution has positive coefficient of skewness when $0. Yet within this range for $p$, the mean can be made to be on either side of the median. We consider three cases where $p=0.7$, $p=0.6$ and $p=0.9$.

Example 6
First the case $p=0.7$.

$\displaystyle f_{0.7}(x)=\left\{\begin{matrix} \displaystyle 0.3\biggl(1+\frac{3}{14} x \biggr)&\ \ \ \ \ \ \frac{-14}{3} \le x \le 0 \\{\text{ }}& \\ 0.3 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The following is the graph of the density curve $f_{0.7}(x)$. The right tail is long since the exponential distribution is on the right side. However, the left side is heavier (with 70% of the weight on the triangle on the left side).

Figure 5

The following shows the results for the density function $f_{0.7}(x)$.

$\displaystyle E(X_{0.7})=\frac{-21.3}{27}=-0.7889$

$\displaystyle \text{median of } X_{0.7} = -0.722613478$

$\displaystyle E(X_{0.7}^2)=\frac{254.4}{81}=3.140740741$

$\displaystyle Var(X_{0.7})=\frac{1835.91}{3^6}$

$\displaystyle E(X_{0.7}^3)=\frac{-2152.2}{405}$

$\displaystyle \gamma_1=\frac{111906.63}{5 1835.91^{1.5}}=0.284517335$

The calculation confirms the positive skew (0.2845), which is a moderately strong positive skewness. Note that the mean is to the left of the median. Both the mean and median are to the left of the mode (at x = 0). In Figure 5, the right side is infinitely long, thus a positively skewed distribution (and is confirmed by the calculation of $\gamma_1$). According to the common notion of how right skew should work, the mean should be further out on the right tail. But this is not the case. The mean is further out on the left side than the median. The violation of the common conception of skewness can occur when one tail is long but the other side is heavier.

Example 7
Now the case $p=0.6$.

$\displaystyle f_{0.6}(x)=\left\{\begin{matrix} \displaystyle 0.4\biggl(1+\frac{1}{3} x \biggr)&\ \ \ \ \ \ -3 \le x \le 0 \\{\text{ }}& \\ 0.4 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The following is the graph of the density curve $f_{0.6}(x)$. The right tail is long since the exponential distribution is on the right side. The left side is still heavy but a little less heavier than in the previous example (with 60% of the weight on the triangle on the left side).

Figure 6

The following shows the results for the density function $f_{0.6}(x)$.

$\displaystyle E(X_{0.6})=-0.2$

$\displaystyle \text{median of } X_{0.6} = -0.261387212$

$\displaystyle E(X_{0.6}^2)=1.7$

$\displaystyle Var(X_{0.6})=1.66$

$\displaystyle E(X_{0.6}^3)=0.78$

$\displaystyle \gamma_1=0.834128035$

The density curve $f_{0.6}(x)$ has a stronger positive skew than the previous example as there is a little more weight on the exponential side (the right side). Even though the mean in this case is to the right of the median, both the mean and median are not on the right tail but on the left triangular side (the heavier side). In any case, the mean is definitely not further out on the longer tail (the right tail) as the common rule of thumb would suggest.

Both Example 6 and Example 7 are right skewed distributions that do not conform to the common expectation about right skewed distributions. The following example will dispel the notion about the direction of the skew.

Example 8
Here we use $p=0.9$ so that there is a still a long right tail but 90% of the weight is on the other side.

$\displaystyle f_{0.9}(x)=\left\{\begin{matrix} \displaystyle 0.1\biggl(1+\frac{1}{18} x \biggr)&\ \ \ \ \ \ -18 \le x \le 0 \\{\text{ }}& \\ 0.1 e^{-x}&\ \ \ \ \ \ x>0 \end{matrix}\right.$

The overall shape of the $f_{0.9}(x)$ is similar to Figure 5 and Figure 6. The following shows the results for the density function $f_{0.9}(x)$.

$\displaystyle E(X_{0.9})=-5.3$

$\displaystyle E(X_{0.9}^2)=48.8$

$\displaystyle Var(X_{0.9})=20.71$

$\displaystyle E(X_{0.9}^3)=-524.28$

$\displaystyle \gamma_1=-0.489285839$

Because there is so little weight on the right tail, the skewness is actually negative (-0.48928). Here we have a right skewed looking distribution that is actually skewed to the left!

________________________________________________________________________

Remarks

Examples 5 through 7 demonstrate that when one tail is long but the other side is heavy, the common conception of right skew and left skew do not apply. The common conception, as discussed earlier, is that the both the mean and the median are located in the longer tail and that the mean is further out in the long tail than the median. The article mentioned earlier is easy to read and gives a fuller discussion of the issues when dealing with the notion of skewness. The common conception of skewness can be easily violated in discrete distributions, especially when the weights on both sides of the median are not equal. All the above examples are unimodal distributions. According the quoted article, bimodal or multimodal distributions can be problematic too.

Of course, the caveat presented here is not meant to discourage anyone from discussing the common conception about skewness. The common conception about the locations of mode, mean and median conveys useful intuition and we should continue to focus on it. But the common rule of thumb should definitely be not be presented as gospel truth as some textbooks had done. Instead, it should be pointed out that the common rule of thumb is imperfect and it would be helpful to have a discussion why the rule is imperfect.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Calculating order statistics using multinomial probabilities

Consider a random sample $X_1,X_2,\cdots,X_n$ drawn from a continuous distribution. Rank the sample items in increasing order, resulting in a ranked sample $Y_1 where $Y_1$ is the smallest sample item, $Y_2$ is the second smallest sample item and so on. The items in the ranked sample are called the order statistics. Recently the author of this blog was calculating a conditional probability such as $P(Y_2>4 \ | \ Y_2>3)$. One way to do this is to calculate the distribution function $P(Y_2 \le t)$. What about the probability $P(Y_5>4 \ | \ Y_2>3)$? Since this one involves two order statistics, the author of this blog initially thought that calculating $P(Y_5>4 \ | \ Y_2>3)$ would require knowing the joint probability distribution of the order statistics $Y_1,Y_2,\cdots ,Y_n$. It turns out that a joint distribution may not be needed. Instead, we can calculate a conditional probability such as $P(Y_5>4 \ | \ Y_2>3)$ using multinomial probabilities. In this post, we demonstrate how this is done using examples. Practice problems are found in here.

The calculation described here can be lengthy and tedious if the sample size is large. To make the calculation more manageable, the examples here have relatively small sample size. To keep the multinomial probabilities easier to calculate, the random samples are drawn from a uniform distribution. The calculation for larger sample sizes from other distributions is better done using a technology solution. In any case, the calculation described here is a great way to practice working with order statistics and multinomial probabilities.

________________________________________________________________________

The multinomial angle

In this post, the order statistics $Y_1 are resulted from ranking the random sample $X_1,X_2,\cdots,X_n$, which is drawn from a continuous distribution with distribution function $F(x)=P(X \le x)$. For the $j$th order statistic, the calculation often begins with its distribution function $P(Y_j \le t)$.

Here’s the thought process for calculating $P(Y_j \le t)$. In drawing the random sample $X_1,X_2,\cdots,X_n$, make a note of the items $\le t$ and the items $>t$. For the event $Y_j \le t$ to happen, there must be at least $j$ many sample items $X_i$ that are $\le t$. For the event $Y_j > t$ to happen, there can be only at most $j-1$ many sample items $X_i$ $\le t$. So to calculate $P(Y_j \le t)$, simply find out the probability of having $j$ or more sample items $\le t$. To calculate $P(Y_j > t)$, find the probability of having at most $j-1$ sample items $\le t$.

$\displaystyle P(Y_j \le t)=\sum \limits_{k=j}^n \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$

$\displaystyle P(Y_j > t)=\sum \limits_{k=0}^{j-1} \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

Both (1) and (2) involve binomial probabilities and are discussed in this previous post. The probability of success is $F(t)=P(X \le t)$ since we are interested in how many sample items that are $\le t$. Both the calculations (1) and (2) are based on counting the number of sample items in the two intervals $\le t$ and $>t$. It turns out that when the probability that is desired involves more than one $Y_j$, we can also count the number of sample items that fall into some appropriate intervals and apply some appropriate multinomial probabilities. Let’s use an example to illustrate the idea.

Example 1
Draw a random sample $X_1,X_2,\cdots,X_{10}$ from the uniform distribution $U(0,4)$. The resulting order statistics are $Y_1. Find the following probabilities:

• $P(Y_4<2
• $P(Y_4<2

For both probabilities, the range of the distribution is broken up into 3 intervals, (0, 2), (2, 3) and (3, 4). Each sample item has probabilities $\frac{2}{4}$, $\frac{1}{4}$, $\frac{1}{4}$ of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval.

The first probability involves the event that there are 4 sample items in the interval (0, 2), 2 sample items in the interval (2, 3) and 4 sample items in the interval (3, 4). Thus the first probability is the following multinomial probability:

\displaystyle \begin{aligned} P(Y_4<2

For the second probability, $Y_5$ does not have to be greater than 2. Thus there could be 5 sample items less than 2. So we need to add one more case to the above probability (5 sample items to the first interval, 1 sample item to the second interval and 4 sample items to the third interval).

\displaystyle \begin{aligned} P(Y_4<2

Example 2
Draw a random sample $X_1,X_2,\cdots,X_6$ from the uniform distribution $U(0,4)$. The resulting order statistics are $Y_1. Find the probability $P(1.

In this problem the range of the distribution is broken up into 3 intervals (0, 1), (1, 3) and (3, 4). Each sample item has probabilities $\frac{1}{4}$, $\frac{2}{4}$, $\frac{1}{4}$ of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval. The counting is a little bit more involved here than in the previous example.

The example is to find the probability that both the second order statistic $Y_2$ and the fourth order statistic $Y_4$ fall into the interval $(1,3)$. To solve this, determine how many sample items that fall into the interval $(0,1)$, $(1,3)$ and $(3,4)$. The following points detail the counting.

• For the event $1 to happen, there can be at most 1 sample item in the interval $(0,1)$.
• For the event $Y_4<3$ to happen, there must be at least 4 sample items in the interval $(0,3)$. Thus if the interval $(0,1)$ has 1 sample item, the interval $(1,3)$ has at least 3 sample items. If the interval $(0,1)$ has no sample item, the interval $(1,3)$ has at least 4 sample items.

The following lists out all the cases that satisfy the above two bullet points. The notation $[a, b, c]$ means that $a$ sample items fall into $(0,1)$, $b$ sample items fall into the interval $(1,3)$ and $c$ sample items fall into the interval $(3,4)$. So $a+b+c=6$. Since the sample items are drawn from $U(0,4)$, the probabilities of a sample item falling into intervals $(0,1)$, $(1,3)$ and $(3,4)$ are $\frac{1}{4}$, $\frac{2}{4}$ and $\frac{1}{4}$, respectively.

[0, 4, 2]
[0, 5, 1]
[0, 6, 0]
[1, 3, 2]
[1, 4, 1]
[1, 5, 0]

\displaystyle \begin{aligned} \frac{6!}{a! \ b! \ c!} \ \biggl[\frac{1}{4} \biggr]^a \ \biggl[\frac{2}{4} \biggr]^b \ \biggl[\frac{1}{4} \biggr]^c&=\frac{6!}{0! \ 4! \ 2!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2=\frac{240}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 5! \ 1!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1=\frac{192}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 6! \ 0!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^6 \ \biggl[\frac{1}{4} \biggr]^0=\frac{64}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 3! \ 2!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^3 \ \biggl[\frac{1}{4} \biggr]^2=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 4! \ 1!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^1=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 5! \ 0!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^0=\frac{192}{4096} \\&\text{ } \\&=\text{sum of probabilities }=\frac{1648}{4096}=0.4023\end{aligned}

So in randomly drawing 6 items from the uniform distribution $U(0,4)$, there is a 40% chance that the second order statistic and the fourth order statistic are between 1 and 3.

________________________________________________________________________

More examples

The method described by the above examples is this. When looking at the event described by the probability problem, the entire range of distribution is broken up into several intervals. Imagine the sample items $X_i$ are randomly being thrown into these interval (i.e. we are sampling from a uniform distribution). Then multinomial probabilities are calculated to account for all the different ways sample items can land into these intervals. The following examples further illustrate this idea.

Example 3
Draw a random sample $X_1,X_2,\cdots,X_7$ from the uniform distribution $U(0,5)$. The resulting order statistics are $Y_1. Find the following probabilities:

• $P(1
• $P(3

The range is broken up into the intervals (0, 1), (1, 3), (3, 4) and (4, 5). The sample items fall into these intervals with probabilities $\frac{1}{5}$, $\frac{2}{5}$, $\frac{1}{5}$ and $\frac{1}{5}$. The following details the counting for the event $1:

• There are no sample items in (0, 1) since $1.
• Based on $Y_1<3, there are at least one sample item and at most 3 sample items in (0, 3). Thus in the interval (1, 3), there are at least one sample item and at most 3 sample items since there are none in (0, 1).
• Based on $Y_4<4$, there are at least 4 sample items in the interval (0, 4). Thus the count in (3, 4) combines with the count in (1, 3) must be at least 4.
• The interval (4, 5) simply receives the left over sample items not in the previous intervals.

The notation $[a, b, c, d]$ lists out the counts in the 4 intervals. The following lists out all the cases described by the above 5 bullet points along with the corresponding multinomial probabilities, with two of the probabilities set up.

$\displaystyle [0, 1, 3, 3] \ \ \ \ \ \ \frac{280}{78125}=\frac{7!}{0! \ 1! \ 3! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^3$

$\displaystyle [0, 1, 4, 2] \ \ \ \ \ \ \frac{210}{78125}$

$\displaystyle [0, 1, 5, 1] \ \ \ \ \ \ \frac{84}{78125}$

$\displaystyle [0, 1, 6, 0] \ \ \ \ \ \ \frac{14}{78125}$

$\displaystyle [0, 2, 2, 3] \ \ \ \ \ \ \frac{840}{78125}$

$\displaystyle [0, 2, 3, 2] \ \ \ \ \ \ \frac{840}{78125}$

$\displaystyle [0, 2, 4, 1] \ \ \ \ \ \ \frac{420}{78125}$

$\displaystyle [0, 2, 5, 0] \ \ \ \ \ \ \frac{84}{78125}$

$\displaystyle [0, 3, 1, 3] \ \ \ \ \ \ \frac{1120}{78125}=\frac{7!}{0! \ 3! \ 1! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3$

$\displaystyle [0, 3, 2, 2] \ \ \ \ \ \ \frac{1680}{78125}$

$\displaystyle [0, 3, 3, 1] \ \ \ \ \ \ \frac{1120}{78125}$

$\displaystyle [0, 3, 4, 0] \ \ \ \ \ \ \frac{280}{78125}$

Summing all the probabilities, $\displaystyle P(1. Out of the 78125 many different ways the 7 sample items can land into these 4 intervals, 6972 of them would satisfy the event $1.

++++++++++++++++++++++++++++++++++

We now calculate the second probability in Example 3.

$\displaystyle P(3

First calculate $P(1. The probability $P(Y_1 is the probability of having at least 1 sample item less than $t$, which is the complement of the probability of all sample items greater than $t$.

\displaystyle \begin{aligned} P(1

The event $1 can occur in 16256 ways. Out of these many ways, 6972 of these satisfy the event $1. Thus we have:

$\displaystyle P(3

Example 4
Draw a random sample $X_1,X_2,X_3,X_4,X_5$ from the uniform distribution $U(0,5)$. The resulting order statistics are $Y_1. Consider the conditional random variable $Y_4 \ | \ Y_2 >3$. For this conditional distribution, find the following:

• $P( Y_4 \le t \ | \ Y_2 >3)$
• $f_{Y_4}(t \ | \ Y_2 >3)$
• $E(Y_4 \ | \ Y_2 >3)$

where $3. Note that $f_{Y_4}(t | \ Y_2 >3)$ is the density function of $Y_4 \ | \ Y_2 >3$.

Note that

$\displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{P(33)}$

In finding $P(3, the range (0, 5) is broken up into 3 intervals (0, 3), (3, t) and (t, 5). The sample items fall into these intervals with probabilities $\frac{3}{5}$, $\frac{t-3}{5}$ and $\frac{5-t}{5}$.

Since $Y_2 >3$, there is at most 1 sample item in the interval (0, 3). Since $Y_4 \le t$, there are at least 4 sample items in the interval (0, t). So the count in the interval (3, t) and the count in (0, 3) should add up to 4 or more items. The following shows all the cases for the event $3 along with the corresponding multinomial probabilities.

$\displaystyle [0, 4, 1] \ \ \ \ \ \ \frac{5!}{0! \ 4! \ 1!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^1$

$\displaystyle [0, 5, 0] \ \ \ \ \ \ \frac{5!}{0! \ 5! \ 0!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^5 \ \biggl[\frac{5-t}{5} \biggr]^0$

$\displaystyle [1, 3, 1] \ \ \ \ \ \ \frac{5!}{1! \ 3! \ 1!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^3 \ \biggl[\frac{5-t}{5} \biggr]^1$

$\displaystyle [1, 4, 0] \ \ \ \ \ \ \frac{5!}{1! \ 4! \ 0!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^0$

After carrying the algebra and simplifying, we have the following:

$\displaystyle P(3

For the event $Y_2 >3$ to happen, there is at most 1 sample item less than 3. So we have:

$\displaystyle P(Y_2 >3)=\binom{5}{0} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^5 +\binom{5}{1} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{2}{5} \biggr]^4=\frac{272}{3125}$

$\displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{272}$

Then the conditional density is obtained by differentiating $P( Y_4 \le t \ | \ Y_2 >3)$.

$\displaystyle f_{Y_4}(t \ | \ Y_2 >3)=\frac{-20t^4+100t^3+540t^2-3750t+5400}{272}$

The following gives the conditional mean $E(Y_4 \ | \ Y_2 >3)$.

\displaystyle \begin{aligned} E(Y_4 \ | \ Y_2 >3)&=\frac{1}{272} \ \int_3^5 t(-20t^4+100t^3+540t^2-3750t+5400) \ dt \\&=\frac{215}{51}=4.216 \end{aligned}

To contrast, the following gives the information on the unconditional distribution of $Y_4$.

$\displaystyle f_{Y_4}(t)=\frac{5!}{3! \ 1! \ 1!} \ \biggl[\frac{t}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr] \ \biggl[ \frac{5-t}{5} \biggr]^1=\frac{20}{3125} \ (5t^3-t^4)$

$\displaystyle E(Y_4)=\frac{20}{3125} \ \int_0^5 t(5t^3-t^4) \ dt=\frac{10}{3}=3.33$

The unconditional mean of $Y_4$ is about 3.33. With the additional information that $Y_2 >3$, the average of $Y_4$ is now 4.2. So a higher value of $Y_2$ pulls up the mean of $Y_4$.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the problem blog, a companion blog to this blog.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

$\displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54 \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46 \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61 \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74 \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20 \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15 \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87 \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03 \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68 \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36 \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ } \end{bmatrix}$

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample $X_1,X_2,\cdots,X_n$ drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics $Y_1. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the $Y_j$ is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics $Y_j$ to estimate the population percentiles is that the expected percentage of the population below $Y_j$ is about the same as the percentage of the sample items less than $Y_j$. According to the explanation in the second post listed above (link), the order statistic $Y_j$ is expected to be above $100p$ percent of the population where $p=\frac{j}{n+1}$. In fact, the order statistics $Y_1 are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between $Y_{j-1}$ and $Y_j$ is $100h$ where $h=\frac{1}{n+1}$.

The above explanation justifies the use of the order statistic $Y_j$ as the sample $100p$th percentile where $p=\frac{j}{n+1}$.

The sample size is $n=$ 54 in the San Francisco rainfall data. Thus the order statistic $Y_{11}$ is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of $Y_{11}$ is 15.22.

With $\frac{45}{54+1}=0.818$, the order statistic $Y_{45}$ is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of $Y_{45}$ is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability $P(Y_i < \tau_p < Y_j)$. This is the probability that the $100p$th percentile, where $0, is in between $Y_i$ and $Y_j$. Let's look at the median $\tau_{0.5}$. For $Y_i<\tau_{0.5}$ to happen, there must be at least $i$ many sample items less than the median $\tau_{0.5}$. For $\tau_{0.5} to happen, there can be at most $j-1$ many sample items less than the median $\tau_{0.5}$. Thus in the random draws of the sample items, in order for the event $Y_i < \tau_{0.5} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{0.5}$. In other words, in $n$ Bernoulli trials, there at at least $i$ and at most $j-1$ successes where the probability of success is $P(X<\tau_{0.5})=$ 0.5. The following is the probability $P(Y_i < \tau_{0.5} < Y_j)$:

$\displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{0.5} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population median $\tau_{0.5}$. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the $100p$th percentile where $0. In order for the event $Y_i < \tau_{p} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{p}$. This is equivalent to $n$ Bernoulli trials resulting in at least $i$ successes and at most $j-1$ successes where the probability of success is $P(X<\tau_{p})=p$.

$\displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{p} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population percentile $\tau_{p}$. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is $n=$ 54. The middle two data elements in the sample is $y_{27}=20.01$ and $y_{28}=20.54$. They are realizations of the order statistics $Y_{27}$ and $Y_{28}$. So in this example, $\frac{27}{54+1}=0.49$ and $\frac{28}{54+1}=0.509$. Thus the order statistic $Y_{27}$ is expected to be greater than about 49% of the population and $Y_{28}$ is expected to be greater than about 51% of the population. So neither $Y_{27}$ nor $Y_{28}$ is an exact fit. So we take the average of the two as an estimate of the population median:

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

Looking for confidence intervals, we consider the intervals $(Y_{21},Y_{34})$, $(Y_{20},Y_{35})$, $(Y_{19},Y_{36})$ and $(Y_{18},Y_{37})$. The following shows the confidence levels.

$\displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271$

$\displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436$

$\displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673$

$\displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666$

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

Median

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

$(Y_{21},Y_{34})$ = (18.11, 23.49) with approximately 92% confidence

$(Y_{20},Y_{35})$ = (17.74, 23.87) with approximately 96% confidence

$(Y_{19},Y_{36})$ = (17.65, 24.09) with approximately 98% confidence

$(Y_{18},Y_{37})$ = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

Lower quartile

$\displaystyle \hat{\tau}_{0.25}=15.985$, average of $Y_{13}$ and $Y_{14}$

$(Y_{7},Y_{20})$ = (13.87, 17.74) with approximately 96% confidence

$(Y_{6},Y_{21})$ = (13.86, 18.11) with approximately 98% confidence

$(Y_{5},Y_{22})$ = (12.54, 18.26) with approximately 99% confidence

Upper quartile

$\displaystyle \hat{\tau}_{0.75}=25.875$, average of $Y_{41}$ and $Y_{42}$

$(Y_{36},Y_{47})$ = (24.09, 29.41) with approximately 91% confidence

$(Y_{35},Y_{48})$ = (23.87, 31.87) with approximately 96% confidence

$(Y_{34},Y_{49})$ = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for $\tau_{0.25}$ and one for $\tau_{0.75}$.

$\displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918$

$\displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918$

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Defining the Poisson distribution

The Poisson distribution is a family of discrete distributions with positive probabilities on the non-negative numbers $0,1,2,\cdots$. Each distribution in this family is indexed by a positive number $\lambda>0$. One way to define this distribution is to give its probability function given the parameter $\lambda$ and then derive various distributional quantities such as mean and variance. Along with other mathematical facts, it can be shown that both the mean and the variance are $\lambda$. In this post, we take a different tack. We look at two view points that give rise to the Poisson distribution. Taking this approach will make it easier to appreciate some of the possible applications of the Poisson distribution. The first view point is that the Poisson distribution is the limiting case of the binomial distribution. The second view point is through the Poisson process, a stochastic process that, under some conditions, counts the number of events and the time points at which these events occur in a given time (or physical) interval.

________________________________________________________________________

Poisson as a limiting case of binomial

A binomial distribution where the number of trials $n$ is large and the probability of success $p$ is small such that $np$ is moderate in size can be approximated using the Poisson distribution with mean $\lambda=np$. This fact follows from Theorem 1, which indicates that the Poisson distribution is the limiting case of the binomial distribution.

Theorem 1
Let $\lambda$ be a fixed positive constant. Then for each integer $x=0,1,2,\cdots$, the following is true:

$\displaystyle \lim_{n \rightarrow \infty} \binom{n}{x} \ p^x \ (1-p)^{n-x}=\lim_{n \rightarrow \infty} \frac{n!}{x! \ (n-x)!} \ p^x \ (1-p)^{n-x}=\frac{e^{-\lambda} \ \lambda^x}{x!}$

where $\displaystyle p=\frac{\lambda}{n}$.

Proof of Theorem 1
We start with a binomial distribution with $n$ trials and with $\displaystyle p=\frac{\lambda}{n}$ being the probability of success, where $n>\lambda$. Let $X_n$ be the count of the number of successes in these $n$ Bernoulli trials. The following is the probability that $X_n=k$.

\displaystyle \begin{aligned} P(X_n=k)&=\binom{n}{k} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n!}{k! (n-k)!} \biggl(\frac{\lambda}{n}\biggr)^k \biggr(1-\frac{\lambda}{n}\biggr)^{n-k} \\&=\frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \biggl(\frac{\lambda^k}{k!}\biggr) \biggr(1-\frac{\lambda}{n}\biggr)^{n} \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \ \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \end{aligned}

In the last step, the terms that contain $n$ are inside the square brackets. Let’s see what they are when $n$ approaches infinity.

$\displaystyle \lim \limits_{n \rightarrow \infty} \ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k}=1$

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{n}=e^{-\lambda}$

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1-\frac{\lambda}{n}\biggr)^{-k}=1$

The reason that the first result is true is that the numerator is a polynomial where the leading term is $n^k$. Upon dividing by $n^k$ and taking the limit, we get 1. The second result is true since the following limit is one of the definitions of the exponential function $e^x$.

$\displaystyle \lim \limits_{n \rightarrow \infty} \biggr(1+\frac{x}{n}\biggr)^{n}=e^{x}$

The third result is true since the exponent $-k$ is a constant. Thus the following is the limit of the probability $P(X_n=k)$ as $n \rightarrow \infty$.

\displaystyle \begin{aligned} \lim \limits_{n \rightarrow \infty} P(X_n=k)&= \biggl(\frac{\lambda^k}{k!}\biggr) \ \lim \limits_{n \rightarrow \infty} \biggl[ \frac{n(n-1)(n-2) \cdots (n-k+1)}{n^k} \ \biggr(1-\frac{\lambda}{n}\biggr)^{n} \ \biggr(1-\frac{\lambda}{n}\biggr)^{-k} \biggr] \\&=\biggl(\frac{\lambda^k}{k!}\biggr) \cdot 1 \cdot e^{-\lambda} \cdot 1 \\&=\frac{e^{-\lambda} \lambda^k}{k!} \end{aligned}

This above derivation completes the proof. $\blacksquare$

In a given binomial distribution, whenever the number of trials $n$ is large and the probability $p$ of success in each trial is small (i.e. each of the Bernoulli trial rarely results in a success), Theorem 1 tells us that we can use the Poisson distribution with parameter $\lambda=np$ to estimate the binomial distribution.

Example 1
The probability of being dealt a full house in a hand of poker is approximately 0.001441. Out of 5000 hands of poker that are dealt at a certain casino, what is the probability that there will be at most 4 full houses?

Let $X$ be the number of full houses in these 5000 poker hands. The exact distribution for $X$ is the binomial distribution with $n=$ 5000 and $p=$ 0.001441. Thus example deals with a large number of trials where each trial is a rare event. So the Poisson estimation is applicable. Let $\lambda=$ 5000(0.001441) = 7.205. Then $P(X \le 4)$ can be approximated by the Poisson random variable $Y$ with parameter $\lambda$. The following is the probability function of $Y$:

$\displaystyle P(Y=y)=e^{-7.205} \ \frac{7.205^y}{y!}$

The following is the approximation of $P(X \le 4)$:

\displaystyle \begin{aligned} P(X \le 4)&\approx P(Y \le 4) \\&=P(Y=0)+P(Y=1)+P(Y=2)+P(Y=3)+P(Y=4) \\&= e^{-7.205} \biggl[ 1+7.205+\frac{7.205^2}{2!}+\frac{7.205^3}{3!}+\frac{7.205^4}{4!}\biggr] \\&=0.155098087 \end{aligned}

The following is a side by side comparison between the binomial distribution and its Poisson approximation. For all practical purposes, they are indistingusihable from one another.

$\displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&\text{Binomial } &\text{ }&\text{ }&\text{Poisson } \\\text{Full Houses}&\text{ }&\text{ }&P(X \le x) &\text{ }&\text{ }&P(Y \le x) \\\text{ }&\text{ }&\text{ }&n=5000 &\text{ }&\text{ }&\lambda=7.205 \\\text{ }&\text{ }&\text{ }&p=0.001441 &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.000739012&\text{ }&\text{ }&0.000742862 \\ 1&\text{ }&\text{ }&0.006071278&\text{ }&\text{ }&0.006095184 \\ 2&\text{ }&\text{ }&0.025304641&\text{ }&\text{ }&0.025376925 \\ 3&\text{ }&\text{ }&0.071544923&\text{ }&\text{ }&0.071685238 \\ 4&\text{ }&\text{ }&0.154905379&\text{ }&\text{ }&0.155098087 \\ 5&\text{ }&\text{ }&0.275104906&\text{ }&\text{ }&0.275296003 \\ 6&\text{ }&\text{ }&0.419508250&\text{ }&\text{ }&0.419633667 \\ 7&\text{ }&\text{ }&0.568176421 &\text{ }&\text{ }&0.568198363 \\ 8&\text{ }&\text{ }&0.702076190 &\text{ }&\text{ }&0.701999442 \\ 9&\text{ }&\text{ }&0.809253326&\text{ }&\text{ }&0.809114639 \\ 10&\text{ }&\text{ }&0.886446690&\text{ }&\text{ }&0.886291139 \\ 11&\text{ }&\text{ }&0.936980038&\text{ }&\text{ }&0.936841746 \\ 12&\text{ }&\text{ }&0.967298041&\text{ }&\text{ }&0.967193173 \\ 13&\text{ }&\text{ }&0.984085073&\text{ }&\text{ }&0.984014868 \\ 14&\text{ }&\text{ }&0.992714372&\text{ }&\text{ }&0.992672033 \\ 15&\text{ }&\text{ }&0.996853671&\text{ }&\text{ }&0.996830358 \end{bmatrix}$

The above table is calculated using the functions BINOM.DIST and POISSON.DIST in Excel. The following shows how it is done. The parameter TRUE indicates that the result is a cumulative distribution. When it is set to FALSE, the formula gives the probability function.

$P(X \le x)=\text{BINOM.DIST(x, 5000, 0.001441, TRUE)}$

$P(Y \le x)=\text{POISSON.DIST(x, 7.205, TRUE)}$

________________________________________________________________________

The Poisson distribution

The limit in Theorem 1 is a probability function and the resulting distribution is called the Poisson distribution. We now gives the formal definition. A random variable $X$ that takes on one of the numbers $0,1,2,\cdots$ is said to be a Poisson random variable with parameter $\lambda>0$ if

$\displaystyle P(X=x)=\frac{e^{-\lambda} \ \lambda^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

It can be shown that the above function is indeed a probability function, i.e., the probabilities sum to 1. Any random variable that has a probability function of the above form is said to follow (or to have) a Poisson distribution. Furthermore, it can be shown that $E(X)=var(X)=\lambda$, i.e., the Poisson parameter is both the mean and variance. Thus the Poisson distribution may be a good fit if the observed data indicate that the sample mean and the sample variance are nearly identical.

The following is the moment generating function of the Poisson distribution with parameter $\lambda$.

$\displaystyle M(t)=E(e^{tX})=e^{\lambda \ (e^t-1)}$

One consequence of the Poisson moment generating function is that any independent sum of Poisson distributions is again a Poisson distribution.

________________________________________________________________________

The Poisson process

Another way, the more important way, to look at the Poisson distribution is the view point of the Poisson process. Consider an experiment in which events that are of interest occur at random in a time interval. The goal here is to record the time of the occurrence of each random event and for the purpose at hand, count the number of random events occurring in a fixed time interval. Starting at time 0, note the time of the occurrence of the first event. Then the time at which the second random event occurs and so on. Out of these measurements, we can derive the length of time between the occurrences of any two consecutive random events. Such measurements belong to a continuous random variable. In this post, we focus on the discrete random variable of the count of the random events in a fixed time interval.

A good example of a Poisson process is the well known experiment in radioactivity conducted by Rutherford and Geiger in 1910. In this experiment, $\alpha$-particles were emitted from a polonium source and the number of $\alpha$-particles were counted during an interval of 7.5 seconds (2608 many such time intervals were observed). A Poisson process is a random process in which several criteria are satisfied. We will show that in a Poisson process, the number of these random occurrences in the fixed time interval will follow a Poisson distribution. First, we discuss the criteria to which a Poisson process must conform.

One of the criteria is that in a very short time interval, the chance of having more than one random event is essentially zero. So either one random event will occur or none will occur in a very short time interval. Considering the occurrence of a random event as a success, there is either a success or a failure in a very short time interval. So a very short time interval in a Poisson process can be regarded as a Bernoulli trial.

The second criterion is that the experiment remains constant over time. Specifically this means that the probability of a random event occurring in a given subinterval is proportional to the length of that subinterval and not on where the subinterval is in the original interval. For example, in the 1910 radioactivity study, $\alpha$-particles were emitted at the rate of $\lambda=$ 3.87 per 7.5 seconds. So the probability of one $\alpha$-particle emitted from the radioactive source in a one-second interval is 3.87/7.5 = 0.516. Then the probability of observing one $\alpha$-particle in a half-second interval is 0.516/2 = 0.258. For a quarter-second interval, the probability is 0.258/2 = 0.129. So if we observe half as long, it will be half as likely to observe the occurrence of a random event. On the other hand, it does not matter when the quarter-second subinterval is, whether at the beginning or toward the end of the original interval of 7.5 seconds.

The third criterion is that non-overlapping subintervals are mutually independent in the sense that what happens in one subinterval (i.e. the occurrence or non-occurrence of a random event) will have no influence on the occurrence of a random event in another subinterval. To summarize, the following are the three criteria of a Poisson process:

Suppose that on average $\lambda$ random events occur in a time interval of length 1.

1. The probability of having more than one random event occurring in a very short time interval is essentially zero.
2. For a very short subinterval of length $\frac{1}{n}$ where $n$ is a sufficiently large integer, the probability of a random event occurring in this subinterval is $\frac{\lambda}{n}$.
3. The numbers of random events occurring in non-overlapping time intervals are independent.

Consider a Poisson process in which the average rate is $\lambda$ random events per unit time interval. Let $Y$ be the number of random events occurring in the unit time interval. In the 1910 radioactivity study, the unit time interval is 7.5 seconds and $Y$ is the count of the number of $\alpha$-particles emitted in 7.5 seconds. It follows that $Y$ has a Poisson distribution with parameter $\lambda$. To see this, subdivide the unit interval into $n$ non-overlapping subintervals of equal length where $n$ is a sufficiently large integer. Let $X_{n,j}$ be the number of random events in the the $j$th subinterval ($1 \le j \le n$). Based on the three assumptions, $X_{n,1},X_{n,2},\cdots,X_{n,n}$ are independent Bernoulli random variables, where the probability of success for each $X_{n,j}$ is $\frac{\lambda}{n}$. Then $X_n=X_{n,1}+X_{n,2}+\cdots+X_{n,n}$ has a binomial distribution with parameters $n$ and $p=\frac{\lambda}{n}$. Theorem 1 tells us that the limiting case of the binomial distributions for $X_n$ is the Poisson distribution with parameter $\lambda$. This Poisson distribution should agree with the distribution for $Y$. The Poisson is also discussed in quite a lot of details in the previous post called Poisson as a Limiting Case of Binomial Distribution.

We now examine the 1910 radioactivity study a little more closely.

Example 2
The basic idea of the 1910 radioactivity study conducted by Rutherford and Geiger is that a polonium source was placed a short distance from an observation point. The number of $\alpha$-particles emitted from the source were counted in 7.5-second intervals for 2608 times. The following is the tabulated results.

$\displaystyle \begin{bmatrix} \text{Number of alpha particles}&\text{ }&\text{Observed} \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts} \\ \text{ }&\text{ }&\text{ } \\ 0&\text{ }&57 \\ 1&\text{ }&203 \\ 2&\text{ }&383 \\ 3&\text{ }&525 \\ 4&\text{ }&532 \\ 5&\text{ }&408 \\ 6&\text{ }&273 \\ 7&\text{ }&139 \\ 8&\text{ }&45 \\ 9&\text{ }&27 \\ 10&\text{ }&10 \\ 11+&\text{ }&6 \\ \text{ }&\text{ }&\text{ } \\ \text{Total }&\text{ }&2608 \end{bmatrix}$

What is the average number of particles observed per 7.5 seconds? The total number of $\alpha$-particles in these 2608 periods is

$0 \times 57+1 \times 203+2 \times 383+ 3 \times 525 + \cdots=10097$.

The mean count per period is $\lambda=\frac{10097}{2608}=3.87$. Consider the Poisson distribution with parameter 3.87. The following is its probability function.

$\displaystyle P(X=x)=\frac{e^{-3.87} \ 3.87^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

Out of 2608 periods, the expected number of periods with $x$ particles in emission is $2608P(X=x)$. The following is a side by side comparison in the observed counts and the expected counts.

$\displaystyle \begin{bmatrix} \text{Number of alpha particles}&\text{ }&\text{Observed}&\text{ }&\text{Expected} \\ \text{recorded per 7.5 seconds }&\text{ }&\text{counts}&\text{ }&\text{counts} \\ \text{ }&\text{ }&\text{ }&\text{ }&2608 \times P(X=x) \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ } \\ 0&\text{ }&57&\text{ }&54.40 \\ 1&\text{ }&203&\text{ }&210.52 \\ 2&\text{ }&383&\text{ }&407.36 \\ 3&\text{ }&525&\text{ }&525.50 \\ 4&\text{ }&532&\text{ }&508.42 \\ 5&\text{ }&408&\text{ }&393.52 \\ 6&\text{ }&273&\text{ }&253.82 \\ 7&\text{ }&139&\text{ }&140.32 \\ 8&\text{ }&45&\text{ }&67.88 \\ 9&\text{ }&27&\text{ }&29.19 \\ 10&\text{ }&10&\text{ }&11.30 \\ 11+&\text{ }&6&\text{ }&5.78 \\ \text{ }&\text{ }&\text{ }&\text{ }&\text{ } \\ \text{Total }&\text{ }&2608&\text{ }&2608 \end{bmatrix}$

The expected counts are quite close to the observed counts, showing that the Poisson distribution is a very good fit to the observed data from the 1910 study.

________________________________________________________________________

We have described the Poisson process as the distribution of random events in a time interval. The same idea can be used to describe random events occurring along a spatial interval, i.e. intervals in terms of distance or volume or other spatial measurements (see Examples 5 and 6 below).

Another point to make is that sometimes it may be necessary to consider an interval other than the unit length. Instead of counting the random events occurring in an interval of length 1, we may want to count the random events in an interval of length $t$. As before, let $\lambda$ be the rate of occurrences in a unit interval. Then the rate of occurrences of the random events is over the interval of length $t$ is $\lambda t$. The same idea will derive that fact that the number of occurrences of the random events of interest in the interval of length $t$ is a Poisson distribution with parameter $\lambda t$. The following is its probability function.

$\displaystyle P(X_t=x)=\frac{e^{-\lambda t} \ (\lambda t)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

where $X_t$ is the count of the random events in an interval of length $t$.

For example, in the 1910 radioactive study, the unit length is 7.5 seconds. The number of $\alpha$-particles observed in a half unit interval (3.75 seconds) will follow a Poisson distribution with parameter $0.5 \lambda=$ 0.5(3.87) = 1.935 with the following probability function:

$\displaystyle P(X_{0.5}=x)=\frac{e^{-1.935} \ (1.935)^x}{x!} \ \ \ \ \ \ \ \ \ \ \ \ \ x=0,1,2,\cdots$

________________________________________________________________________

More examples

Example 3
A radioactive source is metered for 5 hours. During this period, 9638 $\alpha$-particles are counted. What is the probability that during the next minute, between 30 and 34 particles (both inclusive ) will be counted?

The average number of $\alpha$-particles counted per minute is $\lambda=\frac{9638}{300}=32.12$. Let $X$ be the number of $\alpha$-particles counted per minute. Then $X$ has a Poisson distribution with parameter $\lambda=32.12$. The following calculates $P(30 \le X \le 34)$.

\displaystyle \begin{aligned} P(30 \le X \le 34)&=e^{-32.12} \biggl[ \frac{32.12^{30}}{30!}+\frac{32.12^{31}}{31!}+\frac{32.12^{32}}{32!}+\frac{32.12^{33}}{33!}+\frac{32.12^{34}}{34!} \biggr] \\&=0.341118569 \end{aligned}

Alternatively, the POISSON.DIST function in Excel can be used as follows:

\displaystyle \begin{aligned} P(30 \le X \le 34)&=P(X \le 34)-P(X \le 29) \\&=\text{POISSON.DIST(34,32.12,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(29,32.12,TRUE)} \\&=0.671501917-0.330383348 \\&=0.341118569 \end{aligned}

Example 4
The side effect of dry mouth is known to be experienced, on the average, by 5 out of 10,000 individuals taking a certain medication. About 20,000 patients are expected to take this medication next year. What is the probability that between 12 and 16 (both inclusive) patients will experience the side effect of dry mouth? What is the exact probability model that can also be used to work this problem?

The exact model is a binomial distribution. The number of trials $n=$ 20000 and the probability of success in each trial is $p=$ 0.0005 (experiencing the side effect). Here, we use Poisson to estimate the binomial. The average number of patients experiencing side effect is $\lambda=20000(0.0005)=10$. Let $X$ be the number of patients experiencing the side effect. The following calculates the Poisson probability for $P(12 \le X \le 16)$ in two different ways.

\displaystyle \begin{aligned} P(12 \le X \le 16)&=e^{-10} \biggl[ \frac{10^{12}}{12!}+\frac{10^{13}}{13!}+\frac{10^{14}}{14!}+\frac{10^{15}}{15!}+\frac{10^{16}}{16!} \biggr] \\&=0.276182244 \end{aligned}
\displaystyle \begin{aligned} P(12 \le X \le 16)&=P(X \le 11)-P(X \le 16) \\&=\text{POISSON.DIST(16,10,TRUE)} \\& \ \ \ \ \ \ -\text{POISSON.DIST(11,10,TRUE)} \\&=0.97295839-0.696776146 \\&=0.276182244 \end{aligned}

Example 5
In a 10-mile stretch of a highway, car troubles (e.g. tire punctures, dead batteries, and mechanical breakdown) occur at a rate of 1.5 per hour. A tow truck driver can respond to such car troubles and offer roadside assistance, which can include towing and minor repair. Assume that the number of such incidences per hour follows a Poisson distribution. At the beginning of the hour, three tow trucks (and their drivers) are available to respond to any car troubles in this stretch of highway. What is the probability that in the next hour all three tow trick drivers will be busy helping motorists with car troubles in this stretch of highway?

Let $X$ be the number of car troubles that occur in this 10-mile stretch of highway in the one-hour period in question. If in this one hour there are 3 or more car troubles ($X \ge 3$), then all three tow truck drivers will be busy.

\displaystyle \begin{aligned} P(X \ge 3)&=1-P(X \le 2) \\&=1-e^{-1.5} \biggl[ 1+1.5+\frac{1.5^{2}}{2!} \biggr] \\&=1-0.808846831\\&=0.191153169 \end{aligned}

Example 6
Continuing Example 5. Considering that there is only 19% chance that all 3 tow truck drivers will be busy, there is a good chance that the resources are under utilized. What if one of the drivers is assigned to another stretch of highway?

With only two tow trucks available for this 10-mile stretch of highway, the following is the probability that all two tow truck drivers will be busy:

\displaystyle \begin{aligned} P(X \ge 2)&=1-P(X \le 1) \\&=1-e^{-1.5} \biggl[ 1+1.5 \biggr] \\&=1-0.5578254\\&=0.4421746 \end{aligned}

Assigning one driver to another area seems to be a better way to make good use of the available resources. With only two tow truck drivers available, there is much reduced chance (56%) that one of the drivers will be idle, and there is a much increased chance (44%) that all available drivers will be busy.

________________________________________________________________________

Remarks

The Poisson distribution is one of the most important of all probability models and has shown to be an excellent model for a wide array of phenomena such as

• the number of $\alpha$-particles emitted from radioactive source in a given amount of time,
• the number of vehicles passing a particular location on a busy highway,
• the number of traffic accidents in a stretch of highway in a given period of time,
• the number of phone calls arriving at a particular point in a telephone network in a fixed time period,
• the number of insurance losses/claims in a given period of time,
• the number of customers arriving at a ticket window,
• the number of earthquakes occurring in a fixed period of time,
• the number of mutations on a strand of DNA.
• the number of hurricanes in a year that originate in the Atlantic ocean.

What is the Poisson distribution so widely applicable in these and many other seemingly different and diverse phenomena? What is the commonality that ties all these different and diverse phenomena? The commonality is that all these phenomena are basically a series of independent Bernoulli trials. If a phenomenon is a Binomial model where $n$ is large and $p$ is small, then it has a strong connection to Poisson model mathematically through Theorem 1 above (i.e. it has a Poisson approximation). On the other hand, if the random phenomenon follows the criteria in a Poisson process, then the phenomenon is also approximately a Binomial model, which means that in the limiting case it is Poisson.

In both view points discussed in this post, the Poisson distribution can be regarded as a binomial distribution taken at a very granular level. This connection with the binomial distribution points to a vast arrays of problems that can be solved using the Poisson distribution.

________________________________________________________________________

Exercises

Practice problems for the Poisson concepts discussed above can be found in the companion blog (go there via the following link). Working on these exercises is strongly encouraged (you don’t know it until you can do it).

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# A natural look at the negative binomial survival function

The negative binomial distribution is a discrete distribution with two parameters $r$ and $p$ where $r>0$ and $0. It has positive probabilities at the non-negative integers $0,1,2,\cdots$. So it can potentially be used as a model for the random count of a phenomenon of interest. In some cases, the negative binomial distribution has a natural interpretation. In fact, the natural interpretation should be how the negative binomial distribution is introduced. With the parameter $r$ being a positive integer, the negative binomial distribution can be naturally interpreted as the discrete waiting time until the $r$th success. But this natural interpretation does not apply to the general case of $r$ being only a positive real number but not necessarily an integer. However, once the natural case that $r$ being a positive integer is understood, it is easy to make a leap to the general case that $r$ is any positive real number. In this post, we focus on the “natural” version of the negative binomial distribution, where $r$ is a positive integer. In this case, there is a natural way to look at the cumulative distribution function (cdf) and the survival function. The discussion in this post will complement the following posts on the negative binomial distribution.

________________________________________________________________________

The probability functions

A Bernoulli trial is a random experiment in which there are two distinct outcomes. For convenience, they are called success (S) and failure (F). Consider performing a sequence of independent Bernoulli trials such that each trial has the same probability of success $p$. Fix a positive integer $r$. In some cases, we would like to count the number of trials that produce the $r$th success. Let’s call this count $X_r$. In other cases, we may want instead to count the number of failures before getting the $r$th success. Let’s call this count $Y_r$. According to the discussion in the two posts listed above, the probability functions of $X_r$ and $Y_r$ are:

$\displaystyle P(X_r=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)$

$\displaystyle P(Y_r=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

When $r=1$, the resulting distribution is called the geometric distribution.

$\displaystyle P(X_1=x)= p (1-p)^{x-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=1,2,3,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1a)$

$\displaystyle P(Y_1=y)= p (1-p)^y \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2a)$

Because the parameter $r$ is a positive integer, all the above probability functions have an combinatorial explanation. For the probability function in (1), there are $x$ many trials. The last Bernoulli trial is the $r$th success. Then all the preceding $x-1$ trails have at most $r-1$ successes. Hence we have the binomial coefficient $\binom{x-1}{r-1}$ in (1). For the probability function in (2), there are $y+r$ trials ($r$ successes and $y$ failures). Again, the last Bernoulli trial is the $r$th success. So in the preceding $y+r-1$ trials, there are exactly $r-1$ successes and exactly $y$ failures. Hence we have the binomial coefficient $\binom{y+r-1}{y}$ in (2). Let’s look at some examples.

Example 1
Four different prizes are randomly put into boxes of cereal. One of the prizes is a free ticket to the local zoo. Suppose that a family of four (called Family A) decides to buy this cereal until obtaining four free tickets to the zoo. What is the probability that the family will have to buy 10 boxes of cereal to obtain the four free tickets to the zoo? What is the probability that the family will have to buy 16 boxes of cereal to obtain the four free tickets to the zoo?

In this example, the success is a box of cereal with a free ticket to the zoo. So getting a ticket to the zoo is considered a success. Any one of the other three prizes is considered undesirable or a failure. Any one of the four prizes is equally likely. The negative binomial distribution in this example has $r=4$ and $p=0.25$. The count of the boxes of cereal to be purchased is the random variable $X_4$ as described in (1) above. The following gives the answers.

\displaystyle \begin{aligned} P(X_4=10)&=\binom{10-1}{3} \ (0.25)^4 \ (0.75)^{10-4} \\&=\binom{9}{3} \ (0.25)^4 \ (0.75)^{6} \\&=84 \ (0.25)^4 \ (0.75)^{6} \\&=0.0583992 \end{aligned}

\displaystyle \begin{aligned} P(X_4=16)&=\binom{16-1}{3} \ (0.25)^4 \ (0.75)^{16-4} \\&=\binom{15}{3} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766 \end{aligned}

Example 2 (Example 1 continued)
Suppose Family A agrees to give any one of the undesirable prizes away to another family (called Family B). What is the probability that Family A will give 5 undesirable prizes to Family B before obtaining the four desirable tickets to the zoo? What is the probability that Family A will give 12 undesirable prizes to Family B before obtaining the four tickets to the zoo?

The negative binomial distribution in this example has $r=4$ and $p=0.25$. The interest here is to count the number failures (undesirable prizes) before getting 4 successes. Thus the random variable of interest is $Y_4$ as described in (2) above. The following gives the answers.

\displaystyle \begin{aligned} P(Y_4=5)&=\binom{5+4-1}{5} \ (0.25)^4 \ (0.75)^{5} \\&=\binom{8}{5} \ (0.25)^4 \ (0.75)^{5} \\&=56 \ (0.25)^4 \ (0.75)^{5} \\&=0.0519104 \end{aligned}

\displaystyle \begin{aligned} P(Y_4=12)&=\binom{12+4-1}{10} \ (0.25)^4 \ (0.75)^{12} \\&=\binom{15}{12} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766 \end{aligned}

Here’s the mean and variance for both examples.

$\displaystyle E(X_4)=4 \ \frac{1}{0.25}=16$

$\displaystyle Var(X_4)=Var(Y_4)=4 \ \frac{0.75}{0.25^2}=48$

$\displaystyle E(Y_4)=4 \ \frac{0.75}{0.25}=12$

Thus Family A is expected to buy 16 boxes of cereal to get the 4 tickets to the zoo and is expected to give 12 prizes to the other family. However, the variance is fairly large. As a result, the actual number of boxes purchased can vary from the mean by a large amount.

_______________________________________________________________________

The survival functions and the cumulative distribution functions

For any random variable $T$, the cumulative distribution function is $P(T \le t)$ where $T$ can range over all real numbers $t$ or a relevant set of real numbers $t$. The survival function is $P(T > t)$. The term survival comes from the interpretation that $P(T > t)$ is the probability of a life surviving beyond time $t$ if $T$ is meant to model the lifetime of a biological life or some system. Even when $T$ is not a lifetime random variable, we still use the term survival function for $P(T > t)$.

Example 1 asks the probability of a certain number of trials in order to get $r$th success and the probability of a certain number of failures before getting the $r$th success. Sometimes it is more informative to know how many trials that are required to be performed in order to achieve one’s goal. For example, it may be useful to know the mean number of trials or the probability of achieving the goal in $x$ trials or less. In some cases, it may take time and resource to perform the random Bernoulli trials. It will be helpful to know ahead of time the likelihood of achieving one’s goal given the resources that are available. In the above examples, it will be helpful for Family A to have a better and clearer sense of how many boxes of cereal are to be purchased. Therefore, it is worthwhile to look at the cdf and the survival function of the negative binomial distribution.

In terms of basic principle, here’s the survival functions for the distribution described in (1) and (2).

\displaystyle \begin{aligned} P(X_r>x)&=\sum \limits_{k=x+1}^\infty P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=x+1}^\infty \binom{k-1}{r-1} p^r (1-p)^{k-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

\displaystyle \begin{aligned} P(Y_r>y)&=\sum \limits_{j=y+1}^\infty P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y= 0,1,2,3,\cdots \\&=\sum \limits_{j=y+1}^\infty \binom{j+r-1}{j} p^r (1-p)^{j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4) \end{aligned}

The cumulative distribution functions can be defined by basic principle or by taking the complement as follows:

\displaystyle \begin{aligned} P(X_r \le x)&=\sum \limits_{k=r}^x P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=1-P(X_r>x) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)\end{aligned}

\displaystyle \begin{aligned} P(Y_r \le y)&=\sum \limits_{j=0}^y P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,3,\cdots \\&=1-P(Y_r>y) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)\end{aligned}

Example 3 (the above examples continued)
What is the probability that Family A will buy more than 16 boxes in order to obtain the 4 free tickets to the zoo? What is the probability that Family A will give away at most 10 undesirable prizes before obtaining the 4 free tickets to the zoo?

Working from the definition, here’s the answers:

$\displaystyle P(X_4>16)=1-P(X_4 \le 16)=0.40498711$

$\displaystyle P(Y_4 \le 10)=0.47866004$

The mean number of boxes of cereal purchased is 16 as indicated earlier. Since the variance is large, there is still a significance chance (about 40%) that Family A will have to buy more than 16 boxes of cereal before achieving their goal. On the other hand, there is good chance (about 48%) that Family will give away at most 10 undesirable prizes.

Note that the calculation for $P(X_4>16)$ is based on $P(X_4 \le 16)$, which requires the calculation of 17 probabilities. The calculation for $P(Y_4 \le 10)$, which requires 11 probabilities. Such calculation can be done by software of course. There is a natural way of looking at the calculation for (3), (4), (5) and (6). This alternative approach will give much better insight on the negative binomial distribution.

_______________________________________________________________________

A more natural way of interpreting the survival function

We now discuss a better way to look at the survival functions in (3) and (4). Consider the event $X_r>x$ and the event $Y_r>y$. We will see that the negative binomial survival function can be related to the cdf of a binomial distribution.

For the event $X_r>x$ to occur, the $r$th success occurs after performing $x$ trials. So it will take $x+1$ trials or more to get the $r$th success. This means that in the first $x$ trials, there are at most $r-1$ successes. The following highlights the equivalent statements.

\displaystyle \begin{aligned} X_r>x&\equiv \text{the } r \text{th success occurs after performing } x \text{ trials} \\&\equiv \text{it takes at least } x+1 \text{ trials to get the } r \text{th success} \\&\equiv \text{in the first } x \text{ trials, there are at most } r-1 \text{ successes} \end{aligned}

The last statement is a binomial distribution. Specifically, it is the binomial distribution with $x$ trials and the probability of success $p$. Let’s denote the count of successes of this binomial distribution by $B_{x,p}$. Thus we can relate the survival function in (3) with the cdf of $B_{x,p}$ as follows:

\displaystyle \begin{aligned} P(X_r>x)&=P(B_{x,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=0}^{r-1} \ P(B_{x,p}=k) \\&=\sum \limits_{k=0}^{r-1} \ \binom{x}{k} \ p^k (1-p)^{x-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7) \end{aligned}

The advantage of (7) is that it gives us the insight of relating the negative binomial distribution with the binomial distribution. In terms of computation, (7) only requires $r$ many binomial probabilities. Thus Example 3 only requires the computation of 4 probabilities (versus 17 previously).

Note that $X_r=Y_r+r$. Thus the event $Y_r>y$ is the same as the event $X_r>y+r$. So we can just piggy back on the work done in (7). For the sake of the more clarity, here’s a translation for the event $Y_r>y$.

\displaystyle \begin{aligned} Y_r>y&\equiv X_r>y+r \\&\equiv \text{the } r \text{th success occurs after performing } y+r \text{ trials} \\&\equiv \text{it takes at least } y+r+1 \text{ trials to get the } r \text{th success} \\&\equiv \text{in the first } y+r \text{ trials, there are at most } r-1 \text{ successes} \end{aligned}

As before, let $B_{n,p}$ denote the number of successes in performing $n$ Bernoulli trials with $p$ as the probability of success. Based on the above translation, the following gives the survival function for the negative binomial random variable $Y_r$.

\displaystyle \begin{aligned} P(Y_r>y)&=P(X_r>y+r) \\&=P(B_{y+r,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \\&=\sum \limits_{j=0}^{r-1} \ P(B_{y+r,p}=j) \\&=\sum \limits_{j=0}^{r-1} \ \binom{y+r}{j} \ p^j (1-p)^{y+r-j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8) \end{aligned}

For both negative binomial random variables $X_r$ and $Y_r$, the survival functions can be computed using (7) and (8), respectively. Rote memorization of the formulas (7) and (8) is not recommended. Instead, focus on the thought process that translate the events $X_r>x$ and $Y_r>y$ into a binomial distribution.

Example 4 (the above examples continued)
We now rework Example 3 using the ideas presented in (7) and (8).

\displaystyle \begin{aligned} P(X_4>16)&=\sum \limits_{k=0}^{3} \ \binom{16}{k} \ 0.25^k \ 0.75^{16-k}=0.40498711 \end{aligned}

\displaystyle \begin{aligned} P(Y_4 \le 10)&=1-P(Y_4>10) \\&=1-\sum \limits_{j=0}^{3} \ \binom{14}{j} \ 0.25^j \ 0.75^{14-j} \\&=1-0.52133996 \\&=0.47866004 \end{aligned}

Example 5 (the above examples continued)
What is the median number of boxes of cereal purchased by Family A in order to obtain 4 boxes with the prize of free ticket to the zoo? What is the median number of boxes of cereal with undesirable prizes that are purchased by Family A?

We have the following probabilities.

$\displaystyle P(X_4>14)=\sum \limits_{k=0}^{3} \ \binom{14}{k} \ 0.25^k \ 0.75^{16-k}=0.52133996$

$\displaystyle P(X_4>15)=\sum \limits_{k=0}^{3} \ \binom{15}{k} \ 0.25^k \ 0.75^{16-k}=0.461286876$

$\displaystyle P(X_4 \le 14)=1-0.52133996=0.47866004$

$\displaystyle P(X_4 \le 15)=1-0.461286876=0.538713124$

This means that the median number of boxes to be purchased is 15. One way to look at it is that $x=$ 15 is the first number such that $P(X_4 \le x)$ is greater than 0.5. Then the median number of boxes with undesirable prizes is 11 (15 less 4).

_______________________________________________________________________

Comparing with the Gamma distribution

The thought process discussed in (7) and (8) certainly gives a more efficient way to calculate the cumulative distribution function and the survival function of the negative binomial distribution. Even though the negative binomial cdf can be calculated easily by software, the ideas in (7) and (8) provides a formulation that gives more insight on the negative binomial distribution.

The though process in (7) and (8) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Consider the Gamma distribution where the shape parameter $n$ is a positive integer and the rate parameter $\beta$ can be any positive real number. Then the following is the density function of the Gamma distribution under consideration:

$\displaystyle f(x)=\frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta \ x} \ \ \ \ \ x>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)$

The Gamma distribution described by the density function in (9) can be interpreted as a waiting time – the waiting time until the $n$th change in a Poisson process. Thus, if the $n$th change takes place after time $t$, there can be at most $n-1$ arrivals in the time interval $[0,t]$. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The survival function in (7) is analogous to the following relation.

$\displaystyle P(X>t)=\int_t^\infty \frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\beta t} \ (\beta t)^j}{j!} \ \ \ \ \ \ \ \ \ \ \ (10)$

The idea for (7) and (8) is the waiting time until the $r$th success where each success or failure is based on a Bernoulli process. The resulting probability distribution is a discrete waiting time process. The idea for (10) is the waiting time until the $n$th change where each change is based on a Poisson counting process. The resulting probability distribution is a continuous waiting time process. It is not necessary to memorize these formulas. It is easy to reproduce (7), (8) and (10) from the underlying thought processes.

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Deriving some facts of the negative binomial distribution

The previous post called The Negative Binomial Distribution gives a fairly comprehensive discussion of the negative binomial distribution. In this post, we fill in some of the details that are glossed over in that previous post. We derive the following points:

• Discuss the several versions of the negative binomial distribution.
• The negative binomial probabilities sum to one, i.e., the negative binomial probability function is a valid one.
• Derive the moment generating function of the negative binomial distribution.
• Derive the first and second moments and the variance of the negative binomial distribution.
• An observation about independent sum of negative binomial distributions.

________________________________________________________________________

Three versions

The negative binomial distribution has two parameters $r$ and $p$, where $r$ is a positive real number and $0. The first two versions arise from the case that $r$ is a positive integer, which can be interpreted as the random experiment of a sequence of independent Bernoulli trials until the $r$th success (the trials have the same probability of success $p$). In this interpretation, there are two ways of recording the random experiment:

$X =$ the number of Bernoulli trials required to get the $r$th success.
$Y =$ the number of Bernoulli trials that end in failure before getting the $r$th success.

The other parameter $p$ is the probability of success in each Bernoulli trial. The notation $\binom{m}{n}$ is the binomial coefficient where $m$ and $n$ are non-negative integers and $m \ge n$ is defined as:

$\displaystyle \binom{m}{n}=\frac{m!}{n! \ (m-n)!}=\frac{m(m-1) \cdots (m-(n-1))}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)$

With this in mind, the following are the probability functions of the random variables $X$ and $Y$.

$\displaystyle P(X=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)$

$\displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$

The thought process for (1) is that for the event $X=x$ to happen, there can only be $r-1$ successes in the first $x-1$ trials and one additional success occurring in the last trial (the $x$th trial). The thought process for (2) is that for the event $Y=y$ to happen, there are $y+r$ trials ($y$ failures and $r$ successes). In the first $y+r-1$ trials, there can be only $y$ failures (or equivalently $r-1$ successes). Note that $X=Y+r$. Thus knowing the mean of $Y$ will derive the mean of $X$, a fact we will use below.

Instead of memorizing the probability functions (1) and (2), it is better to understand and remember the thought processes involved. Because of the natural interpretation of performing Bernoulli trials until the $r$th success, it is a good idea to introduce the negative binomial distribution via the distributions described by (1) and (2), i.e., the case where the parameter $r$ is a positive integer. When $r=1$, the random experiment is a sequence of independent Bernoulli trials until the first success (this is called the geometric distribution).

Of course, (1) and (2) can also simply be used as counting distributions without any connection with a series of Bernoulli trials (e.g. used in an insurance context as the number of losses or claims arising from a group of insurance policies).

The binomial coefficient in (0) is defined when both numbers are non-negative integers and that the top one is greater than or equal to the bottom one. However, the rightmost term in (0) can be calculated even when the top number $m$ is not a non-negative integer. Thus when $m$ is any real number, the rightmost term (0) can be calculated provided that the bottom number $n$ is a positive integer. For convenience we define $\binom{m}{0}=1$. With this in mind, the binomial coefficient $\binom{m}{n}$ is defined for any real number $m$ and any non-negative integer $n$.

The third version of the negative binomial distribution arises from the relaxation of the binomial coefficient $\binom{m}{n}$ just discussed. With this in mind, the probability function in (2) can be defined for any positive real number $r$:

$\displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)$

where $\displaystyle \binom{y+r-1}{y}=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!}$.

Of course when $r$ is a positive integer, versions (2) and (3) are identical. When $r$ is a positive real number but is not an integer, the distribution cannot be interpreted as the number of failures until the occurrence of $r$th success. Instead, it is used as a counting distribution.

________________________________________________________________________

The probabilities sum to one

Do the probabilities in (1), (2) or (3) sum to one? For the interpretations of (1) and (2), is it possible to repeatedly perform Bernoulli trials and never get the $r$th success? For $r=1$, is it possible to never even get a success? In tossing a fair coin repeatedly, soon enough you will get a head and even if $r$ is a large number, you will eventually get $r$ number of heads. Here we wish to prove this fact mathematically.

To show that (1), (2) and (3) are indeed probability functions, we use a fact concerning Maclaurin’s series expansion of the function $(1-x)^{-r}$, a fact that is covered in a calculus course. In the following two results, $r$ is a fixed positive real number and $y$ is any non-negative integer:

$\displaystyle \binom{y+r-1}{y}=(-1)^y \ \binom{-r}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)$

$\displaystyle \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y=(1-x)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)$

The result (4) is to rearrange the binomial coefficient in probability function (3) to another binomial coefficient with a negative number. This is why there is the word “negative” in negative binomial distribution. The result (5) is the Maclaurin’s series expansion for the function $(1-x)^{-r}$. We first derive these two facts and then use them to show that the negative binomial probabilities in (3) sum to one. The following derives (4).

\displaystyle \begin{aligned} \binom{y+r-1}{y}&=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!} \\&=(-1)^y \ \frac{(-r)(-r-1) \cdots (-r-(y-1))}{y!} \\&=(-1)^y \ \binom{-r}{y} \end{aligned}

To derive (5), let $f(x)=(1-x)^{-r}$. Based on a theorem that can be found in most calculus text, the function $f(x)$ has the following Maclaurin’s series expansion (Maclaurin’s series is simply Taylor’s series with center = 0).

$\displaystyle (1-x)^{-r}=f(0)+f^{'}(0)x+\frac{f^{(2)}(0)}{2!}x^2+\frac{f^{(3)}(0)}{3!}x^3+\cdots + \frac{f^{(n)}(0)}{n!}x^n+\cdots$

where $-1. Now, filling in the derivatives $f^{(n)}(0)$, we have the following derivation.

\displaystyle \begin{aligned} (1-x)^{-r}&=1+rx+\frac{(r+1)r}{2!}x^2+\frac{(r+2)(r+1)r}{3!}x^3 \\& \ \ \ \ \ \ \ \ +\cdots+\frac{(r+y-1)(r+y-2) \cdots (r+1)r}{y!}x^y +\cdots \\&=1+(-1)^1 (-r)x+(-1)^2\frac{(-r)(-r-1)}{2!}x^2 \\& \ \ \ \ \ \ +(-1)^3 \frac{(-r)(-r-1)(-r-2)}{3!}x^3 +\cdots \\& \ \ \ \ \ \ +(-1)^y \frac{(-r)(-r-1) \cdots (-r-y+2)(-r-y+1)}{y!}x^y +\cdots \\&=(-1)^0 \binom{-r}{0}x^0 +(-1)^1 \binom{-r}{1}x^1+(-1)^2 \binom{-r}{2}x^2 \\& \ \ \ \ \ \ +(-1)^3 \binom{-r}{3}x^3+\cdots +(-1)^y \binom{-r}{y}x^y+\cdots \\&=\sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y \end{aligned}

We can now show that the negative binomial probabilities in (3) sum to one. Let $q=1-p$.

\displaystyle \begin{aligned} \sum \limits_{y=0}^\infty \binom{y+r-1}{y} \ p^r \ q^y &=p^r \ \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ q^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ (1-q)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5)\\&=p^r p^{-r} \\&=1 \end{aligned}

________________________________________________________________________

The moment generating function

We now derive the moment generating function of the negative binomial distribution according to (3). The moment generation function is $M(t)=E(e^{tY})$ over all real numbers $t$ for which $M(t)$ is defined. The following derivation does the job.

\displaystyle \begin{aligned} M(t)&=E(e^{tY}) \\&=\sum \limits_{y=0}^\infty \ e^{t y} \ \binom{y+r-1}{y} \ p^r \ (1-p)^y \\&=p^r \ \sum \limits_{y=0}^\infty \ \binom{y+r-1}{y} \ [(1-p) e^t]^y \\&=p^r \ \sum \limits_{y=0}^\infty \ (-1)^y \binom{-r}{y} \ [(1-p) e^t]^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ [1-(1-p) \ e^t]^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5) \\&=\frac{p^r}{[1-(1-p) \ e^t]^{r}}\end{aligned}

The above moment generating function works for the negative binomial distribution with respect to (3) and thus to (2). For the distribution in (1), note that $X=Y+r$. Thus $E(e^{tX})=E(e^{t(Y+r)})=e^{tr} \ E(e^{tY})$. The moment generating function of (1) is simply the above moment generating function multiplied by the factor $e^{tr}$. To summarize, the moment generating functions for the three versions are:

$\displaystyle M_X(t)=E[e^{tX}]=\frac{p^r \ e^{tr}}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (1)$

$\displaystyle M_Y(t)=E[e^{tY}]=\frac{p^r}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (2) \text{ and } (3)$

The domain of the moment generating function is the set of all $t$ that for which $M_X(t)$ or $M_Y(t)$ is defined and is positive. Based on the form that it takes, we focus on making sure that $1-(1-p) \ e^t>0$. This leads to the domain $t<-\text{ln}(1-p)$.

________________________________________________________________________

The mean and the variance

With the moment generating function derived in the above section, we can now focus on finding the moments of the negative binomial distribution. To find the moments, simply take the derivatives of the moment generating function and evaluate at $t=0$. For the distribution represented by the probability function in (3), we calculate the following:

$E(Y)=M_Y^{'}(0)$

$E(Y^2)=M_Y^{(2)}(0)$

$Var(Y)=E(Y^2)-E(Y)^2$

After taking the first and second derivatives and evaluate at $t=0$, the first and the second moments are:

$\displaystyle E(Y)=r \ \frac{1-p}{p}$

$\displaystyle E(Y^2)=\frac{r(1-p)[1+(1-p)]}{p^2}$

The following derives the variance.

\displaystyle \begin{aligned} Var(Y)&=E(Y^2)-E(Y)^2 \\&=\frac{r(1-p)[1+(1-p)]}{p^2}-\frac{(1-p)^2}{p^2} \\&=\frac{r(1-p)[1+r(1-p)-r(1-p)]}{p^2} \\&=\frac{r(1-p)}{p^2} \end{aligned}

The above formula is the variance for the three versions (1), (2) and (3). Note that $Var(Y)>E(Y)$. In contrast, the variance of the Poisson distribution is identical to its mean. Thus in the situation where the variance of observed data is greater than the sample mean, the negative binomial distribution should be a better fit than the Poisson distribution.

________________________________________________________________________

The independent sum

There is an easy consequence that follows from the moment generating function derived above. The sum of several independent negative binomial distributions is also a negative binomial distribution. For example, suppose $T_1,T_2, \cdots,T_n$ are independent negative binomial random variables (version (3)). Suppose each $T_j$ has parameters $r_j$ and $p$ (the second parameter is identical). The moment generating function of the independent sum is the product of the individual moment generating functions. Thus the following is the moment generating function of $T=T_1+\cdots+T_n$.

$\displaystyle M_T(t)=E[e^{tT}]=\frac{p^g}{[1-(1-p) \ e^t]^{g}}$

where $g=r_1+\cdots+r_n$. The moment generating function uniquely identifies the distribution. The above $M_T(t)$ is that of a negative binomial distribution with parameters $g$ and $p$ according to (3).

A special case is that the sum of $n$ independent geometric distributions is a negative binomial distribution with the $r$ parameter being $r=n$. The following is the moment generating function of the sum $W$ of $n$ independent geometric distributions.

$\displaystyle M_W(t)=E[e^{tW}]=\frac{p^n}{[1-(1-p) \ e^t]^{n}}$

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$

# Conditional Distributions, Part 2

We present more examples to further illustrate the thought process of conditional distributions. A conditional distribution is a probability distribution derived from a given probability distribution by focusing on a subset of the original sample space (we assume that the probability distribution being discussed is a model for some random experiment). The new sample space (the subset of the original one) may be some outcomes that are of interest to an experimenter in a random experiment or may reflect some new information we know about the random experiment in question. We illustrate this thought process in the previous post Conditional Distributions, Part 1 using discrete distributions. In this post, we present some continuous examples for conditional distributions. One concept illustrated by the examples in this post is the notion of mean residual life, which has an insurance interpretation (e.g. the average remaining time until death given that the life in question is alive at a certain age).

_____________________________________________________________________________________________________________________________

The Setting

The thought process of conditional distributions is discussed in the previous post Conditional Distributions, Part 1. We repeat the same discussion using continuous distributions.

Let $X$ be a continuous random variable that describes a certain random experiment. Let $S$ be the sample space of this random experiment. Let $f(x)$ be its probability density function.

We assume that $X$ is a univariate random variable, meaning that the sample space $S$ is the real line $\mathbb{R}$ or a subset of $\mathbb{R}$. Since $X$ is a continuous random variable, we know that $S$ would contain an interval, say, $(a,b)$.

Suppose that in the random experiment in question, certain event $A$ has occurred. The probability of the event $A$ is obtained by integrating the density function over the set $A$.

$\displaystyle P(A)=\int_{x \in A} f(x) \ dx$

Since the event $A$ has occurred, $P(A)>0$. Since we are dealing with a continuous distribution, the set $A$ would contain an interval, say $(c,d)$ (otherwise $P(A)=0$). So the new probability distribution we define is also a continuous distribution. The following is the density function defined on the new sample space $A$.

$\displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A$

The above probability distribution is called the conditional distribution of $X$ given the event $A$, denoted by $X \lvert A$. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Example 1

Let $X$ be the lifetime (in years) of a brand new computer purchased from a certain manufacturer. Suppose that the following is the density function of the random variable $X$.

$\displaystyle f(x)=\frac{3}{2500} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ 0

Suppose that you have just purchased a one such computer that is 2-year old and in good working condition. We have the following questions.

• What is the expected lifetime of this 2-year old computer?
• What is the expected number of years of service that will be provided by this 2-year old computer?

Both calculations are conditional means since the computer in question already survived to age 2. However, there is a slight difference between the two calculations. The first one is the expected age of the 2-year old computer, i.e., the conditional mean $E(X \lvert X>2)$. The second one is the expected remaining lifetime of the 2-year old computer, i.e., $E(X-2 \lvert X>2)$.

For a brand new computer, the sample space is the interval $S=0. Knowing that the computer is already 2-year old, the new sample space is $A=2. The total probability of the new sample space is:

$\displaystyle P(A)=P(X>2)=\int_{2}^{10} \frac{3}{2500} \ (100x-20x^2 + x^3) \ dx=\frac{2048}{2500}=0.8192$

The conditional density function of $X$ given $X>2$ is:

\displaystyle \begin{aligned} f(x \lvert X>2)&=\frac{\frac{3}{2500} \ (100x-20x^2 + x^3)} {\frac{2048}{2500}} \\&=\frac{3}{2048} \ (100x-20x^2 + x^3), \ \ \ \ \ \ \ \ \ 2

The first conditional mean is:

\displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^{10} x \ f(x \lvert X>2) \ dx \\&=\int_2^{10} \frac{3}{2048} \ x(100x-20x^2 + x^3) \ dx \\&=\int_2^{10} \frac{3}{2048} \ (100x^2-20x^3 + x^4) \ dx \\&=\frac{47104}{10240}=4.6 \end{aligned}

The second conditional mean is:

$\displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=2.6$

In contrast, the unconditional mean is:

$\displaystyle E(X)=\int_0^{10} \frac{3}{2500} \ (100x^2-20x^3 + x^4) \ dx=4$

So if the lifetime of a computer is modeled by the density function $f(x)$ given here, the expected lifetime of a brand new computer is 4 years. If you know that a computer has already been in use for 2 years and is in good condition, the expected lifetime is 4.6 years, where 2 years of which have already passed, showing us that the remaining lifetime is 2.6 years.

Note that the following calculation is not $E(X \lvert X>2)$, though is something that some students may attempt to do.

$\displaystyle \int_2^{10} x \ f(x) \ dx =\int_2^{10} \frac{3}{2500} \ x(100x-20x^2 + x^3) \ dx=\frac{47104}{12500}=3.76832$

The above calculation does not use the conditional distribution that $X>2$. Also note that the answer is less than the unconditional mean $E(X)$.

Example 2 – Exponential Distribution

Work Example 1 again by assuming that the lifetime of the type of computers in questions follows the exponential distribution with mean 4 years.

The following is the density function of the lifetime $X$.

$\displaystyle f(x)=0.25 \ e^{-0.25 x}, \ \ \ \ \ \ 0

The probability that the computer has survived to age 2 is:

$\displaystyle P(X>2)=\int_2^\infty 0.25 \ e^{-0.25 x} \ dx=e^{-0.25 (2)}=e^{-0.5}$

The conditional density function given that $X>2$ is:

$\displaystyle f(x \lvert X>2)= \frac{0.25 \ e^{-0.25 x}}{e^{-0.25 (2)}}=0.25 \ e^{-0.25 (x-2)}, \ \ \ \ \ \ \ 2

To compute the conditional mean $E(X \lvert X>2)$, we have

\displaystyle \begin{aligned} E(X \lvert X>2)&=\int_2^\infty x \ f(x \lvert X>2) \ dx \\&=\int_2^\infty 0.25 \ x \ e^{-0.25 (x-2)} \ dx \\&=\int_0^\infty 0.25 \ (u+2) \ e^{-0.25 u} \ du \ \ \ (\text{change of variable}) \\&=\int_0^\infty 0.25 \ u \ e^{-0.25 u} \ du+2\int_0^\infty 0.25 \ e^{-0.25 u} \ du \\&=\frac{1}{0.25}+2=4+2=6\end{aligned}

Then $\displaystyle E(X-2 \lvert X>2)=E(X \lvert X>2)-2=6-2=4$.

We have an interesting result here. The expected lifetime of a brand new computer is 4 years. Yet the remaining lifetime for a 2-year old computer is still 4 years! This is the no-memory property of the exponential distribution – if the lifetime of a type of machines is distributed according to an exponential distribution, it does not matter how old the machine is, the remaining lifetime is always the same as the unconditional mean! This point indicates that the exponential distribution is not an appropriate for modeling the lifetime of machines or biological lives that wear out over time.

_____________________________________________________________________________________________________________________________

Mean Residual Life

If a 40-year old man who is a non-smoker wants to purchase a life insurance policy, the insurance company is interested in knowing the expected remaining lifetime of the prospective policyholder. This information will help determine the pricing of the life insurance policy. The expected remaining lifetime of the prospective policyholder is called is called the mean residual life and is the conditional mean $E(X-t \lvert X>t)$ where $X$ is a model for the lifetime of some life.

In engineering and manufacturing applications, probability modeling of lifetimes of objects (e.g. devices, systems or machines) is known as reliability theory. The mean residual life also plays an important role in such applications.

Thus if the random variable $X$ is a lifetime model (lifetime of a life, system or device), then the conditional mean $E(X-t \lvert X>t)$ is called the mean residual life and is the expected remaining lifetime of the life or system in question given that the life has survived to age $t$.

On the other hand, if the random variable $X$ is a model of insurance losses, then the conditional mean $E(X-t \lvert X>t)$ is the expected claim payment per loss given that the loss has exceeded the deductible of $t$. In this interpretation, the conditional mean $E(X-t \lvert X>t)$ is called the mean excess loss function.

_____________________________________________________________________________________________________________________________

Summary

In conclusion, we summarize the approach for calculating the two conditional means demonstrated in the above examples.

Suppose $X$ is a continuous random variable with the support being $(0,\infty)$ (the positive real numbers), with $f(x)$ being the density function. The following is the density function of the conditional probability distribution given that $X>t$.

$\displaystyle f(x \lvert X>t)=\frac{f(x)}{P(X>t)}, \ \ \ \ \ \ \ \ \ x>t$

Then we have the two conditional means:

$\displaystyle E(X \lvert X>t)=\int_t^\infty x \ f(x \lvert X>t) \ dx=\int_t^\infty x \ \frac{f(x)}{P(X>t)} \ dx$

$\displaystyle E(X-t \lvert X>t)=\int_t^\infty (x-t) \ f(x \lvert X>t) \ dx=\int_t^\infty (x-t) \ \frac{f(x)}{P(X>t)} \ dx$

If $E(X \lvert X>t)$ is calculated first (or is easier to calculate), then $E(X-t \lvert X>t)=E(X \lvert X>t)-t$, as shown in the above examples.

If $X$ is a discrete random variable, then the integrals are replaced by summation symbols. As indicated above, the conditional mean $E(X-t \lvert X>t)$ is called the mean residual life when $X$ is a probability model of the lifetime of some system or life.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

$\copyright \ \text{2013 by Dan Ma}$