The skewness of a probability distribution

In this post, we discuss how to calculate the moment coefficient of skewness and also discuss some issues surrounding the notion of skewness.

________________________________________________________________________

Looking at graphs

One informal but useful way of checking the skewness of a distribution is to look at the density curve (or a histogram). Consider the following density functions.

    Figure 1
    Right skewed

    Figure 2
    Left skewed

The density curve in Figure 1 has a longer tail to the right than to the left. The example in Figure 1 is a distribution that is skewed to the right. It is also said to be positively skewed since its coefficient of skewness is positive. The density curve in Figure 2 has a longer tail to the left than to the right. The example in Figure 2 is a distribution that is skewed to the left. It is also said to be negatively skewed since the skewness coefficient is negative. If a density curve looks the same to the left and to the right (such as the bell curve for the normal distribution), then it is a symmetric distribution and the skewness coefficient is zero.

The distribution in Figure 1 is a right skewed distribution (the longer tail is on the right). It is a gamma distribution with mean 2 and median approximately 1.678347. The mode (the highest peak) is at x = 1. The distribution in Figure 2 is a left skewed distribution (the longer tail is on the left) with mean and median approximately 0.909 and 0.9213562, respectively. The mode is at 0.95.

________________________________________________________________________

Common conception about skewness

In the distribution for Figure 1, we can say that “mode < median < mean". In the distribution for Figure 2, we can say that "mean < median < mode". A common conception is that these simple rules characterize all skewed distribution, i.e., the mean is to the right of the median, which in turn is to the right of the mode in a right skewed distribution and that the mean is to the left of the median, which in turn is to the left of the mode in a left skewed distribution. Such rules are certainly easy to remember and are stated in some statistics textbooks. In the above two figures, this rule of thumb is certainly true. It turns out that this rule of thumb does not hold in many instances. The above two graphs are "textbook" demonstrations of skewness. They are gamma distributions and beta distributions and they behave well according to the usual notion of how skewed distributions should look like. In a later section of this post, we will discuss this issue in greater details. First we define the coefficient of skewness.

________________________________________________________________________

Pearson moment coefficient of skewness

The measure of skewness defined here is called the Pearson moment coefficient of skewness. This measure provides information about the amount and direction of the departure from symmetry. Its value can be positive or negative, or even undefined. The higher the absolute value of the skewness measure, the more asymmetric the distribution. The skewness measure of symmetric distributions is, or near, zero.

To help put the definition of skewness in context, we first define raw moments and central moments of a random variable X. The kth raw moment of X is E(X^k), the expected value of the kth power of the random variable X. The first raw moment is the mean of the random variable and is usually denoted by \mu.

The kth central moment of a random variable X is E[(X-\mu)^k], the expected value of the kth power of the deviation of the variable from its mean. The moment E[(X-\mu)^k] is usually denoted by \mu_k. The second central moment is usually called the variance and is denoted by \sigma^2. The square root of \sigma^2, \sigma, is the standard deviation.

The ratio of the standard deviation to the mean, \displaystyle  \frac{\sigma}{\mu}, is called the coefficient of variation.

The ratio of the third central moment to the cube of the standard deviation is called Pearson’s moment coefficient of skewness (or the coefficient of skewness) and is denoted by \gamma_1.

    \displaystyle \gamma_1=\frac{E[ (X-\mu)^3 ]}{\sigma^3}=\frac{\mu_3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

The skewness in (1) can be expanded to derive a version that can be calculated more easily:

    \displaystyle \begin{aligned} \gamma_1&=\frac{E[ (X-\mu)^3 ]}{\sigma^3} \\&=\frac{E(X^3)-3 \mu E(X^2)+3 \mu^2 E(X)-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu [E(X^2)+\mu E(X)]-\mu^3}{\sigma^3} \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{\sigma^3} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2) \\&=\frac{E(X^3)-3 \mu \sigma^2-\mu^3}{(\sigma^2)^{\frac{3}{2}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

The last version (3) is in terms of the first raw moment \mu, the second central moment \sigma^2 and the third raw moment E(X^3). Essentially, the coefficient \gamma_1 can be obtained via (3) by first computing the first three raw moments.

Even though kurtosis is not the focus of this post, we would like to state it to complete the the brief discussion on moments. The ratio of the fourth central moment to the fourth power of the standard deviation, \displaystyle \gamma_2=\frac{\mu_4}{\sigma^4}, is called the kurtosis.

________________________________________________________________________

Examples

In this section, we discuss the skewness in two familiar families of continuous distributions – gamma and beta. We also demonstrate how exponentiation can affect skewness.

Example 1Gamma Distribution
The following is the probability density function of the gamma distribution.

    \displaystyle f(x)=\frac{\beta^\alpha}{\Gamma(\alpha)} \ x^{\alpha-1} \ e^{-\beta x} \ \ \ \ \ \ \ \ \ x>0

where \Gamma(\cdot) is the gamma function, and \alpha and \beta are parameters such that \alpha>0 and \beta>0. The number \alpha is the shape parameter and the number \beta here is the rate parameter. Figure 1 shows the gamma distribution with \alpha=2 and \beta=1. When \alpha=1, we obtain the exponential distribution. When \beta=\frac{1}{2} and \alpha=\frac{k}{2} where k is a positive integer, we obtain the chi square distribution with k degrees of freedom.

Let X be a random variable with the above gamma density function. The raw moments E(X^k), where k=1,2,3,\cdots, are:

    \displaystyle E(X^k)=\frac{(\alpha+k-1)(\alpha+k-2) \cdots \alpha}{\beta^k}

Using the first two raw moments to calculate the variance as well as the third moment, the following calculates the moment coefficient of skewness, based on the form in (3):

    \displaystyle \gamma_1=\frac{\displaystyle \frac{(\alpha+2)(\alpha+1)\alpha}{\beta^3}-3 \frac{\alpha}{\beta} \frac{\alpha}{\beta^3}-\frac{\alpha^3}{\beta^3}}{\biggl( \displaystyle \frac{\alpha}{\beta^2} \biggr)^{\frac{3}{2}}}=\frac{2}{\sqrt{\alpha}}

The above calculation shows that the rate parameter \beta has no effect on skewness. The example in Figure 1 has \alpha=2, giving a coefficient of skewness of \sqrt{2} = 1.414213562. In general, the gamma distribution is skewed positively. However, the gamma distribution becomes more and more symmetric as the shape parameter \alpha \rightarrow \infty. The following graph the gamma densities for \alpha=1, 2, 3, 5, 6 and \beta=1.

    Figure 3
    Gamma densities alpha 1-6

In Figure 3, the light blue density with \alpha=1 is an exponential distribution. The red one with \alpha=2 is the density in Figure 1. With \alpha=6, the gamma density already looks very symmetric (the dark blue).

On the other hand, as the shape parameter \alpha \rightarrow 0, the gamma distribution becomes increasingly positively skewed. When \alpha=\frac{1}{n}, \gamma_1=2 \sqrt{n}. When n \rightarrow \infty, \gamma_1 \rightarrow \infty.

Example 2Beta Distribution
The following is the PDF of a beta distribution:

    \displaystyle f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \ x^{\alpha-1} \ (1-x)^{\beta-1} \ \ \ \ \ \ \ \ \ \ \ \ 0<x<1

where \Gamma(\cdot) is the gamma function, and \alpha and \beta are parameters such that \alpha>0 and \beta>0. Both \alpha and \beta are shape parameters.

In the beta family of distributions, the skewness can range from positive to negative. If the \alpha parameter dominates (i.e. x is to a higher power and 1-x is to a small power in the density function), then the beta distribution has a negative skew (skewed to the left). This is because the function x^n has left skew and the function (1-x)^n has right skew. Then the skewness of the beta distribution follows the one that dominates. If the \beta parameter dominates, the beta distribution is skewed to the right. If both parameters are roughly equal, the beta distribution is close to symmetric. For example when \alpha=20 and \beta=2, the beta distribution is left skewed (its density curve is in Figure 2). As in the gamma case, the skewness of the beta distribution has a close form. The following formula confirms the intuition about the skewness of the beta distribution (found here).

    \displaystyle \gamma_1=\frac{2(\beta-\alpha) \ \sqrt{\alpha+\beta+1}}{(\alpha+\beta+2) \ \sqrt{\alpha \ \beta}}

Thus the beta distribution with \alpha=20 and \beta=2 has skewness coefficient -1.137431317. The following figure further demonstrates the role of the shape parameters play in changing the skewness of the beta distribution.

    Figure 4
    Uniform densities

In Figure 4, as the \alpha parameter goes from 2 to 20, the skewness goes from 1.137431317 to 0.659393193 to 0 to -0.659393193 to -1.137431317.

Example 3Exponentiation
Symmetric distributions have zero coefficient of skewness. Raising a symmetric distribution to a positive power can produce a skewed distribution. For example, let X be the standard normal random variable (mean 0 and variance 1). Let Y=X^2. Then Y has a chi-square distribution with 1 degree of freedom, which means that it is a gamma distribution with \alpha=\frac{1}{2} and \beta=\frac{1}{2}. According to Example 1 above, the skewness coefficient is \frac{2}{\sqrt{0.5}}=2 \sqrt{2}=2.828. Thus squaring a standard normal distribution produces a very strongly positively skewed distribution.

Example 4Exponentiation
When raising a positively skewed distribution to positive power can produce a more strongly positively skewed distribution. For example, let X be an exponential random variable. Example 1 shows that exponential distributions have skewness coefficient 2. We show that the coefficient of skewness for Y=X^2 is approximately 6.619.

The density function for the exponential random variable X is f(x)=\beta e^{-\beta x} where \beta>0 is the rate parameter. It can be shown that the raw moments of X is:

    \displaystyle E(X^k)=\frac{k!}{\beta^k} \ \ \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

Then the first three moments of Y are:

    \displaystyle E(Y)=E(X^2)=\frac{2}{\beta^2}

    \displaystyle E(Y^2)=E(X^4)=\frac{24}{\beta^4}

    \displaystyle E(Y^3)=E(X^6)=\frac{720}{\beta^6}

With the first two raw moments, calculate the variance of Y. Then compute \gamma_1 via formula (3).

    \displaystyle \gamma_1=\frac{\displaystyle \frac{720}{\beta^6}-3 \ \frac{2}{\beta^2} \ \frac{20}{\beta^4}-\frac{8}{\beta^6}}{\biggl( \displaystyle \frac{20}{\beta^4} \biggr)^{\frac{3}{2}}}=\frac{74}{5 \sqrt{5}}=6.61876

Example 5Exponentiation
Raising a left skewed distribution to a positive power can produce a distribution that is less left skewed. The use of increasing exponents eventually produces a positively skewed distribution. Let X be the beta random variable with \alpha=5 and \beta=1. The density function for X is f(x)=5x^4 where 0<x<1. Using the formula shown in Example 2 above, the coefficient of skewness is

    \displaystyle \gamma_1=\frac{2(1-5) \sqrt{5+1+1}}{(5+1+2) \sqrt{5}}=-1.183215957

We wish to calculate the coefficient of skewness for X^2. To do that, it will be helpful to have a formula for the raw moments of X. It is easy to verify that:

    \displaystyle E(X^k)=\frac{5}{5+k} \ \ \ \ \ \ \ \ \ k=1,2,3,\cdots

The first three moments of Y=X^2 are:

    \displaystyle E(Y)=E(X^2)=\frac{5}{7}

    \displaystyle E(Y^2)=E(X^4)=\frac{5}{9}

    \displaystyle E(Y^3)=E(X^6)=\frac{5}{11}

Via formula (3), the following is the coefficient of skewness for Y=X^2.

    \displaystyle \gamma_1=\frac{\displaystyle \frac{5}{11}-3 \ \frac{5}{7} \ \frac{20}{441}-\frac{125}{7^3}}{\biggl( \displaystyle \frac{20}{441} \biggr)^{\frac{3}{2}}}=\frac{-18}{11 \sqrt{5}}=-0.731804065

In this example, squaring the beta distribution with skewness -1.1832 produces a distribution a negatively skewed distribution but with a smaller skew. Let’s raise X to higher powers. The following shows the results:

    \displaystyle X^3 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{11}}{7 \sqrt{5}}=-0.423782771

    \displaystyle X^4 \ \ \ \ \ \ \gamma_1=\frac{-2 \sqrt{13}}{17 \sqrt{5}}=-0.189700182

    \displaystyle X^5 \ \ \ \ \ \ \gamma_1=6 \sqrt{2}=8.485281374

Raising the beta distribution with \alpha=5 and \beta=1 to higher powers eventually produces a positively skewed distribution. This is an interesting example, though this observation probably should not be taken as a rule.

________________________________________________________________________

Counterexamples

All the examples discussed previously are good “textbook” examples in that they help build intuition on how skewness behaves in familiar distributions. However, it is also easy to take the wrong lessons from these examples. The above examples can serve as good introduction to the topic of skewness. It is also important to attempt to provide a caveat that some of the commonly drawn lessons are not appropriate in all circumstances.

As indicated earlier, one wrong lesson from Figure 1 and Figure 2 is that a density curve such as Figure 1 may suggest that “mode < median < mean" for a right skewed distribution and that Figure 2 may suggest that "mean < median < mode" for a left skewed distribution. In both Figure 1 and Figure 2, the mean is further out in the long tail than the median. In certain textbooks, these two observations are even stated as characterizations of right skew and left skew. Such a rule of thumb is easy to state and easy to apply. For some students, such rule provides a lot of clarity about how skewness should work. For such students, checking for skewness is simply a matter of finding the relative position of the mean and median (e.g. in such thinking, if mean is greater than the median, then it is a right skew).

Any discussion of skewness should point out that the simple rule described in the above paragraph, though useful in many situations, is imperfect and may not apply outside of certain familiar distributions. For a good discussion on this issue, see this article.

We highlight one example found in the article mentioned above. This example demonstrates a clear violation of the common misconception indicated above. The following is the density function of the example.

    \displaystyle f_p(x)=\left\{\begin{matrix} \displaystyle (1-p)\biggl(1+\frac{1-p}{2p} x \biggr)&\ \ \ \ \ \ \frac{-2p}{1-p} \le x \le 0 \\{\text{ }}& \\ (1-p) e^{-x}&\ \ \ \ \ \ x>0   \end{matrix}\right.

where 0<p<1. To facilitate the discussion, let X_p be the random variable whose PDF is f_p(x) defined above. The above density function is a juxtaposition of a triangular density and an exponential density. This triangular-exponential distribution has positive coefficient of skewness when 0<p<0.755. Yet within this range for p, the mean can be made to be on either side of the median. We consider three cases where p=0.7, p=0.6 and p=0.9.

Example 6
First the case p=0.7.

    \displaystyle f_{0.7}(x)=\left\{\begin{matrix} \displaystyle 0.3\biggl(1+\frac{3}{14} x \biggr)&\ \ \ \ \ \ \frac{-14}{3} \le x \le 0 \\{\text{ }}& \\ 0.3 e^{-x}&\ \ \ \ \ \ x>0   \end{matrix}\right.

The following is the graph of the density curve f_{0.7}(x). The right tail is long since the exponential distribution is on the right side. However, the left side is heavier (with 70% of the weight on the triangle on the left side).

    Figure 5
    Triangular-exponential 1

The following shows the results for the density function f_{0.7}(x).

    \displaystyle E(X_{0.7})=\frac{-21.3}{27}=-0.7889

    \displaystyle \text{median of } X_{0.7} = -0.722613478

    \displaystyle E(X_{0.7}^2)=\frac{254.4}{81}=3.140740741

    \displaystyle Var(X_{0.7})=\frac{1835.91}{3^6}

    \displaystyle E(X_{0.7}^3)=\frac{-2152.2}{405}

    \displaystyle \gamma_1=\frac{111906.63}{5 1835.91^{1.5}}=0.284517335

The calculation confirms the positive skew (0.2845), which is a moderately strong positive skewness. Note that the mean is to the left of the median. Both the mean and median are to the left of the mode (at x = 0). In Figure 5, the right side is infinitely long, thus a positively skewed distribution (and is confirmed by the calculation of \gamma_1). According to the common notion of how right skew should work, the mean should be further out on the right tail. But this is not the case. The mean is further out on the left side than the median. The violation of the common conception of skewness can occur when one tail is long but the other side is heavier.

Example 7
Now the case p=0.6.

    \displaystyle f_{0.6}(x)=\left\{\begin{matrix} \displaystyle 0.4\biggl(1+\frac{1}{3} x \biggr)&\ \ \ \ \ \ -3 \le x \le 0 \\{\text{ }}& \\ 0.4 e^{-x}&\ \ \ \ \ \ x>0   \end{matrix}\right.

The following is the graph of the density curve f_{0.6}(x). The right tail is long since the exponential distribution is on the right side. The left side is still heavy but a little less heavier than in the previous example (with 60% of the weight on the triangle on the left side).

    Figure 6
    Triangular-exponential 2

The following shows the results for the density function f_{0.6}(x).

    \displaystyle E(X_{0.6})=-0.2

    \displaystyle \text{median of } X_{0.6} = -0.261387212

    \displaystyle E(X_{0.6}^2)=1.7

    \displaystyle Var(X_{0.6})=1.66

    \displaystyle E(X_{0.6}^3)=0.78

    \displaystyle \gamma_1=0.834128035

The density curve f_{0.6}(x) has a stronger positive skew than the previous example as there is a little more weight on the exponential side (the right side). Even though the mean in this case is to the right of the median, both the mean and median are not on the right tail but on the left triangular side (the heavier side). In any case, the mean is definitely not further out on the longer tail (the right tail) as the common rule of thumb would suggest.

Both Example 6 and Example 7 are right skewed distributions that do not conform to the common expectation about right skewed distributions. The following example will dispel the notion about the direction of the skew.

Example 8
Here we use p=0.9 so that there is a still a long right tail but 90% of the weight is on the other side.

    \displaystyle f_{0.9}(x)=\left\{\begin{matrix} \displaystyle 0.1\biggl(1+\frac{1}{18} x \biggr)&\ \ \ \ \ \ -18 \le x \le 0 \\{\text{ }}& \\ 0.1 e^{-x}&\ \ \ \ \ \ x>0   \end{matrix}\right.

The overall shape of the f_{0.9}(x) is similar to Figure 5 and Figure 6. The following shows the results for the density function f_{0.9}(x).

    \displaystyle E(X_{0.9})=-5.3

    \displaystyle E(X_{0.9}^2)=48.8

    \displaystyle Var(X_{0.9})=20.71

    \displaystyle E(X_{0.9}^3)=-524.28

    \displaystyle \gamma_1=-0.489285839

Because there is so little weight on the right tail, the skewness is actually negative (-0.48928). Here we have a right skewed looking distribution that is actually skewed to the left!

________________________________________________________________________

Remarks

Examples 5 through 7 demonstrate that when one tail is long but the other side is heavy, the common conception of right skew and left skew do not apply. The common conception, as discussed earlier, is that the both the mean and the median are located in the longer tail and that the mean is further out in the long tail than the median. The article mentioned earlier is easy to read and gives a fuller discussion of the issues when dealing with the notion of skewness. The common conception of skewness can be easily violated in discrete distributions, especially when the weights on both sides of the median are not equal. All the above examples are unimodal distributions. According the quoted article, bimodal or multimodal distributions can be problematic too.

Of course, the caveat presented here is not meant to discourage anyone from discussing the common conception about skewness. The common conception about the locations of mode, mean and median conveys useful intuition and we should continue to focus on it. But the common rule of thumb should definitely be not be presented as gospel truth as some textbooks had done. Instead, it should be pointed out that the common rule of thumb is imperfect and it would be helpful to have a discussion why the rule is imperfect.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the companion blog to this blog.

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Relating Binomial and Negative Binomial

The negative binomial distribution has a natural intepretation as a waiting time until the arrival of the rth success (when the parameter r is a positive integer). The waiting time refers to the number of independent Bernoulli trials needed to reach the rth success. This interpretation of the negative binomial distribution gives us a good way of relating it to the binomial distribution. For example, if the rth success takes place after k failed Bernoulli trials (for a total of k+r trials), then there can be at most r-1 successes in the first k+r trials. This tells us that the survival function of the negative binomial distribution is the cumulative distribution function (cdf) of a binomial distribution. In this post, we gives the details behind this observation. A previous post on the negative binomial distribution is found here.

A random experiment resulting in two distinct outcomes (success or failure) is called a Bernoulli trial (e.g. head or tail in a coin toss, whether or not the birthday of a customer is the first of January, whether an insurance claim is above or below a given threshold etc). Suppose a series of independent Bernoulli trials are performed until reaching the rth success where the probability of success in each trial is p. Let X_r be the number of failures before the occurrence of the rth success. The following is the probablity mass function of X_r.

\displaystyle (1) \ \ \ \ P(X_r=k)=\binom{k+r-1}{k} p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,3,\cdots

Be definition, the survival function and cdf of X_r are:

\displaystyle (2) \ \ \ \ P(X_r > k)=\sum \limits_{j=k+1}^\infty \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

\displaystyle (3) \ \ \ \ P(X_r \le k)=\sum \limits_{j=0}^k \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

For each positive integer k, let Y_{r+k} be the number of successes in performing a sequence of r+k independent Bernoulli trials where p is the probability of success. In other words, Y_{r+k} has a binomial distribution with parameters r+k and p.

If the random experiment requires more than k failures to reach the rth success, there are at most r-1 successes in the first k+r trails. Thus the survival function of X_r is the same as the cdf of a binomial distribution. Equivalently, the cdf of X_r is the same as the survival function of a binomial distribution. We have the following:

\displaystyle \begin{aligned}(4) \ \ \ \ P(X_r > k)&=P(Y_{k+r} \le r-1) \\&=\sum \limits_{j=0}^{r-1} \binom{k+r}{j} p^j (1-p)^{k+r-j} \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

\displaystyle \begin{aligned}(5) \ \ \ \ P(X_r \le k)&=P(Y_{k+r} > r-1) \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

Remark
The relation (4) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Recall that a Gamma distribution with shape parameter \alpha and scale parameter n, where n is a positive integer, can be interpreted as the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The relation (4) is analogous to the following relation.

\displaystyle (5) \ \ \ \ \int_t^\infty \frac{\alpha^n}{(n-1)!} \ x^{n-1} \ e^{-\alpha x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\alpha t} \ (\alpha t)^j}{j!}

A previous post on the negative binomial distribution is found here.

The Negative Binomial Distribution

A counting distribution is a discrete distribution with probabilities only on the nonnegative integers. Such distributions are important in insurance applications since they can be used to model the number of events such as losses to the insured or claims to the insurer. Though playing a prominent role in statistical theory, the Poisson distribution is not appropriate in all situations, since it requires that the mean and the variance are equaled. Thus the negative binomial distribution is an excellent alternative to the Poisson distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial distribution arises naturally from a probability experiment of performing a series of independent Bernoulli trials until the occurrence of the rth success where r is a positive integer. From this starting point, we discuss three ways to define the distribution. We then discuss several basic properties of the negative binomial distribution. Emphasis is placed on the close connection between the Poisson distribution and the negative binomial distribution.

________________________________________________________________________

Definitions
We define three versions of the negative binomial distribution. The first two versions arise from the view point of performing a series of independent Bernoulli trials until the rth success where r is a positive integer. A Bernoulli trial is a probability experiment whose outcome is random such that there are two possible outcomes (success or failure).

Let X_1 be the number of Bernoulli trials required for the rth success to occur where r is a positive integer. Let p is the probability of success in each trial. The following is the probability function of X_1:

\displaystyle (1) \ \ \ \ \ P(X_1=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots

The idea for (1) is that for X_1=x to happen, there must be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial).

A more common version of the negative binomial distribution is the number of Bernoulli trials in excess of r in order to produce the rth success. In other words, we consider the number of failures before the occurrence of the rth success. Let X_2 be this random variable. The following is the probability function of X_2:

\displaystyle (2) \ \ \ \ \ P(X_2=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The idea for (2) is that there are x+r trials and in the first x+r-1 trials, there are x failures (or equivalently r-1 successes).

In both (1) and (2), the binomial coefficient is defined by

\displaystyle (3) \ \ \ \ \ \binom{y}{k}=\frac{y!}{k! \ (y-k)!}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

where y is a positive integer and k is a nonnegative integer. However, the right-hand-side of (3) can be calculated even if y is not a positive integer. Thus the binomial coefficient \displaystyle \binom{y}{k} can be expanded to work for all real number y. However k must still be nonnegative integer.

\displaystyle (4) \ \ \ \ \ \binom{y}{k}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

For convenience, we let \displaystyle \binom{y}{0}=1. When the real number y>k-1, the binomial coefficient in (4) can be expressed as:

\displaystyle (5) \ \ \ \ \ \binom{y}{k}=\frac{\Gamma(y+1)}{\Gamma(k+1) \Gamma(y-k+1)}

where \Gamma(\cdot) is the gamma function.

With the more relaxed notion of binomial coefficient, the probability function in (2) above can be defined for all real number r. Thus the general version of the negative binomial distribution has two parameters r and p, both real numbers, such that 0<p<1. The following is its probability function.

\displaystyle (6) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

Whenever r in (6) is a real number that is not a positive integer, the interpretation of counting the number of failures until the occurrence of the rth success is no longer important. Instead we can think of it simply as a count distribution.

The following alternative parametrization of the negative binomial distribution is also useful.

\displaystyle (6a) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} \biggl(\frac{\alpha}{\alpha+1}\biggr)^r \biggl(\frac{1}{\alpha+1}\biggr)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The parameters in this alternative parametrization are r and \alpha>0. Clearly, the ratio \frac{\alpha}{\alpha+1} takes the place of p in (6). Unless stated otherwise, we use the parametrization of (6).
________________________________________________________________________

What is negative about the negative binomial distribution?
What is negative about this distribution? What is binomial about this distribution? The name is suggested by the fact that the binomial coefficient in (6) can be rearranged as follows:

\displaystyle \begin{aligned}(7) \ \ \ \ \ \binom{x+r-1}{x}&=\frac{(x+r-1)(x+r-2) \cdots r}{x!} \\&=(-1)^x \frac{(-r-(x-1))(-r-(x-2)) \cdots (-r)}{x!} \\&=(-1)^x \frac{(-r)(-r-1) \cdots (-r-(x-1))}{x!} \\&=(-1)^x \binom{-r}{x} \end{aligned}

The calculation in (7) can be used to verify that (6) is indeed a probability function, that is, all the probabilities sum to 1.

\displaystyle \begin{aligned}(8) \ \ \ \ \ 1&=p^r p^{-r}\\&=p^r (1-q)^{-r} \\&=p^r \sum \limits_{x=0}^\infty \binom{-r}{x} (-q)^x \ \ \ \ \ \ \ \ (8.1) \\&=p^r \sum \limits_{x=0}^\infty (-1)^x \binom{-r}{x} q^x \\&=\sum \limits_{x=0}^\infty \binom{x+r-1}{x} p^r q^x \end{aligned}

In (8), we take q=1-p. The step (8.1) above uses the following formula known as the Newton’s binomial formula.

\displaystyle (9) \ \ \ \ \ (1+t)^w=\sum \limits_{k=0}^\infty \binom{w}{k} t^k

For a detailed discussion of (8) with all the details worked out, see the post called Deriving some facts of the negative binomial distribution.

________________________________________________________________________

The Generating Function
By definition, the following is the generating function of the negative binomial distribution, using :

\displaystyle (10) \ \ \ \ \ g(z)=\sum \limits_{x=0}^\infty \binom{r+x-1}{x} p^r q^x z^x

where q=1-p. Using a similar calculation as in (8), the generating function can be simplified as:

\displaystyle (11) \ \ \ \ \ g(z)=p^r (1-q z)^{-r}=\frac{p^r}{(1-q z)^r}=\frac{p^r}{(1-(1-p) z)^r}; \ \ \ \ \ z<\frac{1}{1-p}

As a result, the moment generating function of the negative binomial distribution is:

\displaystyle (12) \ \ \ \ \ M(t)=\frac{p^r}{(1-(1-p) e^t)^r}; \ \ \ \ \ \ \ t<-ln(1-p)

For a detailed discussion of (12) with all the details worked out, see the post called Deriving some facts of the negative binomial distribution.

________________________________________________________________________

Independent Sum

One useful property of the negative binomial distribution is that the independent sum of negative binomial random variables, all with the same parameter p, also has a negative binomial distribution. Let Y=Y_1+Y_2+\cdots+Y_n be an independent sum such that each X_i has a negative binomial distribution with parameters r_i and p. Then the sum Y=Y_1+Y_2+\cdots+Y_n has a negative binomial distribution with parameters r=r_1+\cdots+r_n and p.

Note that the generating function of an independent sum is the product of the individual generating functions. The following shows that the product of the individual generating functions is of the same form as (11), thus proving the above assertion.

\displaystyle (13) \ \ \ \ \ h(z)=\frac{p^{\sum \limits_{i=1}^n r_i}}{(1-(1-p) z)^{\sum \limits_{i=1}^n r_i}}
________________________________________________________________________

Mean and Variance
The mean and variance can be obtained from the generating function. From E(X)=g'(1) and E(X^2)=g'(1)+g^{(2)}(1), we have:

\displaystyle (14) \ \ \ \ \ E(X)=\frac{r(1-p)}{p} \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X)=\frac{r(1-p)}{p^2}

Note that Var(X)=\frac{1}{p} E(X)>E(X). Thus when the sample data suggest that the variance is greater than the mean, the negative binomial distribution is an excellent alternative to the Poisson distribution. For example, suppose that the sample mean and the sample variance are 3.6 and 7.1. In exploring the possibility of fitting the data using the negative binomial distribution, we would be interested in the negative binomial distribution with this mean and variance. Then plugging these into (14) produces the negative binomial distribution with r=3.7 and p=0.507.
________________________________________________________________________

The Poisson-Gamma Mixture
One important application of the negative binomial distribution is that it is a mixture of a family of Poisson distributions with Gamma mixing weights. Thus the negative binomial distribution can be viewed as a generalization of the Poisson distribution. The negative binomial distribution can be viewed as a Poisson distribution where the Poisson parameter is itself a random variable, distributed according to a Gamma distribution. Thus the negative binomial distribution is known as a Poisson-Gamma mixture.

In an insurance application, the negative binomial distribution can be used as a model for claim frequency when the risks are not homogeneous. Let N has a Poisson distribution with parameter \theta, which can be interpreted as the number of claims in a fixed period of time from an insured in a large pool of insureds. There is uncertainty in the parameter \theta, reflecting the risk characteristic of the insured. Some insureds are poor risks (with large \theta) and some are good risks (with small \theta). Thus the parameter \theta should be regarded as a random variable \Theta. The following is the conditional distribution of the random variable N (conditional on \Theta=\theta):

\displaystyle (15) \ \ \ \ \ P(N=n \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots

Suppose that \Theta has a Gamma distribution with scale parameter \alpha and shape parameter \beta. The following is the probability density function of \Theta.

\displaystyle (16) \ \ \ \ \ g(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta} \ \ \ \ \ \ \ \ \ \ \theta>0

Then the joint density of N and \Theta is:

\displaystyle (17) \ \ \ \ \ P(N=n \lvert \Theta=\theta) \ g(\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta}

The unconditional distribution of N is obtained by summing out \theta in (17).

\displaystyle \begin{aligned}(18) \ \ \ \ \ P(N=n)&=\int_0^\infty P(N=n \lvert \Theta=\theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ d \theta \\&=\int_0^\infty \frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \int_0^\infty \frac{(\alpha+1)^{n+\beta}}{\Gamma(n+\beta)} \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \\&=\frac{\Gamma(n+\beta)}{\Gamma(n+1) \ \Gamma(\beta)} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \\&=\binom{n+\beta-1}{n} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots \end{aligned}

Note that the integral in the fourth step in (18) is 1.0 since the integrand is the pdf of a Gamma distribution. The above probability function is that of a negative binomial distribution. It is of the same form as (6a). Equivalently, it is also of the form (6) with parameter r=\beta and p=\frac{\alpha}{\alpha+1}.

The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional distribution of N is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture distribution of N. The variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter \Theta (see The variance of a mixture).

________________________________________________________________________

The Poisson Distribution as Limit of Negative Binomial
There is another connection to the Poisson distribution, that is, the Poisson distribution is a limiting case of the negative binomial distribution. We show that the generating function of the Poisson distribution can be obtained by taking the limit of the negative binomial generating function as r \rightarrow \infty. Interestingly, the Poisson distribution is also the limit of the binomial distribution.

In this section, we use the negative binomial parametrization of (6a). By replacing \frac{\alpha}{\alpha+1} for p, the following are the mean, variance, and the generating function for the probability function in (6a):

\displaystyle \begin{aligned}(19) \ \ \ \ \ \ &E(X)=\frac{r}{\alpha} \\&\text{ }\\&Var(X)=\frac{\alpha+1}{\alpha} \ \frac{r}{\alpha}=\frac{r(\alpha+1)}{\alpha^2} \\&\text{ } \\&g(z)=\frac{1}{[1-\frac{1}{\alpha}(z-1)]^r} \ \ \ \ \ \ \ z<\alpha+1 \end{aligned}

Let r goes to infinity and \displaystyle \frac{1}{\alpha} goes to zero and at the same time keeping their product constant. Thus \displaystyle \mu=\frac{r}{\alpha} is constant (this is the mean of the negative binomial distribution). We show the following:

\displaystyle (20) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}=e^{\mu (z-1)}

The right-hand side of (20) is the generating function of the Poisson distribution with mean \mu. The generating function in the left-hand side is that of a negative binomial distribution with mean \displaystyle \mu=\frac{r}{\alpha}. The following is the derivation of (20).

\displaystyle \begin{aligned}(21) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(ln[1-\frac{\mu}{r}(z-1)]^{-r}\biggr)} \\&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(-r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \\&=e^{\displaystyle \biggl(\lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \end{aligned}

We now focus on the limit in the exponent.

\displaystyle \begin{aligned}(22) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]&=\lim \limits_{r \rightarrow \infty} \frac{ln(1-\frac{\mu}{r} (z-1))^{-1}}{r^{-1}} \\&=\lim \limits_{r \rightarrow \infty} \frac{(1-\frac{\mu}{r} (z-1)) \ \mu (z-1) r^{-2}}{r^{-2}} \\&=\mu (z-1) \end{aligned}

The middle step in (22) uses the L’Hopital’s Rule. The result in (20) is obtained by combining (21) and (22).

________________________________________________________________________

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

________________________________________________________________________
\copyright \ \text{2011-2015 by Dan Ma}

A note on the F-distribution

Some important distributions in statistical applications are of the form T=\frac{X}{Y} where X and Y are independent and Y >0. When X=\frac{U}{m} and Y=\frac{V}{n} where U and V have chi-square distributions with degrees of freedom m and n, respectively, the ratio \displaystyle T has an F distribution and this ratio is called the F-statistic.

Let f(x) and g(y) be the density functions of the independent random variables X and Y, respectively. Let F(x) and G(y) be the distribution functions of X and Y, respectively. Let \displaystyle T=\frac{X}{Y} where Y>0. We have P[T \leq t]=P[X \leq tY]. Integrating over the region of x \leq ty, we obtain the following.

\displaystyle \begin{aligned}F_T(t)=P[X \leq tY]&=\int_0^\infty \int_0^{ty} f(x) \thinspace g(y) \thinspace dx \thinspace dy\\&=\int_0^\infty g(y) \int_0^{ty} f(x) dx \thinspace dy\\&=\int_0^\infty F(ty) \thinspace g(y) \thinspace dy\end{aligned}

Taking the derivative, we obtain \displaystyle f_T(t)=F_T^{'}(t)=\int_0^\infty y \thinspace f(ty) \thinspace g(y) \thinspace dy.

This above derivation of the density function f_T is found in [1].

The F-statistic
Note that a chi-square distribution with k degrees of freedom (denoted by \chi^2(k)) is a gamma distribution with parameters \alpha=\frac{k}{2} and \beta=\frac{1}{2} where k is a positive integer. Suppose X=\frac{U}{m} and Y=\frac{V}{n} where U and V have chi-square distributions with m and n as degrees of freedom, respectively. The following are the pdfs of U and V:

\displaystyle f_U(y)=\frac{0.5^{0.5m}}{\Gamma(0.5m)} y^{0.5m-1} e^{-0.5y}

\displaystyle f_V(y)=\frac{0.5^{0.5n}}{\Gamma(0.5n)} y^{0.5n-1} e^{-0.5y}

In general, if Z=a W where a \neq 0 is a constant, then we have this relationship for the density functions between Z and W: f_Z(z)=f_W(\frac{z}{a}) \frac{1}{a}. Then the following are the density functions of X and Y. Note that X has a gamma distribution with parameters \alpha=\frac{m}{2} and \beta=\frac{m}{2}. For Y, it is a gamma distribution with parameters \alpha=\frac{n}{2} and \beta=\frac{n}{2}.

\displaystyle f_X(y)=f_U(my)m=m \frac{0.5^{0.5m}}{\Gamma(0.5m)} (my)^{0.5m-1} e^{-0.5my}
\displaystyle =\frac{(0.5m)^{0.5m}}{\Gamma(0.5m)} y^{0.5m-1} e^{-(0.5m)y}

\displaystyle f_Y(y)=f_V(ny)n=\frac{(0.5n)^{0.5n}}{\Gamma(0.5n)} y^{0.5n-1} e^{-(0.5n)y}

Using \displaystyle f_T(t)=\int_0^\infty y \thinspace f_X(ty) \thinspace f_Y(y) \thinspace dy, the following is the derivation of the density function of T=\frac{X}{Y}:

\displaystyle f_T(t)=\int_0^{\infty} y \frac{(0.5m)^{0.5m}}{\Gamma(0.5m)} (ty)^{0.5m-1} e^{-(0.5m)ty} \frac{(0.5n)^{0.5n}}{\Gamma(0.5n)} y^{0.5n-1} e^{-(0.5n)y} dy

\displaystyle =\int_0^{\infty} \frac{(0.5m)^{0.5m}}{\Gamma(0.5m)} \frac{(0.5n)^{0.5n}}{\Gamma(0.5n)} t^{0.5m-1} y^{0.5m+0.5n-1} e^{-(0.5mt+0.5n)y} dy

\displaystyle =\frac{(0.5m)^{0.5m}}{\Gamma(0.5m)} \frac{(0.5n)^{0.5n}}{\Gamma(0.5n)} t^{0.5m-1} \frac{\Gamma(0.5m+0.5n)}{(0.5mt+0.5n)^{0.5mt+0.5n}} \times

\displaystyle \int_0^{\infty} \frac{(0.5mt+0.5n)^{0.5m+0.5n}}{\Gamma(0.5m+0.5n)}y^{0.5m+0.5n-1} e^{-(0.5mt+0.5n)y} dy

\displaystyle =\frac{(0.5m)^{0.5m}}{\Gamma(0.5m)} \frac{(0.5n)^{0.5n}}{\Gamma(0.5n)} t^{0.5m-1} \frac{\Gamma(0.5m+0.5n)}{(0.5mt+0.5n)^{0.5m+0.5n}}

Simplifying the above, we obtain:

\displaystyle f_T(t)=\frac{\Gamma(\frac{m+n}{2})}{\Gamma(\frac{m}{2}) \Gamma(\frac{n}{2})} \thinspace \frac{(\frac{m}{n})^{\frac{m}{2}} \thinspace t^{\frac{m}{2}-1}}{(\frac{m}{n}t+1)^{\frac{m+n}{2}}}

The distributon derived above is said to be an F distribution with m and n degrees of freedom. The first parameter is the degrees of freedom in the numerator and the second parameter is the degrees of freedom in the denominator.

Reference

  1. Feller W., An Introduction to Probability Theory and Its Applications, Vol II, Second Edition, John Wiley & Sons (1971)