Deriving some facts of the negative binomial distribution

The previous post called The Negative Binomial Distribution gives a fairly comprehensive discussion of the negative binomial distribution. In this post, we fill in some of the details that are glossed over in that previous post. We derive the following points:

  • Discuss the several versions of the negative binomial distribution.
  • The negative binomial probabilities sum to one, i.e., the negative binomial probability function is a valid one.
  • Derive the moment generating function of the negative binomial distribution.
  • Derive the first and second moments and the variance of the negative binomial distribution.
  • An observation about independent sum of negative binomial distributions.

________________________________________________________________________

Three versions

The negative binomial distribution has two parameters r and p, where r is a positive real number and 0<p<1. The first two versions arise from the case that r is a positive integer, which can be interpreted as the random experiment of a sequence of independent Bernoulli trials until the rth success (the trials have the same probability of success p). In this interpretation, there are two ways of recording the random experiment:

    X = the number of Bernoulli trials required to get the rth success.
    Y = the number of Bernoulli trials that end in failure before getting the rth success.

The other parameter p is the probability of success in each Bernoulli trial. The notation \binom{m}{n} is the binomial coefficient where m and n are non-negative integers and m \ge n is defined as:

    \displaystyle \binom{m}{n}=\frac{m!}{n! \ (m-n)!}=\frac{m(m-1) \cdots (m-(n-1))}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)

With this in mind, the following are the probability functions of the random variables X and Y.

    \displaystyle P(X=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The thought process for (1) is that for the event X=x to happen, there can only be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial). The thought process for (2) is that for the event Y=y to happen, there are y+r trials (y failures and r successes). In the first y+r-1 trials, there can be only y failures (or equivalently r-1 successes). Note that X=Y+r. Thus knowing the mean of Y will derive the mean of X, a fact we will use below.

Instead of memorizing the probability functions (1) and (2), it is better to understand and remember the thought processes involved. Because of the natural interpretation of performing Bernoulli trials until the rth success, it is a good idea to introduce the negative binomial distribution via the distributions described by (1) and (2), i.e., the case where the parameter r is a positive integer. When r=1, the random experiment is a sequence of independent Bernoulli trials until the first success (this is called the geometric distribution).

Of course, (1) and (2) can also simply be used as counting distributions without any connection with a series of Bernoulli trials (e.g. used in an insurance context as the number of losses or claims arising from a group of insurance policies).

The binomial coefficient in (0) is defined when both numbers are non-negative integers and that the top one is greater than or equal to the bottom one. However, the rightmost term in (0) can be calculated even when the top number m is not a non-negative integer. Thus when m is any real number, the rightmost term (0) can be calculated provided that the bottom number n is a positive integer. For convenience we define \binom{m}{0}=1. With this in mind, the binomial coefficient \binom{m}{n} is defined for any real number m and any non-negative integer n.

The third version of the negative binomial distribution arises from the relaxation of the binomial coefficient \binom{m}{n} just discussed. With this in mind, the probability function in (2) can be defined for any positive real number r:

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where \displaystyle \binom{y+r-1}{y}=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!}.

Of course when r is a positive integer, versions (2) and (3) are identical. When r is a positive real number but is not an integer, the distribution cannot be interpreted as the number of failures until the occurrence of rth success. Instead, it is used as a counting distribution.

________________________________________________________________________

The probabilities sum to one

Do the probabilities in (1), (2) or (3) sum to one? For the interpretations of (1) and (2), is it possible to repeatedly perform Bernoulli trials and never get the rth success? For r=1, is it possible to never even get a success? In tossing a fair coin repeatedly, soon enough you will get a head and even if r is a large number, you will eventually get r number of heads. Here we wish to prove this fact mathematically.

To show that (1), (2) and (3) are indeed probability functions, we use a fact concerning Maclaurin’s series expansion of the function (1-x)^{-r}, a fact that is covered in a calculus course. In the following two results, r is a fixed positive real number and y is any non-negative integer:

    \displaystyle \binom{y+r-1}{y}=(-1)^y \ \binom{-r}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    \displaystyle \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y=(1-x)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The result (4) is to rearrange the binomial coefficient in probability function (3) to another binomial coefficient with a negative number. This is why there is the word “negative” in negative binomial distribution. The result (5) is the Maclaurin’s series expansion for the function (1-x)^{-r}. We first derive these two facts and then use them to show that the negative binomial probabilities in (3) sum to one. The following derives (4).

    \displaystyle \begin{aligned} \binom{y+r-1}{y}&=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!} \\&=(-1)^y \ \frac{(-r)(-r-1) \cdots (-r-(y-1))}{y!} \\&=(-1)^y \ \binom{-r}{y}  \end{aligned}

To derive (5), let f(x)=(1-x)^{-r}. Based on a theorem that can be found in most calculus text, the function f(x) has the following Maclaurin’s series expansion (Maclaurin’s series is simply Taylor’s series with center = 0).

    \displaystyle (1-x)^{-r}=f(0)+f^{'}(0)x+\frac{f^{(2)}(0)}{2!}x^2+\frac{f^{(3)}(0)}{3!}x^3+\cdots + \frac{f^{(n)}(0)}{n!}x^n+\cdots

where -1<x<1. Now, filling in the derivatives f^{(n)}(0), we have the following derivation.

    \displaystyle \begin{aligned} (1-x)^{-r}&=1+rx+\frac{(r+1)r}{2!}x^2+\frac{(r+2)(r+1)r}{3!}x^3 \\& \ \ \ \ \ \ \ \ +\cdots+\frac{(r+y-1)(r+y-2) \cdots (r+1)r}{y!}x^y +\cdots \\&=1+(-1)^1 (-r)x+(-1)^2\frac{(-r)(-r-1)}{2!}x^2 \\& \ \ \ \ \ \ +(-1)^3 \frac{(-r)(-r-1)(-r-2)}{3!}x^3 +\cdots \\& \ \ \ \ \ \ +(-1)^y \frac{(-r)(-r-1) \cdots (-r-y+2)(-r-y+1)}{y!}x^y +\cdots  \\&=(-1)^0 \binom{-r}{0}x^0 +(-1)^1 \binom{-r}{1}x^1+(-1)^2 \binom{-r}{2}x^2 \\& \ \ \ \ \ \ +(-1)^3 \binom{-r}{3}x^3+\cdots +(-1)^y \binom{-r}{y}x^y+\cdots    \\&=\sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y \end{aligned}

We can now show that the negative binomial probabilities in (3) sum to one. Let q=1-p.

    \displaystyle \begin{aligned} \sum \limits_{y=0}^\infty \binom{y+r-1}{y} \ p^r \ q^y &=p^r \ \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ q^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ (1-q)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5)\\&=p^r p^{-r} \\&=1 \end{aligned}

________________________________________________________________________

The moment generating function

We now derive the moment generating function of the negative binomial distribution according to (3). The moment generation function is M(t)=E(e^{tY}) over all real numbers t for which M(t) is defined. The following derivation does the job.

    \displaystyle \begin{aligned} M(t)&=E(e^{tY}) \\&=\sum \limits_{y=0}^\infty \ e^{t y} \ \binom{y+r-1}{y} \ p^r \ (1-p)^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ \binom{y+r-1}{y} \ [(1-p) e^t]^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ (-1)^y \binom{-r}{y} \ [(1-p) e^t]^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ [1-(1-p) \ e^t]^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5) \\&=\frac{p^r}{[1-(1-p) \ e^t]^{r}}\end{aligned}

The above moment generating function works for the negative binomial distribution with respect to (3) and thus to (2). For the distribution in (1), note that X=Y+r. Thus E(e^{tX})=E(e^{t(Y+r)})=e^{tr} \ E(e^{tY}). The moment generating function of (1) is simply the above moment generating function multiplied by the factor e^{tr}. To summarize, the moment generating functions for the three versions are:

    \displaystyle M_X(t)=E[e^{tX}]=\frac{p^r \ e^{tr}}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (1)

    \displaystyle M_Y(t)=E[e^{tY}]=\frac{p^r}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (2) \text{ and } (3)

The domain of the moment generating function is the set of all t that for which M_X(t) or M_Y(t) is defined and is positive. Based on the form that it takes, we focus on making sure that 1-(1-p) \ e^t>0. This leads to the domain t<-\text{ln}(1-p).

________________________________________________________________________

The mean and the variance

With the moment generating function derived in the above section, we can now focus on finding the moments of the negative binomial distribution. To find the moments, simply take the derivatives of the moment generating function and evaluate at t=0. For the distribution represented by the probability function in (3), we calculate the following:

    E(Y)=M_Y^{'}(0)

    E(Y^2)=M_Y^{(2)}(0)

    Var(Y)=E(Y^2)-E(Y)^2

After taking the first and second derivatives and evaluate at t=0, the first and the second moments are:

    \displaystyle E(Y)=r \ \frac{1-p}{p}

    \displaystyle E(Y^2)=\frac{r(1-p)[1+(1-p)]}{p^2}

The following derives the variance.

    \displaystyle \begin{aligned} Var(Y)&=E(Y^2)-E(Y)^2 \\&=\frac{r(1-p)[1+(1-p)]}{p^2}-\frac{(1-p)^2}{p^2} \\&=\frac{r(1-p)[1+r(1-p)-r(1-p)]}{p^2} \\&=\frac{r(1-p)}{p^2}  \end{aligned}

The above formula is the variance for the three versions (1), (2) and (3). Note that Var(Y)>E(Y). In contrast, the variance of the Poisson distribution is identical to its mean. Thus in the situation where the variance of observed data is greater than the sample mean, the negative binomial distribution should be a better fit than the Poisson distribution.

________________________________________________________________________

The independent sum

There is an easy consequence that follows from the moment generating function derived above. The sum of several independent negative binomial distributions is also a negative binomial distribution. For example, suppose T_1,T_2, \cdots,T_n are independent negative binomial random variables (version (3)). Suppose each T_j has parameters r_j and p (the second parameter is identical). The moment generating function of the independent sum is the product of the individual moment generating functions. Thus the following is the moment generating function of T=T_1+\cdots+T_n.

    \displaystyle M_T(t)=E[e^{tT}]=\frac{p^g}{[1-(1-p) \ e^t]^{g}}

where g=r_1+\cdots+r_n. The moment generating function uniquely identifies the distribution. The above M_T(t) is that of a negative binomial distribution with parameters g and p according to (3).

A special case is that the sum of n independent geometric distributions is a negative binomial distribution with the r parameter being r=n. The following is the moment generating function of the sum W of n independent geometric distributions.

    \displaystyle M_W(t)=E[e^{tW}]=\frac{p^n}{[1-(1-p) \ e^t]^{n}}

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}