A natural look at the negative binomial survival function

The negative binomial distribution is a discrete distribution with two parameters r and p where r>0 and 0<p<1. It has positive probabilities at the non-negative integers 0,1,2,\cdots. So it can potentially be used as a model for the random count of a phenomenon of interest. In some cases, the negative binomial distribution has a natural interpretation. In fact, the natural interpretation should be how the negative binomial distribution is introduced. With the parameter r being a positive integer, the negative binomial distribution can be naturally interpreted as the discrete waiting time until the rth success. But this natural interpretation does not apply to the general case of r being only a positive real number but not necessarily an integer. However, once the natural case that r being a positive integer is understood, it is easy to make a leap to the general case that r is any positive real number. In this post, we focus on the “natural” version of the negative binomial distribution, where r is a positive integer. In this case, there is a natural way to look at the cumulative distribution function (cdf) and the survival function. The discussion in this post will complement the following posts on the negative binomial distribution.

________________________________________________________________________

The probability functions

A Bernoulli trial is a random experiment in which there are two distinct outcomes. For convenience, they are called success (S) and failure (F). Consider performing a sequence of independent Bernoulli trials such that each trial has the same probability of success p. Fix a positive integer r. In some cases, we would like to count the number of trials that produce the rth success. Let’s call this count X_r. In other cases, we may want instead to count the number of failures before getting the rth success. Let’s call this count Y_r. According to the discussion in the two posts listed above, the probability functions of X_r and Y_r are:

    \displaystyle P(X_r=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y_r=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

When r=1, the resulting distribution is called the geometric distribution.

    \displaystyle P(X_1=x)= p (1-p)^{x-1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=1,2,3,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1a)

    \displaystyle P(Y_1=y)= p (1-p)^y \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2a)

Because the parameter r is a positive integer, all the above probability functions have an combinatorial explanation. For the probability function in (1), there are x many trials. The last Bernoulli trial is the rth success. Then all the preceding x-1 trails have at most r-1 successes. Hence we have the binomial coefficient \binom{x-1}{r-1} in (1). For the probability function in (2), there are y+r trials (r successes and y failures). Again, the last Bernoulli trial is the rth success. So in the preceding y+r-1 trials, there are exactly r-1 successes and exactly y failures. Hence we have the binomial coefficient \binom{y+r-1}{y} in (2). Let’s look at some examples.

Example 1
Four different prizes are randomly put into boxes of cereal. One of the prizes is a free ticket to the local zoo. Suppose that a family of four (called Family A) decides to buy this cereal until obtaining four free tickets to the zoo. What is the probability that the family will have to buy 10 boxes of cereal to obtain the four free tickets to the zoo? What is the probability that the family will have to buy 16 boxes of cereal to obtain the four free tickets to the zoo?

In this example, the success is a box of cereal with a free ticket to the zoo. So getting a ticket to the zoo is considered a success. Any one of the other three prizes is considered undesirable or a failure. Any one of the four prizes is equally likely. The negative binomial distribution in this example has r=4 and p=0.25. The count of the boxes of cereal to be purchased is the random variable X_4 as described in (1) above. The following gives the answers.

    \displaystyle \begin{aligned} P(X_4=10)&=\binom{10-1}{3} \ (0.25)^4 \ (0.75)^{10-4} \\&=\binom{9}{3} \ (0.25)^4 \ (0.75)^{6} \\&=84 \ (0.25)^4 \ (0.75)^{6} \\&=0.0583992  \end{aligned}

    \displaystyle \begin{aligned} P(X_4=16)&=\binom{16-1}{3} \ (0.25)^4 \ (0.75)^{16-4} \\&=\binom{15}{3} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766  \end{aligned}

Example 2 (Example 1 continued)
Suppose Family A agrees to give any one of the undesirable prizes away to another family (called Family B). What is the probability that Family A will give 5 undesirable prizes to Family B before obtaining the four desirable tickets to the zoo? What is the probability that Family A will give 12 undesirable prizes to Family B before obtaining the four tickets to the zoo?

The negative binomial distribution in this example has r=4 and p=0.25. The interest here is to count the number failures (undesirable prizes) before getting 4 successes. Thus the random variable of interest is Y_4 as described in (2) above. The following gives the answers.

    \displaystyle \begin{aligned} P(Y_4=5)&=\binom{5+4-1}{5} \ (0.25)^4 \ (0.75)^{5} \\&=\binom{8}{5} \ (0.25)^4 \ (0.75)^{5} \\&=56 \ (0.25)^4 \ (0.75)^{5} \\&=0.0519104  \end{aligned}

    \displaystyle \begin{aligned} P(Y_4=12)&=\binom{12+4-1}{10} \ (0.25)^4 \ (0.75)^{12} \\&=\binom{15}{12} \ (0.25)^4 \ (0.75)^{12} \\&=455 \ (0.25)^4 \ (0.75)^{12} \\&=0.056299766  \end{aligned}

Here’s the mean and variance for both examples.

    \displaystyle E(X_4)=4 \ \frac{1}{0.25}=16

    \displaystyle Var(X_4)=Var(Y_4)=4 \ \frac{0.75}{0.25^2}=48

    \displaystyle E(Y_4)=4 \ \frac{0.75}{0.25}=12

Thus Family A is expected to buy 16 boxes of cereal to get the 4 tickets to the zoo and is expected to give 12 prizes to the other family. However, the variance is fairly large. As a result, the actual number of boxes purchased can vary from the mean by a large amount.

_______________________________________________________________________

The survival functions and the cumulative distribution functions

For any random variable T, the cumulative distribution function is P(T \le t) where T can range over all real numbers t or a relevant set of real numbers t. The survival function is P(T > t). The term survival comes from the interpretation that P(T > t) is the probability of a life surviving beyond time t if T is meant to model the lifetime of a biological life or some system. Even when T is not a lifetime random variable, we still use the term survival function for P(T > t).

Example 1 asks the probability of a certain number of trials in order to get rth success and the probability of a certain number of failures before getting the rth success. Sometimes it is more informative to know how many trials that are required to be performed in order to achieve one’s goal. For example, it may be useful to know the mean number of trials or the probability of achieving the goal in x trials or less. In some cases, it may take time and resource to perform the random Bernoulli trials. It will be helpful to know ahead of time the likelihood of achieving one’s goal given the resources that are available. In the above examples, it will be helpful for Family A to have a better and clearer sense of how many boxes of cereal are to be purchased. Therefore, it is worthwhile to look at the cdf and the survival function of the negative binomial distribution.

In terms of basic principle, here’s the survival functions for the distribution described in (1) and (2).

    \displaystyle \begin{aligned} P(X_r>x)&=\sum \limits_{k=x+1}^\infty P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=x+1}^\infty \binom{k-1}{r-1} p^r (1-p)^{k-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3) \end{aligned}

    \displaystyle \begin{aligned} P(Y_r>y)&=\sum \limits_{j=y+1}^\infty P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y= 0,1,2,3,\cdots \\&=\sum \limits_{j=y+1}^\infty \binom{j+r-1}{j} p^r (1-p)^{j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4) \end{aligned}

The cumulative distribution functions can be defined by basic principle or by taking the complement as follows:

    \displaystyle \begin{aligned} P(X_r \le x)&=\sum \limits_{k=r}^x P(X_r=k) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=1-P(X_r>x)  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)\end{aligned}

    \displaystyle \begin{aligned} P(Y_r \le y)&=\sum \limits_{j=0}^y P(Y_r=j) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,3,\cdots \\&=1-P(Y_r>y)  \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)\end{aligned}

Example 3 (the above examples continued)
What is the probability that Family A will buy more than 16 boxes in order to obtain the 4 free tickets to the zoo? What is the probability that Family A will give away at most 10 undesirable prizes before obtaining the 4 free tickets to the zoo?

Working from the definition, here’s the answers:

    \displaystyle P(X_4>16)=1-P(X_4 \le 16)=0.40498711

    \displaystyle P(Y_4 \le 10)=0.47866004

The mean number of boxes of cereal purchased is 16 as indicated earlier. Since the variance is large, there is still a significance chance (about 40%) that Family A will have to buy more than 16 boxes of cereal before achieving their goal. On the other hand, there is good chance (about 48%) that Family will give away at most 10 undesirable prizes.

Note that the calculation for P(X_4>16) is based on P(X_4 \le 16), which requires the calculation of 17 probabilities. The calculation for P(Y_4 \le 10), which requires 11 probabilities. Such calculation can be done by software of course. There is a natural way of looking at the calculation for (3), (4), (5) and (6). This alternative approach will give much better insight on the negative binomial distribution.

_______________________________________________________________________

A more natural way of interpreting the survival function

We now discuss a better way to look at the survival functions in (3) and (4). Consider the event X_r>x and the event Y_r>y. We will see that the negative binomial survival function can be related to the cdf of a binomial distribution.

For the event X_r>x to occur, the rth success occurs after performing x trials. So it will take x+1 trials or more to get the rth success. This means that in the first x trials, there are at most r-1 successes. The following highlights the equivalent statements.

    \displaystyle \begin{aligned} X_r>x&\equiv \text{the } r \text{th success occurs after performing } x \text{ trials} \\&\equiv \text{it takes at least } x+1 \text{ trials to get the } r \text{th success}  \\&\equiv \text{in the first } x \text{ trials, there are at most } r-1 \text{ successes}  \end{aligned}

The last statement is a binomial distribution. Specifically, it is the binomial distribution with x trials and the probability of success p. Let’s denote the count of successes of this binomial distribution by B_{x,p}. Thus we can relate the survival function in (3) with the cdf of B_{x,p} as follows:

    \displaystyle \begin{aligned} P(X_r>x)&=P(B_{x,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \\&=\sum \limits_{k=0}^{r-1} \ P(B_{x,p}=k)  \\&=\sum \limits_{k=0}^{r-1} \ \binom{x}{k} \ p^k (1-p)^{x-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (7)   \end{aligned}

The advantage of (7) is that it gives us the insight of relating the negative binomial distribution with the binomial distribution. In terms of computation, (7) only requires r many binomial probabilities. Thus Example 3 only requires the computation of 4 probabilities (versus 17 previously).

Note that X_r=Y_r+r. Thus the event Y_r>y is the same as the event X_r>y+r. So we can just piggy back on the work done in (7). For the sake of the more clarity, here’s a translation for the event Y_r>y.

    \displaystyle \begin{aligned} Y_r>y&\equiv X_r>y+r \\&\equiv \text{the } r \text{th success occurs after performing } y+r \text{ trials} \\&\equiv \text{it takes at least } y+r+1 \text{ trials to get the } r \text{th success}  \\&\equiv \text{in the first } y+r \text{ trials, there are at most } r-1 \text{ successes}  \end{aligned}

As before, let B_{n,p} denote the number of successes in performing n Bernoulli trials with p as the probability of success. Based on the above translation, the following gives the survival function for the negative binomial random variable Y_r.

    \displaystyle \begin{aligned} P(Y_r>y)&=P(X_r>y+r)   \\&=P(B_{y+r,p} \le r-1) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ y=0,1,2,\cdots \\&=\sum \limits_{j=0}^{r-1} \ P(B_{y+r,p}=j)  \\&=\sum \limits_{j=0}^{r-1} \ \binom{y+r}{j} \ p^j (1-p)^{y+r-j} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (8)   \end{aligned}

For both negative binomial random variables X_r and Y_r, the survival functions can be computed using (7) and (8), respectively. Rote memorization of the formulas (7) and (8) is not recommended. Instead, focus on the thought process that translate the events X_r>x and Y_r>y into a binomial distribution.

Example 4 (the above examples continued)
We now rework Example 3 using the ideas presented in (7) and (8).

    \displaystyle \begin{aligned} P(X_4>16)&=\sum \limits_{k=0}^{3} \ \binom{16}{k} \ 0.25^k \ 0.75^{16-k}=0.40498711    \end{aligned}

    \displaystyle \begin{aligned} P(Y_4 \le 10)&=1-P(Y_4>10)   \\&=1-\sum \limits_{j=0}^{3} \ \binom{14}{j} \ 0.25^j \ 0.75^{14-j} \\&=1-0.52133996  \\&=0.47866004   \end{aligned}

Example 5 (the above examples continued)
What is the median number of boxes of cereal purchased by Family A in order to obtain 4 boxes with the prize of free ticket to the zoo? What is the median number of boxes of cereal with undesirable prizes that are purchased by Family A?

We have the following probabilities.

    \displaystyle P(X_4>14)=\sum \limits_{k=0}^{3} \ \binom{14}{k} \ 0.25^k \ 0.75^{16-k}=0.52133996

    \displaystyle P(X_4>15)=\sum \limits_{k=0}^{3} \ \binom{15}{k} \ 0.25^k \ 0.75^{16-k}=0.461286876

    \displaystyle P(X_4 \le 14)=1-0.52133996=0.47866004

    \displaystyle P(X_4 \le 15)=1-0.461286876=0.538713124

This means that the median number of boxes to be purchased is 15. One way to look at it is that x= 15 is the first number such that P(X_4 \le x) is greater than 0.5. Then the median number of boxes with undesirable prizes is 11 (15 less 4).

_______________________________________________________________________

Comparing with the Gamma distribution

The thought process discussed in (7) and (8) certainly gives a more efficient way to calculate the cumulative distribution function and the survival function of the negative binomial distribution. Even though the negative binomial cdf can be calculated easily by software, the ideas in (7) and (8) provides a formulation that gives more insight on the negative binomial distribution.

The though process in (7) and (8) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Consider the Gamma distribution where the shape parameter n is a positive integer and the rate parameter \beta can be any positive real number. Then the following is the density function of the Gamma distribution under consideration:

    \displaystyle f(x)=\frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta \ x} \ \ \ \ \ x>0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (9)

The Gamma distribution described by the density function in (9) can be interpreted as a waiting time – the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The survival function in (7) is analogous to the following relation.

    \displaystyle P(X>t)=\int_t^\infty \frac{\beta^n}{(n-1)!} \ x^{n-1} \ e^{-\beta x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\beta t} \ (\beta t)^j}{j!} \ \ \ \ \ \ \ \ \ \ \ (10)

The idea for (7) and (8) is the waiting time until the rth success where each success or failure is based on a Bernoulli process. The resulting probability distribution is a discrete waiting time process. The idea for (10) is the waiting time until the nth change where each change is based on a Poisson counting process. The resulting probability distribution is a continuous waiting time process. It is not necessary to memorize these formulas. It is easy to reproduce (7), (8) and (10) from the underlying thought processes.

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Advertisements

Deriving some facts of the negative binomial distribution

The previous post called The Negative Binomial Distribution gives a fairly comprehensive discussion of the negative binomial distribution. In this post, we fill in some of the details that are glossed over in that previous post. We derive the following points:

  • Discuss the several versions of the negative binomial distribution.
  • The negative binomial probabilities sum to one, i.e., the negative binomial probability function is a valid one.
  • Derive the moment generating function of the negative binomial distribution.
  • Derive the first and second moments and the variance of the negative binomial distribution.
  • An observation about independent sum of negative binomial distributions.

________________________________________________________________________

Three versions

The negative binomial distribution has two parameters r and p, where r is a positive real number and 0<p<1. The first two versions arise from the case that r is a positive integer, which can be interpreted as the random experiment of a sequence of independent Bernoulli trials until the rth success (the trials have the same probability of success p). In this interpretation, there are two ways of recording the random experiment:

    X = the number of Bernoulli trials required to get the rth success.
    Y = the number of Bernoulli trials that end in failure before getting the rth success.

The other parameter p is the probability of success in each Bernoulli trial. The notation \binom{m}{n} is the binomial coefficient where m and n are non-negative integers and m \ge n is defined as:

    \displaystyle \binom{m}{n}=\frac{m!}{n! \ (m-n)!}=\frac{m(m-1) \cdots (m-(n-1))}{n!} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (0)

With this in mind, the following are the probability functions of the random variables X and Y.

    \displaystyle P(X=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots \ \ \ \ \ \ \ (1)

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The thought process for (1) is that for the event X=x to happen, there can only be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial). The thought process for (2) is that for the event Y=y to happen, there are y+r trials (y failures and r successes). In the first y+r-1 trials, there can be only y failures (or equivalently r-1 successes). Note that X=Y+r. Thus knowing the mean of Y will derive the mean of X, a fact we will use below.

Instead of memorizing the probability functions (1) and (2), it is better to understand and remember the thought processes involved. Because of the natural interpretation of performing Bernoulli trials until the rth success, it is a good idea to introduce the negative binomial distribution via the distributions described by (1) and (2), i.e., the case where the parameter r is a positive integer. When r=1, the random experiment is a sequence of independent Bernoulli trials until the first success (this is called the geometric distribution).

Of course, (1) and (2) can also simply be used as counting distributions without any connection with a series of Bernoulli trials (e.g. used in an insurance context as the number of losses or claims arising from a group of insurance policies).

The binomial coefficient in (0) is defined when both numbers are non-negative integers and that the top one is greater than or equal to the bottom one. However, the rightmost term in (0) can be calculated even when the top number m is not a non-negative integer. Thus when m is any real number, the rightmost term (0) can be calculated provided that the bottom number n is a positive integer. For convenience we define \binom{m}{0}=1. With this in mind, the binomial coefficient \binom{m}{n} is defined for any real number m and any non-negative integer n.

The third version of the negative binomial distribution arises from the relaxation of the binomial coefficient \binom{m}{n} just discussed. With this in mind, the probability function in (2) can be defined for any positive real number r:

    \displaystyle P(Y=y)=\binom{y+r-1}{y} p^r (1-p)^y \ \ \ \ \ \ \ y=0,1,2,\cdots \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

where \displaystyle \binom{y+r-1}{y}=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!}.

Of course when r is a positive integer, versions (2) and (3) are identical. When r is a positive real number but is not an integer, the distribution cannot be interpreted as the number of failures until the occurrence of rth success. Instead, it is used as a counting distribution.

________________________________________________________________________

The probabilities sum to one

Do the probabilities in (1), (2) or (3) sum to one? For the interpretations of (1) and (2), is it possible to repeatedly perform Bernoulli trials and never get the rth success? For r=1, is it possible to never even get a success? In tossing a fair coin repeatedly, soon enough you will get a head and even if r is a large number, you will eventually get r number of heads. Here we wish to prove this fact mathematically.

To show that (1), (2) and (3) are indeed probability functions, we use a fact concerning Maclaurin’s series expansion of the function (1-x)^{-r}, a fact that is covered in a calculus course. In the following two results, r is a fixed positive real number and y is any non-negative integer:

    \displaystyle \binom{y+r-1}{y}=(-1)^y \ \binom{-r}{y} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

    \displaystyle \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y=(1-x)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (5)

The result (4) is to rearrange the binomial coefficient in probability function (3) to another binomial coefficient with a negative number. This is why there is the word “negative” in negative binomial distribution. The result (5) is the Maclaurin’s series expansion for the function (1-x)^{-r}. We first derive these two facts and then use them to show that the negative binomial probabilities in (3) sum to one. The following derives (4).

    \displaystyle \begin{aligned} \binom{y+r-1}{y}&=\frac{(y+r-1)(y+r-2) \cdots (r+1)r}{y!} \\&=(-1)^y \ \frac{(-r)(-r-1) \cdots (-r-(y-1))}{y!} \\&=(-1)^y \ \binom{-r}{y}  \end{aligned}

To derive (5), let f(x)=(1-x)^{-r}. Based on a theorem that can be found in most calculus text, the function f(x) has the following Maclaurin’s series expansion (Maclaurin’s series is simply Taylor’s series with center = 0).

    \displaystyle (1-x)^{-r}=f(0)+f^{'}(0)x+\frac{f^{(2)}(0)}{2!}x^2+\frac{f^{(3)}(0)}{3!}x^3+\cdots + \frac{f^{(n)}(0)}{n!}x^n+\cdots

where -1<x<1. Now, filling in the derivatives f^{(n)}(0), we have the following derivation.

    \displaystyle \begin{aligned} (1-x)^{-r}&=1+rx+\frac{(r+1)r}{2!}x^2+\frac{(r+2)(r+1)r}{3!}x^3 \\& \ \ \ \ \ \ \ \ +\cdots+\frac{(r+y-1)(r+y-2) \cdots (r+1)r}{y!}x^y +\cdots \\&=1+(-1)^1 (-r)x+(-1)^2\frac{(-r)(-r-1)}{2!}x^2 \\& \ \ \ \ \ \ +(-1)^3 \frac{(-r)(-r-1)(-r-2)}{3!}x^3 +\cdots \\& \ \ \ \ \ \ +(-1)^y \frac{(-r)(-r-1) \cdots (-r-y+2)(-r-y+1)}{y!}x^y +\cdots  \\&=(-1)^0 \binom{-r}{0}x^0 +(-1)^1 \binom{-r}{1}x^1+(-1)^2 \binom{-r}{2}x^2 \\& \ \ \ \ \ \ +(-1)^3 \binom{-r}{3}x^3+\cdots +(-1)^y \binom{-r}{y}x^y+\cdots    \\&=\sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ x^y \end{aligned}

We can now show that the negative binomial probabilities in (3) sum to one. Let q=1-p.

    \displaystyle \begin{aligned} \sum \limits_{y=0}^\infty \binom{y+r-1}{y} \ p^r \ q^y &=p^r \ \sum \limits_{y=0}^\infty (-1)^y \ \binom{-r}{y} \ q^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ (1-q)^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5)\\&=p^r p^{-r} \\&=1 \end{aligned}

________________________________________________________________________

The moment generating function

We now derive the moment generating function of the negative binomial distribution according to (3). The moment generation function is M(t)=E(e^{tY}) over all real numbers t for which M(t) is defined. The following derivation does the job.

    \displaystyle \begin{aligned} M(t)&=E(e^{tY}) \\&=\sum \limits_{y=0}^\infty \ e^{t y} \ \binom{y+r-1}{y} \ p^r \ (1-p)^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ \binom{y+r-1}{y} \ [(1-p) e^t]^y \\&=p^r \ \sum \limits_{y=0}^\infty  \ (-1)^y \binom{-r}{y} \ [(1-p) e^t]^y \ \ \ \ \ \ \ \ \ \ \ \text{using } (4) \\&=p^r \ [1-(1-p) \ e^t]^{-r} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{using } (5) \\&=\frac{p^r}{[1-(1-p) \ e^t]^{r}}\end{aligned}

The above moment generating function works for the negative binomial distribution with respect to (3) and thus to (2). For the distribution in (1), note that X=Y+r. Thus E(e^{tX})=E(e^{t(Y+r)})=e^{tr} \ E(e^{tY}). The moment generating function of (1) is simply the above moment generating function multiplied by the factor e^{tr}. To summarize, the moment generating functions for the three versions are:

    \displaystyle M_X(t)=E[e^{tX}]=\frac{p^r \ e^{tr}}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (1)

    \displaystyle M_Y(t)=E[e^{tY}]=\frac{p^r}{[1-(1-p) \ e^t]^{r}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{for } (2) \text{ and } (3)

The domain of the moment generating function is the set of all t that for which M_X(t) or M_Y(t) is defined and is positive. Based on the form that it takes, we focus on making sure that 1-(1-p) \ e^t>0. This leads to the domain t<-\text{ln}(1-p).

________________________________________________________________________

The mean and the variance

With the moment generating function derived in the above section, we can now focus on finding the moments of the negative binomial distribution. To find the moments, simply take the derivatives of the moment generating function and evaluate at t=0. For the distribution represented by the probability function in (3), we calculate the following:

    E(Y)=M_Y^{'}(0)

    E(Y^2)=M_Y^{(2)}(0)

    Var(Y)=E(Y^2)-E(Y)^2

After taking the first and second derivatives and evaluate at t=0, the first and the second moments are:

    \displaystyle E(Y)=r \ \frac{1-p}{p}

    \displaystyle E(Y^2)=\frac{r(1-p)[1+(1-p)]}{p^2}

The following derives the variance.

    \displaystyle \begin{aligned} Var(Y)&=E(Y^2)-E(Y)^2 \\&=\frac{r(1-p)[1+(1-p)]}{p^2}-\frac{(1-p)^2}{p^2} \\&=\frac{r(1-p)[1+r(1-p)-r(1-p)]}{p^2} \\&=\frac{r(1-p)}{p^2}  \end{aligned}

The above formula is the variance for the three versions (1), (2) and (3). Note that Var(Y)>E(Y). In contrast, the variance of the Poisson distribution is identical to its mean. Thus in the situation where the variance of observed data is greater than the sample mean, the negative binomial distribution should be a better fit than the Poisson distribution.

________________________________________________________________________

The independent sum

There is an easy consequence that follows from the moment generating function derived above. The sum of several independent negative binomial distributions is also a negative binomial distribution. For example, suppose T_1,T_2, \cdots,T_n are independent negative binomial random variables (version (3)). Suppose each T_j has parameters r_j and p (the second parameter is identical). The moment generating function of the independent sum is the product of the individual moment generating functions. Thus the following is the moment generating function of T=T_1+\cdots+T_n.

    \displaystyle M_T(t)=E[e^{tT}]=\frac{p^g}{[1-(1-p) \ e^t]^{g}}

where g=r_1+\cdots+r_n. The moment generating function uniquely identifies the distribution. The above M_T(t) is that of a negative binomial distribution with parameters g and p according to (3).

A special case is that the sum of n independent geometric distributions is a negative binomial distribution with the r parameter being r=n. The following is the moment generating function of the sum W of n independent geometric distributions.

    \displaystyle M_W(t)=E[e^{tW}]=\frac{p^n}{[1-(1-p) \ e^t]^{n}}

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Relating Binomial and Negative Binomial

The negative binomial distribution has a natural intepretation as a waiting time until the arrival of the rth success (when the parameter r is a positive integer). The waiting time refers to the number of independent Bernoulli trials needed to reach the rth success. This interpretation of the negative binomial distribution gives us a good way of relating it to the binomial distribution. For example, if the rth success takes place after k failed Bernoulli trials (for a total of k+r trials), then there can be at most r-1 successes in the first k+r trials. This tells us that the survival function of the negative binomial distribution is the cumulative distribution function (cdf) of a binomial distribution. In this post, we gives the details behind this observation. A previous post on the negative binomial distribution is found here.

A random experiment resulting in two distinct outcomes (success or failure) is called a Bernoulli trial (e.g. head or tail in a coin toss, whether or not the birthday of a customer is the first of January, whether an insurance claim is above or below a given threshold etc). Suppose a series of independent Bernoulli trials are performed until reaching the rth success where the probability of success in each trial is p. Let X_r be the number of failures before the occurrence of the rth success. The following is the probablity mass function of X_r.

\displaystyle (1) \ \ \ \ P(X_r=k)=\binom{k+r-1}{k} p^r (1-p)^k \ \ \ \ \ \ k=0,1,2,3,\cdots

Be definition, the survival function and cdf of X_r are:

\displaystyle (2) \ \ \ \ P(X_r > k)=\sum \limits_{j=k+1}^\infty \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

\displaystyle (3) \ \ \ \ P(X_r \le k)=\sum \limits_{j=0}^k \binom{j+r-1}{j} p^r (1-p)^j \ \ \ \ \ \ k=0,1,2,3,\cdots

For each positive integer k, let Y_{r+k} be the number of successes in performing a sequence of r+k independent Bernoulli trials where p is the probability of success. In other words, Y_{r+k} has a binomial distribution with parameters r+k and p.

If the random experiment requires more than k failures to reach the rth success, there are at most r-1 successes in the first k+r trails. Thus the survival function of X_r is the same as the cdf of a binomial distribution. Equivalently, the cdf of X_r is the same as the survival function of a binomial distribution. We have the following:

\displaystyle \begin{aligned}(4) \ \ \ \ P(X_r > k)&=P(Y_{k+r} \le r-1) \\&=\sum \limits_{j=0}^{r-1} \binom{k+r}{j} p^j (1-p)^{k+r-j} \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

\displaystyle \begin{aligned}(5) \ \ \ \ P(X_r \le k)&=P(Y_{k+r} > r-1) \ \ \ \ \ \ k=0,1,2,3,\cdots \end{aligned}

Remark
The relation (4) is analogous to the relationship between the Gamma distribution and the Poisson distribution. Recall that a Gamma distribution with shape parameter \alpha and scale parameter n, where n is a positive integer, can be interpreted as the waiting time until the nth change in a Poisson process. Thus, if the nth change takes place after time t, there can be at most n-1 arrivals in the time interval [0,t]. Thus the survival function of this Gamma distribution is the same as the cdf of a Poisson distribution. The relation (4) is analogous to the following relation.

\displaystyle (5) \ \ \ \ \int_t^\infty \frac{\alpha^n}{(n-1)!} \ x^{n-1} \ e^{-\alpha x} \ dx=\sum \limits_{j=0}^{n-1} \frac{e^{-\alpha t} \ (\alpha t)^j}{j!}

A previous post on the negative binomial distribution is found here.

The Negative Binomial Distribution

A counting distribution is a discrete distribution with probabilities only on the nonnegative integers. Such distributions are important in insurance applications since they can be used to model the number of events such as losses to the insured or claims to the insurer. Though playing a prominent role in statistical theory, the Poisson distribution is not appropriate in all situations, since it requires that the mean and the variance are equaled. Thus the negative binomial distribution is an excellent alternative to the Poisson distribution, especially in the cases where the observed variance is greater than the observed mean.

The negative binomial distribution arises naturally from a probability experiment of performing a series of independent Bernoulli trials until the occurrence of the rth success where r is a positive integer. From this starting point, we discuss three ways to define the distribution. We then discuss several basic properties of the negative binomial distribution. Emphasis is placed on the close connection between the Poisson distribution and the negative binomial distribution.

________________________________________________________________________

Definitions
We define three versions of the negative binomial distribution. The first two versions arise from the view point of performing a series of independent Bernoulli trials until the rth success where r is a positive integer. A Bernoulli trial is a probability experiment whose outcome is random such that there are two possible outcomes (success or failure).

Let X_1 be the number of Bernoulli trials required for the rth success to occur where r is a positive integer. Let p is the probability of success in each trial. The following is the probability function of X_1:

\displaystyle (1) \ \ \ \ \ P(X_1=x)= \binom{x-1}{r-1} p^r (1-p)^{x-r} \ \ \ \ \ \ \ x=r,r+1,r+2,\cdots

The idea for (1) is that for X_1=x to happen, there must be r-1 successes in the first x-1 trials and one additional success occurring in the last trial (the xth trial).

A more common version of the negative binomial distribution is the number of Bernoulli trials in excess of r in order to produce the rth success. In other words, we consider the number of failures before the occurrence of the rth success. Let X_2 be this random variable. The following is the probability function of X_2:

\displaystyle (2) \ \ \ \ \ P(X_2=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The idea for (2) is that there are x+r trials and in the first x+r-1 trials, there are x failures (or equivalently r-1 successes).

In both (1) and (2), the binomial coefficient is defined by

\displaystyle (3) \ \ \ \ \ \binom{y}{k}=\frac{y!}{k! \ (y-k)!}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

where y is a positive integer and k is a nonnegative integer. However, the right-hand-side of (3) can be calculated even if y is not a positive integer. Thus the binomial coefficient \displaystyle \binom{y}{k} can be expanded to work for all real number y. However k must still be nonnegative integer.

\displaystyle (4) \ \ \ \ \ \binom{y}{k}=\frac{y(y-1) \cdots (y-(k-1))}{k!}

For convenience, we let \displaystyle \binom{y}{0}=1. When the real number y>k-1, the binomial coefficient in (4) can be expressed as:

\displaystyle (5) \ \ \ \ \ \binom{y}{k}=\frac{\Gamma(y+1)}{\Gamma(k+1) \Gamma(y-k+1)}

where \Gamma(\cdot) is the gamma function.

With the more relaxed notion of binomial coefficient, the probability function in (2) above can be defined for all real number r. Thus the general version of the negative binomial distribution has two parameters r and p, both real numbers, such that 0<p<1. The following is its probability function.

\displaystyle (6) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} p^r (1-p)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

Whenever r in (6) is a real number that is not a positive integer, the interpretation of counting the number of failures until the occurrence of the rth success is no longer important. Instead we can think of it simply as a count distribution.

The following alternative parametrization of the negative binomial distribution is also useful.

\displaystyle (6a) \ \ \ \ \ P(X=x)=\binom{x+r-1}{x} \biggl(\frac{\alpha}{\alpha+1}\biggr)^r \biggl(\frac{1}{\alpha+1}\biggr)^x \ \ \ \ \ \ \ x=0,1,2,\cdots

The parameters in this alternative parametrization are r and \alpha>0. Clearly, the ratio \frac{\alpha}{\alpha+1} takes the place of p in (6). Unless stated otherwise, we use the parametrization of (6).
________________________________________________________________________

What is negative about the negative binomial distribution?
What is negative about this distribution? What is binomial about this distribution? The name is suggested by the fact that the binomial coefficient in (6) can be rearranged as follows:

\displaystyle \begin{aligned}(7) \ \ \ \ \ \binom{x+r-1}{x}&=\frac{(x+r-1)(x+r-2) \cdots r}{x!} \\&=(-1)^x \frac{(-r-(x-1))(-r-(x-2)) \cdots (-r)}{x!} \\&=(-1)^x \frac{(-r)(-r-1) \cdots (-r-(x-1))}{x!} \\&=(-1)^x \binom{-r}{x} \end{aligned}

The calculation in (7) can be used to verify that (6) is indeed a probability function, that is, all the probabilities sum to 1.

\displaystyle \begin{aligned}(8) \ \ \ \ \ 1&=p^r p^{-r}\\&=p^r (1-q)^{-r} \\&=p^r \sum \limits_{x=0}^\infty \binom{-r}{x} (-q)^x \ \ \ \ \ \ \ \ (8.1) \\&=p^r \sum \limits_{x=0}^\infty (-1)^x \binom{-r}{x} q^x \\&=\sum \limits_{x=0}^\infty \binom{x+r-1}{x} p^r q^x \end{aligned}

In (8), we take q=1-p. The step (8.1) above uses the following formula known as the Newton’s binomial formula.

\displaystyle (9) \ \ \ \ \ (1+t)^w=\sum \limits_{k=0}^\infty \binom{w}{k} t^k

For a detailed discussion of (8) with all the details worked out, see the post called Deriving some facts of the negative binomial distribution.

________________________________________________________________________

The Generating Function
By definition, the following is the generating function of the negative binomial distribution, using :

\displaystyle (10) \ \ \ \ \ g(z)=\sum \limits_{x=0}^\infty \binom{r+x-1}{x} p^r q^x z^x

where q=1-p. Using a similar calculation as in (8), the generating function can be simplified as:

\displaystyle (11) \ \ \ \ \ g(z)=p^r (1-q z)^{-r}=\frac{p^r}{(1-q z)^r}=\frac{p^r}{(1-(1-p) z)^r}; \ \ \ \ \ z<\frac{1}{1-p}

As a result, the moment generating function of the negative binomial distribution is:

\displaystyle (12) \ \ \ \ \ M(t)=\frac{p^r}{(1-(1-p) e^t)^r}; \ \ \ \ \ \ \ t<-ln(1-p)

For a detailed discussion of (12) with all the details worked out, see the post called Deriving some facts of the negative binomial distribution.

________________________________________________________________________

Independent Sum

One useful property of the negative binomial distribution is that the independent sum of negative binomial random variables, all with the same parameter p, also has a negative binomial distribution. Let Y=Y_1+Y_2+\cdots+Y_n be an independent sum such that each X_i has a negative binomial distribution with parameters r_i and p. Then the sum Y=Y_1+Y_2+\cdots+Y_n has a negative binomial distribution with parameters r=r_1+\cdots+r_n and p.

Note that the generating function of an independent sum is the product of the individual generating functions. The following shows that the product of the individual generating functions is of the same form as (11), thus proving the above assertion.

\displaystyle (13) \ \ \ \ \ h(z)=\frac{p^{\sum \limits_{i=1}^n r_i}}{(1-(1-p) z)^{\sum \limits_{i=1}^n r_i}}
________________________________________________________________________

Mean and Variance
The mean and variance can be obtained from the generating function. From E(X)=g'(1) and E(X^2)=g'(1)+g^{(2)}(1), we have:

\displaystyle (14) \ \ \ \ \ E(X)=\frac{r(1-p)}{p} \ \ \ \ \ \ \ \ \ \ \ \ \ Var(X)=\frac{r(1-p)}{p^2}

Note that Var(X)=\frac{1}{p} E(X)>E(X). Thus when the sample data suggest that the variance is greater than the mean, the negative binomial distribution is an excellent alternative to the Poisson distribution. For example, suppose that the sample mean and the sample variance are 3.6 and 7.1. In exploring the possibility of fitting the data using the negative binomial distribution, we would be interested in the negative binomial distribution with this mean and variance. Then plugging these into (14) produces the negative binomial distribution with r=3.7 and p=0.507.
________________________________________________________________________

The Poisson-Gamma Mixture
One important application of the negative binomial distribution is that it is a mixture of a family of Poisson distributions with Gamma mixing weights. Thus the negative binomial distribution can be viewed as a generalization of the Poisson distribution. The negative binomial distribution can be viewed as a Poisson distribution where the Poisson parameter is itself a random variable, distributed according to a Gamma distribution. Thus the negative binomial distribution is known as a Poisson-Gamma mixture.

In an insurance application, the negative binomial distribution can be used as a model for claim frequency when the risks are not homogeneous. Let N has a Poisson distribution with parameter \theta, which can be interpreted as the number of claims in a fixed period of time from an insured in a large pool of insureds. There is uncertainty in the parameter \theta, reflecting the risk characteristic of the insured. Some insureds are poor risks (with large \theta) and some are good risks (with small \theta). Thus the parameter \theta should be regarded as a random variable \Theta. The following is the conditional distribution of the random variable N (conditional on \Theta=\theta):

\displaystyle (15) \ \ \ \ \ P(N=n \lvert \Theta=\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots

Suppose that \Theta has a Gamma distribution with scale parameter \alpha and shape parameter \beta. The following is the probability density function of \Theta.

\displaystyle (16) \ \ \ \ \ g(\theta)=\frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta} \ \ \ \ \ \ \ \ \ \ \theta>0

Then the joint density of N and \Theta is:

\displaystyle (17) \ \ \ \ \ P(N=n \lvert \Theta=\theta) \ g(\theta)=\frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \theta^{\beta-1} e^{-\alpha \theta}

The unconditional distribution of N is obtained by summing out \theta in (17).

\displaystyle \begin{aligned}(18) \ \ \ \ \ P(N=n)&=\int_0^\infty P(N=n \lvert \Theta=\theta) \ g(\theta) \ d \theta \\&=\int_0^\infty \frac{e^{-\theta} \ \theta^n}{n!} \ \frac{\alpha^\beta}{\Gamma(\beta)} \ \theta^{\beta-1} \ e^{-\alpha \theta} \ d \theta \\&=\int_0^\infty \frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \int_0^\infty \frac{(\alpha+1)^{n+\beta}}{\Gamma(n+\beta)} \theta^{n+\beta-1} \ e^{-(\alpha+1) \theta} d \theta \\&=\frac{\alpha^\beta}{n! \ \Gamma(\beta)} \ \frac{\Gamma(n+\beta)}{(\alpha+1)^{n+\beta}} \\&=\frac{\Gamma(n+\beta)}{\Gamma(n+1) \ \Gamma(\beta)} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \\&=\binom{n+\beta-1}{n} \ \biggl( \frac{\alpha}{\alpha+1}\biggr)^\beta \ \biggl(\frac{1}{\alpha+1}\biggr)^n \ \ \ \ \ \ \ \ \ n=0,1,2,\cdots \end{aligned}

Note that the integral in the fourth step in (18) is 1.0 since the integrand is the pdf of a Gamma distribution. The above probability function is that of a negative binomial distribution. It is of the same form as (6a). Equivalently, it is also of the form (6) with parameter r=\beta and p=\frac{\alpha}{\alpha+1}.

The variance of the negative binomial distribution is greater than the mean. In a Poisson distribution, the mean equals the variance. Thus the unconditional distribution of N is more dispersed than its conditional distributions. This is a characteristic of mixture distributions. The uncertainty in the parameter variable \Theta has the effect of increasing the unconditional variance of the mixture distribution of N. The variance of a mixture distribution has two components, the weighted average of the conditional variances and the variance of the conditional means. The second component represents the additional variance introduced by the uncertainty in the parameter \Theta (see The variance of a mixture).

________________________________________________________________________

The Poisson Distribution as Limit of Negative Binomial
There is another connection to the Poisson distribution, that is, the Poisson distribution is a limiting case of the negative binomial distribution. We show that the generating function of the Poisson distribution can be obtained by taking the limit of the negative binomial generating function as r \rightarrow \infty. Interestingly, the Poisson distribution is also the limit of the binomial distribution.

In this section, we use the negative binomial parametrization of (6a). By replacing \frac{\alpha}{\alpha+1} for p, the following are the mean, variance, and the generating function for the probability function in (6a):

\displaystyle \begin{aligned}(19) \ \ \ \ \ \ &E(X)=\frac{r}{\alpha} \\&\text{ }\\&Var(X)=\frac{\alpha+1}{\alpha} \ \frac{r}{\alpha}=\frac{r(\alpha+1)}{\alpha^2} \\&\text{ } \\&g(z)=\frac{1}{[1-\frac{1}{\alpha}(z-1)]^r} \ \ \ \ \ \ \ z<\alpha+1 \end{aligned}

Let r goes to infinity and \displaystyle \frac{1}{\alpha} goes to zero and at the same time keeping their product constant. Thus \displaystyle \mu=\frac{r}{\alpha} is constant (this is the mean of the negative binomial distribution). We show the following:

\displaystyle (20) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}=e^{\mu (z-1)}

The right-hand side of (20) is the generating function of the Poisson distribution with mean \mu. The generating function in the left-hand side is that of a negative binomial distribution with mean \displaystyle \mu=\frac{r}{\alpha}. The following is the derivation of (20).

\displaystyle \begin{aligned}(21) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} [1-\frac{\mu}{r}(z-1)]^{-r}&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(ln[1-\frac{\mu}{r}(z-1)]^{-r}\biggr)} \\&=\lim \limits_{r \rightarrow \infty} e^{\displaystyle \biggl(-r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \\&=e^{\displaystyle \biggl(\lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]\biggr)} \end{aligned}

We now focus on the limit in the exponent.

\displaystyle \begin{aligned}(22) \ \ \ \ \ \lim \limits_{r \rightarrow \infty} -r \ ln[1-\frac{\mu}{r}(z-1)]&=\lim \limits_{r \rightarrow \infty} \frac{ln(1-\frac{\mu}{r} (z-1))^{-1}}{r^{-1}} \\&=\lim \limits_{r \rightarrow \infty} \frac{(1-\frac{\mu}{r} (z-1)) \ \mu (z-1) r^{-2}}{r^{-2}} \\&=\mu (z-1) \end{aligned}

The middle step in (22) uses the L’Hopital’s Rule. The result in (20) is obtained by combining (21) and (22).

________________________________________________________________________

Reference

  1. Klugman S.A., Panjer H. H., Wilmot G. E. Loss Models, From Data to Decisions, Second Edition., Wiley-Interscience, a John Wiley & Sons, Inc., New York, 2004

________________________________________________________________________
\copyright \ \text{2011-2015 by Dan Ma}

A geometric waiting time occupancy problem

In the provious post (The birthday problem as a waiting time occupancy problem), we presented a waiting time occupancy problem. We like to present another example of a waiting time occupancy problem. This time we also use a birthday problem as illustration. Suppose the proprietor of a restaurant decides that free dessert is offered to any customer dining at the restaurant on his or her birthday. On any given day, how long does it have to wait for the first birthday customer to show up? On average how many customers are expected to visit the restaurant until the first free dessert to be awarded? As we will see in this post, the average number of customers until the arrival of the first birthday customer is 365. For a small restaurant with a daily customer count under three to four hundred, free dessert may not have to be offered at all. We will also show that the median number of customers until the arrival of the first birthday customer is 253. So if the daily customer count is less than 253, there is a greater than 50% chance that a free dessert will not have to be offered. To obtain these results, we ignore leap year and assume that any day of the year is equally likely to be the birthday of a random customer. The results in this post will not hold if the free dessert offer is widely known and many customers come to the restaurant for the purpose of taking advantage of the free dessert offer.

Obviously, this birthday problem is a waiting time problem. We observe the number of customers until the arrival of the first birthday customer. We can also view this as an occupancy problem. The 365 days in the year can be considered as cells. The customer stream can be viewed as a series of balls being randomly placed into the 365 cells. We fix one cell in advance and we keep placing balls into the 365 cells until the a ball is placed into the specified cell. Once a ball reaches the specified cell, we continue to randomly assign balls into the cells. In the restaurant example, after the arrival of the first birthday customer, we continue to observe customers until the arrival of the next birthday customer.

We now discuss the problem in the context of the occupancy problem. There are n cells and we keep assigning balls (one at a time) into the n cells until a ball is placed into a cell specfied in advance. The random placing of each ball into a cell can be viewed as a Bernoulli trial with probability of success p=\frac{1}{n}. Success here means a ball reaches the cell specified in advance. Let X_1 be the number of trials (placements of balls) until the first success. Then X_1 has a geometric distribution with probability of success p=\frac{1}{n}. We make use of some basic facts about the geometric distribution in discussing the stated birthday problem. Let w be any integer greater than 1 and let X_w be the number of trials until the w balls are placed into the cell specified in advance. Then X_w has a negative binomial distribution. We will also discuss this fact with respect to the birthday problem at hand.

The Geometric Distribution
A random experiment resulting in two distinct outcomes (success or failure) is called a Bernoulli trial (e.g. coin tosses, whether or not the birthday of a customer is the first of January). Suppose that the probability of success in each trial is p and we observe a sequence of Bernoulli trials. Let X_1 be the number of Bernoulli trials we observe until a trial resulting in a success. Then X_1 has a geometric distribution. The following are some basic facts:

    \displaystyle P(X_1=k)=(1-p)^{k-1} \thinspace p \ \ \ \ \text{where }k=1,2,3, \cdots

    \displaystyle P(X_1>k)=(1-p)^k \ \ \ \ \ \ \ \ \ \text{where }k=1,2,3, \cdots

    \displaystyle E(X_1)=\frac{1}{p}

    \displaystyle Var(X_1)=\frac{1-p}{p^2}

If the first success occurs after k trials, then the first k trials are all failures. Thus P(X_1>k) is the same as the probability that there are no successes in the first k trials. Thus P(X_1>k) is as indicated above.

With respect to the occupancy problem, the probability of success (a ball is placed in a cell apecified in advance) is p=\frac{1}{n}. The above basic facts for the geometric distribution can be restated as follows:

    \displaystyle P(X_1=k)=\biggl(1-\frac{1}{n} \biggr)^{k-1} \thinspace \frac{1}{n} \ \ \ \ \text{where }k=1,2,3, \cdots

    \displaystyle P(X_1>k)=\biggl(1-\frac{1}{n} \biggr)^k \ \ \ \ \ \ \ \ \ \text{where }k=1,2,3, \cdots

    \displaystyle E(X_1)=\frac{1}{\frac{1}{n}}=n

    \displaystyle Var(X_1)=\frac{1-\frac{1}{n}}{\frac{1}{n^2}}=n^2-n

The Negative Binomial Distribution
Suppose that we observe the Bernoulli trials until we have w successes where w>1. Let X_w be the number of trials to obtain w successes. The random variable X_w follows the negative binomial distribution with parameters w and p. The following lists some basic facts:

    \displaystyle P(X_w=k)=\binom{k-1}{w-1} (1-p)^{k-w} p^w \ \ \ \ \ \ \text{where }k=w,w+1,w+2, \cdots

    \displaystyle P(X_w>k)=\sum \limits_{r=0}^{w-1} \binom{k}{r}p^r (1-p)^{k-r} \ \ \ \ \ \ \text{where }k=w,w+1,w+2, \cdots

    \displaystyle E(X_w)=w E(X_1)=\frac{w}{p}

    \displaystyle Var(X_w)=w Var(X_1)=w \thinspace \frac{1-p}{p^2}

If the w^{th} success takes place after k trials, then there can be at most w-1 successes in the first k trials. This derives P(X_w>k). Note that X_w is the independent sum of w many random variables, each of which has a geometric distribution. This fact derives the mean and variance indicated above.

With respect to the occupancy problem at hand, the following restates the basic facts:

    \displaystyle P(X_w=k)=\binom{k-1}{w-1} \biggl(1-\frac{1}{n} \biggr)^{k-w} \biggl(\frac{1}{n}\biggr)^w \ \ \ \ \ \ \text{where }k=w,w+1,w+2, \cdots

    \displaystyle P(X_w>k)=\sum \limits_{r=0}^{w-1} \binom{k}{r}\biggl(\frac{1}{n}\biggr)^r \biggl(1-\frac{1}{n}\biggr)^{k-r} \ \ \ \ \ \ \text{where }k=w,w+1,w+2, \cdots

    \displaystyle E(X_w)=w E(X_1)=w \thinspace n

    \displaystyle Var(X_w)=w Var(X_1)=w \thinspace (n^2-n)

Some Observations about the Birthday Problem
The number of arrivals until the first birthday customer has a geometric distribution. For each customer, the probability of success is \frac{1}{365} (the probability that today is his or her birthday). Thus the mean number of customers until we see a birthday customer is 365. On the other hand, the median number of customers until the first birthday customer is 253.

    \displaystyle P(X_1 \le 253)=1-P(X_1>253)=1-\biggl(1-\frac{1}{365}\biggr)^{253}=0.500477154

    \displaystyle P(X_1 \le 252)=1-P(X_1>252)=1-\biggl(1-\frac{1}{365}\biggr)^{252}=0.4991048385

Thus, if the number of customers on a given day is less than 253, there is a greater than 50% chance that no free dessert will be given out.

If we consider the number of customers until the second birthday customer, the mean is 730. The median is 613.

    \displaystyle \begin{aligned}P(X_2 \le 613)&=1-P(X_2>613)\\&=1-\sum \limits_{r=0}^{1} \binom{613}{r} \biggl(\frac{1}{365}\biggr)^r \biggl(1-\frac{1}{365}\biggr)^{613-r}\\&=0.500638\end{aligned}

    \displaystyle \begin{aligned}P(X_2 \le 612)&=1-P(X_2>612)\\&=1-\sum \limits_{r=0}^{1} \binom{612}{r} \biggl(\frac{1}{365}\biggr)^r \biggl(1-\frac{1}{365}\biggr)^{612-r}\\&=0.499779\end{aligned}

If the birthdays are equally probable across the 365 days in the year and if the arrivals of customers are truly random, then on any given day it may not be easy to find a birthday customer.

One interesting fact about the waiting time until the first birthday customer is that the geometric distribution has the “no memory” property. On a given day, suppose that 45 customers have arrived and there is no birthday customer. Would this mean that it is more likely that there will be a birthday customer in the next 10 customers than at the beginning of the business day? It turns out that the probability of receiving a birthday customer among the next ten customers is the same as the unconditional probability of having a birthday customer in the next ten customers. It is just as likely to have a birthday customer in the next ten customers when 45 customers have arrived as in the case that no customers have arrived. So the fact that 45 customers have arrived makes no difference.

Reference

  1. Feller, W., An Introduction to Probability Theory and its Applications, Vol. I, 3rd ed., John Wiley & Sons, Inc., New York, 1968