The capture-recapture method

The capture-recapture method is one of the methods for estimating the size of wildlife populations and is based on the hypergeometric distribution. Recall that the hypergeometric distribution is a three-parameter family of discrete distributions and one of the parameters, denoted by N in this post, is the size of the population. We show that the estimate for the parameter N that is obtained from the capture-recapture method is the value of the parameter N that makes the observed data “more likely” than any other possible values of N. Thus, the capture-recapture method produces the maximum likelihood estimate of the population size parameter N of the hypergeometric distribution.

Let’s start with an example. In order to estimate the size of the population of bluegills (a species of fresh water fish) in a small lake in Missouri, a total of w=250 bluegills are captured and tagged and then released. After allowing sufficient time for the tagged fish to disperse, a sample of n=150 bluegills were caught. It was found that y=16 bluegills in the sample were tagged. Estimate the size of the bluegill population in this lake.

Let N be the size of the bluegill population in this lake. The population proportion of the tagged bluegills is \frac{w}{N}. The sample proportion of the tagged bluegills is \frac{y}{n}. In the capture-recapture method, the population proportion and the sample proportion are set equaled. Then we solve for N.

\displaystyle \frac{w}{N}=\frac{y}{n} \Rightarrow N=\frac{w n}{y}=\frac{250(150)}{16}=2343.75=2343

Now, the connection to the hypergeometric distribution. After w=250 bluegills were captured, tagged and released, the population is separated into two distinct classes, tagged and non-tagged. When a sample of n=150 bluegills were selected without replacement, we let Y be the number of bluegills in the sample that were tagged. The distribution of Y is the hypergeometric distribution. The following is the probability function of Y.

\displaystyle P[Y=y]=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}}

In the hypergeometric distribution described here, the parameters w and n are known (w=250 and n=150). We now show that the estimate of N=2343 is the estimate that makes the observed value of y=16 “most likely” (i.e. the estimate of N=2343 is a maximum likelihood estimate of N). To show this, we consider the ratio of the hypergeometric probabilities for two successive values of N.

\displaystyle \frac{P(N)}{P(N-1)}=\frac{(N-w)(N-n)}{N(N-w-n+y)}

where \displaystyle P(N)=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}} and \displaystyle P(N-1)=\frac{\binom{w}{y} \thinspace \binom{N-1-w}{n-y}}{\binom{N-1}{n}}

Note that 1<\frac{P(N)}{P(N-1)} or P(N-1)<P(N) if and only if the following holds:

\displaystyle N(N-w-n+y)<(N-w)(N-n)

\displaystyle N<\frac{w n}{y}

Note that \frac{w n}{y} is the estimate from the capture-recapture method. It is also an upper bound for the population size N such that the probability P(N) is greater than P(N-1). This implies that the maximum likelihood estimate of N is achieved when the estimate is \hat{N}=\frac{w n}{y}.

As an illustration, we compute the probabilities \displaystyle P(N)=\frac{\binom{250}{16} \thinspace \binom{N-250}{150-16}}{\binom{N}{150}} for several values of N above and below N=2343. The following matrix illustrates that the maximum likelihood is achieved at N=2343.

\displaystyle \begin{pmatrix} N&P(N) \\{2340}&0.1084918 \\{2341}&0.1084929 \\{2342}&0.1084935 \\{2343}&0.1084938 \\{2344}&0.1084937 \\{2345}&0.1084933 \\{2346}&0.1084924\end{pmatrix}