Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

    \displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54  \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46    \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61   \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74   \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20  \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15  \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87  \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03  \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68  \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36  \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ }    \end{bmatrix}

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Basic facts about order statistics

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample X_1,X_2,\cdots,X_n drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics Y_1<Y_2<\cdots < Y_n. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the Y_j is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics Y_j to estimate the population percentiles is that the expected percentage of the population below Y_j is about the same as the percentage of the sample items less than Y_j. According to the explanation in the second post listed above (link), the order statistic Y_j is expected to be above 100p percent of the population where p=\frac{j}{n+1}. In fact, the order statistics Y_1<Y_2<\cdots < Y_n are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between Y_{j-1} and Y_j is 100h where h=\frac{1}{n+1}.

The above explanation justifies the use of the order statistic Y_j as the sample 100pth percentile where p=\frac{j}{n+1}.

The sample size is n= 54 in the San Francisco rainfall data. Thus the order statistic Y_{11} is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of Y_{11} is 15.22.

With \frac{45}{54+1}=0.818, the order statistic Y_{45} is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of Y_{45} is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability P(Y_i < \tau_p < Y_j). This is the probability that the 100pth percentile, where 0<p<1, is in between Y_i and Y_j. Let's look at the median \tau_{0.5}. For Y_i<\tau_{0.5} to happen, there must be at least i many sample items less than the median \tau_{0.5}. For \tau_{0.5}<Y_j to happen, there can be at most j-1 many sample items less than the median \tau_{0.5}. Thus in the random draws of the sample items, in order for the event Y_i < \tau_{0.5} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{0.5}. In other words, in n Bernoulli trials, there at at least i and at most j-1 successes where the probability of success is P(X<\tau_{0.5})= 0.5. The following is the probability P(Y_i < \tau_{0.5} < Y_j):

    \displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha

Then interval Y_i < \tau_{0.5} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population median \tau_{0.5}. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the 100pth percentile where 0<p<1. In order for the event Y_i < \tau_{p} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{p}. This is equivalent to n Bernoulli trials resulting in at least i successes and at most j-1 successes where the probability of success is P(X<\tau_{p})=p.

    \displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha

Then interval Y_i < \tau_{p} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population percentile \tau_{p}. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is n= 54. The middle two data elements in the sample is y_{27}=20.01 and y_{28}=20.54. They are realizations of the order statistics Y_{27} and Y_{28}. So in this example, \frac{27}{54+1}=0.49 and \frac{28}{54+1}=0.509. Thus the order statistic Y_{27} is expected to be greater than about 49% of the population and Y_{28} is expected to be greater than about 51% of the population. So neither Y_{27} nor Y_{28} is an exact fit. So we take the average of the two as an estimate of the population median:

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

Looking for confidence intervals, we consider the intervals (Y_{21},Y_{34}), (Y_{20},Y_{35}), (Y_{19},Y_{36}) and (Y_{18},Y_{37}). The following shows the confidence levels.

    \displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271

    \displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436

    \displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673

    \displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

    Median

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

    (Y_{21},Y_{34}) = (18.11, 23.49) with approximately 92% confidence

    (Y_{20},Y_{35}) = (17.74, 23.87) with approximately 96% confidence

    (Y_{19},Y_{36}) = (17.65, 24.09) with approximately 98% confidence

    (Y_{18},Y_{37}) = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

    Lower quartile

    \displaystyle \hat{\tau}_{0.25}=15.985, average of Y_{13} and Y_{14}

    (Y_{7},Y_{20}) = (13.87, 17.74) with approximately 96% confidence

    (Y_{6},Y_{21}) = (13.86, 18.11) with approximately 98% confidence

    (Y_{5},Y_{22}) = (12.54, 18.26) with approximately 99% confidence

    Upper quartile

    \displaystyle \hat{\tau}_{0.75}=25.875, average of Y_{41} and Y_{42}

    (Y_{36},Y_{47}) = (24.09, 29.41) with approximately 91% confidence

    (Y_{35},Y_{48}) = (23.87, 31.87) with approximately 96% confidence

    (Y_{34},Y_{49}) = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for \tau_{0.25} and one for \tau_{0.75}.

    \displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918

    \displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s