# Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

$\displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54 \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46 \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61 \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74 \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20 \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15 \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87 \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03 \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68 \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36 \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ } \end{bmatrix}$

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample $X_1,X_2,\cdots,X_n$ drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics $Y_1. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the $Y_j$ is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics $Y_j$ to estimate the population percentiles is that the expected percentage of the population below $Y_j$ is about the same as the percentage of the sample items less than $Y_j$. According to the explanation in the second post listed above (link), the order statistic $Y_j$ is expected to be above $100p$ percent of the population where $p=\frac{j}{n+1}$. In fact, the order statistics $Y_1 are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between $Y_{j-1}$ and $Y_j$ is $100h$ where $h=\frac{1}{n+1}$.

The above explanation justifies the use of the order statistic $Y_j$ as the sample $100p$th percentile where $p=\frac{j}{n+1}$.

The sample size is $n=$ 54 in the San Francisco rainfall data. Thus the order statistic $Y_{11}$ is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of $Y_{11}$ is 15.22.

With $\frac{45}{54+1}=0.818$, the order statistic $Y_{45}$ is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of $Y_{45}$ is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability $P(Y_i < \tau_p < Y_j)$. This is the probability that the $100p$th percentile, where $0, is in between $Y_i$ and $Y_j$. Let's look at the median $\tau_{0.5}$. For $Y_i<\tau_{0.5}$ to happen, there must be at least $i$ many sample items less than the median $\tau_{0.5}$. For $\tau_{0.5} to happen, there can be at most $j-1$ many sample items less than the median $\tau_{0.5}$. Thus in the random draws of the sample items, in order for the event $Y_i < \tau_{0.5} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{0.5}$. In other words, in $n$ Bernoulli trials, there at at least $i$ and at most $j-1$ successes where the probability of success is $P(X<\tau_{0.5})=$ 0.5. The following is the probability $P(Y_i < \tau_{0.5} < Y_j)$:

$\displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{0.5} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population median $\tau_{0.5}$. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the $100p$th percentile where $0. In order for the event $Y_i < \tau_{p} < Y_j$ to occur, there must be at least $i$ sample items and at most $j-1$ sample items that are less than $\tau_{p}$. This is equivalent to $n$ Bernoulli trials resulting in at least $i$ successes and at most $j-1$ successes where the probability of success is $P(X<\tau_{p})=p$.

$\displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha$

Then interval $Y_i < \tau_{p} < Y_j$ is taken to be the $100(1-\alpha)$% confidence interval for the unknown population percentile $\tau_{p}$. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is $n=$ 54. The middle two data elements in the sample is $y_{27}=20.01$ and $y_{28}=20.54$. They are realizations of the order statistics $Y_{27}$ and $Y_{28}$. So in this example, $\frac{27}{54+1}=0.49$ and $\frac{28}{54+1}=0.509$. Thus the order statistic $Y_{27}$ is expected to be greater than about 49% of the population and $Y_{28}$ is expected to be greater than about 51% of the population. So neither $Y_{27}$ nor $Y_{28}$ is an exact fit. So we take the average of the two as an estimate of the population median:

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

Looking for confidence intervals, we consider the intervals $(Y_{21},Y_{34})$, $(Y_{20},Y_{35})$, $(Y_{19},Y_{36})$ and $(Y_{18},Y_{37})$. The following shows the confidence levels.

$\displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271$

$\displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436$

$\displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673$

$\displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666$

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

Median

$\displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275$

$(Y_{21},Y_{34})$ = (18.11, 23.49) with approximately 92% confidence

$(Y_{20},Y_{35})$ = (17.74, 23.87) with approximately 96% confidence

$(Y_{19},Y_{36})$ = (17.65, 24.09) with approximately 98% confidence

$(Y_{18},Y_{37})$ = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

Lower quartile

$\displaystyle \hat{\tau}_{0.25}=15.985$, average of $Y_{13}$ and $Y_{14}$

$(Y_{7},Y_{20})$ = (13.87, 17.74) with approximately 96% confidence

$(Y_{6},Y_{21})$ = (13.86, 18.11) with approximately 98% confidence

$(Y_{5},Y_{22})$ = (12.54, 18.26) with approximately 99% confidence

Upper quartile

$\displaystyle \hat{\tau}_{0.75}=25.875$, average of $Y_{41}$ and $Y_{42}$

$(Y_{36},Y_{47})$ = (24.09, 29.41) with approximately 91% confidence

$(Y_{35},Y_{48})$ = (23.87, 31.87) with approximately 96% confidence

$(Y_{34},Y_{49})$ = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for $\tau_{0.25}$ and one for $\tau_{0.75}$.

$\displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918$

$\displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918$

________________________________________________________________________
$\copyright \ \text{2015 by Dan Ma}$