When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.
________________________________________________________________________
San Francisco rainfall data
The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).
Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.
________________________________________________________________________
Basic facts about order statistics
Let’s recall some basic facts from the following previous posts:
Let’s say we have a random sample drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics . In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.
The reason that we can use the order statistics to estimate the population percentiles is that the expected percentage of the population below is about the same as the percentage of the sample items less than . According to the explanation in the second post listed above (link), the order statistic is expected to be above percent of the population where . In fact, the order statistics are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between and is where .
The above explanation justifies the use of the order statistic as the sample th percentile where .
The sample size is 54 in the San Francisco rainfall data. Thus the order statistic is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of is 15.22.
With , the order statistic is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of is 28.68 inches.
The key for constructing confidence interval for percentiles is to calculate the probability . This is the probability that the th percentile, where , is in between and . Let's look at the median . For to happen, there must be at least many sample items less than the median . For to happen, there can be at most many sample items less than the median . Thus in the random draws of the sample items, in order for the event to occur, there must be at least sample items and at most sample items that are less than . In other words, in Bernoulli trials, there at at least and at most successes where the probability of success is 0.5. The following is the probability :
Then interval is taken to be the % confidence interval for the unknown population median . Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.
Consider the th percentile where . In order for the event to occur, there must be at least sample items and at most sample items that are less than . This is equivalent to Bernoulli trials resulting in at least successes and at most successes where the probability of success is .
Then interval is taken to be the % confidence interval for the unknown population percentile . As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.
________________________________________________________________________
Percentiles of annual rainfall
Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.
The sample size is 54. The middle two data elements in the sample is and . They are realizations of the order statistics and . So in this example, and . Thus the order statistic is expected to be greater than about 49% of the population and is expected to be greater than about 51% of the population. So neither nor is an exact fit. So we take the average of the two as an estimate of the population median:
Looking for confidence intervals, we consider the intervals , , and . The following shows the confidence levels.
The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.
-
Median
= (18.11, 23.49) with approximately 92% confidence
= (17.74, 23.87) with approximately 96% confidence
= (17.65, 24.09) with approximately 98% confidence
= (17.50, 24.49) with approximately 99% confidence
For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.
Lower quartile
, average of and
= (13.87, 17.74) with approximately 96% confidence
= (13.86, 18.11) with approximately 98% confidence
= (12.54, 18.26) with approximately 99% confidence
Upper quartile
, average of and
= (24.09, 29.41) with approximately 91% confidence
= (23.87, 31.87) with approximately 96% confidence
= (23.49, 34.02) with approximately 98% confidence
The following shows the calculation for two of the confidence intervals, one for and one for .
________________________________________________________________________