The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size , the order statistic is the sample percentile and is an estimate of the unknown population percentile . The justification is that the area under the density curve of the distribution and to the left of is on average (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].
Let be a random sample drawn from a continuous distribution with , and denoting the common random variable, the common distribution function and probability density function, respectively. Let be the associated order statistics. Let . Note that can be interpreted as an area under the density curve:
In the previous post (The order statistics and the uniform distribution), we showed that . On this basis, is defined as the sample percentile where and is used as an estimate for the unknown population percentile.
The construction of confidence intervals for percentiles is based on the probability where is the percentile. Let’s take median as an example and consider . For to happen, there must be at least two sample items that are less than . For to happen, there can be no more than 8 sample items that are less than . In drawing each sample item, consider as a success. The probability of a success is thus . We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:
Then the interval is taken to be the % confidence interval for the unknown population median.
The above discussion can easily be generalized. The following computes the probability where is the percentile and .
Then the interval is taken to be the % confidence interval for the unknown population percentile . The above probability is based on the binomial distribution with parameters and . Its mean is and its variance is . This fact becomes useful when using normal approximation of the above probability.
Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, has a higher confidence level than the inteval . Note that of the two probabilities below, the first one is higher.
The following matrix contains a random sample of grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.
Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.
The sample median , the grocery purchase. The upper quartile ( percentile) is the grocery purchase 101.81.
To construct a confidence interval for the median, we need to compute the probability . We use the interval because of the following probability:
Thus the interval is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is and the variance is . Consider the following:
The normal approximation is quite good.
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)