Confidence intervals for percentiles

The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size n=24, the 6^{th} order statistic Y_6 is the sample 24^{th} percentile and is an estimate of the unknown population 24^{th} percentile \tau_{0.24}. The justification is that the area under the density curve of the distribution and to the left of Y_6 is on average =\frac{6}{24+1}=0.24 (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].

Let X_1,X_2, \cdots, X_n be a random sample drawn from a continuous distribution with X, F(x) and f(x) denoting the common random variable, the common distribution function and probability density function, respectively. Let Y_1<Y_2< \cdots <Y_n be the associated order statistics. Let W_i=F(Y_i). Note that F(Y_i) can be interpreted as an area under the density curve:

    \displaystyle W_i=F(Y_i)=\int_{-\infty}^{Y_i}f(x) dx

In the previous post (The order statistics and the uniform distribution), we showed that \displaystyle E[W_i]=\frac{i}{n+1}. On this basis, Y_i is defined as the sample (100p)^{th} percentile where \displaystyle p=\frac{i}{n+1} and is used as an estimate for the unknown population (100p)^{th} percentile.

The construction of confidence intervals for percentiles is based on the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile. Let’s take median as an example and consider P[Y_2 < \tau_{0.5} < Y_8]. For Y_2 < \tau_{0.5} to happen, there must be at least two sample items X_k that are less than \tau_{0.5}. For \tau_{0.5} < Y_8 to happen, there can be no more than 8 sample items X_k that are less than \tau_{0.5}. In drawing each sample item, consider X < \tau_{0.5} as a success. The probability of a success is thus p=P[X<\tau_{0.5}]=0.5. We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.5} < Y_8]=\sum \limits_{k=2}^{7} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=1-\alpha

Then the interval (Y_2,Y_8) is taken to be the 100(1-\alpha) % confidence interval for the unknown population median.

The above discussion can easily be generalized. The following computes the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile and p=P[X < \tau_p].

    \displaystyle P[Y_i < \tau_{p} < Y_j]=\sum \limits_{k=i}^{j-1} \binom{n}{k} p^k (1-p)^{n-k}=1-\alpha

Then the interval (Y_i,Y_j) is taken to be the 100(1-\alpha) % confidence interval for the unknown population percentile \tau_p. The above probability is based on the binomial distribution with parameters n and p=P(X<\tau_p). Its mean is np and its variance is np(1-p). This fact becomes useful when using normal approximation of the above probability.

Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, (Y_2,Y_{15}) has a higher confidence level than the inteval (Y_6,Y_{10}). Note that of the two probabilities below, the first one is higher.

    \displaystyle P[Y_2 < \tau_{0.5} < Y_{15}]=\sum \limits_{k=2}^{14} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

    \displaystyle P[Y_6 < \tau_{0.5} < Y_{10}]=\sum \limits_{k=6}^{9} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

Example
The following matrix contains a random sample of n=15 grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.

    \displaystyle \begin{pmatrix} 3.34&14.70&45.71&47.69&48.25 \\{52.22}&57.25&60.79&63.87&66.85 \\{88.13}&101.81&147.33&165.10&168.28 \end{pmatrix}

Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.

The sample median \hat{\tau}_{0.5}=60.79, the 8^{th} grocery purchase. The upper quartile (75^{th} percentile) is the 12^{th} grocery purchase 101.81.

To construct a confidence interval for the median, we need to compute the probability P[Y_i< \tau_{0.5} < Y_j]. We use the interval (Y_{4},Y_{12}) because of the following probability:

    \displaystyle P[Y_{4} < \tau_{0.5} < Y_{12}]=\sum \limits_{k=4}^{11} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=0.96484375

Thus the interval (47.69,101.81) is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is 15(0.5)=7.5 and the variance is 15(0.5)(0.5)=3.75. Consider the following:

    \displaystyle \Phi \biggl(\frac{11.5-7.5}{\sqrt{3.75}}\biggr)-\Phi \biggl(\frac{3.5-7.5}{\sqrt{3.75}}\biggr)

    \displaystyle =\Phi \biggl(2.07\biggr)-\Phi \biggl(-2.07\biggr)=0.9808-0.0192=0.9616

The normal approximation is quite good.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

________________________________________________________________________
\copyright \ \text{2010 to 2015 by Dan Ma}

Advertisements

2 thoughts on “Confidence intervals for percentiles

  1. I see that the order statistic is bracketed by Y4 and Y12.

    How does one know to use 11.5 and 3.5 when constructing the normal approximation?

    Thanks,

    Richard thornton

    thornton dot richard at gmail dot com

  2. I really like the simple calculations presented here. What I find puzzling is that I cannot find a reference for the exact CI coverage calculations. There are many books that cover the asymptotic approximation. Even Hollander (1999) referenced here does not show this method. Any idea for a reference? I would like to use this method but have no reference for it.

    The other question I have is: how do you choose the bounds for other percentiles than the median? There are potentially multiple intervals with at least 95% coverage. What I did it is to determine the upper bound for the two-sided 95% confidence interval by calculating the lower one-sided 97.5% confidence interval. The lower bound would correspond to the bound of the the upper one-sided 97.5% confidence interval.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s