# Confidence intervals for percentiles

The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size $n=24$, the $6^{th}$ order statistic $Y_6$ is the sample $24^{th}$ percentile and is an estimate of the unknown population $24^{th}$ percentile $\tau_{0.24}$. The justification is that the area under the density curve of the distribution and to the left of $Y_6$ is on average $=\frac{6}{24+1}=0.24$ (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].

Let $X_1,X_2, \cdots, X_n$ be a random sample drawn from a continuous distribution with $X$, $F(x)$ and $f(x)$ denoting the common random variable, the common distribution function and probability density function, respectively. Let $Y_1 be the associated order statistics. Let $W_i=F(Y_i)$. Note that $F(Y_i)$ can be interpreted as an area under the density curve:

$\displaystyle W_i=F(Y_i)=\int_{-\infty}^{Y_i}f(x) dx$

In the previous post (The order statistics and the uniform distribution), we showed that $\displaystyle E[W_i]=\frac{i}{n+1}$. On this basis, $Y_i$ is defined as the sample $(100p)^{th}$ percentile where $\displaystyle p=\frac{i}{n+1}$ and is used as an estimate for the unknown population $(100p)^{th}$ percentile.

The construction of confidence intervals for percentiles is based on the probability $P[Y_i < \tau_p < Y_j]$ where $\tau_p$ is the $(100p)^{th}$ percentile. Let’s take median as an example and consider $P[Y_2 < \tau_{0.5} < Y_8]$. For $Y_2 < \tau_{0.5}$ to happen, there must be at least two sample items $X_k$ that are less than $\tau_{0.5}$. For $\tau_{0.5} < Y_8$ to happen, there can be no more than 8 sample items $X_k$ that are less than $\tau_{0.5}$. In drawing each sample item, consider $X < \tau_{0.5}$ as a success. The probability of a success is thus $p=P[X<\tau_{0.5}]=0.5$. We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:

$\displaystyle P[Y_2 < \tau_{0.5} < Y_8]=\sum \limits_{k=2}^{7} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=1-\alpha$

Then the interval $(Y_2,Y_8)$ is taken to be the $100(1-\alpha)$ % confidence interval for the unknown population median.

The above discussion can easily be generalized. The following computes the probability $P[Y_i < \tau_p < Y_j]$ where $\tau_p$ is the $(100p)^{th}$ percentile and $p=P[X < \tau_p]$.

$\displaystyle P[Y_i < \tau_{p} < Y_j]=\sum \limits_{k=i}^{j-1} \binom{n}{k} p^k (1-p)^{n-k}=1-\alpha$

Then the interval $(Y_i,Y_j)$ is taken to be the $100(1-\alpha)$ % confidence interval for the unknown population percentile $\tau_p$. The above probability is based on the binomial distribution with parameters $n$ and $p=P(X<\tau_p)$. Its mean is $np$ and its variance is $np(1-p)$. This fact becomes useful when using normal approximation of the above probability.

Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, $(Y_2,Y_{15})$ has a higher confidence level than the inteval $(Y_6,Y_{10})$. Note that of the two probabilities below, the first one is higher.

$\displaystyle P[Y_2 < \tau_{0.5} < Y_{15}]=\sum \limits_{k=2}^{14} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}$

$\displaystyle P[Y_6 < \tau_{0.5} < Y_{10}]=\sum \limits_{k=6}^{9} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}$

Example
The following matrix contains a random sample of $n=15$ grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.

$\displaystyle \begin{pmatrix} 3.34&14.70&45.71&47.69&48.25 \\{52.22}&57.25&60.79&63.87&66.85 \\{88.13}&101.81&147.33&165.10&168.28 \end{pmatrix}$

Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.

The sample median $\hat{\tau}_{0.5}=60.79$, the $8^{th}$ grocery purchase. The upper quartile ($75^{th}$ percentile) is the $12^{th}$ grocery purchase 101.81.

To construct a confidence interval for the median, we need to compute the probability $P[Y_i< \tau_{0.5} < Y_j]$. We use the interval $(Y_{4},Y_{12})$ because of the following probability:

$\displaystyle P[Y_{4} < \tau_{0.5} < Y_{12}]=\sum \limits_{k=4}^{11} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=0.96484375$

Thus the interval $(47.69,101.81)$ is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is $15(0.5)=7.5$ and the variance is $15(0.5)(0.5)=3.75$. Consider the following:

$\displaystyle \Phi \biggl(\frac{11.5-7.5}{\sqrt{3.75}}\biggr)-\Phi \biggl(\frac{3.5-7.5}{\sqrt{3.75}}\biggr)$

$\displaystyle =\Phi \biggl(2.07\biggr)-\Phi \biggl(-2.07\biggr)=0.9808-0.0192=0.9616$

The normal approximation is quite good.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

________________________________________________________________________
$\copyright \ \text{2010 to 2015 by Dan Ma}$

## 3 thoughts on “Confidence intervals for percentiles”

1. I see that the order statistic is bracketed by Y4 and Y12.

How does one know to use 11.5 and 3.5 when constructing the normal approximation?

Thanks,

Richard thornton

thornton dot richard at gmail dot com

2. I really like the simple calculations presented here. What I find puzzling is that I cannot find a reference for the exact CI coverage calculations. There are many books that cover the asymptotic approximation. Even Hollander (1999) referenced here does not show this method. Any idea for a reference? I would like to use this method but have no reference for it.

The other question I have is: how do you choose the bounds for other percentiles than the median? There are potentially multiple intervals with at least 95% coverage. What I did it is to determine the upper bound for the two-sided 95% confidence interval by calculating the lower one-sided 97.5% confidence interval. The lower bound would correspond to the bound of the the upper one-sided 97.5% confidence interval.