Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

    \displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54  \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46    \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61   \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74   \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20  \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15  \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87  \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03  \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68  \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36  \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ }    \end{bmatrix}

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Basic facts about order statistics

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample X_1,X_2,\cdots,X_n drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics Y_1<Y_2<\cdots < Y_n. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the Y_j is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics Y_j to estimate the population percentiles is that the expected percentage of the population below Y_j is about the same as the percentage of the sample items less than Y_j. According to the explanation in the second post listed above (link), the order statistic Y_j is expected to be above 100p percent of the population where p=\frac{j}{n+1}. In fact, the order statistics Y_1<Y_2<\cdots < Y_n are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between Y_{j-1} and Y_j is 100h where h=\frac{1}{n+1}.

The above explanation justifies the use of the order statistic Y_j as the sample 100pth percentile where p=\frac{j}{n+1}.

The sample size is n= 54 in the San Francisco rainfall data. Thus the order statistic Y_{11} is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of Y_{11} is 15.22.

With \frac{45}{54+1}=0.818, the order statistic Y_{45} is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of Y_{45} is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability P(Y_i < \tau_p < Y_j). This is the probability that the 100pth percentile, where 0<p<1, is in between Y_i and Y_j. Let's look at the median \tau_{0.5}. For Y_i<\tau_{0.5} to happen, there must be at least i many sample items less than the median \tau_{0.5}. For \tau_{0.5}<Y_j to happen, there can be at most j-1 many sample items less than the median \tau_{0.5}. Thus in the random draws of the sample items, in order for the event Y_i < \tau_{0.5} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{0.5}. In other words, in n Bernoulli trials, there at at least i and at most j-1 successes where the probability of success is P(X<\tau_{0.5})= 0.5. The following is the probability P(Y_i < \tau_{0.5} < Y_j):

    \displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha

Then interval Y_i < \tau_{0.5} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population median \tau_{0.5}. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the 100pth percentile where 0<p<1. In order for the event Y_i < \tau_{p} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{p}. This is equivalent to n Bernoulli trials resulting in at least i successes and at most j-1 successes where the probability of success is P(X<\tau_{p})=p.

    \displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha

Then interval Y_i < \tau_{p} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population percentile \tau_{p}. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is n= 54. The middle two data elements in the sample is y_{27}=20.01 and y_{28}=20.54. They are realizations of the order statistics Y_{27} and Y_{28}. So in this example, \frac{27}{54+1}=0.49 and \frac{28}{54+1}=0.509. Thus the order statistic Y_{27} is expected to be greater than about 49% of the population and Y_{28} is expected to be greater than about 51% of the population. So neither Y_{27} nor Y_{28} is an exact fit. So we take the average of the two as an estimate of the population median:

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

Looking for confidence intervals, we consider the intervals (Y_{21},Y_{34}), (Y_{20},Y_{35}), (Y_{19},Y_{36}) and (Y_{18},Y_{37}). The following shows the confidence levels.

    \displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271

    \displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436

    \displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673

    \displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

    Median

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

    (Y_{21},Y_{34}) = (18.11, 23.49) with approximately 92% confidence

    (Y_{20},Y_{35}) = (17.74, 23.87) with approximately 96% confidence

    (Y_{19},Y_{36}) = (17.65, 24.09) with approximately 98% confidence

    (Y_{18},Y_{37}) = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

    Lower quartile

    \displaystyle \hat{\tau}_{0.25}=15.985, average of Y_{13} and Y_{14}

    (Y_{7},Y_{20}) = (13.87, 17.74) with approximately 96% confidence

    (Y_{6},Y_{21}) = (13.86, 18.11) with approximately 98% confidence

    (Y_{5},Y_{22}) = (12.54, 18.26) with approximately 99% confidence

    Upper quartile

    \displaystyle \hat{\tau}_{0.75}=25.875, average of Y_{41} and Y_{42}

    (Y_{36},Y_{47}) = (24.09, 29.41) with approximately 91% confidence

    (Y_{35},Y_{48}) = (23.87, 31.87) with approximately 96% confidence

    (Y_{34},Y_{49}) = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for \tau_{0.25} and one for \tau_{0.75}.

    \displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918

    \displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Advertisements

The sign test, more examples

This is a continuation of the previous post The sign test. Examples 1 and 2 are presented in the previous post. In this post we present three more examples. Example 3 is a matched pairs problem and is an example demonstrating that the sign test may not as powerful as the t-test when the population is close to normal. Example 4 is a one-sample location problem. Example 5 is an example of an application of the sign test when the outcomes of the study or experiment are not numerical. For more information about distribution-free inferences, see [Hollander & Wolfe].

Example 3
Courses in introductory statistics are increasingly popular at community colleges across the United States. These are statistics courses that teach basic concepts of descriptive statistics, probability notions and basic inferential statistical procedures such as one and two-sample t procedures. A certain teacher of statistics at a local community college believes that taking such a course improves students’ quantitative skills. At the beginning of one semester, this professor administered a quantitative diagnostic test to a group of 15 students taking an introductory statistics course. At the end of the semester, the professor administered a second quantitative diagnostic test. The maximum possible score on each test is 50. Though the second test was at a similar level of difficulty as the first test, the questions in the second test were different and the contexts of the problems were different. Thus simply taking the first test should not improve the second test. The following matrices show the scores before and after taking the statistics course:

\displaystyle \begin{pmatrix} \text{Student}&\text{Pre-Statistics}&\text{Post-Statistics}&\text{Diff} \\{1}&17&21&4 \\{2}&26&26&0 \\{3}&16&19&3 \\{4}&28&26&-2 \\{5}&23&30&7 \\{6}&35&40&5 \\{7}&41&43&2 \\{8}&18&15&-3 \\{9}&30&29&-1 \\{10}&29&31&2 \\{11}&45&46&1 \\{12}&8&7&-1 \\{13}&38&43&5 \\{14}&31&31&0 \\{15}&36&37&1 \end{pmatrix}

Is there evidence that taking introductory statistics course at community colleges improves students’ quantitative skills? Do the analysis using the sign test.

For a given student, let X be the post-statistics score on the diagnostic test and let Y be the pre-statistics score on the disgnostic test. Let p=P[X>Y]. This is the probability that the student has an improvement on the quantitative test after taking a one-semester introductory statistics course. The test hypotheses are as follows:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

Another interpretation of the above alternative hypothesis is that the median of the post-statistics quantitative scores has moved upward. Let W be the number of students with an improvement between the post and pre scores. Since there are two students with a zero difference, under H_0, W \sim \text{binomial}(13,0.5). Then the observed value of W is w=9. The following is the P-value:

\displaystyle \text{P-value}=P[W \ge 9]=\sum \limits_{k=9}^{13} \binom{13}{k} \biggl(\frac{1}{2}\biggr)^{13}=0.1334

If we want to set the probability of a type I error at 0.10, we would not reject the null hypothesis H_0. Thus based on the sign test, it appears that merely taking an introductory statistics course may not improve a student’s quantitative skills.

The data set for the differences in scores appears symmetric and has no strong skewness and no obvious outliers. So it should be safe to use the t-test. With \mu_d being the mean of X-Y, the hypotheses for the t-test are:

\displaystyle H_0:\mu_d=0 \ \ \ \ H_1:\mu_d>0

We obtain: t-score=2.08 and the P-value=0.028. Thus with the t-test, we would reject the null hypothesis and have the opposite conclusion. Because the sign test does not use all the available information in the data, it is not as powerful as the t-test.

Example 4
Acid rain is an environmental challenge in many places around the world. It refers to rain or any other form of precipitation that is unusually acidic, i.e. rainwater having elevated levels of hydrogen ions (low pH). The measure of pH is a measure of the acidity or basicity of a solution and has a scale ranging from 0 to 14. Distilled water, with carbon dioxide removed, has a neutral pH level of 7. Liquids with a pH less than 7 are acidic. However, even unpolluted rainwater is slightly acidic with pH varying between 5.2 to 6.0 due to the fact that carbon dioxide and water in the air react together to form carbonic acid. Thus, rainwater is only considered acidic if the pH level is less than 5.2.

In a remote region in Washington state, an enviromental biologist measured the pH levels of rainwater and obtained the following data for 16 rainwater samples on 16 different dates:

\displaystyle \begin{pmatrix} 4.73&4.79&4.87&4.88 \\{5.04}&5.06&5.07&5.09 \\{5.11}&5.16&5.18&5.21 \\{5.23}&5.24&5.25&5.25 \end{pmatrix}

Is there reason to believe that the rainwater from this region is considered acidic (less than 5.2)? Use the sign test to perform the analysis.

Let X be the pH level of a sample of rainwater in this region of Washington state. Let p=P[5.2>X]=P[5.2-X>0]. Thus p is the probability of a plus sign when comparing the each data measurement and 5.2. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

The null hypothesis H_0 is equivalent to the statement that the median pH level is 5.2. If the median pH level is less than 5.2, then a data measurement will be more likely to have a plus sign. Thus the above alternative hypothesis is the statement that the median pH level is less than 5.2.

Let W be the number of plus signs (i.e. 5.2-X>0). Then W \sim \text{binomial}(16,0.5). There are 11 data measurements with plus signs (w=11). Thus the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{16} \binom{16}{k} \biggl(\frac{1}{2}\biggr)^{16}=0.1051

At the level of significance \alpha=0.05, the null hypothesis is not rejected. We still believe that the rainwater in this region is not acidic.

Example 5
There are two statistics instructors who are both sought after by students in a local college. Let’s call them instructor A and instructor B. The math department conducted a survey to find out who is more popular with the students. In surveying 15 students, the department found that 11 of the students prefer instructor B over instructor A. Use the sign test to test the hypothesis of no difference in popularity against the alternative hypothesis that instructor B is more popular.

More than \frac{2}{3} of the students in the sample prefer instructor B over A. This seems like convincing evidence that B is indeed more popular. Let perform some calculation to confirm this. Let W be the number of students in the sample who prefer B over A. The null hypothesis is that A and B are equally popular. The alternative hypothesis is that B is more popular. If the null hypothesis is true, then W \sim \text{binomial}(15,0.5). Then the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{15} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^{15}=0.05923

This P-value suggests that we have strong evidence that instructor B is more popular among the students.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

The sign test

What kind of significance tests do we use for doing inference on the mean of an obviously non-normal population? If the sample is large, we can still use the t-test since the sampling distribution of the sample mean \overline{X} is close to normal and the t-procedure is robust. If the sample size is small and the underlying distribution is clearly not normal (e.g. is extremely skewed), what significance test do we use? Let’s take the example of a matched pairs data problem. The matched pairs t-test is to test the hypothesis that there is “no difference” between two continuous random variables X and Y that are paired. If the underlying distributions are normal or if the sample size is large, the matched pairs t-test are an excellent test. However, absent normality or large samples, the sign test is an alternative to the matched pairs t-test. In this post, we discuss how the sign test works and present some examples. Examples 1 and 2 are shown in this post. Examples 3, 4 and 5 are shown in the next post
The sign test, more examples.

The sign test and the confidence intervals for percentiles (discussed in the previous post Confidence intervals for percentiles) are examples of distribution-free inference procedures. They are called distribution-free because no assumptions are made about the underlying distributions of the data measurements. For more information about distribution-free inferences, see [Hollander & Wolfe].

We discuss two types of problems for which the sign test is applicable – one-sample location problems and matched pairs data problems. In the one-sample problems, the sign test is to test whether the location (median) of the data has shifted. In the matched pairs problems, the sign test is to test whether the location (median) of one variable has shifted in relation to the matched variable. Thus, the test hypotheses must be restated in terms of the median if the sign test is to be used as an alternative to the t-test. With the sign test, the question is “has the median changed?” whereas the question is “has the mean changed?” for the t-test.

The sign test is one of the simplest distribution-free procedures. It is an excellent choice for a significance test when the sample size is small and the data are highly skewed or have outliers. In such cases, the sign test is preferred over the t-test. However, the sign test is generally less powerful than the t-test. For the matched pairs problems, the sign test only looks at the signs of the differences of the data pairs. The magnitude of the differences is not taken into account. Because the sign test does not use all the available information contained in the data, it is less powerful than the t-test when the population is close to normal.

How the sign test works
Suppose that (X,Y) is a pair of continuous random variables. Suppose that a random sample of paired data (X_1,Y_1),(X_2,Y_2), \cdots, (X_n,Y_n) is obtained. We omit the observations (X_i,Y_i) with X_i=Y_i. Let m be the number of pairs for which X_i \ne Y_i. For each of these m pairs, we make a note of the sign of the difference X_i-Y_i (+ if X_i>Y_i and - if X_i<Y_i). Let W be the number of + signs out of these m pairs. The sign test gets its name from the fact that the statistic W is the test statistic of the sign test. Thus we are only considering the signs of the differences in the paired data and not the magnitude of the differences. The sign test is also called the binomial test since the statistic W has a binomial distribution.

Let p=P[X>Y]. Note that this is the probability that a data pair (X,Y) has a + sign. If p=\frac{1}{2}, then any random pair (X,Y) has an equal chance of being a + or a - sign. The null hypothesis H_0:p=\frac{1}{2} is the hypothesis of “no difference”. Under this hypothesis, there is no difference between the two measurements X and Y. The sign test is test the null hypothesis H_0:p=\frac{1}{2} against any one of the following alternative hypotheses:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \ \ \text{(Left-tailed)}
\displaystyle H_1:p>\frac{1}{2} \ \ \ \ \ \text{(Right-tailed)}
\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \ \ \text{(Two-tailed)}

The statistic W can be considered a series of m independent trials, each of which has probability of success p=P[X>Y]. Thus W \sim binomial(m,p). When H_0 is true, W \sim binomial(m,\frac{1}{2}). Thus the binomial distribution is used for calculating significance. The left-tailed P-value is of the form P[W \le w] and the right-tailed P-value is P[W \ge w]. Then the two-tailed P-value is twice the one-sided P-value.

The sign test can also be viewed as testing the hypothesis that the median of the differences is zero. Let m_d be the median of the differences X-Y. The null hypothesis H_0:p=\frac{1}{2} is equivalent to the hypothesis H_0:m_d=0. For the alternative hypotheses, we have the following equivalences:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d<0

\displaystyle H_1:p>\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d>0

\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d \ne 0

Example 1
A running club conducts a 6-week training program in preparing 20 middle aged amateur runners for a 5K running race. The following matrix shows the running times (in minutes) before and after the training program. Note that five kilometers = 3.1 miles.

\displaystyle \begin{pmatrix} \text{Runner}&\text{Pre-training}&\text{Post-training}&\text{Diff} \\{1}&57.5&54.9&2.6 \\{2}&52.4&53.5&-1.1 \\{3}&59.2&49.0&10.2 \\{4}&27.0&24.5&2.5 \\{5}&55.8&50.7&5.1 \\{6}&60.8&57.5&3.3 \\{7}&40.6&37.2&3.4 \\{8}&47.3&42.3&5.0 \\{9}&43.9&47.3&-3.4 \\{10}&43.7&34.8&8.9 \\{11}&60.8&53.3&7.5 \\{12}&43.9&33.8&10.1 \\{13}&45.6&41.7&3.9 \\{14}&40.6&41.5&-0.9 \\{15}&54.1&52.5&1.6 \\{16}&50.7&52.4&-1.7 \\{17}&25.4&25.9&-0.5 \\{18}&57.5&54.7&2.8 \\{19}&43.9&38.7&5.2 \\{20}&43.9&39.9&4.0 \end{pmatrix}

The difference is taken to be pre-training time minus post-training time. Use the sign test to test whether the training program improves run time.

For a given runner, let X be a random pre-training running time and Y be a random post-training running time. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ \ H_1:p>\frac{1}{2} \ \ \ \text{where} \ p=P[X>Y]

Under the null hypothesis H_0, there is no difference between the pre-training run time and post-training run time. The difference is equally likely to be a plus sign or a minus sign. Let W be the number of runners in the sample for which X_i-Y_i>0. Then W \sim \text{Binomial}(20,0.5). The observed value of the statistic W is w=15. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 15]=\sum \limits_{k=15}^{20} \binom{20}{k} \biggl(\frac{1}{2}\biggr)^{20}=0.02069

Because of the small P-value, the result of 15 out of 20 runners having improved run time cannot be due to random chance alone. So we reject H_0 and we have good reason to believe that the training program reduces run time.

Example 2
A car owner is curoius about the effect of oil changes on gas mileage. For each of 17 oil changes, he recorded data for miles per gallon (MPG) prior to the oil change and after the oil change. The following matrix shows the data:

\displaystyle \begin{pmatrix} \text{Oil Change}&\text{MPG (Pre)}&\text{MPG (Post)}&\text{Diff} \\{1}&24.24&27.45&3.21 \\{2}&24.33&24.60&0.27 \\{3}&24.45&28.27&3.82 \\{4}&23.37&22.49&-0.88 \\{5}&26.73&28.67&1.94 \\{6}&30.40&27.51&-2.89 \\{7}&29.57&29.28&-0.29 \\{8}&22.27&23.18&0.91 \\{9}&27.00&27.64&0.64 \\{10}&24.95&26.01&1.06 \\{11}&27.12&27.39&0.27 \\{12}&28.53&28.67&0.14 \\{13}&27.55&30.27&2.72 \\{14}&30.17&27.83&-2.34 \\{15}&26.00&27.78&1.78 \\{16}&27.52&29.18&1.66 \\{17}&34.61&33.04&-1.57\end{pmatrix}

Regular oil changes are obviously crucial to maintaining the overall health of the car. It seems to make sense that oil changes would improve gas mileage. Is there evidence that this is the case? Do the analysis using the sign test.

In this example we set the hypotheses in terms of the median. For a given oil change, let X be the post oil change MPG and Y be the pre oil change MPG. Consider the differences X-Y. Let m_d be the median of the differences X-Y. We test the null hypothesis that there is no change in MPG before and after oil change against the alternative hypothesis that the median of the post oil change MPG has shifted to the right in relation to the pre oil change MPG. We have the following hypotheses:

\displaystyle H_0:m_d=0 \ \ \ \ \ H_1:m_d>0

Let W be the number of oil changes with positive differences in MPG (post minus pre). Then W \sim \text{Binomial}(17,0.5). The observed value of the statistic W is w=12. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 12]=\sum \limits_{k=12}^{17} \binom{17}{k} \biggl(\frac{1}{2}\biggr)^{17}=0.07173

At the significance level of \alpha=0.10, we reject the null hypothesis. However, we would like to add a caveat. The value of this example is that it is an excellent demonstration of the sign test. The 17 oil changes are not controlled. For example, the data are just records of mileage and gas usage for 17 oil changes (both pre and post). No effort was made to make sure that the driving conditions are similar for the pre oil change MPG and post oil change MPG (freeway vs. local streets, weather conditions, etc). With more care in producing the data, we can conceivably derive a more definite answer.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

Confidence intervals for percentiles

The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size n=24, the 6^{th} order statistic Y_6 is the sample 24^{th} percentile and is an estimate of the unknown population 24^{th} percentile \tau_{0.24}. The justification is that the area under the density curve of the distribution and to the left of Y_6 is on average =\frac{6}{24+1}=0.24 (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].

Let X_1,X_2, \cdots, X_n be a random sample drawn from a continuous distribution with X, F(x) and f(x) denoting the common random variable, the common distribution function and probability density function, respectively. Let Y_1<Y_2< \cdots <Y_n be the associated order statistics. Let W_i=F(Y_i). Note that F(Y_i) can be interpreted as an area under the density curve:

    \displaystyle W_i=F(Y_i)=\int_{-\infty}^{Y_i}f(x) dx

In the previous post (The order statistics and the uniform distribution), we showed that \displaystyle E[W_i]=\frac{i}{n+1}. On this basis, Y_i is defined as the sample (100p)^{th} percentile where \displaystyle p=\frac{i}{n+1} and is used as an estimate for the unknown population (100p)^{th} percentile.

The construction of confidence intervals for percentiles is based on the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile. Let’s take median as an example and consider P[Y_2 < \tau_{0.5} < Y_8]. For Y_2 < \tau_{0.5} to happen, there must be at least two sample items X_k that are less than \tau_{0.5}. For \tau_{0.5} < Y_8 to happen, there can be no more than 8 sample items X_k that are less than \tau_{0.5}. In drawing each sample item, consider X < \tau_{0.5} as a success. The probability of a success is thus p=P[X<\tau_{0.5}]=0.5. We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.5} < Y_8]=\sum \limits_{k=2}^{7} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=1-\alpha

Then the interval (Y_2,Y_8) is taken to be the 100(1-\alpha) % confidence interval for the unknown population median.

The above discussion can easily be generalized. The following computes the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile and p=P[X < \tau_p].

    \displaystyle P[Y_i < \tau_{p} < Y_j]=\sum \limits_{k=i}^{j-1} \binom{n}{k} p^k (1-p)^{n-k}=1-\alpha

Then the interval (Y_i,Y_j) is taken to be the 100(1-\alpha) % confidence interval for the unknown population percentile \tau_p. The above probability is based on the binomial distribution with parameters n and p=P(X<\tau_p). Its mean is np and its variance is np(1-p). This fact becomes useful when using normal approximation of the above probability.

Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, (Y_2,Y_{15}) has a higher confidence level than the inteval (Y_6,Y_{10}). Note that of the two probabilities below, the first one is higher.

    \displaystyle P[Y_2 < \tau_{0.5} < Y_{15}]=\sum \limits_{k=2}^{14} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

    \displaystyle P[Y_6 < \tau_{0.5} < Y_{10}]=\sum \limits_{k=6}^{9} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

Example
The following matrix contains a random sample of n=15 grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.

    \displaystyle \begin{pmatrix} 3.34&14.70&45.71&47.69&48.25 \\{52.22}&57.25&60.79&63.87&66.85 \\{88.13}&101.81&147.33&165.10&168.28 \end{pmatrix}

Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.

The sample median \hat{\tau}_{0.5}=60.79, the 8^{th} grocery purchase. The upper quartile (75^{th} percentile) is the 12^{th} grocery purchase 101.81.

To construct a confidence interval for the median, we need to compute the probability P[Y_i< \tau_{0.5} < Y_j]. We use the interval (Y_{4},Y_{12}) because of the following probability:

    \displaystyle P[Y_{4} < \tau_{0.5} < Y_{12}]=\sum \limits_{k=4}^{11} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=0.96484375

Thus the interval (47.69,101.81) is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is 15(0.5)=7.5 and the variance is 15(0.5)(0.5)=3.75. Consider the following:

    \displaystyle \Phi \biggl(\frac{11.5-7.5}{\sqrt{3.75}}\biggr)-\Phi \biggl(\frac{3.5-7.5}{\sqrt{3.75}}\biggr)

    \displaystyle =\Phi \biggl(2.07\biggr)-\Phi \biggl(-2.07\biggr)=0.9808-0.0192=0.9616

The normal approximation is quite good.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

________________________________________________________________________
\copyright \ \text{2010 to 2015 by Dan Ma}

The order statistics and the uniform distribution

In this post, we show that the order statistics of the uniform distribution on the unit interval are distributed according to the beta distributions. This leads to a discussion on estimation of percentiles using order statistics. We also present an example of using order statistics to construct confidence intervals of population percentiles. For a discussion on the distributions of order statistics of random samples drawn from a continuous distribution, see the previous post The distributions of the order statistics.

Suppose that we have a random sample X_1,X_2,\cdots,X_n of size n from a continuous distribution with common distribution function F_X(x)=F(x) and common density function f_X(x)=f(x). The order statistics Y_1<Y_2< \cdots <Y_n are obtained by ordering the sample X_1,X_2,\cdots,X_n in ascending order. In other words, Y_1 is the smallest item in the sample and Y_2 is the second smallest item in the sample and so on. Since this is random sampling from a continuous distribution, we assume that the probability of a tie between two order statistics is zero. In the previous post The distributions of the order statistics, we derive the probability density function of the i^{th} order statistic:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f(y)

The Order Statistics of the Uniform Distribution
Suppose that the random sample X_1,X_2, \cdots, X_n are drawn from U(0,1). Since the distribution function of U(0,1) is F(y)=y where 0<y<1, the probability density function of the i^{th} order statistic is:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace y^{i-1} \thinspace [1-y]^{n-i} where 0<y<1.

The above density function is from the family of beta distributions. In general, the pdf of a beta distribution and its mean and variance are:

    \displaystyle f_{W}(w)=\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \thinspace w^{a-1} \thinspace [1-w]^{b-1} where 0<w<1 where \Gamma(\cdot) is the gamma function.

      \displaystyle E[W]=\frac{a}{a+b}

      \displaystyle Var[W]=\frac{ab}{(a+b)^2 (a+b+1)}

Then, the following shows the pdf of the i^{th} order statistic of the uniform distribution on the unit interval and its mean and variance:

    \displaystyle f_{Y_i}(y)=\frac{\Gamma(n+1)}{\Gamma(i) \Gamma(n-i+1)} \thinspace y^{i-1} \thinspace [1-y]^{(n-i+1)-1} where 0<y<1.

      \displaystyle E[Y_i]=\frac{i}{i+(n-i+1)}=\frac{i}{n+1}

      \displaystyle Var[Y_i]=\frac{i(n-i+1)}{(n+1)^2 (n+2)}

Estimation of Percentiles
In descriptive statistics, we define the sample percentiles using the order statistics (even though the term order statistics may not be used in a non-calculus based introductory statistics course). For example, if sample size is an odd integer n=2m+1, then the sample median is the order statistic Y_{m+1}. The preceding discussion on the order statistics of the uniform distribution can show us that this approach is a sound one.

Suppose we have a random sample of size n from an arbitrary continuous distribution. The order statistics listed in ascending order are:

    \displaystyle Y_1<Y_2<Y_3< \cdots <Y_n

For each i \le n, consider W_i=F(Y_i). Since the distribution function F(x) is a non-decreasing function, the W_i are also increasing:

    \displaystyle W_1<W_2<W_3< \cdots <W_n

It can be shown that if F(x) is a distribution function of a continuous random variable X, then the transformation F(X) follows the uniform distribution U(0,1). Then the following transformed random sample:

    \displaystyle F(X_1),F(X_2), \cdots, F(X_n)

are drawn from the uniform distribution U(0,1). Furthermore, W_i are the order statistics for this random sample. By the preceding discussion, \displaystyle E[W_i]=E[F(Y_i)]=\frac{i}{n+1}. Note that F(Y_i) is the area under the density function f(x) and to the left of Y_i. Thus F(Y_i) is a random area and E[W_i]=E[F(Y_i)] is the expected area under the density curve f(x) to the left of Y_i. Recall that f(x) is the common density function of the original sample X_1,X_2,\cdots,X_n.

For example, suppose the sample size n is an odd integer where n=2m+1. Then the sample median is Y_{m+1}. Note that \displaystyle E[W_{m+1}]=\frac{m+1}{n+1}=\frac{1}{2}. Thus if we choose Y_{m+1} as a point estimate for the population median, Y_{m+1} is expected to be above the bottom 50% of the population and is expected to be below the upper 50% of the population.

Furthermore, E[W_i - W_{i-1}] is the expected area under the density curve and between Y_i and Y_{i-1}. This expected area is:

    \displaystyle E[W_i - W_{i-1}]=E[F(Y_i)]-E[F(Y_{i-1})]=\frac{i}{n+1}-\frac{i-1}{n+1}=\frac{1}{n+1}

The expected area under the density curve and above the maximum order statistic Y_n is:

    \displaystyle E[1-F(Y_n)]=1-\frac{n}{n+1}=\frac{1}{n+1}

Consequently here is an interesting observation about the order statistics Y_1<Y_2<Y_3< \cdots <Y_n. The order statistics Y_i divides the the area under the density curve f(x) and above the x-axis into n+1 areas. On average each of these area is \displaystyle \frac{1}{n+1}.

As a result, it makes sense to use order statistics as estimator of percentiles. For example, we can use Y_i as the (100p)^{th} percentile of the sample where \displaystyle p=\frac{i}{n+1}. Then Y_i is an estimator of the population percentile \tau_{p} where the area under the density curve f(x) and to the left of \tau_{p} is p. In the case that (n+1)p is not an integer, then we interpolate between two order statistics. For example, if (n+1)p=5.7, then we interpolate between Y_5 and Y_6.

Example
Suppose we have a random sample of size n=11 drawn from a continuous distribution. Find estimators for the median, first quartile and second quartile. Find an estimate for the 85^{th} percentile. Construct an 87% confidence interval for the 40^{th} percentile.

The estimator for the median is Y_6. The estimator for the first quartile (25^{th} percentile) is third order statistic Y_3. The estimator for the second quartile (75^{th} percentile) is the ninth order statistic Y_9. Based on the preceding discussion, the expected area under the density curve f(x) to the left of Y_3,Y_6,Y_9 are 0.25, 0.5 and 0.75, respectively.

To find the 85^{th} percentile, note that (n+1)p=12(0.85)=10.2. Thus we interpolate Y_{10} and Y_{11}. In our example, we use linear interpolation, though taking the arithmetic average of Y_{10} and Y_{11} is also a valid approach. The following is an estimate of the 85^{th} percentile.

    \displaystyle \hat{\tau}_{0.85}=0.8Y_{10}+0.2Y_{11}

To find the confidence interval, consider the probability P[Y_2 < \tau_{0.4} < Y_7] where \tau_{0.4} is the 40^{th} percentile. Consider the event X \le \tau_{0.4} as a success with probability of success p=0.4. For Y_2 < \tau_{0.4} < Y_7 to happen, there must be at least 2 successes and fewer than 7 success in the binomial distribution with n=11 and p=0.4. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.4} < Y_7]=\sum \limits_{j=2}^{6} \binom{11}{j} 0.4^{j} 0.6^{11-j}=0.8704

Thus the interval (Y_2,Y_7) can be taken as the 87% confidence interval for \tau_{0.4}. This is an example of a distribution-free confidence interval because nothing is assumed about the underlying distribution in the construction of the confidence interval.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma}

The distributions of the order statistics

Sample statistics such as sample median, sample quartiles and sample minimum and maximum play a prominent role in the analysis using empirical data (e.g. in descriptive statistics and exploratory data analysis (EDA)). In this post we discuss order statistics and their distributions. The order statistics are the items from a randon sample arranged in increasing order. The focus here is to present the distribution functions and probability density functions of order statistics. The order statistics are important tools in non-parametric statistical inferences. In subsequent posts, we will present examples of applications in non-parametric methods.

In this post, we only consider random samples obtained from a continuous distribution (i.e. the distribution function is a continuous function). Let X_1,X_2, \cdots, X_n be a random sample of size n from a continuous distribution with distribution function F(x). We order the random sample in increasing order and obtain Y_1,Y_2, \cdots, Y_n. In other words, we have:

    Y_1= the smallest of X_1,X_2, \cdots, X_n
    Y_2= the second smallest of X_1,X_2, \cdots, X_n
    \cdot
    \cdot
    \cdot
    Y_n= the largest of X_1,X_2, \cdots, X_n

We set Y_{min}=Y_1 and Y_{max}=Y_n. The order statistic Y_i is called the i^{th} order statistic. Since we are working with a continuous distribution, we assume that the probability of two sample items being equal is zero. Thus we can assume that Y_1<Y_2< \cdots <Y_n. That is, the probability of a tie is zero among the order statistics.

The Distribution Functions of the Order Statistics
The distribution function of Y_i is an upper tail of a binomial distribution. If the event Y_i \le y occurs, then there are at least i many X_j in the sample that are less than or equal to y. Consider the event that X \le y as a success and F(y)=P[X \le y] as the probability of success. Then the drawing of each sample item becomes a Bernoulli trial (a success or a failure). We are interested in the probability of having at least i many successes. Thus the following is the distribution function of Y_i:

    \displaystyle F_{Y_i}(y)=P[Y_i \le y]=\sum \limits_{k=i}^{n} \binom{n}{k} F(y)^k [1-F(y)]^{n-k}\ \ \ \ \ \ \ \ \ \ \ \  (1)

The following relationship is used in deriving the probability density function:

    \displaystyle F_{Y_i}(y)=F_{Y_{i-1}}(y)-\binom{n}{i-1} F(y)^{i-1} [1-F(y)]^{n-i+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The Probability Density Functions of the Order Statistics
The probability density function of Y_i is given by:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f_X(y) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

We prove this by induction. Consider i=1. Note that F_{Y_1}(y) is the probability that at least one X_j \le y and is the complement of the probability of having no X_j \le y. Thus F_{Y_1}(y)=1-[1-F(y)]^n. By taking derivative, we have:

    \displaystyle f_{Y_1}(y)=F_{Y_1}^{-1}(y)=n [1-F(y)]^{n-1} f_X(y)

Suppose we derive the pdf of Y_{i-1} using (3) and obtain the following:

    \displaystyle f_{Y_{i-1}}(y)=\frac{n!}{(i-2)! (n-i+1)!} \thinspace F(y)^{i-2} \thinspace [1-F(y)]^{n-i+1} f_X(y)

Now we take the derivative of (2) above and we have:

    \displaystyle f_{Y_i}(y)=f_{Y_{i-1}}(y)-\biggl[(i-1)\binom{n}{i-1} F(y)^{i-2} f_X(y)[1-F(y)]^{n-i+1}
    \displaystyle -\ \ \ \ \ \ \ \ \ \ \binom{n}{i-1}F(y)^{i-1}(n-i+1)[1-F(y)]^{n-i} f_X(y) \biggr]

After simplifying the right hand side, we obtain the pdf of Y as in (3).

Comments
We would like to make two comments. One is that in terms of problem solving, it may be better to rely on the distribution function in (1) above to derive the pdf. The thought process behind (1) is clear. The second is that the last three terms in the pdf in (3) are very instructive. Let’s arrange these three terms as follows:.

    \displaystyle F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i}

Note that the first term is the probability that there are i-1 sample items below y. The middle term indicates that one sample item is right around y. The third term indicates that there are n-i items above y. Thus the following multinomial probability is the pdf in (3):

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! 1! (n-i)!} \thinspace F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

This heuristic approach is further described here.

Example
Suppose that a sample of size n=11 is drawn from the uniform distribution on the interval (0, \theta). Find the pdfs for Y_{min}=Y_1, Y_{max}=Y_{11} and Y_6. Find E[Y_6].

Let X \sim uniform(0,\theta). The distribution function and pdf of X are:

    \displaystyle F(y)=\left\{\begin{matrix}0&\thinspace y<0\\{\displaystyle \frac{y}{\theta}}&\thinspace 0 \le y < \theta\\{1}&\thinspace y \ge \theta\end{matrix}\right.

    \displaystyle f(y)=\left\{\begin{matrix}\displaystyle \frac{1}{\theta}&\thinspace 0<y<\theta\\{0}&\thinspace otherwise\end{matrix}\right.

Using (3), the following are the pdfs of Y_1, Y_{11} and Y_6.

    \displaystyle f_{Y_1}(y)=\frac{11}{\theta^{11}} (\theta-y)^{10}

    \displaystyle f_{Y_{11}}(y)=\frac{11}{\theta^{11}} y^{10}

    \displaystyle f_{Y_6}(y)=2772 \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta}

In this example, Y_6 is the sample median and serves as a point estimate for the population median \frac{\theta}{2}. As an estimator of the median, we prefer Y_6 not to overestimate or underestimate \frac{\theta}{2} (we call such estimator as unbiased estimator). In this particular example, the sample median Y_6 is an unbiased estimator of \frac{\theta}{2}. To see this we show E[Y_6]=\frac{\theta}{2}.

    \displaystyle E[Y_6]=\int_0^{\theta}2772 y \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta} dy

    By substituting w=\frac{y}{\theta}, we have the following beta integral.

    \displaystyle E[Y_6]=2772 \theta \int_0^1 w^{7-1} (1-w)^{6-1} dw

    \displaystyle E[Y_6]=2772 \theta \thinspace \frac{\Gamma(7) \Gamma(6)}{\Gamma(13)}=2772 \theta \thinspace \frac{6! \thinspace 5!}{12!}=\frac{\theta}{2}

________________________________________________________________________

Practice problems

Practice problems are found here in a companion blog.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma} Revised April 6, 2015.