The sign test, more examples

This is a continuation of the previous post The sign test. Examples 1 and 2 are presented in the previous post. In this post we present three more examples. Example 3 is a matched pairs problem and is an example demonstrating that the sign test may not as powerful as the t-test when the population is close to normal. Example 4 is a one-sample location problem. Example 5 is an example of an application of the sign test when the outcomes of the study or experiment are not numerical. For more information about distribution-free inferences, see [Hollander & Wolfe].

Example 3
Courses in introductory statistics are increasingly popular at community colleges across the United States. These are statistics courses that teach basic concepts of descriptive statistics, probability notions and basic inferential statistical procedures such as one and two-sample t procedures. A certain teacher of statistics at a local community college believes that taking such a course improves students’ quantitative skills. At the beginning of one semester, this professor administered a quantitative diagnostic test to a group of 15 students taking an introductory statistics course. At the end of the semester, the professor administered a second quantitative diagnostic test. The maximum possible score on each test is 50. Though the second test was at a similar level of difficulty as the first test, the questions in the second test were different and the contexts of the problems were different. Thus simply taking the first test should not improve the second test. The following matrices show the scores before and after taking the statistics course:

\displaystyle \begin{pmatrix} \text{Student}&\text{Pre-Statistics}&\text{Post-Statistics}&\text{Diff} \\{1}&17&21&4 \\{2}&26&26&0 \\{3}&16&19&3 \\{4}&28&26&-2 \\{5}&23&30&7 \\{6}&35&40&5 \\{7}&41&43&2 \\{8}&18&15&-3 \\{9}&30&29&-1 \\{10}&29&31&2 \\{11}&45&46&1 \\{12}&8&7&-1 \\{13}&38&43&5 \\{14}&31&31&0 \\{15}&36&37&1 \end{pmatrix}

Is there evidence that taking introductory statistics course at community colleges improves students’ quantitative skills? Do the analysis using the sign test.

For a given student, let X be the post-statistics score on the diagnostic test and let Y be the pre-statistics score on the disgnostic test. Let p=P[X>Y]. This is the probability that the student has an improvement on the quantitative test after taking a one-semester introductory statistics course. The test hypotheses are as follows:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

Another interpretation of the above alternative hypothesis is that the median of the post-statistics quantitative scores has moved upward. Let W be the number of students with an improvement between the post and pre scores. Since there are two students with a zero difference, under H_0, W \sim \text{binomial}(13,0.5). Then the observed value of W is w=9. The following is the P-value:

\displaystyle \text{P-value}=P[W \ge 9]=\sum \limits_{k=9}^{13} \binom{13}{k} \biggl(\frac{1}{2}\biggr)^{13}=0.1334

If we want to set the probability of a type I error at 0.10, we would not reject the null hypothesis H_0. Thus based on the sign test, it appears that merely taking an introductory statistics course may not improve a student’s quantitative skills.

The data set for the differences in scores appears symmetric and has no strong skewness and no obvious outliers. So it should be safe to use the t-test. With \mu_d being the mean of X-Y, the hypotheses for the t-test are:

\displaystyle H_0:\mu_d=0 \ \ \ \ H_1:\mu_d>0

We obtain: t-score=2.08 and the P-value=0.028. Thus with the t-test, we would reject the null hypothesis and have the opposite conclusion. Because the sign test does not use all the available information in the data, it is not as powerful as the t-test.

Example 4
Acid rain is an environmental challenge in many places around the world. It refers to rain or any other form of precipitation that is unusually acidic, i.e. rainwater having elevated levels of hydrogen ions (low pH). The measure of pH is a measure of the acidity or basicity of a solution and has a scale ranging from 0 to 14. Distilled water, with carbon dioxide removed, has a neutral pH level of 7. Liquids with a pH less than 7 are acidic. However, even unpolluted rainwater is slightly acidic with pH varying between 5.2 to 6.0 due to the fact that carbon dioxide and water in the air react together to form carbonic acid. Thus, rainwater is only considered acidic if the pH level is less than 5.2.

In a remote region in Washington state, an enviromental biologist measured the pH levels of rainwater and obtained the following data for 16 rainwater samples on 16 different dates:

\displaystyle \begin{pmatrix} 4.73&4.79&4.87&4.88 \\{5.04}&5.06&5.07&5.09 \\{5.11}&5.16&5.18&5.21 \\{5.23}&5.24&5.25&5.25 \end{pmatrix}

Is there reason to believe that the rainwater from this region is considered acidic (less than 5.2)? Use the sign test to perform the analysis.

Let X be the pH level of a sample of rainwater in this region of Washington state. Let p=P[5.2>X]=P[5.2-X>0]. Thus p is the probability of a plus sign when comparing the each data measurement and 5.2. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

The null hypothesis H_0 is equivalent to the statement that the median pH level is 5.2. If the median pH level is less than 5.2, then a data measurement will be more likely to have a plus sign. Thus the above alternative hypothesis is the statement that the median pH level is less than 5.2.

Let W be the number of plus signs (i.e. 5.2-X>0). Then W \sim \text{binomial}(16,0.5). There are 11 data measurements with plus signs (w=11). Thus the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{16} \binom{16}{k} \biggl(\frac{1}{2}\biggr)^{16}=0.1051

At the level of significance \alpha=0.05, the null hypothesis is not rejected. We still believe that the rainwater in this region is not acidic.

Example 5
There are two statistics instructors who are both sought after by students in a local college. Let’s call them instructor A and instructor B. The math department conducted a survey to find out who is more popular with the students. In surveying 15 students, the department found that 11 of the students prefer instructor B over instructor A. Use the sign test to test the hypothesis of no difference in popularity against the alternative hypothesis that instructor B is more popular.

More than \frac{2}{3} of the students in the sample prefer instructor B over A. This seems like convincing evidence that B is indeed more popular. Let perform some calculation to confirm this. Let W be the number of students in the sample who prefer B over A. The null hypothesis is that A and B are equally popular. The alternative hypothesis is that B is more popular. If the null hypothesis is true, then W \sim \text{binomial}(15,0.5). Then the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{15} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^{15}=0.05923

This P-value suggests that we have strong evidence that instructor B is more popular among the students.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

The sign test

What kind of significance tests do we use for doing inference on the mean of an obviously non-normal population? If the sample is large, we can still use the t-test since the sampling distribution of the sample mean \overline{X} is close to normal and the t-procedure is robust. If the sample size is small and the underlying distribution is clearly not normal (e.g. is extremely skewed), what significance test do we use? Let’s take the example of a matched pairs data problem. The matched pairs t-test is to test the hypothesis that there is “no difference” between two continuous random variables X and Y that are paired. If the underlying distributions are normal or if the sample size is large, the matched pairs t-test are an excellent test. However, absent normality or large samples, the sign test is an alternative to the matched pairs t-test. In this post, we discuss how the sign test works and present some examples. Examples 1 and 2 are shown in this post. Examples 3, 4 and 5 are shown in the next post
The sign test, more examples.

The sign test and the confidence intervals for percentiles (discussed in the previous post Confidence intervals for percentiles) are examples of distribution-free inference procedures. They are called distribution-free because no assumptions are made about the underlying distributions of the data measurements. For more information about distribution-free inferences, see [Hollander & Wolfe].

We discuss two types of problems for which the sign test is applicable – one-sample location problems and matched pairs data problems. In the one-sample problems, the sign test is to test whether the location (median) of the data has shifted. In the matched pairs problems, the sign test is to test whether the location (median) of one variable has shifted in relation to the matched variable. Thus, the test hypotheses must be restated in terms of the median if the sign test is to be used as an alternative to the t-test. With the sign test, the question is “has the median changed?” whereas the question is “has the mean changed?” for the t-test.

The sign test is one of the simplest distribution-free procedures. It is an excellent choice for a significance test when the sample size is small and the data are highly skewed or have outliers. In such cases, the sign test is preferred over the t-test. However, the sign test is generally less powerful than the t-test. For the matched pairs problems, the sign test only looks at the signs of the differences of the data pairs. The magnitude of the differences is not taken into account. Because the sign test does not use all the available information contained in the data, it is less powerful than the t-test when the population is close to normal.

How the sign test works
Suppose that (X,Y) is a pair of continuous random variables. Suppose that a random sample of paired data (X_1,Y_1),(X_2,Y_2), \cdots, (X_n,Y_n) is obtained. We omit the observations (X_i,Y_i) with X_i=Y_i. Let m be the number of pairs for which X_i \ne Y_i. For each of these m pairs, we make a note of the sign of the difference X_i-Y_i (+ if X_i>Y_i and - if X_i<Y_i). Let W be the number of + signs out of these m pairs. The sign test gets its name from the fact that the statistic W is the test statistic of the sign test. Thus we are only considering the signs of the differences in the paired data and not the magnitude of the differences. The sign test is also called the binomial test since the statistic W has a binomial distribution.

Let p=P[X>Y]. Note that this is the probability that a data pair (X,Y) has a + sign. If p=\frac{1}{2}, then any random pair (X,Y) has an equal chance of being a + or a - sign. The null hypothesis H_0:p=\frac{1}{2} is the hypothesis of “no difference”. Under this hypothesis, there is no difference between the two measurements X and Y. The sign test is test the null hypothesis H_0:p=\frac{1}{2} against any one of the following alternative hypotheses:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \ \ \text{(Left-tailed)}
\displaystyle H_1:p>\frac{1}{2} \ \ \ \ \ \text{(Right-tailed)}
\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \ \ \text{(Two-tailed)}

The statistic W can be considered a series of m independent trials, each of which has probability of success p=P[X>Y]. Thus W \sim binomial(m,p). When H_0 is true, W \sim binomial(m,\frac{1}{2}). Thus the binomial distribution is used for calculating significance. The left-tailed P-value is of the form P[W \le w] and the right-tailed P-value is P[W \ge w]. Then the two-tailed P-value is twice the one-sided P-value.

The sign test can also be viewed as testing the hypothesis that the median of the differences is zero. Let m_d be the median of the differences X-Y. The null hypothesis H_0:p=\frac{1}{2} is equivalent to the hypothesis H_0:m_d=0. For the alternative hypotheses, we have the following equivalences:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d<0

\displaystyle H_1:p>\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d>0

\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d \ne 0

Example 1
A running club conducts a 6-week training program in preparing 20 middle aged amateur runners for a 5K running race. The following matrix shows the running times (in minutes) before and after the training program. Note that five kilometers = 3.1 miles.

\displaystyle \begin{pmatrix} \text{Runner}&\text{Pre-training}&\text{Post-training}&\text{Diff} \\{1}&57.5&54.9&2.6 \\{2}&52.4&53.5&-1.1 \\{3}&59.2&49.0&10.2 \\{4}&27.0&24.5&2.5 \\{5}&55.8&50.7&5.1 \\{6}&60.8&57.5&3.3 \\{7}&40.6&37.2&3.4 \\{8}&47.3&42.3&5.0 \\{9}&43.9&47.3&-3.4 \\{10}&43.7&34.8&8.9 \\{11}&60.8&53.3&7.5 \\{12}&43.9&33.8&10.1 \\{13}&45.6&41.7&3.9 \\{14}&40.6&41.5&-0.9 \\{15}&54.1&52.5&1.6 \\{16}&50.7&52.4&-1.7 \\{17}&25.4&25.9&-0.5 \\{18}&57.5&54.7&2.8 \\{19}&43.9&38.7&5.2 \\{20}&43.9&39.9&4.0 \end{pmatrix}

The difference is taken to be pre-training time minus post-training time. Use the sign test to test whether the training program improves run time.

For a given runner, let X be a random pre-training running time and Y be a random post-training running time. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ \ H_1:p>\frac{1}{2} \ \ \ \text{where} \ p=P[X>Y]

Under the null hypothesis H_0, there is no difference between the pre-training run time and post-training run time. The difference is equally likely to be a plus sign or a minus sign. Let W be the number of runners in the sample for which X_i-Y_i>0. Then W \sim \text{Binomial}(20,0.5). The observed value of the statistic W is w=15. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 15]=\sum \limits_{k=15}^{20} \binom{20}{k} \biggl(\frac{1}{2}\biggr)^{20}=0.02069

Because of the small P-value, the result of 15 out of 20 runners having improved run time cannot be due to random chance alone. So we reject H_0 and we have good reason to believe that the training program reduces run time.

Example 2
A car owner is curoius about the effect of oil changes on gas mileage. For each of 17 oil changes, he recorded data for miles per gallon (MPG) prior to the oil change and after the oil change. The following matrix shows the data:

\displaystyle \begin{pmatrix} \text{Oil Change}&\text{MPG (Pre)}&\text{MPG (Post)}&\text{Diff} \\{1}&24.24&27.45&3.21 \\{2}&24.33&24.60&0.27 \\{3}&24.45&28.27&3.82 \\{4}&23.37&22.49&-0.88 \\{5}&26.73&28.67&1.94 \\{6}&30.40&27.51&-2.89 \\{7}&29.57&29.28&-0.29 \\{8}&22.27&23.18&0.91 \\{9}&27.00&27.64&0.64 \\{10}&24.95&26.01&1.06 \\{11}&27.12&27.39&0.27 \\{12}&28.53&28.67&0.14 \\{13}&27.55&30.27&2.72 \\{14}&30.17&27.83&-2.34 \\{15}&26.00&27.78&1.78 \\{16}&27.52&29.18&1.66 \\{17}&34.61&33.04&-1.57\end{pmatrix}

Regular oil changes are obviously crucial to maintaining the overall health of the car. It seems to make sense that oil changes would improve gas mileage. Is there evidence that this is the case? Do the analysis using the sign test.

In this example we set the hypotheses in terms of the median. For a given oil change, let X be the post oil change MPG and Y be the pre oil change MPG. Consider the differences X-Y. Let m_d be the median of the differences X-Y. We test the null hypothesis that there is no change in MPG before and after oil change against the alternative hypothesis that the median of the post oil change MPG has shifted to the right in relation to the pre oil change MPG. We have the following hypotheses:

\displaystyle H_0:m_d=0 \ \ \ \ \ H_1:m_d>0

Let W be the number of oil changes with positive differences in MPG (post minus pre). Then W \sim \text{Binomial}(17,0.5). The observed value of the statistic W is w=12. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 12]=\sum \limits_{k=12}^{17} \binom{17}{k} \biggl(\frac{1}{2}\biggr)^{17}=0.07173

At the significance level of \alpha=0.10, we reject the null hypothesis. However, we would like to add a caveat. The value of this example is that it is an excellent demonstration of the sign test. The 17 oil changes are not controlled. For example, the data are just records of mileage and gas usage for 17 oil changes (both pre and post). No effort was made to make sure that the driving conditions are similar for the pre oil change MPG and post oil change MPG (freeway vs. local streets, weather conditions, etc). With more care in producing the data, we can conceivably derive a more definite answer.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

Confidence intervals for percentiles

The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size n=24, the 6^{th} order statistic Y_6 is the sample 24^{th} percentile and is an estimate of the unknown population 24^{th} percentile \tau_{0.24}. The justification is that the area under the density curve of the distribution and to the left of Y_6 is on average =\frac{6}{24+1}=0.24 (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].

Let X_1,X_2, \cdots, X_n be a random sample drawn from a continuous distribution with X, F(x) and f(x) denoting the common random variable, the common distribution function and probability density function, respectively. Let Y_1<Y_2< \cdots <Y_n be the associated order statistics. Let W_i=F(Y_i). Note that F(Y_i) can be interpreted as an area under the density curve:

    \displaystyle W_i=F(Y_i)=\int_{-\infty}^{Y_i}f(x) dx

In the previous post (The order statistics and the uniform distribution), we showed that \displaystyle E[W_i]=\frac{i}{n+1}. On this basis, Y_i is defined as the sample (100p)^{th} percentile where \displaystyle p=\frac{i}{n+1} and is used as an estimate for the unknown population (100p)^{th} percentile.

The construction of confidence intervals for percentiles is based on the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile. Let’s take median as an example and consider P[Y_2 < \tau_{0.5} < Y_8]. For Y_2 < \tau_{0.5} to happen, there must be at least two sample items X_k that are less than \tau_{0.5}. For \tau_{0.5} < Y_8 to happen, there can be no more than 8 sample items X_k that are less than \tau_{0.5}. In drawing each sample item, consider X < \tau_{0.5} as a success. The probability of a success is thus p=P[X<\tau_{0.5}]=0.5. We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.5} < Y_8]=\sum \limits_{k=2}^{7} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=1-\alpha

Then the interval (Y_2,Y_8) is taken to be the 100(1-\alpha) % confidence interval for the unknown population median.

The above discussion can easily be generalized. The following computes the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile and p=P[X < \tau_p].

    \displaystyle P[Y_i < \tau_{p} < Y_j]=\sum \limits_{k=i}^{j-1} \binom{n}{k} p^k (1-p)^{n-k}=1-\alpha

Then the interval (Y_i,Y_j) is taken to be the 100(1-\alpha) % confidence interval for the unknown population percentile \tau_p. The above probability is based on the binomial distribution with parameters n and p=P(X<\tau_p). Its mean is np and its variance is np(1-p). This fact becomes useful when using normal approximation of the above probability.

Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, (Y_2,Y_{15}) has a higher confidence level than the inteval (Y_6,Y_{10}). Note that of the two probabilities below, the first one is higher.

    \displaystyle P[Y_2 < \tau_{0.5} < Y_{15}]=\sum \limits_{k=2}^{14} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

    \displaystyle P[Y_6 < \tau_{0.5} < Y_{10}]=\sum \limits_{k=6}^{9} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

Example
The following matrix contains a random sample of n=15 grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.

    \displaystyle \begin{pmatrix} 3.34&14.70&45.71&47.69&48.25 \\{52.22}&57.25&60.79&63.87&66.85 \\{88.13}&101.81&147.33&165.10&168.28 \end{pmatrix}

Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.

The sample median \hat{\tau}_{0.5}=60.79, the 8^{th} grocery purchase. The upper quartile (75^{th} percentile) is the 12^{th} grocery purchase 101.81.

To construct a confidence interval for the median, we need to compute the probability P[Y_i< \tau_{0.5} < Y_j]. We use the interval (Y_{4},Y_{12}) because of the following probability:

    \displaystyle P[Y_{4} < \tau_{0.5} < Y_{12}]=\sum \limits_{k=4}^{11} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=0.96484375

Thus the interval (47.69,101.81) is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is 15(0.5)=7.5 and the variance is 15(0.5)(0.5)=3.75. Consider the following:

    \displaystyle \Phi \biggl(\frac{11.5-7.5}{\sqrt{3.75}}\biggr)-\Phi \biggl(\frac{3.5-7.5}{\sqrt{3.75}}\biggr)

    \displaystyle =\Phi \biggl(2.07\biggr)-\Phi \biggl(-2.07\biggr)=0.9808-0.0192=0.9616

The normal approximation is quite good.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

________________________________________________________________________
\copyright \ \text{2010 to 2015 by Dan Ma}

The order statistics and the uniform distribution

In this post, we show that the order statistics of the uniform distribution on the unit interval are distributed according to the beta distributions. This leads to a discussion on estimation of percentiles using order statistics. We also present an example of using order statistics to construct confidence intervals of population percentiles. For a discussion on the distributions of order statistics of random samples drawn from a continuous distribution, see the previous post The distributions of the order statistics.

Suppose that we have a random sample X_1,X_2,\cdots,X_n of size n from a continuous distribution with common distribution function F_X(x)=F(x) and common density function f_X(x)=f(x). The order statistics Y_1<Y_2< \cdots <Y_n are obtained by ordering the sample X_1,X_2,\cdots,X_n in ascending order. In other words, Y_1 is the smallest item in the sample and Y_2 is the second smallest item in the sample and so on. Since this is random sampling from a continuous distribution, we assume that the probability of a tie between two order statistics is zero. In the previous post The distributions of the order statistics, we derive the probability density function of the i^{th} order statistic:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f(y)

The Order Statistics of the Uniform Distribution
Suppose that the random sample X_1,X_2, \cdots, X_n are drawn from U(0,1). Since the distribution function of U(0,1) is F(y)=y where 0<y<1, the probability density function of the i^{th} order statistic is:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace y^{i-1} \thinspace [1-y]^{n-i} where 0<y<1.

The above density function is from the family of beta distributions. In general, the pdf of a beta distribution and its mean and variance are:

    \displaystyle f_{W}(w)=\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \thinspace w^{a-1} \thinspace [1-w]^{b-1} where 0<w<1 where \Gamma(\cdot) is the gamma function.

      \displaystyle E[W]=\frac{a}{a+b}

      \displaystyle Var[W]=\frac{ab}{(a+b)^2 (a+b+1)}

Then, the following shows the pdf of the i^{th} order statistic of the uniform distribution on the unit interval and its mean and variance:

    \displaystyle f_{Y_i}(y)=\frac{\Gamma(n+1)}{\Gamma(i) \Gamma(n-i+1)} \thinspace y^{i-1} \thinspace [1-y]^{(n-i+1)-1} where 0<y<1.

      \displaystyle E[Y_i]=\frac{i}{i+(n-i+1)}=\frac{i}{n+1}

      \displaystyle Var[Y_i]=\frac{i(n-i+1)}{(n+1)^2 (n+2)}

Estimation of Percentiles
In descriptive statistics, we define the sample percentiles using the order statistics (even though the term order statistics may not be used in a non-calculus based introductory statistics course). For example, if sample size is an odd integer n=2m+1, then the sample median is the order statistic Y_{m+1}. The preceding discussion on the order statistics of the uniform distribution can show us that this approach is a sound one.

Suppose we have a random sample of size n from an arbitrary continuous distribution. The order statistics listed in ascending order are:

    \displaystyle Y_1<Y_2<Y_3< \cdots <Y_n

For each i \le n, consider W_i=F(Y_i). Since the distribution function F(x) is a non-decreasing function, the W_i are also increasing:

    \displaystyle W_1<W_2<W_3< \cdots <W_n

It can be shown that if F(x) is a distribution function of a continuous random variable X, then the transformation F(X) follows the uniform distribution U(0,1). Then the following transformed random sample:

    \displaystyle F(X_1),F(X_2), \cdots, F(X_n)

are drawn from the uniform distribution U(0,1). Furthermore, W_i are the order statistics for this random sample. By the preceding discussion, \displaystyle E[W_i]=E[F(Y_i)]=\frac{i}{n+1}. Note that F(Y_i) is the area under the density function f(x) and to the left of Y_i. Thus F(Y_i) is a random area and E[W_i]=E[F(Y_i)] is the expected area under the density curve f(x) to the left of Y_i. Recall that f(x) is the common density function of the original sample X_1,X_2,\cdots,X_n.

For example, suppose the sample size n is an odd integer where n=2m+1. Then the sample median is Y_{m+1}. Note that \displaystyle E[W_{m+1}]=\frac{m+1}{n+1}=\frac{1}{2}. Thus if we choose Y_{m+1} as a point estimate for the population median, Y_{m+1} is expected to be above the bottom 50% of the population and is expected to be below the upper 50% of the population.

Furthermore, E[W_i - W_{i-1}] is the expected area under the density curve and between Y_i and Y_{i-1}. This expected area is:

    \displaystyle E[W_i - W_{i-1}]=E[F(Y_i)]-E[F(Y_{i-1})]=\frac{i}{n+1}-\frac{i-1}{n+1}=\frac{1}{n+1}

The expected area under the density curve and above the maximum order statistic Y_n is:

    \displaystyle E[1-F(Y_n)]=1-\frac{n}{n+1}=\frac{1}{n+1}

Consequently here is an interesting observation about the order statistics Y_1<Y_2<Y_3< \cdots <Y_n. The order statistics Y_i divides the the area under the density curve f(x) and above the x-axis into n+1 areas. On average each of these area is \displaystyle \frac{1}{n+1}.

As a result, it makes sense to use order statistics as estimator of percentiles. For example, we can use Y_i as the (100p)^{th} percentile of the sample where \displaystyle p=\frac{i}{n+1}. Then Y_i is an estimator of the population percentile \tau_{p} where the area under the density curve f(x) and to the left of \tau_{p} is p. In the case that (n+1)p is not an integer, then we interpolate between two order statistics. For example, if (n+1)p=5.7, then we interpolate between Y_5 and Y_6.

Example
Suppose we have a random sample of size n=11 drawn from a continuous distribution. Find estimators for the median, first quartile and second quartile. Find an estimate for the 85^{th} percentile. Construct an 87% confidence interval for the 40^{th} percentile.

The estimator for the median is Y_6. The estimator for the first quartile (25^{th} percentile) is third order statistic Y_3. The estimator for the second quartile (75^{th} percentile) is the ninth order statistic Y_9. Based on the preceding discussion, the expected area under the density curve f(x) to the left of Y_3,Y_6,Y_9 are 0.25, 0.5 and 0.75, respectively.

To find the 85^{th} percentile, note that (n+1)p=12(0.85)=10.2. Thus we interpolate Y_{10} and Y_{11}. In our example, we use linear interpolation, though taking the arithmetic average of Y_{10} and Y_{11} is also a valid approach. The following is an estimate of the 85^{th} percentile.

    \displaystyle \hat{\tau}_{0.85}=0.8Y_{10}+0.2Y_{11}

To find the confidence interval, consider the probability P[Y_2 < \tau_{0.4} < Y_7] where \tau_{0.4} is the 40^{th} percentile. Consider the event X \le \tau_{0.4} as a success with probability of success p=0.4. For Y_2 < \tau_{0.4} < Y_7 to happen, there must be at least 2 successes and fewer than 7 success in the binomial distribution with n=11 and p=0.4. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.4} < Y_7]=\sum \limits_{j=2}^{6} \binom{11}{j} 0.4^{j} 0.6^{11-j}=0.8704

Thus the interval (Y_2,Y_7) can be taken as the 87% confidence interval for \tau_{0.4}. This is an example of a distribution-free confidence interval because nothing is assumed about the underlying distribution in the construction of the confidence interval.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma}

The distributions of the order statistics

Sample statistics such as sample median, sample quartiles and sample minimum and maximum play a prominent role in the analysis using empirical data (e.g. in descriptive statistics and exploratory data analysis (EDA)). In this post we discuss order statistics and their distributions. The order statistics are the items from a randon sample arranged in increasing order. The focus here is to present the distribution functions and probability density functions of order statistics. The order statistics are important tools in non-parametric statistical inferences. In subsequent posts, we will present examples of applications in non-parametric methods.

In this post, we only consider random samples obtained from a continuous distribution (i.e. the distribution function is a continuous function). Let X_1,X_2, \cdots, X_n be a random sample of size n from a continuous distribution with distribution function F(x). We order the random sample in increasing order and obtain Y_1,Y_2, \cdots, Y_n. In other words, we have:

    Y_1= the smallest of X_1,X_2, \cdots, X_n
    Y_2= the second smallest of X_1,X_2, \cdots, X_n
    \cdot
    \cdot
    \cdot
    Y_n= the largest of X_1,X_2, \cdots, X_n

We set Y_{min}=Y_1 and Y_{max}=Y_n. The order statistic Y_i is called the i^{th} order statistic. Since we are working with a continuous distribution, we assume that the probability of two sample items being equal is zero. Thus we can assume that Y_1<Y_2< \cdots <Y_n. That is, the probability of a tie is zero among the order statistics.

The Distribution Functions of the Order Statistics
The distribution function of Y_i is an upper tail of a binomial distribution. If the event Y_i \le y occurs, then there are at least i many X_j in the sample that are less than or equal to y. Consider the event that X \le y as a success and F(y)=P[X \le y] as the probability of success. Then the drawing of each sample item becomes a Bernoulli trial (a success or a failure). We are interested in the probability of having at least i many successes. Thus the following is the distribution function of Y_i:

    \displaystyle F_{Y_i}(y)=P[Y_i \le y]=\sum \limits_{k=i}^{n} \binom{n}{k} F(y)^k [1-F(y)]^{n-k}\ \ \ \ \ \ \ \ \ \ \ \  (1)

The following relationship is used in deriving the probability density function:

    \displaystyle F_{Y_i}(y)=F_{Y_{i-1}}(y)-\binom{n}{i-1} F(y)^{i-1} [1-F(y)]^{n-i+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The Probability Density Functions of the Order Statistics
The probability density function of Y_i is given by:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f_X(y) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

We prove this by induction. Consider i=1. Note that F_{Y_1}(y) is the probability that at least one X_j \le y and is the complement of the probability of having no X_j \le y. Thus F_{Y_1}(y)=1-[1-F(y)]^n. By taking derivative, we have:

    \displaystyle f_{Y_1}(y)=F_{Y_1}^{-1}(y)=n [1-F(y)]^{n-1} f_X(y)

Suppose we derive the pdf of Y_{i-1} using (3) and obtain the following:

    \displaystyle f_{Y_{i-1}}(y)=\frac{n!}{(i-2)! (n-i+1)!} \thinspace F(y)^{i-2} \thinspace [1-F(y)]^{n-i+1} f_X(y)

Now we take the derivative of (2) above and we have:

    \displaystyle f_{Y_i}(y)=f_{Y_{i-1}}(y)-\biggl[(i-1)\binom{n}{i-1} F(y)^{i-2} f_X(y)[1-F(y)]^{n-i+1}
    \displaystyle -\ \ \ \ \ \ \ \ \ \ \binom{n}{i-1}F(y)^{i-1}(n-i+1)[1-F(y)]^{n-i} f_X(y) \biggr]

After simplifying the right hand side, we obtain the pdf of Y as in (3).

Comments
We would like to make two comments. One is that in terms of problem solving, it may be better to rely on the distribution function in (1) above to derive the pdf. The thought process behind (1) is clear. The second is that the last three terms in the pdf in (3) are very instructive. Let’s arrange these three terms as follows:.

    \displaystyle F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i}

Note that the first term is the probability that there are i-1 sample items below y. The middle term indicates that one sample item is right around y. The third term indicates that there are n-i items above y. Thus the following multinomial probability is the pdf in (3):

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! 1! (n-i)!} \thinspace F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

This heuristic approach is further described here.

Example
Suppose that a sample of size n=11 is drawn from the uniform distribution on the interval (0, \theta). Find the pdfs for Y_{min}=Y_1, Y_{max}=Y_{11} and Y_6. Find E[Y_6].

Let X \sim uniform(0,\theta). The distribution function and pdf of X are:

    \displaystyle F(y)=\left\{\begin{matrix}0&\thinspace y<0\\{\displaystyle \frac{y}{\theta}}&\thinspace 0 \le y < \theta\\{1}&\thinspace y \ge \theta\end{matrix}\right.

    \displaystyle f(y)=\left\{\begin{matrix}\displaystyle \frac{1}{\theta}&\thinspace 0<y<\theta\\{0}&\thinspace otherwise\end{matrix}\right.

Using (3), the following are the pdfs of Y_1, Y_{11} and Y_6.

    \displaystyle f_{Y_1}(y)=\frac{11}{\theta^{11}} (\theta-y)^{10}

    \displaystyle f_{Y_{11}}(y)=\frac{11}{\theta^{11}} y^{10}

    \displaystyle f_{Y_6}(y)=2772 \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta}

In this example, Y_6 is the sample median and serves as a point estimate for the population median \frac{\theta}{2}. As an estimator of the median, we prefer Y_6 not to overestimate or underestimate \frac{\theta}{2} (we call such estimator as unbiased estimator). In this particular example, the sample median Y_6 is an unbiased estimator of \frac{\theta}{2}. To see this we show E[Y_6]=\frac{\theta}{2}.

    \displaystyle E[Y_6]=\int_0^{\theta}2772 y \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta} dy

    By substituting w=\frac{y}{\theta}, we have the following beta integral.

    \displaystyle E[Y_6]=2772 \theta \int_0^1 w^{7-1} (1-w)^{6-1} dw

    \displaystyle E[Y_6]=2772 \theta \thinspace \frac{\Gamma(7) \Gamma(6)}{\Gamma(13)}=2772 \theta \thinspace \frac{6! \thinspace 5!}{12!}=\frac{\theta}{2}

________________________________________________________________________

Practice problems

Practice problems are found here in a companion blog.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma} Revised April 6, 2015.

The matching problem

One of the classic problems in probability theory is the “matching problem.” This problem has many variations and dated back to the early 18th century. There are many ways to describe the problem. One such description is the example of matching letters with envelops. Suppose there are n letters with n matching envelops (assume that each letter has only one matching envelop). Further suppose that the secretary stuffs the letters randomly into envelops. What is the probability that every letter is matched correctly, or that no letter is matched correctly or that exactly k letters are stuffed into the correct envelops? In this post we solve a simpler problem. What is the probability that at least one letter is stuffed into a correct envelop? Though the answer is in terms of the total number of letters n, it turns out that the answer is practically independent of n and is roughly \frac{2}{3}.

We first states the formula for the union of n events. After we apply this formula to obtain the answer to the matching problem, we give a sketch of a proof to the formula.

The union of n events
For any n events E_1,E_2, \cdots,E_n that are defined on the same sample space, we have the following formula:

P[E_1 \cup E_2 \cup \cdots \cup E_n]=\sum \limits_{m=1}^{n} (-1)^{m+1} \thinspace S_m where

S_1=\sum \limits_{r=1}^{n}P[E_r],

S_2=\sum \limits_{j<k}P[E_j \cap E_k],

S_m=\sum P[E_{i(1)} \cap E_{i(2)} \cap \cdots \cap E_{i(m)}]

Note that in the general term S_m, the sum is taken over all increasing sequence i(\cdot), i.e. 1 \le i(1) < i(2) < \cdots < i(m) \le n. For n=2,3, we have the following familiar formulas:

\displaystyle P[E_1 \cup E_2]=P[E_1]+P[E_2]-P[E_1 \cap E_2]

\displaystyle P[E_1 \cup E_2 \cup E_3]=
\displaystyle P[E_1]+P[E_2]+P[E_3]-P[E_1 \cap E_2]-P[E_1 \cap E_3]-P[E_2 \cap E_3]+P[E_1 \cap E_2 \cap E_3]

The matching problem
Suppose that the n letters are numbered 1,2, \cdots, n. Let E_i be the event that the i^{th} letter is stuffed into the correct envelop. Then P[E_1 \cup E_2 \cup \cdots \cup E_n] is the probability that at least one letter is matched with the correct envelop.

The probability of the intersection of m events is:

\displaystyle P[E_{i(1)} \cap E_{i(2)} \cap \cdots \cap E_{i(m)}]=\frac{(n-m)!}{n!}.

For m=1, we have P[E_i]=\frac{(n-1)!}{n!}=\frac{1}{n}. Note that the i^{th} position is fixed and we permute the other n-1 positions. For m=2, we have P[E_i \cap E_j]=\frac{(n-2)!}{n!}. Here, the i^{th} and j^{th} positions are fixed and we permute the other n-2 positions.

We now show that S_m=\frac{1}{m!}. There are \binom{n}{m} number of ways to have m matches out of n letters. Thus we have:

\displaystyle S_m=\binom{n}{m} \frac{(n-m)!}{n!}=\frac{1}{m!}

Using the formula for the union of n events, we have:

\displaystyle P[\bigcup_{m=1}^{n} E_m]=1-\frac{1}{2!}+\frac{1}{3!}- \cdots + (-1)^{n+1}\frac{1}{n!}

\displaystyle 1-P[\bigcup_{m=1}^{n} E_m]=1-1+\frac{1}{2!}-\frac{1}{3!}+ \cdots + (-1)^{n}\frac{1}{n!}

Note that the left hand side of the above is the first n+1 terms of the expansion of e^{-1}. Thus we have:

\displaystyle \lim_{n \rightarrow \infty}P[\bigcup_{m=1}^{n} E_m]=\lim_{n \rightarrow \infty}\biggl(1-\frac{1}{2!}+\frac{1}{3!}- \cdots + (-1)^{n+1}\frac{1}{n!}\biggr)

=1-e^{-1}=0.632121

The above limit converges quite rapidly. Let P_n=P[\bigcup_{m=1}^{n} E_m]. The following matrix shows the first several terms of this limit.

\displaystyle \begin{pmatrix} n&P_n \\{2}&0.50000 \\{3}&0.66667 \\{4}&0.62500 \\{5}&0.63333 \\{6}&0.63194 \\{7}&0.63214 \\{8}&0.63212\end{pmatrix}

Sketch of proof
We now discuss the union formula:

P[E_1 \cup E_2 \cup \cdots \cup E_n]=\sum \limits_{m=1}^{n} (-1)^{m+1} \thinspace S_m

The key idea is that any sample point in the union E_1 \cup \cdots \cup E_n is counted exactly one time in the right hand side of the formula. Suppose that a sample point is in exactly t events E_{i(1)}, \cdots E_{i(t)}. Then the following shows the number of times the sample point is counted in each expression:

\displaystyle \sum \limits_{m=1}^{n}P[E_m] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ t times

\displaystyle \sum \limits_{a<b}P[E_a \cap E_b] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \binom{t}{2} times

\displaystyle \sum \limits_{a<b<c}P[E_a \cap E_b \cap E_c] \ \ \ \ \ \ \ \ \ \binom{t}{3} times and so on

Thus the sample point in question will be counted exactly H times in the right hand side of the formula.

\displaystyle H=\binom{t}{1}-\binom{t}{2}+\binom{t}{3}- \cdots + (-1)^{t+1}\binom{t}{t}

The following is the derivation that H=1.

\displaystyle 0=(1-1)^t=\sum \limits_{a=0}^{t} \binom{t}{a}(-1)^{a}(1)^{t-a}=\binom{t}{0}-\binom{t}{1}+\binom{t}{2}+ \cdots +(-1)^t \binom{t}{t}

\displaystyle 1=\binom{k}{0}=\binom{t}{1}-\binom{t}{2}- \cdots +(-1)^{t+1} \binom{t}{t}=H

The capture-recapture method

The capture-recapture method is one of the methods for estimating the size of wildlife populations and is based on the hypergeometric distribution. Recall that the hypergeometric distribution is a three-parameter family of discrete distributions and one of the parameters, denoted by N in this post, is the size of the population. We show that the estimate for the parameter N that is obtained from the capture-recapture method is the value of the parameter N that makes the observed data “more likely” than any other possible values of N. Thus, the capture-recapture method produces the maximum likelihood estimate of the population size parameter N of the hypergeometric distribution.

Let’s start with an example. In order to estimate the size of the population of bluegills (a species of fresh water fish) in a small lake in Missouri, a total of w=250 bluegills are captured and tagged and then released. After allowing sufficient time for the tagged fish to disperse, a sample of n=150 bluegills were caught. It was found that y=16 bluegills in the sample were tagged. Estimate the size of the bluegill population in this lake.

Let N be the size of the bluegill population in this lake. The population proportion of the tagged bluegills is \frac{w}{N}. The sample proportion of the tagged bluegills is \frac{y}{n}. In the capture-recapture method, the population proportion and the sample proportion are set equaled. Then we solve for N.

\displaystyle \frac{w}{N}=\frac{y}{n} \Rightarrow N=\frac{w n}{y}=\frac{250(150)}{16}=2343.75=2343

Now, the connection to the hypergeometric distribution. After w=250 bluegills were captured, tagged and released, the population is separated into two distinct classes, tagged and non-tagged. When a sample of n=150 bluegills were selected without replacement, we let Y be the number of bluegills in the sample that were tagged. The distribution of Y is the hypergeometric distribution. The following is the probability function of Y.

\displaystyle P[Y=y]=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}}

In the hypergeometric distribution described here, the parameters w and n are known (w=250 and n=150). We now show that the estimate of N=2343 is the estimate that makes the observed value of y=16 “most likely” (i.e. the estimate of N=2343 is a maximum likelihood estimate of N). To show this, we consider the ratio of the hypergeometric probabilities for two successive values of N.

\displaystyle \frac{P(N)}{P(N-1)}=\frac{(N-w)(N-n)}{N(N-w-n+y)}

where \displaystyle P(N)=\frac{\binom{w}{y} \thinspace \binom{N-w}{n-y}}{\binom{N}{n}} and \displaystyle P(N-1)=\frac{\binom{w}{y} \thinspace \binom{N-1-w}{n-y}}{\binom{N-1}{n}}

Note that 1<\frac{P(N)}{P(N-1)} or P(N-1)<P(N) if and only if the following holds:

\displaystyle N(N-w-n+y)<(N-w)(N-n)

\displaystyle N<\frac{w n}{y}

Note that \frac{w n}{y} is the estimate from the capture-recapture method. It is also an upper bound for the population size N such that the probability P(N) is greater than P(N-1). This implies that the maximum likelihood estimate of N is achieved when the estimate is \hat{N}=\frac{w n}{y}.

As an illustration, we compute the probabilities \displaystyle P(N)=\frac{\binom{250}{16} \thinspace \binom{N-250}{150-16}}{\binom{N}{150}} for several values of N above and below N=2343. The following matrix illustrates that the maximum likelihood is achieved at N=2343.

\displaystyle \begin{pmatrix} N&P(N) \\{2340}&0.1084918 \\{2341}&0.1084929 \\{2342}&0.1084935 \\{2343}&0.1084938 \\{2344}&0.1084937 \\{2345}&0.1084933 \\{2346}&0.1084924\end{pmatrix}