The sign test, more examples

This is a continuation of the previous post The sign test. Examples 1 and 2 are presented in the previous post. In this post we present three more examples. Example 3 is a matched pairs problem and is an example demonstrating that the sign test may not as powerful as the t-test when the population is close to normal. Example 4 is a one-sample location problem. Example 5 is an example of an application of the sign test when the outcomes of the study or experiment are not numerical. For more information about distribution-free inferences, see [Hollander & Wolfe].

Example 3
Courses in introductory statistics are increasingly popular at community colleges across the United States. These are statistics courses that teach basic concepts of descriptive statistics, probability notions and basic inferential statistical procedures such as one and two-sample t procedures. A certain teacher of statistics at a local community college believes that taking such a course improves students’ quantitative skills. At the beginning of one semester, this professor administered a quantitative diagnostic test to a group of 15 students taking an introductory statistics course. At the end of the semester, the professor administered a second quantitative diagnostic test. The maximum possible score on each test is 50. Though the second test was at a similar level of difficulty as the first test, the questions in the second test were different and the contexts of the problems were different. Thus simply taking the first test should not improve the second test. The following matrices show the scores before and after taking the statistics course:

\displaystyle \begin{pmatrix} \text{Student}&\text{Pre-Statistics}&\text{Post-Statistics}&\text{Diff} \\{1}&17&21&4 \\{2}&26&26&0 \\{3}&16&19&3 \\{4}&28&26&-2 \\{5}&23&30&7 \\{6}&35&40&5 \\{7}&41&43&2 \\{8}&18&15&-3 \\{9}&30&29&-1 \\{10}&29&31&2 \\{11}&45&46&1 \\{12}&8&7&-1 \\{13}&38&43&5 \\{14}&31&31&0 \\{15}&36&37&1 \end{pmatrix}

Is there evidence that taking introductory statistics course at community colleges improves students’ quantitative skills? Do the analysis using the sign test.

For a given student, let X be the post-statistics score on the diagnostic test and let Y be the pre-statistics score on the disgnostic test. Let p=P[X>Y]. This is the probability that the student has an improvement on the quantitative test after taking a one-semester introductory statistics course. The test hypotheses are as follows:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

Another interpretation of the above alternative hypothesis is that the median of the post-statistics quantitative scores has moved upward. Let W be the number of students with an improvement between the post and pre scores. Since there are two students with a zero difference, under H_0, W \sim \text{binomial}(13,0.5). Then the observed value of W is w=9. The following is the P-value:

\displaystyle \text{P-value}=P[W \ge 9]=\sum \limits_{k=9}^{13} \binom{13}{k} \biggl(\frac{1}{2}\biggr)^{13}=0.1334

If we want to set the probability of a type I error at 0.10, we would not reject the null hypothesis H_0. Thus based on the sign test, it appears that merely taking an introductory statistics course may not improve a student’s quantitative skills.

The data set for the differences in scores appears symmetric and has no strong skewness and no obvious outliers. So it should be safe to use the t-test. With \mu_d being the mean of X-Y, the hypotheses for the t-test are:

\displaystyle H_0:\mu_d=0 \ \ \ \ H_1:\mu_d>0

We obtain: t-score=2.08 and the P-value=0.028. Thus with the t-test, we would reject the null hypothesis and have the opposite conclusion. Because the sign test does not use all the available information in the data, it is not as powerful as the t-test.

Example 4
Acid rain is an environmental challenge in many places around the world. It refers to rain or any other form of precipitation that is unusually acidic, i.e. rainwater having elevated levels of hydrogen ions (low pH). The measure of pH is a measure of the acidity or basicity of a solution and has a scale ranging from 0 to 14. Distilled water, with carbon dioxide removed, has a neutral pH level of 7. Liquids with a pH less than 7 are acidic. However, even unpolluted rainwater is slightly acidic with pH varying between 5.2 to 6.0 due to the fact that carbon dioxide and water in the air react together to form carbonic acid. Thus, rainwater is only considered acidic if the pH level is less than 5.2.

In a remote region in Washington state, an enviromental biologist measured the pH levels of rainwater and obtained the following data for 16 rainwater samples on 16 different dates:

\displaystyle \begin{pmatrix} 4.73&4.79&4.87&4.88 \\{5.04}&5.06&5.07&5.09 \\{5.11}&5.16&5.18&5.21 \\{5.23}&5.24&5.25&5.25 \end{pmatrix}

Is there reason to believe that the rainwater from this region is considered acidic (less than 5.2)? Use the sign test to perform the analysis.

Let X be the pH level of a sample of rainwater in this region of Washington state. Let p=P[5.2>X]=P[5.2-X>0]. Thus p is the probability of a plus sign when comparing the each data measurement and 5.2. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ H_1:p>\frac{1}{2}

The null hypothesis H_0 is equivalent to the statement that the median pH level is 5.2. If the median pH level is less than 5.2, then a data measurement will be more likely to have a plus sign. Thus the above alternative hypothesis is the statement that the median pH level is less than 5.2.

Let W be the number of plus signs (i.e. 5.2-X>0). Then W \sim \text{binomial}(16,0.5). There are 11 data measurements with plus signs (w=11). Thus the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{16} \binom{16}{k} \biggl(\frac{1}{2}\biggr)^{16}=0.1051

At the level of significance \alpha=0.05, the null hypothesis is not rejected. We still believe that the rainwater in this region is not acidic.

Example 5
There are two statistics instructors who are both sought after by students in a local college. Let’s call them instructor A and instructor B. The math department conducted a survey to find out who is more popular with the students. In surveying 15 students, the department found that 11 of the students prefer instructor B over instructor A. Use the sign test to test the hypothesis of no difference in popularity against the alternative hypothesis that instructor B is more popular.

More than \frac{2}{3} of the students in the sample prefer instructor B over A. This seems like convincing evidence that B is indeed more popular. Let perform some calculation to confirm this. Let W be the number of students in the sample who prefer B over A. The null hypothesis is that A and B are equally popular. The alternative hypothesis is that B is more popular. If the null hypothesis is true, then W \sim \text{binomial}(15,0.5). Then the P-value is:

\displaystyle \text{P-value}=P[W \ge 11]=\sum \limits_{k=11}^{15} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^{15}=0.05923

This P-value suggests that we have strong evidence that instructor B is more popular among the students.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

The sign test

What kind of significance tests do we use for doing inference on the mean of an obviously non-normal population? If the sample is large, we can still use the t-test since the sampling distribution of the sample mean \overline{X} is close to normal and the t-procedure is robust. If the sample size is small and the underlying distribution is clearly not normal (e.g. is extremely skewed), what significance test do we use? Let’s take the example of a matched pairs data problem. The matched pairs t-test is to test the hypothesis that there is “no difference” between two continuous random variables X and Y that are paired. If the underlying distributions are normal or if the sample size is large, the matched pairs t-test are an excellent test. However, absent normality or large samples, the sign test is an alternative to the matched pairs t-test. In this post, we discuss how the sign test works and present some examples. Examples 1 and 2 are shown in this post. Examples 3, 4 and 5 are shown in the next post
The sign test, more examples.

The sign test and the confidence intervals for percentiles (discussed in the previous post Confidence intervals for percentiles) are examples of distribution-free inference procedures. They are called distribution-free because no assumptions are made about the underlying distributions of the data measurements. For more information about distribution-free inferences, see [Hollander & Wolfe].

We discuss two types of problems for which the sign test is applicable – one-sample location problems and matched pairs data problems. In the one-sample problems, the sign test is to test whether the location (median) of the data has shifted. In the matched pairs problems, the sign test is to test whether the location (median) of one variable has shifted in relation to the matched variable. Thus, the test hypotheses must be restated in terms of the median if the sign test is to be used as an alternative to the t-test. With the sign test, the question is “has the median changed?” whereas the question is “has the mean changed?” for the t-test.

The sign test is one of the simplest distribution-free procedures. It is an excellent choice for a significance test when the sample size is small and the data are highly skewed or have outliers. In such cases, the sign test is preferred over the t-test. However, the sign test is generally less powerful than the t-test. For the matched pairs problems, the sign test only looks at the signs of the differences of the data pairs. The magnitude of the differences is not taken into account. Because the sign test does not use all the available information contained in the data, it is less powerful than the t-test when the population is close to normal.

How the sign test works
Suppose that (X,Y) is a pair of continuous random variables. Suppose that a random sample of paired data (X_1,Y_1),(X_2,Y_2), \cdots, (X_n,Y_n) is obtained. We omit the observations (X_i,Y_i) with X_i=Y_i. Let m be the number of pairs for which X_i \ne Y_i. For each of these m pairs, we make a note of the sign of the difference X_i-Y_i (+ if X_i>Y_i and - if X_i<Y_i). Let W be the number of + signs out of these m pairs. The sign test gets its name from the fact that the statistic W is the test statistic of the sign test. Thus we are only considering the signs of the differences in the paired data and not the magnitude of the differences. The sign test is also called the binomial test since the statistic W has a binomial distribution.

Let p=P[X>Y]. Note that this is the probability that a data pair (X,Y) has a + sign. If p=\frac{1}{2}, then any random pair (X,Y) has an equal chance of being a + or a - sign. The null hypothesis H_0:p=\frac{1}{2} is the hypothesis of “no difference”. Under this hypothesis, there is no difference between the two measurements X and Y. The sign test is test the null hypothesis H_0:p=\frac{1}{2} against any one of the following alternative hypotheses:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \ \ \text{(Left-tailed)}
\displaystyle H_1:p>\frac{1}{2} \ \ \ \ \ \text{(Right-tailed)}
\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \ \ \text{(Two-tailed)}

The statistic W can be considered a series of m independent trials, each of which has probability of success p=P[X>Y]. Thus W \sim binomial(m,p). When H_0 is true, W \sim binomial(m,\frac{1}{2}). Thus the binomial distribution is used for calculating significance. The left-tailed P-value is of the form P[W \le w] and the right-tailed P-value is P[W \ge w]. Then the two-tailed P-value is twice the one-sided P-value.

The sign test can also be viewed as testing the hypothesis that the median of the differences is zero. Let m_d be the median of the differences X-Y. The null hypothesis H_0:p=\frac{1}{2} is equivalent to the hypothesis H_0:m_d=0. For the alternative hypotheses, we have the following equivalences:

\displaystyle H_1:p<\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d<0

\displaystyle H_1:p>\frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d>0

\displaystyle H_1:p \ne \frac{1}{2} \ \ \ \equiv \ \ \ H_1:m_d \ne 0

Example 1
A running club conducts a 6-week training program in preparing 20 middle aged amateur runners for a 5K running race. The following matrix shows the running times (in minutes) before and after the training program. Note that five kilometers = 3.1 miles.

\displaystyle \begin{pmatrix} \text{Runner}&\text{Pre-training}&\text{Post-training}&\text{Diff} \\{1}&57.5&54.9&2.6 \\{2}&52.4&53.5&-1.1 \\{3}&59.2&49.0&10.2 \\{4}&27.0&24.5&2.5 \\{5}&55.8&50.7&5.1 \\{6}&60.8&57.5&3.3 \\{7}&40.6&37.2&3.4 \\{8}&47.3&42.3&5.0 \\{9}&43.9&47.3&-3.4 \\{10}&43.7&34.8&8.9 \\{11}&60.8&53.3&7.5 \\{12}&43.9&33.8&10.1 \\{13}&45.6&41.7&3.9 \\{14}&40.6&41.5&-0.9 \\{15}&54.1&52.5&1.6 \\{16}&50.7&52.4&-1.7 \\{17}&25.4&25.9&-0.5 \\{18}&57.5&54.7&2.8 \\{19}&43.9&38.7&5.2 \\{20}&43.9&39.9&4.0 \end{pmatrix}

The difference is taken to be pre-training time minus post-training time. Use the sign test to test whether the training program improves run time.

For a given runner, let X be a random pre-training running time and Y be a random post-training running time. The hypotheses to be tested are:

\displaystyle H_0:p=\frac{1}{2} \ \ \ \ \ H_1:p>\frac{1}{2} \ \ \ \text{where} \ p=P[X>Y]

Under the null hypothesis H_0, there is no difference between the pre-training run time and post-training run time. The difference is equally likely to be a plus sign or a minus sign. Let W be the number of runners in the sample for which X_i-Y_i>0. Then W \sim \text{Binomial}(20,0.5). The observed value of the statistic W is w=15. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 15]=\sum \limits_{k=15}^{20} \binom{20}{k} \biggl(\frac{1}{2}\biggr)^{20}=0.02069

Because of the small P-value, the result of 15 out of 20 runners having improved run time cannot be due to random chance alone. So we reject H_0 and we have good reason to believe that the training program reduces run time.

Example 2
A car owner is curoius about the effect of oil changes on gas mileage. For each of 17 oil changes, he recorded data for miles per gallon (MPG) prior to the oil change and after the oil change. The following matrix shows the data:

\displaystyle \begin{pmatrix} \text{Oil Change}&\text{MPG (Pre)}&\text{MPG (Post)}&\text{Diff} \\{1}&24.24&27.45&3.21 \\{2}&24.33&24.60&0.27 \\{3}&24.45&28.27&3.82 \\{4}&23.37&22.49&-0.88 \\{5}&26.73&28.67&1.94 \\{6}&30.40&27.51&-2.89 \\{7}&29.57&29.28&-0.29 \\{8}&22.27&23.18&0.91 \\{9}&27.00&27.64&0.64 \\{10}&24.95&26.01&1.06 \\{11}&27.12&27.39&0.27 \\{12}&28.53&28.67&0.14 \\{13}&27.55&30.27&2.72 \\{14}&30.17&27.83&-2.34 \\{15}&26.00&27.78&1.78 \\{16}&27.52&29.18&1.66 \\{17}&34.61&33.04&-1.57\end{pmatrix}

Regular oil changes are obviously crucial to maintaining the overall health of the car. It seems to make sense that oil changes would improve gas mileage. Is there evidence that this is the case? Do the analysis using the sign test.

In this example we set the hypotheses in terms of the median. For a given oil change, let X be the post oil change MPG and Y be the pre oil change MPG. Consider the differences X-Y. Let m_d be the median of the differences X-Y. We test the null hypothesis that there is no change in MPG before and after oil change against the alternative hypothesis that the median of the post oil change MPG has shifted to the right in relation to the pre oil change MPG. We have the following hypotheses:

\displaystyle H_0:m_d=0 \ \ \ \ \ H_1:m_d>0

Let W be the number of oil changes with positive differences in MPG (post minus pre). Then W \sim \text{Binomial}(17,0.5). The observed value of the statistic W is w=12. Since this is a right-tailed test, the following is the P-value:

\displaystyle \text{P-value}=P[W \ge 12]=\sum \limits_{k=12}^{17} \binom{17}{k} \biggl(\frac{1}{2}\biggr)^{17}=0.07173

At the significance level of \alpha=0.10, we reject the null hypothesis. However, we would like to add a caveat. The value of this example is that it is an excellent demonstration of the sign test. The 17 oil changes are not controlled. For example, the data are just records of mileage and gas usage for 17 oil changes (both pre and post). No effort was made to make sure that the driving conditions are similar for the pre oil change MPG and post oil change MPG (freeway vs. local streets, weather conditions, etc). With more care in producing the data, we can conceivably derive a more definite answer.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)