Calculating order statistics using multinomial probabilities

Consider a random sample X_1,X_2,\cdots,X_n drawn from a continuous distribution. Rank the sample items in increasing order, resulting in a ranked sample Y_1<Y_2<\cdots <Y_n where Y_1 is the smallest sample item, Y_2 is the second smallest sample item and so on. The items in the ranked sample are called the order statistics. Recently the author of this blog was calculating a conditional probability such as P(Y_2>4 \ | \ Y_2>3). One way to do this is to calculate the distribution function P(Y_2 \le t). What about the probability P(Y_5>4 \ | \ Y_2>3)? Since this one involves two order statistics, the author of this blog initially thought that calculating P(Y_5>4 \ | \ Y_2>3) would require knowing the joint probability distribution of the order statistics Y_1,Y_2,\cdots ,Y_n. It turns out that a joint distribution may not be needed. Instead, we can calculate a conditional probability such as P(Y_5>4 \ | \ Y_2>3) using multinomial probabilities. In this post, we demonstrate how this is done using examples. Practice problems are found in here.

The calculation described here can be lengthy and tedious if the sample size is large. To make the calculation more manageable, the examples here have relatively small sample size. To keep the multinomial probabilities easier to calculate, the random samples are drawn from a uniform distribution. The calculation for larger sample sizes from other distributions is better done using a technology solution. In any case, the calculation described here is a great way to practice working with order statistics and multinomial probabilities.

________________________________________________________________________

The multinomial angle

In this post, the order statistics Y_1<Y_2<\cdots <Y_n are resulted from ranking the random sample X_1,X_2,\cdots,X_n, which is drawn from a continuous distribution with distribution function F(x)=P(X \le x). For the jth order statistic, the calculation often begins with its distribution function P(Y_j \le t).

Here’s the thought process for calculating P(Y_j \le t). In drawing the random sample X_1,X_2,\cdots,X_n, make a note of the items \le t and the items >t. For the event Y_j \le t to happen, there must be at least j many sample items X_i that are \le t. For the event Y_j > t to happen, there can be only at most j-1 many sample items X_i \le t. So to calculate P(Y_j \le t), simply find out the probability of having j or more sample items \le t. To calculate P(Y_j > t), find the probability of having at most j-1 sample items \le t.

    \displaystyle P(Y_j \le t)=\sum \limits_{k=j}^n \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)

    \displaystyle P(Y_j > t)=\sum \limits_{k=0}^{j-1} \binom{n}{k} \ \biggl[ F(t) \biggr]^k \ \biggl[1-F(x) \biggr]^{n-k} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

Both (1) and (2) involve binomial probabilities and are discussed in this previous post. The probability of success is F(t)=P(X \le t) since we are interested in how many sample items that are \le t. Both the calculations (1) and (2) are based on counting the number of sample items in the two intervals \le t and >t. It turns out that when the probability that is desired involves more than one Y_j, we can also count the number of sample items that fall into some appropriate intervals and apply some appropriate multinomial probabilities. Let’s use an example to illustrate the idea.

Example 1
Draw a random sample X_1,X_2,\cdots,X_{10} from the uniform distribution U(0,4). The resulting order statistics are Y_1<Y_2< \cdots <Y_{10}. Find the following probabilities:

  • P(Y_4<2<Y_5<Y_6<3<Y_7)
  • P(Y_4<2<Y_6<3<Y_7)

For both probabilities, the range of the distribution is broken up into 3 intervals, (0, 2), (2, 3) and (3, 4). Each sample item has probabilities \frac{2}{4}, \frac{1}{4}, \frac{1}{4} of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval.

The first probability involves the event that there are 4 sample items in the interval (0, 2), 2 sample items in the interval (2, 3) and 4 sample items in the interval (3, 4). Thus the first probability is the following multinomial probability:

    \displaystyle \begin{aligned} P(Y_4<2<Y_5<Y_6<3<Y_7)&=\frac{10!}{4! \ 2! \ 4!} \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2 \ \biggl[\frac{1}{4} \biggr]^4 \\&\text{ } \\&=\frac{50400}{1048567} \\&\text{ } \\&=0.0481  \end{aligned}

For the second probability, Y_5 does not have to be greater than 2. Thus there could be 5 sample items less than 2. So we need to add one more case to the above probability (5 sample items to the first interval, 1 sample item to the second interval and 4 sample items to the third interval).

    \displaystyle \begin{aligned} P(Y_4<2<Y_6<3<Y_7)&=\frac{10!}{4! \ 2! \ 4!} \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2 \ \biggl[\frac{1}{4} \biggr]^4 \\& \ \ \ \ + \frac{10!}{5! \ 1! \ 4!} \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{1}{4} \biggr]^4 \\&\text{ } \\&=\frac{50400+40320}{1048567} \\&\text{ } \\&=\frac{90720}{1048567} \\&\text{ } \\&=0.0865  \end{aligned}

Example 2
Draw a random sample X_1,X_2,\cdots,X_6 from the uniform distribution U(0,4). The resulting order statistics are Y_1<Y_2< \cdots <Y_6. Find the probability P(1<Y_2<Y_4<3).

In this problem the range of the distribution is broken up into 3 intervals (0, 1), (1, 3) and (3, 4). Each sample item has probabilities \frac{1}{4}, \frac{2}{4}, \frac{1}{4} of falling into these intervals, respectively. Multinomial probabilities are calculated on these 3 intervals. It is a matter of counting the numbers of sample items falling into each interval. The counting is a little bit more involved here than in the previous example.

The example is to find the probability that both the second order statistic Y_2 and the fourth order statistic Y_4 fall into the interval (1,3). To solve this, determine how many sample items that fall into the interval (0,1), (1,3) and (3,4). The following points detail the counting.

  • For the event 1<Y_2 to happen, there can be at most 1 sample item in the interval (0,1).
  • For the event Y_4<3 to happen, there must be at least 4 sample items in the interval (0,3). Thus if the interval (0,1) has 1 sample item, the interval (1,3) has at least 3 sample items. If the interval (0,1) has no sample item, the interval (1,3) has at least 4 sample items.

The following lists out all the cases that satisfy the above two bullet points. The notation [a, b, c] means that a sample items fall into (0,1), b sample items fall into the interval (1,3) and c sample items fall into the interval (3,4). So a+b+c=6. Since the sample items are drawn from U(0,4), the probabilities of a sample item falling into intervals (0,1), (1,3) and (3,4) are \frac{1}{4}, \frac{2}{4} and \frac{1}{4}, respectively.

    [0, 4, 2]
    [0, 5, 1]
    [0, 6, 0]
    [1, 3, 2]
    [1, 4, 1]
    [1, 5, 0]

    \displaystyle \begin{aligned} \frac{6!}{a! \ b! \ c!} \ \biggl[\frac{1}{4} \biggr]^a \ \biggl[\frac{2}{4} \biggr]^b \ \biggl[\frac{1}{4} \biggr]^c&=\frac{6!}{0! \ 4! \ 2!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^2=\frac{240}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 5! \ 1!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^1=\frac{192}{4096} \\&\text{ } \\&=\frac{6!}{0! \ 6! \ 0!} \ \biggl[\frac{1}{4} \biggr]^0 \ \biggl[\frac{2}{4} \biggr]^6 \ \biggl[\frac{1}{4} \biggr]^0=\frac{64}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 3! \ 2!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^3 \ \biggl[\frac{1}{4} \biggr]^2=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 4! \ 1!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^4 \ \biggl[\frac{1}{4} \biggr]^1=\frac{480}{4096} \\&\text{ } \\&=\frac{6!}{1! \ 5! \ 0!} \ \biggl[\frac{1}{4} \biggr]^1 \ \biggl[\frac{2}{4} \biggr]^5 \ \biggl[\frac{1}{4} \biggr]^0=\frac{192}{4096} \\&\text{ } \\&=\text{sum of probabilities }=\frac{1648}{4096}=0.4023\end{aligned}

So in randomly drawing 6 items from the uniform distribution U(0,4), there is a 40% chance that the second order statistic and the fourth order statistic are between 1 and 3.

________________________________________________________________________

More examples

The method described by the above examples is this. When looking at the event described by the probability problem, the entire range of distribution is broken up into several intervals. Imagine the sample items X_i are randomly being thrown into these interval (i.e. we are sampling from a uniform distribution). Then multinomial probabilities are calculated to account for all the different ways sample items can land into these intervals. The following examples further illustrate this idea.

Example 3
Draw a random sample X_1,X_2,\cdots,X_7 from the uniform distribution U(0,5). The resulting order statistics are Y_1<Y_2< \cdots <Y_7. Find the following probabilities:

  • P(1<Y_1<3<Y_4<4)
  • P(3<Y_4<4 \ | \ 1<Y_1<3)

The range is broken up into the intervals (0, 1), (1, 3), (3, 4) and (4, 5). The sample items fall into these intervals with probabilities \frac{1}{5}, \frac{2}{5}, \frac{1}{5} and \frac{1}{5}. The following details the counting for the event 1<Y_1<3<Y_4<4:

  • There are no sample items in (0, 1) since 1<Y_1.
  • Based on Y_1<3<Y_4, there are at least one sample item and at most 3 sample items in (0, 3). Thus in the interval (1, 3), there are at least one sample item and at most 3 sample items since there are none in (0, 1).
  • Based on Y_4<4, there are at least 4 sample items in the interval (0, 4). Thus the count in (3, 4) combines with the count in (1, 3) must be at least 4.
  • The interval (4, 5) simply receives the left over sample items not in the previous intervals.

The notation [a, b, c, d] lists out the counts in the 4 intervals. The following lists out all the cases described by the above 5 bullet points along with the corresponding multinomial probabilities, with two of the probabilities set up.

    \displaystyle [0, 1, 3, 3] \ \ \ \ \ \ \frac{280}{78125}=\frac{7!}{0! \ 1! \ 3! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^3

    \displaystyle [0, 1, 4, 2] \ \ \ \ \ \ \frac{210}{78125}

    \displaystyle [0, 1, 5, 1] \ \ \ \ \ \ \frac{84}{78125}

    \displaystyle [0, 1, 6, 0] \ \ \ \ \ \ \frac{14}{78125}

    \displaystyle [0, 2, 2, 3] \ \ \ \ \ \ \frac{840}{78125}

    \displaystyle [0, 2, 3, 2] \ \ \ \ \ \ \frac{840}{78125}

    \displaystyle [0, 2, 4, 1] \ \ \ \ \ \ \frac{420}{78125}

    \displaystyle [0, 2, 5, 0] \ \ \ \ \ \ \frac{84}{78125}

    \displaystyle [0, 3, 1, 3] \ \ \ \ \ \ \frac{1120}{78125}=\frac{7!}{0! \ 3! \ 1! \ 3!} \ \biggl[\frac{1}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr]^1 \ \biggl[\frac{1}{5} \biggr]^3

    \displaystyle [0, 3, 2, 2] \ \ \ \ \ \ \frac{1680}{78125}

    \displaystyle [0, 3, 3, 1] \ \ \ \ \ \ \frac{1120}{78125}

    \displaystyle [0, 3, 4, 0] \ \ \ \ \ \ \frac{280}{78125}

Summing all the probabilities, \displaystyle P(1<Y_1<3<Y_4<4)=\frac{6972}{78125}=0.08924. Out of the 78125 many different ways the 7 sample items can land into these 4 intervals, 6972 of them would satisfy the event 1<Y_1<3<Y_4<4.

++++++++++++++++++++++++++++++++++

We now calculate the second probability in Example 3.

    \displaystyle P(3<Y_4<4 \ | \ 1<Y_1<3)=\frac{P(1<Y_1<3<Y_4<4)}{P(1<Y_1<3)}

First calculate P(1<Y_1<3)=P(Y_1<3)-P(Y_1<1). The probability P(Y_1<t) is the probability of having at least 1 sample item less than t, which is the complement of the probability of all sample items greater than t.

    \displaystyle \begin{aligned} P(1<Y_1<3)&=P(Y_1<3)-P(Y_1<1) \\&=1-\biggl( \frac{2}{5} \biggr)^7 -\biggl[1-\biggl( \frac{4}{5} \biggr)^7 \biggr] \\&=\frac{77997-61741}{78125} \\&=\frac{16256}{78125} \end{aligned}

The event 1<Y_1<3 can occur in 16256 ways. Out of these many ways, 6972 of these satisfy the event 1<Y_1<3<Y_4<4. Thus we have:

    \displaystyle P(3<Y_4<4 \ | \ 1<Y_1<3)=\frac{6972}{16256}=0.4289

Example 4
Draw a random sample X_1,X_2,X_3,X_4,X_5 from the uniform distribution U(0,5). The resulting order statistics are Y_1<Y_2<Y_3<Y_4 <Y_5. Consider the conditional random variable Y_4 \ | \ Y_2 >3. For this conditional distribution, find the following:

  • P( Y_4 \le t \ | \ Y_2 >3)
  • f_{Y_4}(t \ | \ Y_2 >3)
  • E(Y_4 \ | \ Y_2 >3)

where 3<t<5. Note that f_{Y_4}(t | \ Y_2 >3) is the density function of Y_4 \ | \ Y_2 >3.

Note that

    \displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{P(3<Y_2<Y_4 \le t)}{P(Y_2 >3)}

In finding P(3<Y_2<Y_4 \le t), the range (0, 5) is broken up into 3 intervals (0, 3), (3, t) and (t, 5). The sample items fall into these intervals with probabilities \frac{3}{5}, \frac{t-3}{5} and \frac{5-t}{5}.

Since Y_2 >3, there is at most 1 sample item in the interval (0, 3). Since Y_4 \le t, there are at least 4 sample items in the interval (0, t). So the count in the interval (3, t) and the count in (0, 3) should add up to 4 or more items. The following shows all the cases for the event 3<Y_2<Y_4 \le t along with the corresponding multinomial probabilities.

    \displaystyle [0, 4, 1] \ \ \ \ \ \ \frac{5!}{0! \ 4! \ 1!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^1

    \displaystyle [0, 5, 0] \ \ \ \ \ \ \frac{5!}{0! \ 5! \ 0!} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{t-3}{5} \biggr]^5 \ \biggl[\frac{5-t}{5} \biggr]^0

    \displaystyle [1, 3, 1] \ \ \ \ \ \ \frac{5!}{1! \ 3! \ 1!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^3 \ \biggl[\frac{5-t}{5} \biggr]^1

    \displaystyle [1, 4, 0] \ \ \ \ \ \ \frac{5!}{1! \ 4! \ 0!} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{t-3}{5} \biggr]^4 \ \biggl[\frac{5-t}{5} \biggr]^0

After carrying the algebra and simplifying, we have the following:

    \displaystyle P(3<Y_2<Y_4 \le t)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{3125}

For the event Y_2 >3 to happen, there is at most 1 sample item less than 3. So we have:

    \displaystyle P(Y_2 >3)=\binom{5}{0} \ \biggl[\frac{3}{5} \biggr]^0 \ \biggl[\frac{2}{5} \biggr]^5 +\binom{5}{1} \ \biggl[\frac{3}{5} \biggr]^1 \ \biggl[\frac{2}{5} \biggr]^4=\frac{272}{3125}

    \displaystyle P( Y_4 \le t \ | \ Y_2 >3)=\frac{-4t^5+25t^4+180t^3-1890t^2+5400t-5103}{272}

Then the conditional density is obtained by differentiating P( Y_4 \le t \ | \ Y_2 >3).

    \displaystyle f_{Y_4}(t \ | \ Y_2 >3)=\frac{-20t^4+100t^3+540t^2-3750t+5400}{272}

The following gives the conditional mean E(Y_4 \ | \ Y_2 >3).

    \displaystyle \begin{aligned} E(Y_4 \ | \ Y_2 >3)&=\frac{1}{272} \ \int_3^5 t(-20t^4+100t^3+540t^2-3750t+5400) \ dt \\&=\frac{215}{51}=4.216 \end{aligned}

To contrast, the following gives the information on the unconditional distribution of Y_4.

    \displaystyle f_{Y_4}(t)=\frac{5!}{3! \ 1! \ 1!} \ \biggl[\frac{t}{5} \biggr]^3 \ \biggl[\frac{1}{5} \biggr] \ \biggl[ \frac{5-t}{5} \biggr]^1=\frac{20}{3125} \ (5t^3-t^4)

    \displaystyle E(Y_4)=\frac{20}{3125} \ \int_0^5 t(5t^3-t^4) \ dt=\frac{10}{3}=3.33

The unconditional mean of Y_4 is about 3.33. With the additional information that Y_2 >3, the average of Y_4 is now 4.2. So a higher value of Y_2 pulls up the mean of Y_4.

________________________________________________________________________

Practice problems

Practice problems to reinforce the calculation are found in the problem blog, a companion blog to this blog.

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Advertisements

Confidence intervals for San Francisco rainfall

When estimating population percentiles, there is a way to do it that is distribution free. Draw a random sample from the population of interest and take the middle element in the random sample as an estimate of the population median. Furthermore, we can even attach a confidence interval to this estimate of median without knowing (or assuming) a probability distribution of the underlying phenomenon. This “distribution free” method is shown in the post called Confidence intervals for percentiles. In this post, we give an additional example using annual rainfall data in San Francisco to illustrate this approach of non-parametric inference using order statistics.

________________________________________________________________________

San Francisco rainfall data

The following table shows the annual rainfall data in San Francisco (in inches) from 1960-2013 (data source). The table consits of 54 measurements and is sorted in increasing order from left to right (and from top to bottom). Each annual rainfall measurement is from July of that year to June of the following year. The driest year (7.97 inches) is 1975, the period from July 1975 to June 1976. The wettest year (47.22 inches) is 1997, which is the period from July 1997 to June 1998. The most recent data point is the fifth measurement 12.54 inches (the period from July 2013 to June 2014).

    \displaystyle \begin{bmatrix} 7.97&\text{ }&11.06&\text{ } &11.06&\text{ }&12.32&\text{ }&12.54  \\ 13.86&\text{ }&13.87&\text{ } &14.08&\text{ }&14.32&\text{ }&14.46    \\ 15.22&\text{ }&15.39&\text{ } &15.64&\text{ }&16.33&\text{ }&16.61   \\ 16.89&\text{ }&17.43&\text{ } &17.50&\text{ }&17.65&\text{ }&17.74   \\ 18.11&\text{ }&18.26&\text{ } &18.74&\text{ }&18.79&\text{ }&19.20  \\ 19.47&\text{ }&20.01&\text{ } &20.54&\text{ }&20.80&\text{ }&22.15  \\ 22.29&\text{ }&22.47&\text{ } &22.63&\text{ }&23.49&\text{ }&23.87  \\ 24.09&\text{ }&24.49&\text{ } &24.89&\text{ }&24.89&\text{ }&25.03  \\ 25.09&\text{ }&26.66&\text{ } &26.87&\text{ }&27.76&\text{ }&28.68  \\ 28.87&\text{ }&29.41&\text{ }&31.87&\text{ } &34.02&\text{ }&34.36  \\ 34.43&\text{ }&37.10&\text{ }&38.17&\text{ } &47.22&\text{ }&\text{ }    \end{bmatrix}

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

________________________________________________________________________

Basic facts about order statistics

Let’s recall some basic facts from the following previous posts:

Let’s say we have a random sample X_1,X_2,\cdots,X_n drawn from a population whose percentiles are unknown and we wish to estimate them. Rank the items of the random sample to obtain the order statistics Y_1<Y_2<\cdots < Y_n. In an ideal setting, the measurements are supposed to arise from a continuous distribution. So the chance of a tie among the Y_j is zero. But this assumption may not hold on occasions. There are some ties in the San Francisco rainfall data (e.g. the second and third data point). The small number of ties will not affect the calculation performed below.

The reason that we can use the order statistics Y_j to estimate the population percentiles is that the expected percentage of the population below Y_j is about the same as the percentage of the sample items less than Y_j. According to the explanation in the second post listed above (link), the order statistic Y_j is expected to be above 100p percent of the population where p=\frac{j}{n+1}. In fact, the order statistics Y_1<Y_2<\cdots < Y_n are expected to divide the population in roughly equal segments. More specifically the expected percentage of the population in between Y_{j-1} and Y_j is 100h where h=\frac{1}{n+1}.

The above explanation justifies the use of the order statistic Y_j as the sample 100pth percentile where p=\frac{j}{n+1}.

The sample size is n= 54 in the San Francisco rainfall data. Thus the order statistic Y_{11} is the sample 20th percentile and can be taken as an estimate of the population 20th percentile for the San Francisco annual rainfall. Here the realized value of Y_{11} is 15.22.

With \frac{45}{54+1}=0.818, the order statistic Y_{45} is the sample 82nd percentile and is taken as an estimate of the population 82nd percentile for the San Francisco annual rainfall. The realized value of Y_{45} is 28.68 inches.

The key for constructing confidence interval for percentiles is to calculate the probability P(Y_i < \tau_p < Y_j). This is the probability that the 100pth percentile, where 0<p<1, is in between Y_i and Y_j. Let's look at the median \tau_{0.5}. For Y_i<\tau_{0.5} to happen, there must be at least i many sample items less than the median \tau_{0.5}. For \tau_{0.5}<Y_j to happen, there can be at most j-1 many sample items less than the median \tau_{0.5}. Thus in the random draws of the sample items, in order for the event Y_i < \tau_{0.5} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{0.5}. In other words, in n Bernoulli trials, there at at least i and at most j-1 successes where the probability of success is P(X<\tau_{0.5})= 0.5. The following is the probability P(Y_i < \tau_{0.5} < Y_j):

    \displaystyle P(Y_i < \tau_{0.5} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ 0.5^k \ 0.5^{n-k}=1 - \alpha

Then interval Y_i < \tau_{0.5} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population median \tau_{0.5}. Note that this confidence interval is constructed without knowing (or assuming) anything about the underlying distribution of the population.

Consider the 100pth percentile where 0<p<1. In order for the event Y_i < \tau_{p} < Y_j to occur, there must be at least i sample items and at most j-1 sample items that are less than \tau_{p}. This is equivalent to n Bernoulli trials resulting in at least i successes and at most j-1 successes where the probability of success is P(X<\tau_{p})=p.

    \displaystyle P(Y_i < \tau_{p} < Y_j)=\sum \limits_{k=i}^{j-1} \binom{n}{k} \ p^k \ (1-p)^{n-k}=1 - \alpha

Then interval Y_i < \tau_{p} < Y_j is taken to be the 100(1-\alpha)% confidence interval for the unknown population percentile \tau_{p}. As mentioned earlier, this confidence interval does not need to rely on any information about the distribution of the population and is said to be distribution free. It only relies on a probability statement that involves the binomial distribution in describing the positioning of the sample items. In the past, people used normal approximation to the binomial to estimate this probability. The normal approximation should be no longer needed as computing software is now easily available. For example, binomial probabilities can be computed in Excel for number of trials a million or more.

________________________________________________________________________

Percentiles of annual rainfall

Using the above data, estimate the median, the lower quartile (25th percentile) and the upper quartile (the 75th percentile) of the annual rainfall in San Francisco. Then find a reasonably good confidence interval for each of the three population percentiles.

The sample size is n= 54. The middle two data elements in the sample is y_{27}=20.01 and y_{28}=20.54. They are realizations of the order statistics Y_{27} and Y_{28}. So in this example, \frac{27}{54+1}=0.49 and \frac{28}{54+1}=0.509. Thus the order statistic Y_{27} is expected to be greater than about 49% of the population and Y_{28} is expected to be greater than about 51% of the population. So neither Y_{27} nor Y_{28} is an exact fit. So we take the average of the two as an estimate of the population median:

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

Looking for confidence intervals, we consider the intervals (Y_{21},Y_{34}), (Y_{20},Y_{35}), (Y_{19},Y_{36}) and (Y_{18},Y_{37}). The following shows the confidence levels.

    \displaystyle P(Y_{21} < \tau_{0.5} < Y_{34})=\sum \limits_{k=21}^{33} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.924095271

    \displaystyle P(Y_{20} < \tau_{0.5} < Y_{35})=\sum \limits_{k=20}^{34} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.959776436

    \displaystyle P(Y_{19} < \tau_{0.5} < Y_{36})=\sum \limits_{k=19}^{35} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.980165673

    \displaystyle P(Y_{18} < \tau_{0.5} < Y_{37})=\sum \limits_{k=18}^{36} \binom{54}{k} \ 0.5^k \ (0.5)^{54-k}=0.99092666

The above calculation is done in Excel. The binomial probabilities are done using the function BINOM.DIST. So we have the following confidence intervals for the median annual San Francisco rainfall in inches.

    Median

    \displaystyle \hat{\tau}_{0.5}=\frac{20.01+20.54}{2}=20.275

    (Y_{21},Y_{34}) = (18.11, 23.49) with approximately 92% confidence

    (Y_{20},Y_{35}) = (17.74, 23.87) with approximately 96% confidence

    (Y_{19},Y_{36}) = (17.65, 24.09) with approximately 98% confidence

    (Y_{18},Y_{37}) = (17.50, 24.49) with approximately 99% confidence

For the lower quartile and upper quartile, the following are the results. The reader is invited to confirm the calculation.

    Lower quartile

    \displaystyle \hat{\tau}_{0.25}=15.985, average of Y_{13} and Y_{14}

    (Y_{7},Y_{20}) = (13.87, 17.74) with approximately 96% confidence

    (Y_{6},Y_{21}) = (13.86, 18.11) with approximately 98% confidence

    (Y_{5},Y_{22}) = (12.54, 18.26) with approximately 99% confidence

    Upper quartile

    \displaystyle \hat{\tau}_{0.75}=25.875, average of Y_{41} and Y_{42}

    (Y_{36},Y_{47}) = (24.09, 29.41) with approximately 91% confidence

    (Y_{35},Y_{48}) = (23.87, 31.87) with approximately 96% confidence

    (Y_{34},Y_{49}) = (23.49, 34.02) with approximately 98% confidence

The following shows the calculation for two of the confidence intervals, one for \tau_{0.25} and one for \tau_{0.75}.

    \displaystyle P(Y_{6} < \tau_{0.25} < Y_{21})=\sum \limits_{k=6}^{20} \binom{54}{k} \ 0.25^k \ (0.25)^{54-k}=0.979889918

    \displaystyle P(Y_{34} < \tau_{0.75} < Y_{49})=\sum \limits_{k=34}^{38} \binom{54}{k} \ 0.75^k \ (0.75)^{54-k}=0.979889918

________________________________________________________________________
\copyright \ \text{2015 by Dan Ma}

Confidence intervals for percentiles

The order statistics play an important role in both descriptive statistics and non-parametric inferences. Sample percentiles (median, quartiles, etc) can be defined using the order statistics and can be used as point estimates for the corresponding percentiles in the population. For example, with a random sample of size n=24, the 6^{th} order statistic Y_6 is the sample 24^{th} percentile and is an estimate of the unknown population 24^{th} percentile \tau_{0.24}. The justification is that the area under the density curve of the distribution and to the left of Y_6 is on average =\frac{6}{24+1}=0.24 (see the discussion below). The order statistics can also be used for constructing confidence intervals for unknown population percentiles. Such confidence intervals are often called distribution-free confidence intervals because no information about the underlying distribution is used in the construction. In the previous post (The order statistics and the uniform distribution), an example was given demonstrating how confidence intervals for percentiles of a continuous distribution are constructed. In this post, we describe the general algorithm in greater details and present another example. For more information about distribution-free inferences, see [Hollander & Wolfe].

Let X_1,X_2, \cdots, X_n be a random sample drawn from a continuous distribution with X, F(x) and f(x) denoting the common random variable, the common distribution function and probability density function, respectively. Let Y_1<Y_2< \cdots <Y_n be the associated order statistics. Let W_i=F(Y_i). Note that F(Y_i) can be interpreted as an area under the density curve:

    \displaystyle W_i=F(Y_i)=\int_{-\infty}^{Y_i}f(x) dx

In the previous post (The order statistics and the uniform distribution), we showed that \displaystyle E[W_i]=\frac{i}{n+1}. On this basis, Y_i is defined as the sample (100p)^{th} percentile where \displaystyle p=\frac{i}{n+1} and is used as an estimate for the unknown population (100p)^{th} percentile.

The construction of confidence intervals for percentiles is based on the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile. Let’s take median as an example and consider P[Y_2 < \tau_{0.5} < Y_8]. For Y_2 < \tau_{0.5} to happen, there must be at least two sample items X_k that are less than \tau_{0.5}. For \tau_{0.5} < Y_8 to happen, there can be no more than 8 sample items X_k that are less than \tau_{0.5}. In drawing each sample item, consider X < \tau_{0.5} as a success. The probability of a success is thus p=P[X<\tau_{0.5}]=0.5. We are interested in the probability of having at least 2 and at most 7 successes. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.5} < Y_8]=\sum \limits_{k=2}^{7} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=1-\alpha

Then the interval (Y_2,Y_8) is taken to be the 100(1-\alpha) % confidence interval for the unknown population median.

The above discussion can easily be generalized. The following computes the probability P[Y_i < \tau_p < Y_j] where \tau_p is the (100p)^{th} percentile and p=P[X < \tau_p].

    \displaystyle P[Y_i < \tau_{p} < Y_j]=\sum \limits_{k=i}^{j-1} \binom{n}{k} p^k (1-p)^{n-k}=1-\alpha

Then the interval (Y_i,Y_j) is taken to be the 100(1-\alpha) % confidence interval for the unknown population percentile \tau_p. The above probability is based on the binomial distribution with parameters n and p=P(X<\tau_p). Its mean is np and its variance is np(1-p). This fact becomes useful when using normal approximation of the above probability.

Note that the wider the interval estimates, the more confidence can be attached. On the other hand, the more precise the interval estimate, the less confidence can be attached to the interval. This is true for parametric methods and is also true for the non-parametric method at hand. Though this is clear, we would like to call this out for the sake of completeness. For example, as confidence intervals for the median, (Y_2,Y_{15}) has a higher confidence level than the inteval (Y_6,Y_{10}). Note that of the two probabilities below, the first one is higher.

    \displaystyle P[Y_2 < \tau_{0.5} < Y_{15}]=\sum \limits_{k=2}^{14} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

    \displaystyle P[Y_6 < \tau_{0.5} < Y_{10}]=\sum \limits_{k=6}^{9} \binom{n}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}

Example
The following matrix contains a random sample of n=15 grocery purchased amounts of a certain family in 2009. The data are arranged in increasing order on each row from left to right.

    \displaystyle \begin{pmatrix} 3.34&14.70&45.71&47.69&48.25 \\{52.22}&57.25&60.79&63.87&66.85 \\{88.13}&101.81&147.33&165.10&168.28 \end{pmatrix}

Find the sample median and the sample upper quartile. Construct an approximate 96% confidence interval for the population median.

The sample median \hat{\tau}_{0.5}=60.79, the 8^{th} grocery purchase. The upper quartile (75^{th} percentile) is the 12^{th} grocery purchase 101.81.

To construct a confidence interval for the median, we need to compute the probability P[Y_i< \tau_{0.5} < Y_j]. We use the interval (Y_{4},Y_{12}) because of the following probability:

    \displaystyle P[Y_{4} < \tau_{0.5} < Y_{12}]=\sum \limits_{k=4}^{11} \binom{15}{k} \biggl(\frac{1}{2}\biggr)^k \biggl(\frac{1}{2}\biggr)^{n-k}=0.96484375

Thus the interval (47.69,101.81) is an approximate 96% confidence interval for the median grocery purchase amount for this family. The above calculation is made using an Excel spread sheet. Let's compare this answer with a normal approximation. The mean of the binomial distribution in question is 15(0.5)=7.5 and the variance is 15(0.5)(0.5)=3.75. Consider the following:

    \displaystyle \Phi \biggl(\frac{11.5-7.5}{\sqrt{3.75}}\biggr)-\Phi \biggl(\frac{3.5-7.5}{\sqrt{3.75}}\biggr)

    \displaystyle =\Phi \biggl(2.07\biggr)-\Phi \biggl(-2.07\biggr)=0.9808-0.0192=0.9616

The normal approximation is quite good.

Reference
Myles Hollander and Douglas A. Wolfe, Non-parametric Statistical Methods, Second Edition, Wiley (1999)

________________________________________________________________________
\copyright \ \text{2010 to 2015 by Dan Ma}

The order statistics and the uniform distribution

In this post, we show that the order statistics of the uniform distribution on the unit interval are distributed according to the beta distributions. This leads to a discussion on estimation of percentiles using order statistics. We also present an example of using order statistics to construct confidence intervals of population percentiles. For a discussion on the distributions of order statistics of random samples drawn from a continuous distribution, see the previous post The distributions of the order statistics.

Suppose that we have a random sample X_1,X_2,\cdots,X_n of size n from a continuous distribution with common distribution function F_X(x)=F(x) and common density function f_X(x)=f(x). The order statistics Y_1<Y_2< \cdots <Y_n are obtained by ordering the sample X_1,X_2,\cdots,X_n in ascending order. In other words, Y_1 is the smallest item in the sample and Y_2 is the second smallest item in the sample and so on. Since this is random sampling from a continuous distribution, we assume that the probability of a tie between two order statistics is zero. In the previous post The distributions of the order statistics, we derive the probability density function of the i^{th} order statistic:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f(y)

The Order Statistics of the Uniform Distribution
Suppose that the random sample X_1,X_2, \cdots, X_n are drawn from U(0,1). Since the distribution function of U(0,1) is F(y)=y where 0<y<1, the probability density function of the i^{th} order statistic is:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace y^{i-1} \thinspace [1-y]^{n-i} where 0<y<1.

The above density function is from the family of beta distributions. In general, the pdf of a beta distribution and its mean and variance are:

    \displaystyle f_{W}(w)=\frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \thinspace w^{a-1} \thinspace [1-w]^{b-1} where 0<w<1 where \Gamma(\cdot) is the gamma function.

      \displaystyle E[W]=\frac{a}{a+b}

      \displaystyle Var[W]=\frac{ab}{(a+b)^2 (a+b+1)}

Then, the following shows the pdf of the i^{th} order statistic of the uniform distribution on the unit interval and its mean and variance:

    \displaystyle f_{Y_i}(y)=\frac{\Gamma(n+1)}{\Gamma(i) \Gamma(n-i+1)} \thinspace y^{i-1} \thinspace [1-y]^{(n-i+1)-1} where 0<y<1.

      \displaystyle E[Y_i]=\frac{i}{i+(n-i+1)}=\frac{i}{n+1}

      \displaystyle Var[Y_i]=\frac{i(n-i+1)}{(n+1)^2 (n+2)}

Estimation of Percentiles
In descriptive statistics, we define the sample percentiles using the order statistics (even though the term order statistics may not be used in a non-calculus based introductory statistics course). For example, if sample size is an odd integer n=2m+1, then the sample median is the order statistic Y_{m+1}. The preceding discussion on the order statistics of the uniform distribution can show us that this approach is a sound one.

Suppose we have a random sample of size n from an arbitrary continuous distribution. The order statistics listed in ascending order are:

    \displaystyle Y_1<Y_2<Y_3< \cdots <Y_n

For each i \le n, consider W_i=F(Y_i). Since the distribution function F(x) is a non-decreasing function, the W_i are also increasing:

    \displaystyle W_1<W_2<W_3< \cdots <W_n

It can be shown that if F(x) is a distribution function of a continuous random variable X, then the transformation F(X) follows the uniform distribution U(0,1). Then the following transformed random sample:

    \displaystyle F(X_1),F(X_2), \cdots, F(X_n)

are drawn from the uniform distribution U(0,1). Furthermore, W_i are the order statistics for this random sample. By the preceding discussion, \displaystyle E[W_i]=E[F(Y_i)]=\frac{i}{n+1}. Note that F(Y_i) is the area under the density function f(x) and to the left of Y_i. Thus F(Y_i) is a random area and E[W_i]=E[F(Y_i)] is the expected area under the density curve f(x) to the left of Y_i. Recall that f(x) is the common density function of the original sample X_1,X_2,\cdots,X_n.

For example, suppose the sample size n is an odd integer where n=2m+1. Then the sample median is Y_{m+1}. Note that \displaystyle E[W_{m+1}]=\frac{m+1}{n+1}=\frac{1}{2}. Thus if we choose Y_{m+1} as a point estimate for the population median, Y_{m+1} is expected to be above the bottom 50% of the population and is expected to be below the upper 50% of the population.

Furthermore, E[W_i - W_{i-1}] is the expected area under the density curve and between Y_i and Y_{i-1}. This expected area is:

    \displaystyle E[W_i - W_{i-1}]=E[F(Y_i)]-E[F(Y_{i-1})]=\frac{i}{n+1}-\frac{i-1}{n+1}=\frac{1}{n+1}

The expected area under the density curve and above the maximum order statistic Y_n is:

    \displaystyle E[1-F(Y_n)]=1-\frac{n}{n+1}=\frac{1}{n+1}

Consequently here is an interesting observation about the order statistics Y_1<Y_2<Y_3< \cdots <Y_n. The order statistics Y_i divides the the area under the density curve f(x) and above the x-axis into n+1 areas. On average each of these area is \displaystyle \frac{1}{n+1}.

As a result, it makes sense to use order statistics as estimator of percentiles. For example, we can use Y_i as the (100p)^{th} percentile of the sample where \displaystyle p=\frac{i}{n+1}. Then Y_i is an estimator of the population percentile \tau_{p} where the area under the density curve f(x) and to the left of \tau_{p} is p. In the case that (n+1)p is not an integer, then we interpolate between two order statistics. For example, if (n+1)p=5.7, then we interpolate between Y_5 and Y_6.

Example
Suppose we have a random sample of size n=11 drawn from a continuous distribution. Find estimators for the median, first quartile and second quartile. Find an estimate for the 85^{th} percentile. Construct an 87% confidence interval for the 40^{th} percentile.

The estimator for the median is Y_6. The estimator for the first quartile (25^{th} percentile) is third order statistic Y_3. The estimator for the second quartile (75^{th} percentile) is the ninth order statistic Y_9. Based on the preceding discussion, the expected area under the density curve f(x) to the left of Y_3,Y_6,Y_9 are 0.25, 0.5 and 0.75, respectively.

To find the 85^{th} percentile, note that (n+1)p=12(0.85)=10.2. Thus we interpolate Y_{10} and Y_{11}. In our example, we use linear interpolation, though taking the arithmetic average of Y_{10} and Y_{11} is also a valid approach. The following is an estimate of the 85^{th} percentile.

    \displaystyle \hat{\tau}_{0.85}=0.8Y_{10}+0.2Y_{11}

To find the confidence interval, consider the probability P[Y_2 < \tau_{0.4} < Y_7] where \tau_{0.4} is the 40^{th} percentile. Consider the event X \le \tau_{0.4} as a success with probability of success p=0.4. For Y_2 < \tau_{0.4} < Y_7 to happen, there must be at least 2 successes and fewer than 7 success in the binomial distribution with n=11 and p=0.4. Thus we have:

    \displaystyle P[Y_2 < \tau_{0.4} < Y_7]=\sum \limits_{j=2}^{6} \binom{11}{j} 0.4^{j} 0.6^{11-j}=0.8704

Thus the interval (Y_2,Y_7) can be taken as the 87% confidence interval for \tau_{0.4}. This is an example of a distribution-free confidence interval because nothing is assumed about the underlying distribution in the construction of the confidence interval.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma}

The distributions of the order statistics

Sample statistics such as sample median, sample quartiles and sample minimum and maximum play a prominent role in the analysis using empirical data (e.g. in descriptive statistics and exploratory data analysis (EDA)). In this post we discuss order statistics and their distributions. The order statistics are the items from a randon sample arranged in increasing order. The focus here is to present the distribution functions and probability density functions of order statistics. The order statistics are important tools in non-parametric statistical inferences. In subsequent posts, we will present examples of applications in non-parametric methods.

In this post, we only consider random samples obtained from a continuous distribution (i.e. the distribution function is a continuous function). Let X_1,X_2, \cdots, X_n be a random sample of size n from a continuous distribution with distribution function F(x). We order the random sample in increasing order and obtain Y_1,Y_2, \cdots, Y_n. In other words, we have:

    Y_1= the smallest of X_1,X_2, \cdots, X_n
    Y_2= the second smallest of X_1,X_2, \cdots, X_n
    \cdot
    \cdot
    \cdot
    Y_n= the largest of X_1,X_2, \cdots, X_n

We set Y_{min}=Y_1 and Y_{max}=Y_n. The order statistic Y_i is called the i^{th} order statistic. Since we are working with a continuous distribution, we assume that the probability of two sample items being equal is zero. Thus we can assume that Y_1<Y_2< \cdots <Y_n. That is, the probability of a tie is zero among the order statistics.

The Distribution Functions of the Order Statistics
The distribution function of Y_i is an upper tail of a binomial distribution. If the event Y_i \le y occurs, then there are at least i many X_j in the sample that are less than or equal to y. Consider the event that X \le y as a success and F(y)=P[X \le y] as the probability of success. Then the drawing of each sample item becomes a Bernoulli trial (a success or a failure). We are interested in the probability of having at least i many successes. Thus the following is the distribution function of Y_i:

    \displaystyle F_{Y_i}(y)=P[Y_i \le y]=\sum \limits_{k=i}^{n} \binom{n}{k} F(y)^k [1-F(y)]^{n-k}\ \ \ \ \ \ \ \ \ \ \ \  (1)

The following relationship is used in deriving the probability density function:

    \displaystyle F_{Y_i}(y)=F_{Y_{i-1}}(y)-\binom{n}{i-1} F(y)^{i-1} [1-F(y)]^{n-i+1} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The Probability Density Functions of the Order Statistics
The probability density function of Y_i is given by:

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! (n-i)!} \thinspace F(y)^{i-1} \thinspace [1-F(y)]^{n-i} f_X(y) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

We prove this by induction. Consider i=1. Note that F_{Y_1}(y) is the probability that at least one X_j \le y and is the complement of the probability of having no X_j \le y. Thus F_{Y_1}(y)=1-[1-F(y)]^n. By taking derivative, we have:

    \displaystyle f_{Y_1}(y)=F_{Y_1}^{-1}(y)=n [1-F(y)]^{n-1} f_X(y)

Suppose we derive the pdf of Y_{i-1} using (3) and obtain the following:

    \displaystyle f_{Y_{i-1}}(y)=\frac{n!}{(i-2)! (n-i+1)!} \thinspace F(y)^{i-2} \thinspace [1-F(y)]^{n-i+1} f_X(y)

Now we take the derivative of (2) above and we have:

    \displaystyle f_{Y_i}(y)=f_{Y_{i-1}}(y)-\biggl[(i-1)\binom{n}{i-1} F(y)^{i-2} f_X(y)[1-F(y)]^{n-i+1}
    \displaystyle -\ \ \ \ \ \ \ \ \ \ \binom{n}{i-1}F(y)^{i-1}(n-i+1)[1-F(y)]^{n-i} f_X(y) \biggr]

After simplifying the right hand side, we obtain the pdf of Y as in (3).

Comments
We would like to make two comments. One is that in terms of problem solving, it may be better to rely on the distribution function in (1) above to derive the pdf. The thought process behind (1) is clear. The second is that the last three terms in the pdf in (3) are very instructive. Let’s arrange these three terms as follows:.

    \displaystyle F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i}

Note that the first term is the probability that there are i-1 sample items below y. The middle term indicates that one sample item is right around y. The third term indicates that there are n-i items above y. Thus the following multinomial probability is the pdf in (3):

    \displaystyle f_{Y_i}(y)=\frac{n!}{(i-1)! 1! (n-i)!} \thinspace F(y)^{i-1} \thinspace f_X(y) \thinspace [1-F(y)]^{n-i} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

This heuristic approach is further described here.

Example
Suppose that a sample of size n=11 is drawn from the uniform distribution on the interval (0, \theta). Find the pdfs for Y_{min}=Y_1, Y_{max}=Y_{11} and Y_6. Find E[Y_6].

Let X \sim uniform(0,\theta). The distribution function and pdf of X are:

    \displaystyle F(y)=\left\{\begin{matrix}0&\thinspace y<0\\{\displaystyle \frac{y}{\theta}}&\thinspace 0 \le y < \theta\\{1}&\thinspace y \ge \theta\end{matrix}\right.

    \displaystyle f(y)=\left\{\begin{matrix}\displaystyle \frac{1}{\theta}&\thinspace 0<y<\theta\\{0}&\thinspace otherwise\end{matrix}\right.

Using (3), the following are the pdfs of Y_1, Y_{11} and Y_6.

    \displaystyle f_{Y_1}(y)=\frac{11}{\theta^{11}} (\theta-y)^{10}

    \displaystyle f_{Y_{11}}(y)=\frac{11}{\theta^{11}} y^{10}

    \displaystyle f_{Y_6}(y)=2772 \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta}

In this example, Y_6 is the sample median and serves as a point estimate for the population median \frac{\theta}{2}. As an estimator of the median, we prefer Y_6 not to overestimate or underestimate \frac{\theta}{2} (we call such estimator as unbiased estimator). In this particular example, the sample median Y_6 is an unbiased estimator of \frac{\theta}{2}. To see this we show E[Y_6]=\frac{\theta}{2}.

    \displaystyle E[Y_6]=\int_0^{\theta}2772 y \biggl(\frac{y}{\theta}\biggr)^5 \biggl(1-\frac{y}{\theta}\biggr)^5 \frac{1}{\theta} dy

    By substituting w=\frac{y}{\theta}, we have the following beta integral.

    \displaystyle E[Y_6]=2772 \theta \int_0^1 w^{7-1} (1-w)^{6-1} dw

    \displaystyle E[Y_6]=2772 \theta \thinspace \frac{\Gamma(7) \Gamma(6)}{\Gamma(13)}=2772 \theta \thinspace \frac{6! \thinspace 5!}{12!}=\frac{\theta}{2}

________________________________________________________________________

Practice problems

Practice problems are found here in a companion blog.

________________________________________________________________________
\copyright \ \text{2010 - 2015 by Dan Ma} Revised April 6, 2015.