# Conditional Distributions, Part 1

We illustrate the thought process of conditional distributions with a series of examples. These examples are presented in a series of blog posts. In this post, we look at some conditional distributions derived from discrete probability distributions.

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

The Setting

Suppose we have a discrete random variable $X$ with $f(x)=P(X=x)$ as the probability mass function. Suppose some random experiment can be modeled by the discrete random variable $X$. The sample space $S$ for this probability experiment is the set of sample points with positive probability masses, i.e. $S$ is the set of all $x$ for which $f(x)=P(X=x)>0$. In the examples below, $S$ is either a subset of the real line $\mathbb{R}$ or a subset of the plane $\mathbb{R} \times \mathbb{R}$. Conceivably the sample space could be subset of any Euclidean space $\mathbb{R}^n$ in higher dimension.

Suppose that we are informed that some event $A$ in the random experiment has occurred ($A$ is a subset of the sample space $S$). Given this new information, all the sample points outside of the event $A$ are irrelevant. Or perhaps, in this random experiment, we are only interested in those outcomes that are elements of some subset $A$ of the sample space $S$. In either of these scenarios, we wish to make the event $A$ as a new sample space.

The probability of the event $A$, denoted by $P(A)$, is derived by summing the probabilities $f(x)=P(X=x)$ over all the sample points $x \in A$. We have:

$\displaystyle P(A)=\sum_{x \in A} P(X=x)$

The probability $P(A)$ may not be 1.0. So the probability masses $f(x)=P(X=x)$ for the sample points $x \in A$, if they are unadjusted, may not form a probability distribution. However, if we consider each such probability mass $f(x)=P(X=x)$ as a proportion of the probability $P(A)$, then the probability masses of the event $A$ will form a probability distribution. For example, say the event $A$ consists of two probability masses 0.2 and 0.3, which sum to 0.5. Then in the new sample space, the first probability mass is 0.4 (0.2 multiplied by $\displaystyle \frac{1}{0.5}$ or divided by 0.5) and the second probability mass is 0.6.

We now summarize the above paragraph. Using the event $A$ as a new sample space, the probability mass function is:

$\displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A$

The above probability distribution is called the conditional distribution of $X$ given the event $A$, denoted by $X \lvert A$. This new probability distribution incorporates new information about the results of a random experiment.

Once this new probability distribution is established, we can compute various distributional quantities (e.g. cumulative distribution function, mean, variance and other higher moments).

_____________________________________________________________________________________________________________________________

Examples

Suppose that two students take a multiple choice test that has 5 questions. Let $X$ be the number of correct answers of one student and $Y$ be the number of correct answers of the other student (these can be considered as test scores for the purpose of the examples here). Assume that $X$ and $Y$ are independent. The following shows the probability functions.

$\displaystyle \begin{bmatrix} \text{Count of}&\text{ }&\text{ }&P(X=x) &\text{ }&\text{ }&P(Y=y) \\\text{Correct Answers}&\text{ }&\text{ }&\text{ } &\text{ }&\text{ }&\text{ } \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 0&\text{ }&\text{ }&0.4&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 1&\text{ }&\text{ }&0.2&\text{ }&\text{ }&0.1 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 2&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 3&\text{ }&\text{ }&0.1&\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 4&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \\\text{ }&\text{ }&\text{ } &\text{ }&\text{ } \\ 5&\text{ }&\text{ }&0.1 &\text{ }&\text{ }&0.2 \end{bmatrix}$

Note that $E(X)=1.6$ and $E(Y)=2.9$. Without knowing any additional information, we can expect on average one student gets 1.6 correct answers and one student gets 2.9 correct answers. If having 3 or more correct answers is considered passing, then the student represented by $X$ has a 30% chance of passing while the student represented by $Y$ has a 60% chance of passing. The following examples show how the expectation can change as soon as new information is known.

The following examples are based on these two test scores $X$ and $Y$.

Example 1

In this example, we only consider the student whose correct answers are modeled by the random variable $X$. In addition to knowing the probability function $P(X=x)$, we also know that this student has at least one correct answer (i.e. the new information is $X>0$).

In light of the new information, the new sample space is $A=\left\{1,2,3,4,5 \right\}$. Note that $P(A)=0.6$. In this new sample space, each probability mass is the original one divided by 0.6. For example, for the sample point $X=1$, we have $\displaystyle P(X=1 \lvert X>0)=\frac{0.2}{0.6}=\frac{2}{6}$. The following is the conditional probability distribution of $X$ given $X>0$.

$\displaystyle P(X=1 \lvert X>0)=\frac{2}{6}$

$\displaystyle P(X=2 \lvert X>0)=\frac{1}{6}$

$\displaystyle P(X=3 \lvert X>0)=\frac{1}{6}$

$\displaystyle P(X=4 \lvert X>0)=\frac{1}{6}$

$\displaystyle P(X=5 \lvert X>0)=\frac{1}{6}$

The conditional mean is the mean of the conditional distribution. We have $\displaystyle E(X \lvert X>0)=\frac{16}{6}=2.67$. Given that this student is knowledgeable enough to answer some question correctly, the expectation is higher than before knowing the additional information. Also, given the new information, the student in question has a 50% chance of passing (vs. 30% before the new information is known).

Example 2

We now look at a joint distribution that has a 2-dimensional sample space. Consider the joint distribution of test scores $X$ and $Y$. If the new information is that the total number of correct answers among them is 4, how would this change our expectation of their performance?

Since $X$ and $Y$ are independent, the sample space is a square as indicated the figure below.

$\text{ }$

Figure 1 – Sample Space of Test Scores

Because the two scores are independent, the joint probability at each of these 36 sample points is the product of the individual probabilities. We have $P(X=x,Y=y)=P(X=x) \times P(Y=y)$. The following figure shows one such joint probability.

Figure 2 – Joint Probability Function

After taking the test, suppose that we have the additional information that the two students have a total of 4 correct answers. With this new information, we can focus our attention on the new sample space that is indicated in the following figure.

Figure 3 – New Sample Space

Now we wish to discuss the conditional probability distribution of $X \lvert X+Y=4$ and the conditional probability distribution of $Y \lvert X+Y=4$. In particular, given that there are 4 correct answers between the two students, what would be their expected numbers of correct answers and what would be their chances of passing?

There are 5 sample points in the new sample space (the 5 points circled above). The conditional probability distribution is obtained by making each probability mass as a fraction of the sum of the 5 probability masses. First we calculate the 5 joint probabilities.

$\displaystyle P(X=0,Y=4)=P(X=0) \times P(Y=4) =0.4 \times 0.2=0.08$

$\displaystyle P(X=1,Y=3)=P(X=1) \times P(Y=3) =0.2 \times 0.2=0.04$

$\displaystyle P(X=2,Y=2)=P(X=2) \times P(Y=2) =0.1 \times 0.2=0.02$

$\displaystyle P(X=3,Y=1)=P(X=3) \times P(Y=1) =0.1 \times 0.1=0.01$

$\displaystyle P(X=4,Y=0)=P(X=4) \times P(Y=0) =0.1 \times 0.1=0.01$

The sum of these 5 joint probabilities is $P(X+Y=4)=0.16$. Making each of these joint probabilities as a fraction of 0.16, we have the following two conditional probability distributions.

$\displaystyle P(X=0 \lvert X+Y=4)=\frac{8}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=0 \lvert X+Y=4)=\frac{1}{16}$

$\displaystyle P(X=1 \lvert X+Y=4)=\frac{4}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=1 \lvert X+Y=4)=\frac{1}{16}$

$\displaystyle P(X=2 \lvert X+Y=4)=\frac{2}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=2 \lvert X+Y=4)=\frac{2}{16}$

$\displaystyle P(X=3 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=3 \lvert X+Y=4)=\frac{4}{16}$

$\displaystyle P(X=4 \lvert X+Y=4)=\frac{1}{16} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(Y=4 \lvert X+Y=4)=\frac{8}{16}$

Now the conditional means given that $X+Y=4$, comparing against the unconditional means.

$\displaystyle E(X \lvert X+Y=4)=\frac{0+4+4+3+4}{16}=\frac{15}{16}=0.9375 \ \ \ \ \ \ \ \ \text{vs} \ \ E(X)=1.6$

$\displaystyle E(Y \lvert X+Y=4)=\frac{0+1+4+12+32}{16}=\frac{49}{16}=3.0625 \ \ \ \ \ \text{vs} \ \ E(Y)=2.9$

Now compare the chances of passing.

$\displaystyle P(X \ge 3 \lvert X+Y=4)=\frac{4}{16}=0.25 \ \ \ \ \ \ \ \ \ \ \text{vs} \ \ P(X \ge 3)=0.3$

$\displaystyle P(Y \ge 3 \lvert X+Y=4)=\frac{14}{16}=0.875 \ \ \ \ \ \ \ \ \text{vs} \ \ P(Y \ge 3)=0.6$

Based on the new information of $X+Y=4$, we have a lower expectation for the student represented by $X$ and a higher expectation for the student represented by $Y$. Observe that the conditional probability at $X=0$ increases to 0.5 from 0.4, while the conditional probability at $X=4$ increases to 0.5 from 0.2.

Example 3

Now suppose the new information is that the two students do well on the test. Particularly, their combined number of correct answers is greater than or equal to 5, i.e., $X+Y \ge 5$. How would this impact the conditional distributions?

First we discuss the conditional distributions for $X \lvert X+Y \ge 5$ and $Y \lvert X+Y \ge 5$. By considering the new information, the following is the new sample space.

Figure 4 – New Sample Space

To derive the conditional distribution of $X \lvert X+Y \ge 5$, sum the joint probabilities within the new sample space for each $X=x$. The calculation is shown below.

$\displaystyle P(X=0 \cap X+Y \ge 5)=0.4 \times 0.2=0.08$

$\displaystyle P(X=1 \cap X+Y \ge 5)=0.2 \times (0.2+0.2)=0.08$

$\displaystyle P(X=2 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2)=0.06$

$\displaystyle P(X=3 \cap X+Y \ge 5)=0.1 \times (0.2+0.2+0.2+0.2)=0.08$

$\displaystyle P(X=4 \cap X+Y \ge 5)=0.1 \times (1-0.1)=0.09$

$\displaystyle P(X=5 \cap X+Y \ge 5)=0.1 \times (1)=0.10$

The sum of these probabilities is 0.49, which is $P(X+Y \ge 5)$. The conditional distribution of $X \lvert X+Y \ge 5$ is obtained by taking each of the above probabilities as a fraction of 0.49. We have:

$\displaystyle P(X=0 \lvert X+Y \ge 5)=\frac{8}{49}=0.163$

$\displaystyle P(X=1 \lvert X+Y \ge 5)=\frac{8}{49}=0.163$

$\displaystyle P(X=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122$

$\displaystyle P(X=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163$

$\displaystyle P(X=4 \lvert X+Y \ge 5)=\frac{9}{49}=0.184$

$\displaystyle P(X=5 \lvert X+Y \ge 5)=\frac{10}{49}=0.204$

We have the conditional mean $\displaystyle E(X \lvert X+Y \ge 5)=\frac{0+8+12+24+36+50}{49}=\frac{130}{49}=2.653$ (vs. $E(X)=1.6$). The conditional probability of passing is $\displaystyle P(X \ge 3 \lvert X+Y \ge 5)=\frac{27}{49}=0.55$ (vs. $P(X \ge 3)=0.3$).

Note that the above conditional distribution for $X \lvert X+Y \ge 5$ is not as skewed as the original one for $X$. With the information that both test takers do well, the expected score for the student represented by $X$ is much higher.

With similar calculation we have the following results for the conditional distribution of $Y \lvert X+Y \ge 5$.

$\displaystyle P(Y=0 \lvert X+Y \ge 5)=\frac{1}{49}=0.02$

$\displaystyle P(Y=1 \lvert X+Y \ge 5)=\frac{2}{49}=0.04$

$\displaystyle P(Y=2 \lvert X+Y \ge 5)=\frac{6}{49}=0.122$

$\displaystyle P(Y=3 \lvert X+Y \ge 5)=\frac{8}{49}=0.163$

$\displaystyle P(Y=4 \lvert X+Y \ge 5)=\frac{12}{49}=0.245$

$\displaystyle P(Y=5 \lvert X+Y \ge 5)=\frac{20}{49}=0.408$

We have the conditional mean $\displaystyle E(Y \lvert X+Y \ge 5)=\frac{0+2+12+24+48+100}{49}=\frac{186}{49}=3.8$ (vs. $E(Y)=2.9$). The conditional probability of passing is $\displaystyle P(Y \ge 3 \lvert X+Y \ge 5)=\frac{40}{49}=0.82$ (vs. $P(Y \ge 3)=0.6$). Indeed, with the information that both test takers do well, we can expect much higher results from each individual test taker.

Example 4

In Examples 2 and 3, the new information involve both test takers (both random variables). If the new information involves just one test taker, it may be immaterial on the exam score of the other student. For example, suppose that $Y \ge 4$. Then what is the conditional distribution for $X \lvert Y \ge 4$? Since $X$ and $Y$ are independent, the high score $Y \ge 4$ has no impact on the score $X$. However, the high joint score $X+Y \ge 5$ does have an impact on each of the individual scores (Example 3).

_____________________________________________________________________________________________________________________________

Summary

We conclude with a summary of the thought process of conditional distributions.

Suppose $X$ is a discrete random variable and $f(x)=P(X=x)$ is its probability function. Further suppose that $X$ is the probability model of some random experiment. The sample space of this random experiment is $S$.

Suppose we have some new information that in this random experiment, some event $A$ has occurred. The event $A$ is a subset of the sample space $S$.

To incorporate this new information, the event $A$ is the new sample space. The random variable incorporated with the new information, denoted by $X \lvert A$, has a conditional probability distribution. The following is the probability function of the conditional distribution.

$\displaystyle f(x \lvert A)=\frac{f(x)}{P(A)}=\frac{P(X=x)}{P(A)}, \ \ \ \ \ \ \ \ \ x \in A$

where $P(A)$ = $\displaystyle \sum_{x \in A} P(X=x)$.

The thought process is that in the conditional distribution is derived from taking each original probability mass as a fraction of the total probability $P(A)$. The probability function derived in this manner reflects the new information that the event $A$ has occurred.

Once the conditional probability function is derived, it can be used just like any other probability function, e.g. computationally for finding various distributional quantities.

_____________________________________________________________________________________________________________________________

Practice Problems

Practice problems are found in the companion blog.

_____________________________________________________________________________________________________________________________

$\copyright \ \text{2013 by Dan Ma}$