Week 4: Probability

Learning goals

By the end of this session, you should be able to…

  • Calculate probabilities by hand
  • Calculate expectation and variance of a random variable
  • Calculate the expectation and variance of a sum of n random variables and relate this to the terms:
    • SD of a population
    • SE of a sample

Exercises

Problem A

A1.

\(X\) and \(Y\) are independent random variables and \(a\) and \(b\) are constants. You have the following scaled random variables:

\[U = X + Y\] \[V = a \cdot X\] \[W = a \cdot X + b\]

a) Calculate the expectation of these random variables!

\[E[X + Y] = E[X] + E[Y]\]

\[E[a \cdot X] = E[a] \cdot E[X] = a \cdot E[X]\] because the expectation of a constant is the constant itself

\[E[a \cdot X + b] = E[a] \cdot E[X] + E[b] = a \cdot E[X] + b\]

b) Calculate the variance and relate these results to the table on page 85 (effect of scaling a statistical measurement)!

\[V[U] = V[X] + V[Y]\]

\[V[a \cdot X] = V[X] = a^2 \cdot V[X]\] - the variance of the random variable is multiplied by the square of the constant

\[V[a \cdot X + b] = a^2 \cdot V[X]\] - adding a constant doesn’t change the variance

If you look at the table on page 85, it basically shows you these exact same rules.

Proofs: \[ \begin{aligned} V[X + Y] &= E[(X+Y)^2] - (E[X+Y])^2 \\ &= E[X^2 + 2 \cdot X \cdot Y + Y^2] - (E[X] + E[Y])^2 \\ &= E[X^2] + 2 \cdot E[X \cdot Y] + E[Y^2] - ((E[X])^2 + 2 \cdot E[X] \cdot E[Y] + (E[Y]^2)) \\ &= E[X^2] + 2 \cdot E[X \cdot Y] + E[Y^2] - (E[X])^2 - 2 \cdot E[X] \cdot E[Y] - (E[Y]^2) \\ &= (E[X^2]-(E[X])^2) + (E[Y^2]-(E[Y])^2) + 2 \cdot (E[X \cdot Y] - E[X] \cdot E[Y]) \\ &= V[X] + V[Y] + 2 \cdot Cov(X,Y) \end{aligned} \] The \(Cov(X,Y)\) part is called the covariance, and it expresses how our variables \(X\) and \(Y\) vary together. However, in this case \(X\) and \(Y\) are independent, therefore, the covariance is 0 - hence, we ultimately arrive to the formula \(V[X+Y] = V[X] + V[Y]\).

\[ \begin{aligned} V[a \cdot X] &= E[(a \cdot X)^2] - (E[a \cdot X])^2 \\ &= E[a^2 \cdot X^2] - (a \cdot E[X])^2 \\ &= a^2 \cdot E[X^2] - a^2 \cdot E[X]^2 \\ &= a^2 \cdot (E[X^2] - (E[X])^2) \\ &= a^2 \cdot V[X] \end{aligned} \]

\[ \begin{aligned} V[a \cdot X + b] &= E[(a \cdot X + b)^2] - (E[a \cdot X + b])^2 \\ &= E[a^2 \cdot X^2 + 2 \cdot a \cdot b \cdot X + b^2] - (a \cdot E[X] + b)^2 \\ &= a^2 \cdot E[X^2] + 2 \cdot a \cdot b \cdot E[X] + b^2 - a^2 \cdot E^2[X] - 2 \cdot a \cdot b \cdot E[X] - b^2 \\ &= a^2 \cdot E[X^2] - a^2 \cdot E^2[X] \\ &= a^2 \cdot (E[X^2] - E^2[X]) \\ &= a^2 \cdot V[X] \\ \end{aligned} \] Take me back to the beginning!

A2.

Imagine that you have n measurements (or observations) that are independent from each other, and that each observation comes from the same underlying population. We define a random variable for each observation: \(X_1\), \(X_2\),…, \(X_n\). All random variables \(X_i\) are independent have the same underlying probability distribution.

Calculate \(E[M]\) and discuss what this implies when we estimate the mean of a population from the sample mean.

\[ \begin{aligned} E[M] &= E[(X_1 ... X_n) / n] \\ &= 1/n \cdot (E[X_1] + ... + E[X_n]) \\ &= 1/n \cdot (\mu + ... + \mu) \\ &= 1/n \cdot (n\cdot \mu) \\ &= \mu \end{aligned} \]

This means that the expected value of the sample mean is exactly the same as the mean of the individual \(X_i\) - this means that whenever we want to estimate the mean of a population parameter, calculating the mean of our sample is the best we can do.

Calculate \(V[M]\) and \(sqrt(V[M])\) and relate your result to the formula of the book when calculating the \(SE\) of a mean in a sample with size \(n\).

\[ \begin{aligned} V[M] &= V[\frac{X_1...X_n}{n}] \\ &= V[\frac{1}{n^2} \cdot (X_1 + X_2 ... + X_n)] \\ &= \frac{1}{n^2}\cdot\mu^2_1 + \mu^2_2... + \mu^2_n \\ &= \frac{1}{n^2}\cdot n \cdot \mu^2 \\ & = \mu^2 / n \end{aligned} \]

Analogously…

\[\sqrt{(V[M])} = \frac{\mu}{\sqrt{n}}\]

Notice that this is exactly how we calculated the standard error of the mean!

In these proofs we were able to substitute \[E[X_i] = \mu\] and \[V[X_i] = \sigma^2\] because the observations are independent and come from the same underlying probability distribution. The random variables are independently and identically distributed, so they have the same mean and variance.

Take me back to the beginning!

Problem B

Let \(X\) be a random variable recording the genotype of an individual for a given SNP in the genome.

An individual is either AA, AT or TT and we map the genotype to the number of A alleles so if we have an individual being TT then \(X=0\), AT then \(X=1\), and AA then \(X=2\).

Imagine that we sample a very large (infinite) population and that the frequency of T in the population is 0.15. Imagine also that individuals are mating at random.

Calculate the following probabilities:

\(X = 0\) means that the individual has the TT genotype.
\[Pr(X = 0) = Pr[T] \cdot Pr[T] = 0.15 \cdot 0.15 = 0.0225\]

\(X = 1\) means that the individual has the AT genotype.
\[Pr(X = 1) = (Pr[A] \cdot Pr[T]) + (Pr[T] \cdot Pr[A]) = 0.85 \cdot 0.15 = 0.255\]

\(X = 2\) means that the individual has the AA genotype.
\[Pr(X = 2) = Pr[A] \cdot Pr[A] = 0.85 \cdot 0.85 = 0.7225\]

Calculate \(E[X]\) and \(V[X]\).

\[ \begin{aligned} E[X] &= 0 \cdot Pr[X = 0] + 1 \cdot Pr[X = 1]+2 \cdot Pr[X = 2] \\ &= 0 \cdot 0.0225 + 1 \cdot 0.255 + 2 \cdot 0.7225 = 1.7 \end{aligned} \]

Fine, but what does this mean? According to this calculation the expectation of \(X\) is 1.7 - however, we don’t even have this option. \(X\) can be either 0, 1 or 2, corresponding to the genotypes TT, AT and AA. Our result is closest to 2, so we can say that we’d expect genotype AA to be the “typical” one. If we’re looking at the probabilities, this actually makes sense as the probability of the AA genotype is the largest, 0.7225, so we wouldn’t be very suprised to observe that genotype the most often.

Quick sidenote: if you think about 0, 1 and 2 as numbers here, you might be led astray because then you might start thinking that when we’re multiplying \(Pr[X = 0]\) by 0, we’re essentially “losing” or “disregarding” it. This is not true. Let’s imagine a population where there are only TT individuals (so no A alleles whatsoever). Then the calculation would look like this:

\[Pr(X = 0) = Pr[T] \cdot Pr[T] = 1 \cdot 1 = 1\] \[Pr(X = 1) = (Pr[A] \cdot Pr[T]) + (Pr[T] \cdot Pr[A]) = 1 \cdot 0 = 0\]

\[Pr(X = 2) = Pr[A] \cdot Pr[A] = 0 \cdot 0 = 0\] \[ \begin{aligned} E[X] &= 0 \cdot Pr[X = 0] + 1 \cdot Pr[X = 1]+2 \cdot Pr[X = 2] \\ &= 0 \cdot 1 + 1 \cdot 0 + 2 \cdot 0 = 0 \end{aligned}\] Does this mean that we calculated that the expectation is nothing? No. What we calculated is that the expectation is 0, corresponding to genotype TT, meaning that the typical individual has the TT genotype, which makes sense if there are only T alleles in the population. So in this case think about the values of \(X\) as “indicators” for the genotypes.

Let’s get back to the original exercise now and calculate the variance:

\[ \begin{aligned} V[X] &= E[X^2] - (E[X])^2 \\ &= (0\cdot0.0225 + 1\cdot0.255 + 4\cdot0.7225) - 1.7^2 \\ &= 3.145 - 2.89 = 0.255 \end{aligned}\] Notice what you need to square here! \((E[X])^2\) is fairly simple - you just take what you calculated in the previous exercise, square it and call it a day.
\(E[X^2]\) is a bit more tricky - it’s the expectation of \(X^2\). The probabilities remain the same, however, you need to multiply them by \(X^2\) instead of \(X\) (ie. 0, 1, and 4 instead of 0, 1 and 2).

Take me back to the beginning!