8.1 About the variance

Simply put, the variance is a measure of spread around the mean. The main idea behind the calculation of the variance is to quantify the typical concentration of values around the mean. The way this is done is by averaging the squared deviations from the mean.

\[ var(X) = \frac{(x_1 - \bar{x})^2 + \dots + (x_n - \bar{x})^2}{n} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

Let’s disect the terms and operations involved in the formula of the variance.

the main terms are the deviations from the mean \((x_i - \bar{x})\), that is, the difference between each observation \(x_i\) and the mean \(\bar{x}\).
conceptually speaking, we want to know what is the average size of the deviations around the mean.
simply averaging the deviations won’t work because their sum is zero (i.e. the sum of deviations around the mean will cancel out because the mean is the balancing point).
this is why we square each deviation: \((x_i - \bar{x})^2\), which literally means getting the squared distance from \(x_i\) to \(\bar{x}\).
having squared all the deviations, then we average them to get the variance.

Because the variance has squared units, we need to take the square root to “recover” the original units in which \(X\) is expressed. This gives us the standard deviation

\[ sd(X) = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2} \]

In this sense, you can say that the standard deviation is roughly the average distance that the data points vary from the mean.

8.1.1 Sample Variance

In practice, you will often find two versions of the formula for the variance: one in which the sum of squared deviations is divided by \(n\), and another one in which the division is done by \(n-1\). Each version is associated to the statistical inference view of variance in terms of whether the data comes from the population or from a sample of the population.

The population variance is obtained dividing by \(n\):

\[ \textsf{population variance:} \quad \frac{1}{(n)} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The sample variance is obtained dividing by \(n - 1\) instead of dividing by \(n\). The reason for doing this is to get an unbiased estimor of the population variance:

\[ \textsf{sample variance:} \quad \frac{1}{(n-1)} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

It is important to note that most statistical software compute the variance with the unbiased version. This is also the case in R with the function var(). For instance, consider the toy data set in the following matrix X

# data matrix
X <- matrix(c(150, 172, 180, 49, 77, 80), nrow = 3, ncol = 2)
rownames(X) <- c("Leia", "Luke", "Han")
colnames(X) <- c("weight", "height")
X
     weight height
Leia    150     49
Luke    172     77
Han     180     80

The unbiased variance (i.e. sample variance) of weight is:

# variance of shots
var(X[,1])
[1] 241

Compare it to the biased variance:

# biased variance of shots
(nrow(X) - 1) / (nrow(X)) * var(X[,1])
[1] 161

As you can tell from the two types of variances, there seems to be an important difference. This is because the number of observations \(n = 5\) in this case is small. However, as the sample size increases, the difference between \(n-1\) and \(n\) will be negligible.

If you implement your own functions and are planning to compare them against other software, then it is crucial to known what other programmers are using for computing the variance. Otherwise, your results might be a bit different from the ones with other people’s code.

In this book, to keep notation as simpler as possible, we will use the factor \(\frac{1}{n}\) for the rest of the formulas. However, keep in mind that most variance-based computations in R use \(\frac{1}{n-1}\).