4.2 Standardization
When variables are quantitative, they can be measured in the same scale. But they can also be measured in different scales. For example, consider four variables: 1) Weight measured in kilograms, 2) Height measured in centimeters, 3) Income measured in dollars, and 4) Temperature measured in Celsius degrees. When you use a method in which comparisons or calculations are made taking into account the variance of the variables, you will face an interesting phenomenon: the variable that has the largest magnitude will dominate the variability in the data. And this could be a (serious) problem.
To compensate for the differences in scales, something must be done. The question is then: how to balance the contributions of the variables in a way that you can have a fair comparison among them? The key is to put them all under a common scale. Let’s review four options to standardize a variable:
- by standard deviations from the mean (i.e. standard units)
- by the overall range
- by chosen percentiles
- by the mean
As a working example, let’s use the same data from chapter Data Matrix.
gender height weight jedi
Anakin Skywalker male 1.88 84.0 yes_jedi
Padme Amidala female 1.65 45.0 no_jedi
Luke Skywalker male 1.72 77.0 yes_jedi
Leia Organa female 1.50 49.0 no_jedi
Qui-Gon Jinn male 1.93 88.5 yes_jedi
Obi-Wan Kenobi male 1.82 77.0 yes_jedi
We will use variable height
and denoted with \(X\):
4.2.1 By Standard Units
One way to obtain a common scale is to standardize the variables by number of standard deviations from the mean. The goal is to convert a variable \(X\) into a variable \(Z\) in standard units. You do this by subtracting the mean \(\bar{x}\) from every value \(x_i\), and then divide by the standard deviation \(s\). The conversion formula is:
\[ z_i = \frac{x_i - \bar{x}}{s} \]
where \(\bar{x}\) is the mean of \(X\), and \(s\) is the standard deviation of \(X\). A value that is expressed in standard units measures how far that value is from the mean, in terms of standard deviations.
In vector notation, the standardized vector \(\mathbf{z}\) of \(\mathbf{x}\) is:
\[ \mathbf{z} = \frac{1}{s} (\mathbf{x} - \mathbf{\bar{x}}) \]
Here’s some code in R that shows step-by-step the operations to obtain a variable in standard units:
# mean
x_mean <- mean(x)
x_mean
[1] 1.75
# standard deviation
x_sd <- sd(x)
x_sd
[1] 0.16
# height in stadard units
std_units <- (x - x_mean) / x_sd
std_units
[1] 0.814 -0.626 -0.188 -1.565 1.127 0.438
The standardized vector std_units
now has a mean of 0 and standard deviation 0:
4.2.2 By the Overall Range
Another kind of standardization is to subtract the minimum value from every value \(x_i\), and then divide by the overall range: \(max(x) - min(x)\).
\[ z_i = \frac{x_i - min(x)}{max(x) - min(x)} \]
Notice that this type of standardization will produce a variable \(Z\) that is linearly transformed ranging from 0 to 1, where 0 is its minimum and 1 its maximum value.
# maximum
x_max <- max(x)
x_max
[1] 1.93
# minumum
x_min <- min(x)
x_min
[1] 1.5
# height standardized by overall range
std_range <- (x - x_min) / (x_max - x_min)
std_range
[1] 0.884 0.349 0.512 0.000 1.000 0.744
The standardized vector std_range
now ranges from 0 to 1:
4.2.3 By Chosen Percentiles
A general type of range-based standardization is to divide by a different type of range, not just the overall range. This is done by using any range given from a pair of symmetric percentiles. For example, you can choose the 25-th and the 75-th percentiles, also known as the first quartile \(Q_1\) and third quartile \(Q_3\), respectively. In other words, standardized by the interquartile range or IQR:
\[ z_i = \frac{x_i}{Q_3 - Q_1} = \frac{x_i}{\text{IQR}(x)} \]
Why to standardize by chosen percentiles other than the maximum and the minimum? Because the overall range is sensitive to outliers; so to have a less sensitive scale, we divide by a more robust range. As another example of this type of standardization, you can divide by the range between the 5-th and the 95-th percentiles.
# inter-quartile range
x_iqr <- IQR(x)
# height standardized by IQR
std_iqr <- x / x_iqr
std_iqr
[1] 9.52 8.35 8.71 7.59 9.77 9.22
The standardized vector std_iqr
now has an IQR of 1:
4.2.4 By the Mean
A less common but equally interesting type of standardization is to divide the values \(x_i\) by their mean \(\bar{x}\) which causes transformed values \(z_i\) to have standard deviations equal to their coefficient of variation:
\[ z_i = \frac{x_i}{\bar{x}} \]
4.2.5 About standardization
All the previous standardizations can be put in terms of a form of weighting given by the following formula:
\[ z_i = w \times x_i \]
where \(w\) represents a weight or scaling factor that can be:
- the standard deviation: \(w = s\)
- the overall range: \(w = max(x) - min(x)\)
- the IQR: \(w = \text{IQR}\)
- the mean: \(w = \bar{x}\)