4.2 Standardization

When variables are quantitative, they can be measured in the same scale. But they can also be measured in different scales. For example, consider four variables: 1) Weight measured in kilograms, 2) Height measured in centimeters, 3) Income measured in dollars, and 4) Temperature measured in Celsius degrees. When you use a method in which comparisons or calculations are made taking into account the variance of the variables, you will face an interesting phenomenon: the variable that has the largest magnitude will dominate the variability in the data. And this could be a (serious) problem.

To compensate for the differences in scales, something must be done. The question is then: how to balance the contributions of the variables in a way that you can have a fair comparison among them? The key is to put them all under a common scale. Let’s review four options to standardize a variable:

  • by standard deviations from the mean (i.e. standard units)
  • by the overall range
  • by chosen percentiles
  • by the mean

As a working example, let’s use the same data from chapter Data Matrix.

                 gender height weight     jedi
Anakin Skywalker   male   1.88   84.0 yes_jedi
Padme Amidala    female   1.65   45.0  no_jedi
Luke Skywalker     male   1.72   77.0 yes_jedi
Leia Organa      female   1.50   49.0  no_jedi
Qui-Gon Jinn       male   1.93   88.5 yes_jedi
Obi-Wan Kenobi     male   1.82   77.0 yes_jedi

We will use variable height and denoted with \(X\):

x <- dat$height
x
[1] 1.88 1.65 1.72 1.50 1.93 1.82

4.2.1 By Standard Units

One way to obtain a common scale is to standardize the variables by number of standard deviations from the mean. The goal is to convert a variable \(X\) into a variable \(Z\) in standard units. You do this by subtracting the mean \(\bar{x}\) from every value \(x_i\), and then divide by the standard deviation \(s\). The conversion formula is:

\[ z_i = \frac{x_i - \bar{x}}{s} \]

where \(\bar{x}\) is the mean of \(X\), and \(s\) is the standard deviation of \(X\). A value that is expressed in standard units measures how far that value is from the mean, in terms of standard deviations.

In vector notation, the standardized vector \(\mathbf{z}\) of \(\mathbf{x}\) is:

\[ \mathbf{z} = \frac{1}{s} (\mathbf{x} - \mathbf{\bar{x}}) \]

Here’s some code in R that shows step-by-step the operations to obtain a variable in standard units:

# mean
x_mean <- mean(x)
x_mean
[1] 1.75

# standard deviation
x_sd <- sd(x)
x_sd
[1] 0.16

# height in stadard units
std_units <- (x - x_mean) / x_sd
std_units
[1]  0.814 -0.626 -0.188 -1.565  1.127  0.438

The standardized vector std_units now has a mean of 0 and standard deviation 0:

mean(std_units)
[1] -2.73e-16
sd(std_units)
[1] 1

4.2.2 By the Overall Range

Another kind of standardization is to subtract the minimum value from every value \(x_i\), and then divide by the overall range: \(max(x) - min(x)\).

\[ z_i = \frac{x_i - min(x)}{max(x) - min(x)} \]

Notice that this type of standardization will produce a variable \(Z\) that is linearly transformed ranging from 0 to 1, where 0 is its minimum and 1 its maximum value.

# maximum
x_max <- max(x)
x_max
[1] 1.93

# minumum
x_min <- min(x)
x_min
[1] 1.5

# height standardized by overall range
std_range <- (x - x_min) / (x_max - x_min)
std_range
[1] 0.884 0.349 0.512 0.000 1.000 0.744

The standardized vector std_range now ranges from 0 to 1:

min(std_range)
[1] 0
max(std_range)
[1] 1

4.2.3 By Chosen Percentiles

A general type of range-based standardization is to divide by a different type of range, not just the overall range. This is done by using any range given from a pair of symmetric percentiles. For example, you can choose the 25-th and the 75-th percentiles, also known as the first quartile \(Q_1\) and third quartile \(Q_3\), respectively. In other words, standardized by the interquartile range or IQR:

\[ z_i = \frac{x_i}{Q_3 - Q_1} = \frac{x_i}{\text{IQR}(x)} \]

Why to standardize by chosen percentiles other than the maximum and the minimum? Because the overall range is sensitive to outliers; so to have a less sensitive scale, we divide by a more robust range. As another example of this type of standardization, you can divide by the range between the 5-th and the 95-th percentiles.

# inter-quartile range
x_iqr <- IQR(x)

# height standardized by IQR
std_iqr <- x / x_iqr
std_iqr
[1] 9.52 8.35 8.71 7.59 9.77 9.22

The standardized vector std_iqr now has an IQR of 1:

IQR(std_iqr)
[1] 1

4.2.4 By the Mean

A less common but equally interesting type of standardization is to divide the values \(x_i\) by their mean \(\bar{x}\) which causes transformed values \(z_i\) to have standard deviations equal to their coefficient of variation:

\[ z_i = \frac{x_i}{\bar{x}} \]

4.2.5 About standardization

All the previous standardizations can be put in terms of a form of weighting given by the following formula:

\[ z_i = w \times x_i \]

where \(w\) represents a weight or scaling factor that can be:

  • the standard deviation: \(w = s\)
  • the overall range: \(w = max(x) - min(x)\)
  • the IQR: \(w = \text{IQR}\)
  • the mean: \(w = \bar{x}\)