4.1 Dummy Variables

As we saw in the previous chapter, categorical variables are characteristics or qualities observed on individuals. An example of a categorical variable is Sex with possible values male and female. Most of the time, categorical variables will be codified in a non-numeric way, usually as strings or character values. However, it is not unusual to find numeric codes such as “female” = 1, and “male” = 2.

More often than not, it is convenient to code categorical variables as dummy variables, that is, decompose a categorical variable into one or more indicator variables that take on the values 0 or 1. For example, suppose that Sex has two values male and female. This variable can be coded as two dummy variables which have values [1 0 0] for female, [0 1 1] for male:

\[ \left[\begin{array}{c} female \\ male \\ male \\ \end{array}\right] = \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ \end{array}\right] \]

For some statistical learning methods such as regression analysis—due to a technical reason—only one of these two dummies can be used. This means that you would drop one of the dummy columns:

\[ \left[\begin{array}{c} female \\ male \\ male \\ \end{array}\right] = \left[\begin{array}{c} 1 \\ 0 \\ 0 \\ \end{array}\right] \]

The resulting one dummy variable is interpreted as the difference of female and male, keeping female as the reference value.