14 Factors

One of the nicest features about R is that it provides a data structure exclusively designed to handle categorical data: factors.

As mentioned before, vectors are the most essential type of data structure in R. They are atomic structures (can contain only one type of data): integers, real numbers, logical values, characters, complex numbers.

Related to vectors, there is another important data structure in R called factor. Factors are data structures exclusively designed to handle categorical data.

14.1 Example

player <- c('Thompson', 'Curry', 'Green', 'Durant', 'Pachulia')
position <- c('SG', 'PG', 'PF', 'SF', 'C')
salary <- c(16663575, 12112359, 15330435, 26540100, 2898000)
ppg <- c(22.3, 25.3, 10.2, 25.1, 6.1)
rookie <- rep(FALSE, 5)

14.1.1 Creating Factors

To create a factor you use the homonym function factor(), which takes a vector as input. The vector can be either numeric, character or logical.

Looking at the available variables, we can treat Position and rookie as categorical variables. This means that we can convert the corresponding vectors position, and rookie into factors.

# convert to factor
position <- factor(position)
position
#> [1] SG PG PF SF C 
#> Levels: C PF PG SF SG

rookie <- factor(rookie)

Notice how position and rookie are displayed. Even though the elements are the same in both the vector and the factor, they are printed in different formats. The letters in the factor are printed without quotes.

14.1.2 How does R store factors?

Under the hood, a factor is internally stored using two arrays (R vectors): one is an integer array containing the values of the categories, the other array is the “levels” which has the names of categories which are mapped to the integers.

One way to confirm that the values of the categories are mapped as integers is by using the function storage.mode()

# storage of factor
storage.mode(position)
#> [1] "integer"

14.1.3 Manipulating Factors

Because factors are internally stored as integers, you can manipulate factors as any other vector:

position[1:5]
#> [1] SG PG PF SF C 
#> Levels: C PF PG SF SG
position[c(1, 3, 5)]
#> [1] SG PF C 
#> Levels: C PF PG SF SG
position[rep(1, 5)]
#> [1] SG SG SG SG SG
#> Levels: C PF PG SF SG
rookie[player == 'Iguodala']
#> factor(0)
#> Levels: FALSE
rookie[player == 'McCaw']
#> factor(0)
#> Levels: FALSE

14.1.4 Why using R factors?

When or/and why to use factors? The simplest answer is: use R factors when you want to handle categorical data as such. Often, statisticians think about variables as categorical data, expressed in several scales: binary, nominal, and ordinal. And R lets you handle this type of data through factors. Many functions in R are specifically dedicated for factors, and you can (should) take advantage of such behavior.

14.2 What is an R factor?

The term factor as used in R for handling categorical variables, comes from the terminology used in Analysis of Variance, commonly referred to as ANOVA. In this statistical method, a categorical variable is commonly referred to as factor and its categories are known as levels. Perhaps this is not the best terminology but it is the one R uses, which reflects its distinctive statistical origins. Especially for those users without a brackground in statistics, this is one of R’s idiosyncracies that seems disconcerning at the beginning. But as long as you keep in mind that a factor is just the object that allows you to handle a qualitative variable you’ll be fine. In case you need it, here’s a short mantra to remember: “factors have levels”.

14.2.1 Creating Factors

To create a factor in R you use the homonym function factor(), which takes a vector as input. The vector can be either numeric, character or logical. Let’s see our first example:

# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)

# creating a factor from num_vector
first_factor <- factor(num_vector)

first_factor
#> [1] 1 2 3 1 2 3 2
#> Levels: 1 2 3

As you can tell from the previous code snippet, factor() converts the numeric vector num_vector into a factor (i.e. a categorical variable) with 3 categories—the so called levels.

You can also obtain a factor from a string vector:

# string vector
str_vector <- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')

str_vector
#> [1] "a" "b" "c" "b" "c" "a" "c" "b"

# creating a factor from str_vector
second_factor <- factor(str_vector)

second_factor
#> [1] a b c b c a c b
#> Levels: a b c

Notice how str_vector and second_factor are displayed. Even though the elements are the same in both the vector and the factor, they are printed in different formats. The letters in the string vector are displayed with quotes, while the letters in the factor are printed without quotes.

And of course, you can use a logical vector to generate a factor as well:

# logical vector
log_vector <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

# creating a factor from log_vector
third_factor <- factor(log_vector)

third_factor
#> [1] TRUE  FALSE TRUE  TRUE  FALSE
#> Levels: FALSE TRUE

14.2.2 How R treats factors?

If you’re curious and check the technical R Language Definition, available online (https://cran.r-project.org/manuals.html), you’ll find that R factors are referred to as compound objects. According to the manual:

“Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers.”

Essentially, a factor is internally stored using two arrays: one is an integer array containing the values of categories, the other array is the “levels” which has the names of categories which are mapped to the integers.

Under the hood, the way R stores factors is as vectors of integer values. One way to confirm this is using the function `typeof()``

typeof(first_factor)
#> [1] "integer"

This means that we can manipulate factors just like we manipulate vectors. In addition, many functions for vectors can be applied to factors. For instance, we can use the function length() to get the number of elements in a factor:

# factors have length
length(first_factor)
#> [1] 7

We can also use the square brackets [ ] to extract or select elements of a factor. Inside the brackets we specify vectors of indices such as numeric vectors, logical vectors, and sometimes even character vectors.

# first element
first_factor[1]

# third element
first_factor[3]

# second to fourth elements
first_factor[2:4]

# last element
first_factor[length(first_factor)]

# logical subsetting
first_factor[rep(c(TRUE, FALSE), length.out = 7)]

If you have a factor with named elements, you can also specify the names of the elements within the brackets:

names(first_factor) <- letters[1:length(first_factor)]
first_factor

first_factor[c('b', 'd', 'f')]

So what makes a factor different from a vector?

Well, it turns out that factors have an additional attribute that vectors don’t: levels. And as you can expect, the class of a factor is indeed "factor" (not "vector").

# attributes of a factor
attributes(first_factor)
#> $levels
#> [1] "1" "2" "3"
#> 
#> $class
#> [1] "factor"

Another feature that makes factors so special is that their values (the levels) are mapped to a set of character values for displaying purposes. This might seem like a minor feature but it has two important consequences. On the one hand, this implies that factors provide a way to store character values very efficiently. Why? Because each unique character value is stored only once, and the data itself is stored as a vector of integers.

Notice how the numeric value 1 was mapped into the character value "1". And the same happens for the other values 2 and 3 that are mapped into the characters "2" and "3".

14.2.3 What is the advantage of R factors?

Every time I teach about factors, there is inevitably one student who asks a very pertinent question: Why do we want to use factors? Isn’t it redundant to have a factor object when there are already character or integer vectors?

I have two answers to this question.

The first has to do with the storage of factors. Storing a factor as integers will usually be more efficient than storing a character vector. As we’ve seen, this is an important issue especially when factors are of considerable size. The second reason has to do with ordinal variables.

Qualitative data can be classified into nominal and ordinal variables. Nominal variables could be easily handled with character vectors. In fact, nominal means name (values are just names or labels), and there’s no natural order among the categories. A different case is when we have ordinal variables, like sizes "small", "medium", "large", or college years "freshman", "sophomore", "junior", "senior".

In these cases we are still using names of categories, but they can be arranged in increasing or decreasing order. In other words, we can rank the categories since they have a natural order: small is less than medium which is less than large. Likewise, freshman comes first, then sophomore, followed by junior, and finally senior.

So here’s an important question: How do we keep the order of categories in an ordinal variable? We can use a character vector to store the values. But a character vector does not allow us to store the ranking of categories. The solution in R comes via factors. We can use factors to define ordinal variables, like the following example:

sizes <- factor(c('sm', 'md', 'lg', 'sm', 'md'),
                levels = c('sm', 'md', 'lg'),
                ordered = TRUE)

sizes
#> [1] sm md lg sm md
#> Levels: sm < md < lg

As you can tell, sizes has ordered levels, clearly identifying the first category "sm", the second one "md", and the third one "lg".

Another advantage of factors is that many functions in R have been designed to work with factors. They expect the input to be a factor, and will act accordingly.