5 Factors

I’m one of those with the humble opinion that great software for data science and analytics should have a data structure dedicated to handle categorical data. Luckily, one of the nicest features about R is that it provides a data object exclusively designed to handle categorical data: factors.

The term “factor” as used in R for handling categorical variables, comes from the terminology used in Analysis of Variance, commonly referred to as ANOVA. In this statistical method, a categorical variable is commonly referred to as, surprise-surprise, factor and its categories are known as levels. Perhaps this is not the best terminology but it is the one R uses, which reflects its distinctive statistical origins. Especially for those users without a background in statistics, this is one of R’s idiosyncrasies that seems disconcerting at the beginning. But as long as you keep in mind that a factor is just the object that allows you to handle a qualitative variable you’ll be fine. In case you need it, here’s a short mantra to remember:

factors have levels

5.1 Creating Factors

To create a factor in R you use the homonym function factor(), which takes a vector as input. The vector can be either numeric, character or logical. Let’s see our first example:

# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)

# creating a factor from num_vector
first_factor <- factor(num_vector)

first_factor
> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3

As you can tell from the previous code snippet, factor() converts the numeric vector num_vector into a factor (i.e. a categorical variable) with 3 categories—the so called levels.

You can also obtain a factor from a string vector:

# string vector
str_vector <- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')

str_vector
> [1] "a" "b" "c" "b" "c" "a" "c" "b"

# creating a factor from str_vector
second_factor <- factor(str_vector)

second_factor
> [1] a b c b c a c b
> Levels: a b c

Notice how str_vector and second_factor are displayed. Even though the elements are the same in both the vector and the factor, they are printed in different formats. The letters in the string vector are displayed with quotes, while the letters in the factor are printed without quotes.

And of course, you can use a logical vector to generate a factor as well:

# logical vector
log_vector <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

# creating a factor from log_vector
third_factor <- factor(log_vector)

third_factor
> [1] TRUE  FALSE TRUE  TRUE  FALSE
> Levels: FALSE TRUE

5.2 How R treats factors

Technically speaking, R factors are referred to as compound objects. According to the “R Language Definition” manual:

“Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers.”

What does this mean?

Essentially, a factor is internally stored using two ingredients: one is an integer vector containing the values of categories, the other is a vector with the “levels” which has the names of categories which are mapped to the integers.

Under the hood, the way R stores factors is as vectors of integer values. One way to confirm this is using the function storage.mode()

# storage of factor
storage.mode(first_factor)
> [1] "integer"

This means that we can manipulate factors just like we manipulate vectors. In addition, many functions for vectors can be applied to factors. For instance, we can use the function length() to get the number of elements in a factor:

# factors have length
length(first_factor)
> [1] 7

We can also use the square brackets [ ] to extract or select elements of a factor. Inside the brackets we specify vectors of indices such as numeric vectors, logical vectors, and sometimes even character vectors.

# first element
first_factor[1]

# third element
first_factor[3]

# second to fourth elements
first_factor[2:4]

# last element
first_factor[length(first_factor)]

# logical subsetting
first_factor[rep(c(TRUE, FALSE), length.out = 7)]

If you have a factor with named elements, you can also specify the names of the elements within the brackets:

names(first_factor) <- letters[1:length(first_factor)]
first_factor
> a b c d e f g 
> 1 2 3 1 2 3 2 
> Levels: 1 2 3

first_factor[c('b', 'd', 'f')]
> b d f 
> 2 1 3 
> Levels: 1 2 3

However, you should know that factors are NOT really vectors. To see this you can check the behavior of the functions is.factor() and is.vector() on a factor:

# factors are not vectors
is.vector(first_factor)
> [1] FALSE

# factors are factors
is.factor(first_factor)
> [1] TRUE

Even a single element of a factor is also a factor:

class(first_factor[1])
> [1] "factor"

So what makes a factor different from a vector?

Well, it turns out that factors have an additional attribute that vectors don’t: levels. And as you can expect, the class of a factor is indeed "factor" (not "vector").

# attributes of a factor
attributes(first_factor)
> $levels
> [1] "1" "2" "3"
> 
> $class
> [1] "factor"
> 
> $names
> [1] "a" "b" "c" "d" "e" "f" "g"

Another feature that makes factors so special is that their values (the levels) are mapped to a set of character values for displaying purposes. This might seem like a minor feature but it has two important consequences. On the one hand, this implies that factors provide a way to store character values very efficiently. Why? Because each unique character value is stored only once, and the data itself is stored as a vector of integers.

Notice how the numeric value 1 was mapped into the character value "1". And the same happens for the other values 2 and 3 that are mapped into the characters "2" and "3".

What is the advantage of R factors?

Every time I teach about factors, there is inevitably one student who asks a very pertinent question: Why do we want to use factors? Isn’t it redundant to have a factor object when there are already character or integer vectors?

I have two answers to this question.

The first has to do with the storage of factors. Storing a factor as integers will usually be more efficient than storing a character vector. As we’ve seen, this is an important issue especially when the data—to be encoded into a factor—is of considerable size.

The second reason has to do with categorical variables of ordinal nature. Qualitative data can be classified into nominal and ordinal variables. Nominal variables could be easily handled with character vectors. In fact, nominal means name (values are just names or labels), and there’s no natural order among the categories.

A different story is when we have ordinal variables, like sizes "small", "medium", "large" or college years "freshman", "sophomore", "junior", "senior". In these cases we are still using names of categories, but they can be arranged in increasing or decreasing order. In other words, we can rank the categories since they have a natural order: small is less than medium which is less than large. Likewise, freshman comes first, then sophomore, followed by junior, and finally senior.

So here’s an important question: How do we keep the order of categories in an ordinal variable? We can use a character vector to store the values. But a character vector does not allow us to store the ranking of categories. The solution in R comes via factors. We can use factors to define ordinal variables, like the following example:

sizes <- factor(
  x = c('sm', 'md', 'lg', 'sm', 'md'),
  levels = c('sm', 'md', 'lg'),
  ordered = TRUE)

sizes
> [1] sm md lg sm md
> Levels: sm < md < lg

As you can tell, sizes has ordered levels, clearly identifying the first category "sm", the second one "md", and the third one "lg".

5.3 A closer look at `factor()`

Since working with categorical data in R typically involves working with factors, you should become familiar with the variety of functions related with them. In the following sections we’ll cover a bunch of details about factors so you can be better prepared to deal with any type of categorical data.

5.3.1 Function `factor()`

Given the fundamental role played by the function factor() we need to pay a closer look at its arguments. If you check the documentation—see help(factor)—you’ll see that the usage of the function factor() is:

  factor(x = character(), levels, labels = levels,
         exclude = NA, ordered = is.ordered(x), nmax = NA)

with the following arguments:

x a vector of data
levels an optional vector for the categories
labels an optional character vector of labels for the levels
exclude a vector of values to be excluded when forming the set of levels
ordered logical value to indicate if the levels should be regarded as ordered
nmax an upper bound on the number of levels

The main argument of factor() is the input vector x. The next argument is levels, followed by labels, both of which are optional arguments. Although you won’t always be providing values for levels and labels, it is important to understand how R handles these arguments by default.

Argument `levels`

If levels is not provided (which is what happens in most cases), then R assigns the unique values in x as the category levels.

For example, consider our numeric vector from the first example: num_vector contains unique values 1, 2, and 3.

# numeric vector
num_vector <- c(1, 2, 3, 1, 2, 3, 2)

# creating a factor from num_vector
first_factor <- factor(num_vector)

first_factor
> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3

Now imagine we want to have levels 1, 2, 3, 4, and 5. This is how you can define the factor with an extended set of levels:

# numeric vector
num_vector
> [1] 1 2 3 1 2 3 2

# defining levels
one_factor <- factor(num_vector, levels = 1:5)
one_factor
> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3 4 5

Although the created factor only has values between 1 and 3, the levels range from 1 to 5. This can be useful if we plan to add elements whose values are not in the input vector num_vector. For instance, you can append two more elements to one_factor with values 4 and 5 like this:

# adding values 4 and 5
one_factor[c(8, 9)] <- c(4, 5)
one_factor
> [1] 1 2 3 1 2 3 2 4 5
> Levels: 1 2 3 4 5

If you attempt to insert an element having a value that is not in the predefined set of levels, R will insert a missing value (<NA>) instead, and you’ll get a warning message like the one below:

# attempting to add value 6 (not in levels)
one_factor[1] <- 6
> Warning in `[<-.factor`(`*tmp*`, 1, value = 6): invalid factor level, NA
> generated
one_factor
> [1] <NA> 2    3    1    2    3    2    4    5   
> Levels: 1 2 3 4 5

Argument `labels`

Another very useful argument is labels, which allows you to provide a string vector for naming the levels in a different way from the values in x. Let’s take the numeric vector num_vector again, and say we want to use words as labels instead of numeric values. Here’s how you can create a factor with predefined labels:

# defining labels
num_word_vector <- factor(num_vector, labels = c("one", "two", "three"))

num_word_vector
> [1] one   two   three one   two   three two  
> Levels: one two three

Argument `exclude`

If you want to ignore some values of the input vector x, you can use the exclude argument. You just need to provide those values which will be removed from the set of levels.

# excluding level 3
factor(num_vector, exclude = 3)
> [1] 1    2    <NA> 1    2    <NA> 2   
> Levels: 1 2

# excluding levels 1 and 3
factor(num_vector, exclude = c(1,3))
> [1] <NA> 2    <NA> <NA> 2    <NA> 2   
> Levels: 2

The side effect of exclude is that it returns a missing value (<NA>) for each element that was excluded, which is not always what we want. Here’s one way to remove the missing values when excluding 3:

# excluding level 3
num_fac12 <- factor(num_vector, exclude = 3)

# oops, we have some missing values
num_fac12
> [1] 1    2    <NA> 1    2    <NA> 2   
> Levels: 1 2
# removing missing values
num_fac12[!is.na(num_fac12)]
> [1] 1 2 1 2 2
> Levels: 1 2

5.3.2 Unclassing factors

We’ve mentioned that factors are stored as vectors of integers (for efficiency reasons). But we also said that factors are more than vectors. Even though a factor is displayed with string labels, the way it is stored internally is as integers. Why is this important to know? Because there will be occasions in which you’ll need to know exactly what numbers are associated to each level values.

Imagine you have a factor with levels 11, 22, 33, 44.

# factor
xfactor <- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor
> [1] 22 11 44 33 11 22 44
> Levels: 11 22 33 44

To obtain the integer vector associated to xfactor you can use the function unclass():

# unclassing a factor
unclass(xfactor)
> [1] 2 1 4 3 1 2 4
> attr(,"levels")
> [1] "11" "22" "33" "44"

As you can see, the levels "11", "22", "33", "44" were mapped to the vector of integers (1 2 3 4).

An alternative option is to simply apply as.numeric() or as.integer() instead of using unclass():

# equivalent to unclass
as.integer(xfactor)
> [1] 2 1 4 3 1 2 4

# equivalent to unclass
as.numeric(xfactor)
> [1] 2 1 4 3 1 2 4

Although rarely used, there can be some cases in which what you need to do is revert the integer values in order to get the original factor levels. This is only possible when the levels of the factor are themselves numeric. To accomplish this use the following command:

# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
> [1] 22 11 44 33 11 22 44

5.4 Ordinal Factors

By default, factor() creates a nominal categorical variable, not an ordinal. One way to check that you have a nominal factor is to use the function is.ordered(), which returns TRUE if its argument is an ordinal factor.

# ordinal factor?
is.ordered(num_vector)
> [1] FALSE

If you want to specify an ordinal factor you must use the ordered argument of factor(). This is how you can generate an ordinal value from num_vector:

# ordinal factor from numeric vector
ordinal_num <- factor(num_vector, ordered = TRUE)
ordinal_num
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3

As you can tell from the snippet above, the levels of ordinal_factor are displayed with less-than symbols `‘<’}, which means that the levels have an increasing order. We can also get an ordinal factor from our string vector:

# ordinal factor from character vector
ordinal_str <- factor(str_vector, ordered = TRUE)
ordinal_str
> [1] a b c b c a c b
> Levels: a < b < c

In fact, when you set ordered = TRUE, R sorts the provided values in alphanumeric order. If you have the following alphanumeric vector ("a1", "1a", "1b", "b1"), what do you think will be the generated ordered factor? Let’s check the answer:

# alphanumeric vector
alphanum <- c("a1", "1a", "1b", "b1")

# ordinal factor from character vector
ordinal_alphanum <- factor(alphanum, ordered = TRUE)
ordinal_alphanum
> [1] a1 1a 1b b1
> Levels: 1a < 1b < a1 < b1

An alternative way to specify an ordinal variable is by using the function ordered(), which is just a convenient wrapper for factor(x, ..., ordered = TRUE):

# ordinal factor with ordered()
ordered(num_vector)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3

# same as using 'ordered' argument
factor(num_vector, ordered = TRUE)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3

A word of caution. Don’t confuse the function ordered() with order(). They are not equivalent. order() arranges a vector into ascending or descending order, and returns the sorted vector. ordered(), as we’ve seen, is used to get ordinal factors.

Of course, you won’t always be using the default order provided by the functions factor(..., ordered = TRUE) or ordered(). Sometimes you want to determine categories according to a different order.

For example, let’s take the values of str_vector and let’s assume that we want them in descending order, that is, c < b < a. How can you do that? Easy, you just need to specify the levels in the order you want them and set ordered = TRUE (or use ordered()):

# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
> [1] a b c b c a c b
> Levels: c < b < a

# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
> [1] a b c b c a c b
> Levels: c < b < a

Here’s another example. Consider a set of size values "xs" extra-small, "sm" small, "md" medium, "lg" large, and "xl" extra-large. If you have a vector with size values you can create an ordinal variable as follows:

# vector of sizes
sizes <- c("sm", "xs", "xl", "lg", "xs", "lg")

# setting levels with specified order
ordered(sizes, levels = c("xs", "sm", "md", "lg", "xl"))
> [1] sm xs xl lg xs lg
> Levels: xs < sm < md < lg < xl

Notice that when you create an ordinal factor, the given levels will always be considered in an increasing order. This means that the first value of levels will be the smallest one, then the second one, and so on. The last category, in turn, is taken as the one at the top of the scale.

Now that we have several nominal and ordinal factors, we can compare the behavior of is.ordered() on two factors:

# is.ordered() on an ordinal factor
ordinal_str
> [1] a b c b c a c b
> Levels: a < b < c
is.ordered(ordinal_str)
> [1] TRUE

# is.ordered() on a nominal factor
second_factor
> [1] a b c b c a c b
> Levels: a b c
is.ordered(second_factor)
> [1] FALSE