9 Factors
I’m one of those with the humble opinion that great software for data science and analytics should have a data structure dedicated to handle categorical data. Luckily, one of the nicest features about R is that it provides a data object exclusively designed to handle categorical data: factors.
The term “factor” as used in R for handling categorical variables, comes from the terminology used in Analysis of Variance, commonly referred to as ANOVA. In this statistical method, a categorical variable is commonly referred to as, surprise-surprise, factor and its categories are known as levels. Perhaps this is not the best terminology but it is the one R uses, which reflects its distinctive statistical origins. Especially for those users without a background in statistics, this is one of R’s idiosyncrasies that seems disconcerting at the beginning. But as long as you keep in mind that a factor is just the object that allows you to handle a qualitative variable you’ll be fine. In case you need it, here’s a short mantra to remember:
factors have levels
9.1 Creating Factors
To create a factor in R you use the homonym function factor()
, which takes a
vector as input. The vector can be either numeric, character or logical. Let’s
see our first example:
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3
As you can tell from the previous code snippet, factor()
converts the numeric
vector num_vector
into a factor (i.e. a categorical variable) with 3
categories—the so called levels
.
You can also obtain a factor from a string vector:
# string vector
<- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')
str_vector
str_vector> [1] "a" "b" "c" "b" "c" "a" "c" "b"
# creating a factor from str_vector
<- factor(str_vector)
second_factor
second_factor> [1] a b c b c a c b
> Levels: a b c
Notice how str_vector
and second_factor
are displayed. Even though the
elements are the same in both the vector and the factor, they are printed in
different formats. The letters in the string vector are displayed with quotes,
while the letters in the factor are printed without quotes.
And of course, you can use a logical vector to generate a factor as well:
# logical vector
<- c(TRUE, FALSE, TRUE, TRUE, FALSE)
log_vector
# creating a factor from log_vector
<- factor(log_vector)
third_factor
third_factor> [1] TRUE FALSE TRUE TRUE FALSE
> Levels: FALSE TRUE
9.2 How R treats factors
Technically speaking, R factors are referred to as compound objects. According to the “R Language Definition” manual:
“Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers.”
What does this mean?
Essentially, a factor is internally stored using two ingredients: one is an integer vector containing the values of categories, the other is a vector with the “levels” which has the names of categories which are mapped to the integers.
Under the hood, the way R stores factors is as vectors of integer values.
One way to confirm this is using the function storage.mode()
# storage of factor
storage.mode(first_factor)
> [1] "integer"
This means that we can manipulate factors just like we manipulate vectors. In
addition, many functions for vectors can be applied to factors. For instance,
we can use the function length()
to get the number of elements in a factor:
# factors have length
length(first_factor)
> [1] 7
We can also use the square brackets [ ]
to extract or select elements of a
factor. Inside the brackets we specify vectors of indices such as numeric
vectors, logical vectors, and sometimes even character vectors.
# first element
1]
first_factor[
# third element
3]
first_factor[
# second to fourth elements
2:4]
first_factor[
# last element
length(first_factor)]
first_factor[
# logical subsetting
rep(c(TRUE, FALSE), length.out = 7)] first_factor[
If you have a factor with named elements, you can also specify the names of the elements within the brackets:
names(first_factor) <- letters[1:length(first_factor)]
first_factor> a b c d e f g
> 1 2 3 1 2 3 2
> Levels: 1 2 3
c('b', 'd', 'f')]
first_factor[> b d f
> 2 1 3
> Levels: 1 2 3
However, you should know that factors are NOT really vectors. To see this you
can check the behavior of the functions is.factor()
and is.vector()
on a
factor:
# factors are not vectors
is.vector(first_factor)
> [1] FALSE
# factors are factors
is.factor(first_factor)
> [1] TRUE
Even a single element of a factor is also a factor:
class(first_factor[1])
> [1] "factor"
So what makes a factor different from a vector?
Well, it turns out that factors have an additional attribute that vectors don’t:
levels
. And as you can expect, the class of a factor is indeed "factor"
(not "vector"
).
# attributes of a factor
attributes(first_factor)
> $levels
> [1] "1" "2" "3"
>
> $class
> [1] "factor"
>
> $names
> [1] "a" "b" "c" "d" "e" "f" "g"
Another feature that makes factors so special is that their values (the levels) are mapped to a set of character values for displaying purposes. This might seem like a minor feature but it has two important consequences. On the one hand, this implies that factors provide a way to store character values very efficiently. Why? Because each unique character value is stored only once, and the data itself is stored as a vector of integers.
Notice how the numeric value 1
was mapped into the character value "1"
. And
the same happens for the other values 2
and 3
that are mapped into the
characters "2"
and "3"
.
What is the advantage of R factors?
Every time I teach about factors, there is inevitably one student who asks a very pertinent question: Why do we want to use factors? Isn’t it redundant to have a factor object when there are already character or integer vectors?
I have two answers to this question.
The first has to do with the storage of factors. Storing a factor as integers will usually be more efficient than storing a character vector. As we’ve seen, this is an important issue especially when the data—to be encoded into a factor—is of considerable size.
The second reason has to do with categorical variables of ordinal nature. Qualitative data can be classified into nominal and ordinal variables. Nominal variables could be easily handled with character vectors. In fact, nominal means name (values are just names or labels), and there’s no natural order among the categories.
A different story is when we have ordinal variables, like sizes "small"
,
"medium"
, "large"
or college years "freshman"
, "sophomore"
, "junior"
,
"senior"
. In these cases we are still using names of categories, but they
can be arranged in increasing or decreasing order. In other words, we can rank
the categories since they have a natural order: small is less than medium which
is less than large. Likewise, freshman comes first, then sophomore, followed by
junior, and finally senior.
So here’s an important question: How do we keep the order of categories in an ordinal variable? We can use a character vector to store the values. But a character vector does not allow us to store the ranking of categories. The solution in R comes via factors. We can use factors to define ordinal variables, like the following example:
<- factor(
sizes x = c('sm', 'md', 'lg', 'sm', 'md'),
levels = c('sm', 'md', 'lg'),
ordered = TRUE)
sizes> [1] sm md lg sm md
> Levels: sm < md < lg
As you can tell, sizes
has ordered levels, clearly identifying the first
category "sm"
, the second one "md"
, and the third one "lg"
.
9.3 A closer look at factor()
Since working with categorical data in R typically involves working with factors, you should become familiar with the variety of functions related with them. In the following sections we’ll cover a bunch of details about factors so you can be better prepared to deal with any type of categorical data.
9.3.1 Function factor()
Given the fundamental role played by the function factor()
we need to pay a
closer look at its arguments. If you check the documentation—see
help(factor)
—you’ll see that the usage of the function factor()
is:
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x), nmax = NA)
with the following arguments:
x
a vector of datalevels
an optional vector for the categorieslabels
an optional character vector of labels for the levelsexclude
a vector of values to be excluded when forming the set of levelsordered
logical value to indicate if the levels should be regarded as orderednmax
an upper bound on the number of levels
The main argument of factor()
is the input vector x
. The next argument is
levels
, followed by labels
, both of which are optional arguments. Although
you won’t always be providing values for levels
and labels
, it is important
to understand how R handles these arguments by default.
Argument levels
If levels
is not provided (which is what happens in most cases), then R
assigns the unique values in x
as the category levels.
For example, consider our numeric vector from the first example: num_vector
contains unique values 1, 2, and 3.
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3
Now imagine we want to have levels
1, 2, 3, 4, and 5. This is how you can
define the factor with an extended set of levels:
# numeric vector
num_vector> [1] 1 2 3 1 2 3 2
# defining levels
<- factor(num_vector, levels = 1:5)
one_factor
one_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3 4 5
Although the created factor only has values between 1 and 3, the levels
range
from 1 to 5. This can be useful if we plan to add elements whose values are not
in the input vector num_vector
. For instance, you can append two more elements
to one_factor
with values 4
and 5
like this:
# adding values 4 and 5
c(8, 9)] <- c(4, 5)
one_factor[
one_factor> [1] 1 2 3 1 2 3 2 4 5
> Levels: 1 2 3 4 5
If you attempt to insert an element having a value that is not in the
predefined set of levels, R will insert a missing value (<NA>
) instead, and
you’ll get a warning message like the one below:
# attempting to add value 6 (not in levels)
1] <- 6
one_factor[> Warning in `[<-.factor`(`*tmp*`, 1, value = 6): invalid factor level, NA
> generated
one_factor> [1] <NA> 2 3 1 2 3 2 4 5
> Levels: 1 2 3 4 5
Argument labels
Another very useful argument is labels
, which allows you to provide a string
vector for naming the levels
in a different way from the values in x
. Let’s
take the numeric vector num_vector
again, and say we want to use words as
labels instead of numeric values. Here’s how you can create a factor with
predefined labels
:
# defining labels
<- factor(num_vector, labels = c("one", "two", "three"))
num_word_vector
num_word_vector> [1] one two three one two three two
> Levels: one two three
Argument exclude
If you want to ignore some values of the input vector x
, you can use the
exclude
argument. You just need to provide those values which will be removed
from the set of levels
.
# excluding level 3
factor(num_vector, exclude = 3)
> [1] 1 2 <NA> 1 2 <NA> 2
> Levels: 1 2
# excluding levels 1 and 3
factor(num_vector, exclude = c(1,3))
> [1] <NA> 2 <NA> <NA> 2 <NA> 2
> Levels: 2
The side effect of exclude
is that it returns a missing value (<NA>
) for
each element that was excluded, which is not always what we want. Here’s one
way to remove the missing values when excluding 3:
# excluding level 3
<- factor(num_vector, exclude = 3)
num_fac12
# oops, we have some missing values
num_fac12> [1] 1 2 <NA> 1 2 <NA> 2
> Levels: 1 2
# removing missing values
!is.na(num_fac12)]
num_fac12[> [1] 1 2 1 2 2
> Levels: 1 2
9.3.2 Unclassing factors
We’ve mentioned that factors are stored as vectors of integers (for efficiency reasons). But we also said that factors are more than vectors. Even though a factor is displayed with string labels, the way it is stored internally is as integers. Why is this important to know? Because there will be occasions in which you’ll need to know exactly what numbers are associated to each level values.
Imagine you have a factor with levels
11, 22, 33, 44.
# factor
<- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor
xfactor> [1] 22 11 44 33 11 22 44
> Levels: 11 22 33 44
To obtain the integer vector associated to xfactor
you can use the function
unclass()
:
# unclassing a factor
unclass(xfactor)
> [1] 2 1 4 3 1 2 4
> attr(,"levels")
> [1] "11" "22" "33" "44"
As you can see, the levels "11"
, "22"
, "33"
, "44"
were mapped to the
vector of integers (1 2 3 4)
.
An alternative option is to simply apply as.numeric()
or as.integer()
instead of using unclass()
:
# equivalent to unclass
as.integer(xfactor)
> [1] 2 1 4 3 1 2 4
# equivalent to unclass
as.numeric(xfactor)
> [1] 2 1 4 3 1 2 4
Although rarely used, there can be some cases in which what you need to do is revert the integer values in order to get the original factor levels. This is only possible when the levels of the factor are themselves numeric. To accomplish this use the following command:
# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
> [1] 22 11 44 33 11 22 44
9.4 Ordinal Factors
By default, factor()
creates a nominal categorical variable, not an ordinal.
One way to check that you have a nominal factor is to use the function
is.ordered()
, which returns TRUE
if its argument is an ordinal factor.
# ordinal factor?
is.ordered(num_vector)
> [1] FALSE
If you want to specify an ordinal factor you must use the ordered
argument of
factor()
. This is how you can generate an ordinal value from num_vector
:
# ordinal factor from numeric vector
<- factor(num_vector, ordered = TRUE)
ordinal_num
ordinal_num> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
As you can tell from the snippet above, the levels of ordinal_factor
are
displayed with less-than symbols `‘<’}, which means that the levels have an
increasing order. We can also get an ordinal factor from our string vector:
# ordinal factor from character vector
<- factor(str_vector, ordered = TRUE)
ordinal_str
ordinal_str> [1] a b c b c a c b
> Levels: a < b < c
In fact, when you set ordered = TRUE
, R sorts the provided values in
alphanumeric order. If you have the following alphanumeric vector
("a1", "1a", "1b", "b1")
, what do you think will be the generated ordered
factor? Let’s check the answer:
# alphanumeric vector
<- c("a1", "1a", "1b", "b1")
alphanum
# ordinal factor from character vector
<- factor(alphanum, ordered = TRUE)
ordinal_alphanum
ordinal_alphanum> [1] a1 1a 1b b1
> Levels: 1a < 1b < a1 < b1
An alternative way to specify an ordinal variable is by using the function
ordered()
, which is just a convenient wrapper for
factor(x, ..., ordered = TRUE)
:
# ordinal factor with ordered()
ordered(num_vector)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
# same as using 'ordered' argument
factor(num_vector, ordered = TRUE)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
A word of caution. Don’t confuse the function ordered()
with order()
. They
are not equivalent. order()
arranges a vector into ascending or descending
order, and returns the sorted vector. ordered()
, as we’ve seen, is used to
get ordinal factors.
Of course, you won’t always be using the default order provided by the
functions factor(..., ordered = TRUE)
or ordered()
. Sometimes you want to
determine categories according to a different order.
For example, let’s take the values of str_vector
and let’s assume that we
want them in descending order, that is, c < b < a
. How can you do that? Easy,
you just need to specify the levels
in the order you want them and set
ordered = TRUE
(or use ordered()
):
# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
> [1] a b c b c a c b
> Levels: c < b < a
# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
> [1] a b c b c a c b
> Levels: c < b < a
Here’s another example. Consider a set of size values "xs"
extra-small, "sm"
small, "md"
medium, "lg"
large, and "xl"
extra-large. If you have a
vector with size values you can create an ordinal variable as follows:
# vector of sizes
<- c("sm", "xs", "xl", "lg", "xs", "lg")
sizes
# setting levels with specified order
ordered(sizes, levels = c("xs", "sm", "md", "lg", "xl"))
> [1] sm xs xl lg xs lg
> Levels: xs < sm < md < lg < xl
Notice that when you create an ordinal factor, the given levels
will always
be considered in an increasing order. This means that the first value of levels
will be the smallest one, then the second one, and so on. The last category,
in turn, is taken as the one at the top of the scale.
Now that we have several nominal and ordinal factors, we can compare the
behavior of is.ordered()
on two factors:
# is.ordered() on an ordinal factor
ordinal_str> [1] a b c b c a c b
> Levels: a < b < c
is.ordered(ordinal_str)
> [1] TRUE
# is.ordered() on a nominal factor
second_factor> [1] a b c b c a c b
> Levels: a b c
is.ordered(second_factor)
> [1] FALSE