# 5 Factors

I’m one of those with the humble opinion that great software for data science
and analytics should have a data structure dedicated to handle categorical data.
Luckily, one of the nicest features about R is that it provides a data
object exclusively designed to handle categorical data: **factors**.

The term “factor” as used in R for handling categorical variables, comes from
the terminology used in *Analysis of Variance*, commonly referred to as ANOVA.
In this statistical method, a categorical variable is commonly referred to as,
surprise-surprise, *factor* and its categories are known as *levels*. Perhaps
this is not the best terminology but it is the one R uses, which reflects its
distinctive statistical origins. Especially for those users without a background
in statistics, this is one of R’s idiosyncrasies that seems disconcerting at
the beginning. But as long as you keep in mind that a factor is just the object
that allows you to handle a qualitative variable you’ll be fine. In case you
need it, here’s a short mantra to remember:

factors have levels

## 5.1 Creating Factors

To create a factor in R you use the homonym function `factor()`

, which takes a
vector as input. The vector can be either numeric, character or logical. Let’s
see our first example:

```
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3
```

As you can tell from the previous code snippet, `factor()`

converts the numeric
vector `num_vector`

into a factor (i.e. a categorical variable) with 3
categories—the so called `levels`

.

You can also obtain a factor from a string vector:

```
# string vector
<- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')
str_vector
str_vector> [1] "a" "b" "c" "b" "c" "a" "c" "b"
# creating a factor from str_vector
<- factor(str_vector)
second_factor
second_factor> [1] a b c b c a c b
> Levels: a b c
```

Notice how `str_vector`

and `second_factor`

are displayed. Even though the
elements are the same in both the vector and the factor, they are printed in
different formats. The letters in the string vector are displayed with quotes,
while the letters in the factor are printed without quotes.

And of course, you can use a logical vector to generate a factor as well:

```
# logical vector
<- c(TRUE, FALSE, TRUE, TRUE, FALSE)
log_vector
# creating a factor from log_vector
<- factor(log_vector)
third_factor
third_factor> [1] TRUE FALSE TRUE TRUE FALSE
> Levels: FALSE TRUE
```

## 5.2 How R treats factors

Technically speaking, R factors are referred to as *compound objects*. According
to the “R Language Definition” manual:

“Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers.”

What does this mean?

Essentially, a factor is internally stored using two ingredients: one is an integer vector containing the values of categories, the other is a vector with the “levels” which has the names of categories which are mapped to the integers.

Under the hood, the way R stores factors is as vectors of integer values.
One way to confirm this is using the function `storage.mode()`

```
# storage of factor
storage.mode(first_factor)
> [1] "integer"
```

This means that we can manipulate factors just like we manipulate vectors. In
addition, many functions for vectors can be applied to factors. For instance,
we can use the function `length()`

to get the number of elements in a factor:

```
# factors have length
length(first_factor)
> [1] 7
```

We can also use the square brackets `[ ]`

to extract or select elements of a
factor. Inside the brackets we specify vectors of indices such as numeric
vectors, logical vectors, and sometimes even character vectors.

```
# first element
1]
first_factor[
# third element
3]
first_factor[
# second to fourth elements
2:4]
first_factor[
# last element
length(first_factor)]
first_factor[
# logical subsetting
rep(c(TRUE, FALSE), length.out = 7)] first_factor[
```

If you have a factor with named elements, you can also specify the names of the elements within the brackets:

```
names(first_factor) <- letters[1:length(first_factor)]
first_factor> a b c d e f g
> 1 2 3 1 2 3 2
> Levels: 1 2 3
c('b', 'd', 'f')]
first_factor[> b d f
> 2 1 3
> Levels: 1 2 3
```

However, you should know that factors are NOT really vectors. To see this you
can check the behavior of the functions `is.factor()`

and `is.vector()`

on a
factor:

```
# factors are not vectors
is.vector(first_factor)
> [1] FALSE
# factors are factors
is.factor(first_factor)
> [1] TRUE
```

Even a single element of a factor is also a factor:

```
class(first_factor[1])
> [1] "factor"
```

#### So what makes a factor different from a vector?

Well, it turns out that factors have an additional attribute that vectors don’t:
`levels`

. And as you can expect, the class of a factor is indeed `"factor"`

(not `"vector"`

).

```
# attributes of a factor
attributes(first_factor)
> $levels
> [1] "1" "2" "3"
>
> $class
> [1] "factor"
>
> $names
> [1] "a" "b" "c" "d" "e" "f" "g"
```

Another feature that makes factors so special is that their values (the levels) are mapped to a set of character values for displaying purposes. This might seem like a minor feature but it has two important consequences. On the one hand, this implies that factors provide a way to store character values very efficiently. Why? Because each unique character value is stored only once, and the data itself is stored as a vector of integers.

Notice how the numeric value `1`

was mapped into the character value `"1"`

. And
the same happens for the other values `2`

and `3`

that are mapped into the
characters `"2"`

and `"3"`

.

#### What is the advantage of R factors?

Every time I teach about factors, there is inevitably one student who asks a very pertinent question: Why do we want to use factors? Isn’t it redundant to have a factor object when there are already character or integer vectors?

I have two answers to this question.

The first has to do with the storage of factors. Storing a factor as integers will usually be more efficient than storing a character vector. As we’ve seen, this is an important issue especially when the data—to be encoded into a factor—is of considerable size.

The second reason has to do with categorical variables of *ordinal* nature.
Qualitative data can be classified into nominal and ordinal variables. Nominal
variables could be easily handled with character vectors. In fact, *nominal*
means name (values are just names or labels), and there’s no natural order
among the categories.

A different story is when we have ordinal variables, like sizes `"small"`

,
`"medium"`

, `"large"`

or college years `"freshman"`

, `"sophomore"`

, `"junior"`

,
`"senior"`

. In these cases we are still using names of categories, but they
can be arranged in increasing or decreasing order. In other words, we can rank
the categories since they have a natural order: small is less than medium which
is less than large. Likewise, freshman comes first, then sophomore, followed by
junior, and finally senior.

So here’s an important question: How do we keep the order of categories in an ordinal variable? We can use a character vector to store the values. But a character vector does not allow us to store the ranking of categories. The solution in R comes via factors. We can use factors to define ordinal variables, like the following example:

```
<- factor(
sizes x = c('sm', 'md', 'lg', 'sm', 'md'),
levels = c('sm', 'md', 'lg'),
ordered = TRUE)
sizes> [1] sm md lg sm md
> Levels: sm < md < lg
```

As you can tell, `sizes`

has ordered levels, clearly identifying the first
category `"sm"`

, the second one `"md"`

, and the third one `"lg"`

.

## 5.3 A closer look at `factor()`

Since working with categorical data in R typically involves working with factors, you should become familiar with the variety of functions related with them. In the following sections we’ll cover a bunch of details about factors so you can be better prepared to deal with any type of categorical data.

### 5.3.1 Function `factor()`

Given the fundamental role played by the function `factor()`

we need to pay a
closer look at its arguments. If you check the documentation—see
`help(factor)`

—you’ll see that the usage of the function `factor()`

is:

```
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x), nmax = NA)
```

with the following arguments:

`x`

a vector of data`levels`

an optional vector for the categories`labels`

an optional character vector of labels for the levels`exclude`

a vector of values to be excluded when forming the set of levels`ordered`

logical value to indicate if the levels should be regarded as ordered`nmax`

an upper bound on the number of levels

The main argument of `factor()`

is the input vector `x`

. The next argument is
`levels`

, followed by `labels`

, both of which are optional arguments. Although
you won’t always be providing values for `levels`

and `labels`

, it is important
to understand how R handles these arguments by default.

#### Argument `levels`

If `levels`

is not provided (which is what happens in most cases), then R
assigns the unique values in `x`

as the category levels.

For example, consider our numeric vector from the first example: `num_vector`

contains unique values 1, 2, and 3.

```
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3
```

Now imagine we want to have `levels`

1, 2, 3, 4, and 5. This is how you can
define the factor with an extended set of levels:

```
# numeric vector
num_vector> [1] 1 2 3 1 2 3 2
# defining levels
<- factor(num_vector, levels = 1:5)
one_factor
one_factor> [1] 1 2 3 1 2 3 2
> Levels: 1 2 3 4 5
```

Although the created factor only has values between 1 and 3, the `levels`

range
from 1 to 5. This can be useful if we plan to add elements whose values are not
in the input vector `num_vector`

. For instance, you can append two more elements
to `one_factor`

with values `4`

and `5`

like this:

```
# adding values 4 and 5
c(8, 9)] <- c(4, 5)
one_factor[
one_factor> [1] 1 2 3 1 2 3 2 4 5
> Levels: 1 2 3 4 5
```

If you attempt to insert an element having a value that is not in the
predefined set of levels, R will insert a missing value (`<NA>`

) instead, and
you’ll get a warning message like the one below:

```
# attempting to add value 6 (not in levels)
1] <- 6
one_factor[> Warning in `[<-.factor`(`*tmp*`, 1, value = 6): invalid factor level, NA
> generated
one_factor> [1] <NA> 2 3 1 2 3 2 4 5
> Levels: 1 2 3 4 5
```

#### Argument `labels`

Another very useful argument is `labels`

, which allows you to provide a string
vector for naming the `levels`

in a different way from the values in `x`

. Let’s
take the numeric vector `num_vector`

again, and say we want to use words as
labels instead of numeric values. Here’s how you can create a factor with
predefined `labels`

:

```
# defining labels
<- factor(num_vector, labels = c("one", "two", "three"))
num_word_vector
num_word_vector> [1] one two three one two three two
> Levels: one two three
```

#### Argument `exclude`

If you want to ignore some values of the input vector `x`

, you can use the
`exclude`

argument. You just need to provide those values which will be removed
from the set of `levels`

.

```
# excluding level 3
factor(num_vector, exclude = 3)
> [1] 1 2 <NA> 1 2 <NA> 2
> Levels: 1 2
# excluding levels 1 and 3
factor(num_vector, exclude = c(1,3))
> [1] <NA> 2 <NA> <NA> 2 <NA> 2
> Levels: 2
```

The side effect of `exclude`

is that it returns a missing value (`<NA>`

) for
each element that was excluded, which is not always what we want. Here’s one
way to remove the missing values when excluding 3:

```
# excluding level 3
<- factor(num_vector, exclude = 3)
num_fac12
# oops, we have some missing values
num_fac12> [1] 1 2 <NA> 1 2 <NA> 2
> Levels: 1 2
# removing missing values
!is.na(num_fac12)]
num_fac12[> [1] 1 2 1 2 2
> Levels: 1 2
```

### 5.3.2 Unclassing factors

We’ve mentioned that factors are stored as vectors of integers (for efficiency reasons). But we also said that factors are more than vectors. Even though a factor is displayed with string labels, the way it is stored internally is as integers. Why is this important to know? Because there will be occasions in which you’ll need to know exactly what numbers are associated to each level values.

Imagine you have a factor with `levels`

11, 22, 33, 44.

```
# factor
<- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor
xfactor> [1] 22 11 44 33 11 22 44
> Levels: 11 22 33 44
```

To obtain the integer vector associated to `xfactor`

you can use the function
`unclass()`

:

```
# unclassing a factor
unclass(xfactor)
> [1] 2 1 4 3 1 2 4
> attr(,"levels")
> [1] "11" "22" "33" "44"
```

As you can see, the levels `"11"`

, `"22"`

, `"33"`

, `"44"`

were mapped to the
vector of integers `(1 2 3 4)`

.

An alternative option is to simply apply `as.numeric()`

or `as.integer()`

instead of using `unclass()`

:

```
# equivalent to unclass
as.integer(xfactor)
> [1] 2 1 4 3 1 2 4
# equivalent to unclass
as.numeric(xfactor)
> [1] 2 1 4 3 1 2 4
```

Although rarely used, there can be some cases in which what you need to do is revert the integer values in order to get the original factor levels. This is only possible when the levels of the factor are themselves numeric. To accomplish this use the following command:

```
# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
> [1] 22 11 44 33 11 22 44
```

## 5.4 Ordinal Factors

By default, `factor()`

creates a *nominal* categorical variable, not an ordinal.
One way to check that you have a nominal factor is to use the function
`is.ordered()`

, which returns `TRUE`

if its argument is an ordinal factor.

```
# ordinal factor?
is.ordered(num_vector)
> [1] FALSE
```

If you want to specify an ordinal factor you must use the `ordered`

argument of
`factor()`

. This is how you can generate an ordinal value from `num_vector`

:

```
# ordinal factor from numeric vector
<- factor(num_vector, ordered = TRUE)
ordinal_num
ordinal_num> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
```

As you can tell from the snippet above, the levels of `ordinal_factor`

are
displayed with less-than symbols `‘<’}, which means that the levels have an
increasing order. We can also get an ordinal factor from our string vector:

```
# ordinal factor from character vector
<- factor(str_vector, ordered = TRUE)
ordinal_str
ordinal_str> [1] a b c b c a c b
> Levels: a < b < c
```

In fact, when you set `ordered = TRUE`

, R sorts the provided values in
alphanumeric order. If you have the following alphanumeric vector
`("a1", "1a", "1b", "b1")`

, what do you think will be the generated ordered
factor? Let’s check the answer:

```
# alphanumeric vector
<- c("a1", "1a", "1b", "b1")
alphanum
# ordinal factor from character vector
<- factor(alphanum, ordered = TRUE)
ordinal_alphanum
ordinal_alphanum> [1] a1 1a 1b b1
> Levels: 1a < 1b < a1 < b1
```

An alternative way to specify an ordinal variable is by using the function
`ordered()`

, which is just a convenient wrapper for
`factor(x, ..., ordered = TRUE)`

:

```
# ordinal factor with ordered()
ordered(num_vector)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
# same as using 'ordered' argument
factor(num_vector, ordered = TRUE)
> [1] 1 2 3 1 2 3 2
> Levels: 1 < 2 < 3
```

A word of caution. Don’t confuse the function `ordered()`

with `order()`

. They
are not equivalent. `order()`

arranges a vector into ascending or descending
order, and returns the sorted vector. `ordered()`

, as we’ve seen, is used to
get ordinal factors.

Of course, you won’t always be using the default order provided by the
functions `factor(..., ordered = TRUE)`

or `ordered()`

. Sometimes you want to
determine categories according to a different order.

For example, let’s take the values of `str_vector`

and let’s assume that we
want them in descending order, that is, `c < b < a`

. How can you do that? Easy,
you just need to specify the `levels`

in the order you want them and set
`ordered = TRUE`

(or use `ordered()`

):

```
# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
> [1] a b c b c a c b
> Levels: c < b < a
# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
> [1] a b c b c a c b
> Levels: c < b < a
```

Here’s another example. Consider a set of size values `"xs"`

extra-small, `"sm"`

small, `"md"`

medium, `"lg"`

large, and `"xl"`

extra-large. If you have a
vector with size values you can create an ordinal variable as follows:

```
# vector of sizes
<- c("sm", "xs", "xl", "lg", "xs", "lg")
sizes
# setting levels with specified order
ordered(sizes, levels = c("xs", "sm", "md", "lg", "xl"))
> [1] sm xs xl lg xs lg
> Levels: xs < sm < md < lg < xl
```

Notice that when you create an ordinal factor, the given `levels`

will always
be considered in an increasing order. This means that the first value of `levels`

will be the smallest one, then the second one, and so on. The last category,
in turn, is taken as the one at the top of the scale.

Now that we have several nominal and ordinal factors, we can compare the
behavior of `is.ordered()`

on two factors:

```
# is.ordered() on an ordinal factor
ordinal_str> [1] a b c b c a c b
> Levels: a < b < c
is.ordered(ordinal_str)
> [1] TRUE
# is.ordered() on a nominal factor
second_factor> [1] a b c b c a c b
> Levels: a b c
is.ordered(second_factor)
> [1] FALSE
```