```
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 2 3
```

I’m one of those with the humble opinion that great software for data science and analytics should have a data structure dedicated to handle categorical data. Luckily, one of the nicest features about R is that it provides a data object exclusively designed to handle categorical data: **factors**.

The term “factor” as used in R for handling categorical variables, comes from the terminology used in *Analysis of Variance*, commonly referred to as ANOVA. In this statistical method, a categorical variable is commonly referred to as, surprise-surprise, *factor* and its categories are known as *levels*. Perhaps this is not the best terminology but it is the one R uses, which reflects its distinctive statistical origins. Especially for those users without a background in statistics, this is one of R’s idiosyncrasies that seems disconcerting at the beginning. But as long as you keep in mind that a factor is just the object that allows you to handle a qualitative variable you’ll be fine. In case you need it, here’s a short mantra to remember:

factors have levels

To create a factor in R you use the homonym function `factor()`

, which takes a vector as input. The vector can be either numeric, character or logical. Let’s see our first example:

```
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 2 3
```

As you can tell from the previous code snippet, `factor()`

converts the numeric vector `num_vector`

into a factor (i.e. a categorical variable) with 3 categories—the so called `levels`

.

You can also obtain a factor from a string vector:

```
# string vector
<- c('a', 'b', 'c', 'b', 'c', 'a', 'c', 'b')
str_vector
str_vector
```

`[1] "a" "b" "c" "b" "c" "a" "c" "b"`

```
# creating a factor from str_vector
<- factor(str_vector)
second_factor
second_factor
```

```
[1] a b c b c a c b
Levels: a b c
```

Notice how `str_vector`

and `second_factor`

are displayed. Even though the elements are the same in both the vector and the factor, they are printed in different formats. The letters in the string vector are displayed with quotes, while the letters in the factor are printed without quotes.

And of course, you can use a logical vector to generate a factor as well:

```
# logical vector
<- c(TRUE, FALSE, TRUE, TRUE, FALSE)
log_vector
# creating a factor from log_vector
<- factor(log_vector)
third_factor
third_factor
```

```
[1] TRUE FALSE TRUE TRUE FALSE
Levels: FALSE TRUE
```

Technically speaking, R factors are referred to as *compound objects*. According to the “R Language Definition” manual:

“Factors are currently implemented using an integer array to specify the actual levels and a second array of names that are mapped to the integers.”

What does this mean?

Essentially, a factor is internally stored using two ingredients: one is an integer vector containing the values of categories, the other is a vector with the “levels” which has the names of categories which are mapped to the integers.

Under the hood, the way R stores factors is as vectors of integer values. One way to confirm this is using the function `storage.mode()`

```
# storage of factor
storage.mode(first_factor)
```

`[1] "integer"`

This means that we can manipulate factors just like we manipulate vectors. In addition, many functions for vectors can be applied to factors. For instance, we can use the function `length()`

to get the number of elements in a factor:

```
# factors have length
length(first_factor)
```

`[1] 7`

We can also use the square brackets `[ ]`

to extract or select elements of a factor. Inside the brackets we specify vectors of indices such as numeric vectors, logical vectors, and sometimes even character vectors.

```
# first element
1]
first_factor[
# third element
3]
first_factor[
# second to fourth elements
2:4]
first_factor[
# last element
length(first_factor)]
first_factor[
# logical subsetting
rep(c(TRUE, FALSE), length.out = 7)] first_factor[
```

If you have a factor with named elements, you can also specify the names of the elements within the brackets:

```
names(first_factor) <- letters[1:length(first_factor)]
first_factor
```

```
a b c d e f g
1 2 3 1 2 3 2
Levels: 1 2 3
```

`c('b', 'd', 'f')] first_factor[`

```
b d f
2 1 3
Levels: 1 2 3
```

However, you should know that factors are NOT really vectors. To see this you can check the behavior of the functions `is.factor()`

and `is.vector()`

on a factor:

```
# factors are not vectors
is.vector(first_factor)
```

`[1] FALSE`

```
# factors are factors
is.factor(first_factor)
```

`[1] TRUE`

Even a single element of a factor is also a factor:

`class(first_factor[1])`

`[1] "factor"`

Well, it turns out that factors have an additional attribute that vectors don’t: `levels`

. And as you can expect, the class of a factor is indeed `"factor"`

(not `"vector"`

).

```
# attributes of a factor
attributes(first_factor)
```

```
$levels
[1] "1" "2" "3"
$class
[1] "factor"
$names
[1] "a" "b" "c" "d" "e" "f" "g"
```

Another feature that makes factors so special is that their values (the levels) are mapped to a set of character values for displaying purposes. This might seem like a minor feature but it has two important consequences. On the one hand, this implies that factors provide a way to store character values very efficiently. Why? Because each unique character value is stored only once, and the data itself is stored as a vector of integers.

Notice how the numeric value `1`

was mapped into the character value `"1"`

. And the same happens for the other values `2`

and `3`

that are mapped into the characters `"2"`

and `"3"`

.

Every time I teach about factors, there is inevitably one student who asks a very pertinent question: Why do we want to use factors? Isn’t it redundant to have a factor object when there are already character or integer vectors?

I have two answers to this question.

The first has to do with the storage of factors. Storing a factor as integers will usually be more efficient than storing a character vector. As we’ve seen, this is an important issue especially when the data—to be encoded into a factor—is of considerable size.

The second reason has to do with categorical variables of *ordinal* nature. Qualitative data can be classified into nominal and ordinal variables. Nominal variables could be easily handled with character vectors. In fact, *nominal* means name (values are just names or labels), and there’s no natural order among the categories.

A different story is when we have ordinal variables, like sizes `"small"`

, `"medium"`

, `"large"`

or college years `"freshman"`

, `"sophomore"`

, `"junior"`

, `"senior"`

. In these cases we are still using names of categories, but they can be arranged in increasing or decreasing order. In other words, we can rank the categories since they have a natural order: small is less than medium which is less than large. Likewise, freshman comes first, then sophomore, followed by junior, and finally senior.

So here’s an important question: How do we keep the order of categories in an ordinal variable? We can use a character vector to store the values. But a character vector does not allow us to store the ranking of categories. The solution in R comes via factors. We can use factors to define ordinal variables, like the following example:

```
<- factor(
sizes x = c('sm', 'md', 'lg', 'sm', 'md'),
levels = c('sm', 'md', 'lg'),
ordered = TRUE)
sizes
```

```
[1] sm md lg sm md
Levels: sm < md < lg
```

As you can tell, `sizes`

has ordered levels, clearly identifying the first category `"sm"`

, the second one `"md"`

, and the third one `"lg"`

.

`factor()`

Since working with categorical data in R typically involves working with factors, you should become familiar with the variety of functions related with them. In the following sections we’ll cover a bunch of details about factors so you can be better prepared to deal with any type of categorical data.

`factor()`

Given the fundamental role played by the function `factor()`

we need to pay a closer look at its arguments. If you check the documentation—see `help(factor)`

—you’ll see that the usage of the function `factor()`

is:

```
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x), nmax = NA)
```

with the following arguments:

`x`

a vector of data`levels`

an optional vector for the categories`labels`

an optional character vector of labels for the levels`exclude`

a vector of values to be excluded when forming the set of levels`ordered`

logical value to indicate if the levels should be regarded as ordered`nmax`

an upper bound on the number of levels

The main argument of `factor()`

is the input vector `x`

. The next argument is `levels`

, followed by `labels`

, both of which are optional arguments. Although you won’t always be providing values for `levels`

and `labels`

, it is important to understand how R handles these arguments by default.

`levels`

If `levels`

is not provided (which is what happens in most cases), then R assigns the unique values in `x`

as the category levels.

For example, consider our numeric vector from the first example: `num_vector`

contains unique values 1, 2, and 3.

```
# numeric vector
<- c(1, 2, 3, 1, 2, 3, 2)
num_vector
# creating a factor from num_vector
<- factor(num_vector)
first_factor
first_factor
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 2 3
```

Now imagine we want to have `levels`

1, 2, 3, 4, and 5. This is how you can define the factor with an extended set of levels:

```
# numeric vector
num_vector
```

`[1] 1 2 3 1 2 3 2`

```
# defining levels
<- factor(num_vector, levels = 1:5)
one_factor one_factor
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 2 3 4 5
```

Although the created factor only has values between 1 and 3, the `levels`

range from 1 to 5. This can be useful if we plan to add elements whose values are not in the input vector `num_vector`

. For instance, you can append two more elements to `one_factor`

with values `4`

and `5`

like this:

```
# adding values 4 and 5
c(8, 9)] <- c(4, 5)
one_factor[ one_factor
```

```
[1] 1 2 3 1 2 3 2 4 5
Levels: 1 2 3 4 5
```

If you attempt to insert an element having a value that is not in the predefined set of levels, R will insert a missing value (`<NA>`

) instead, and you’ll get a warning message like the one below:

```
# attempting to add value 6 (not in levels)
1] <- 6 one_factor[
```

```
Warning in `[<-.factor`(`*tmp*`, 1, value = 6): invalid factor level, NA
generated
```

` one_factor`

```
[1] <NA> 2 3 1 2 3 2 4 5
Levels: 1 2 3 4 5
```

`labels`

Another very useful argument is `labels`

, which allows you to provide a string vector for naming the `levels`

in a different way from the values in `x`

. Let’s take the numeric vector `num_vector`

again, and say we want to use words as labels instead of numeric values. Here’s how you can create a factor with predefined `labels`

:

```
# defining labels
<- factor(num_vector, labels = c("one", "two", "three"))
num_word_vector
num_word_vector
```

```
[1] one two three one two three two
Levels: one two three
```

`exclude`

If you want to ignore some values of the input vector `x`

, you can use the `exclude`

argument. You just need to provide those values which will be removed from the set of `levels`

.

```
# excluding level 3
factor(num_vector, exclude = 3)
```

```
[1] 1 2 <NA> 1 2 <NA> 2
Levels: 1 2
```

```
# excluding levels 1 and 3
factor(num_vector, exclude = c(1,3))
```

```
[1] <NA> 2 <NA> <NA> 2 <NA> 2
Levels: 2
```

The side effect of `exclude`

is that it returns a missing value (`<NA>`

) for each element that was excluded, which is not always what we want. Here’s one way to remove the missing values when excluding 3:

```
# excluding level 3
<- factor(num_vector, exclude = 3)
num_fac12
# oops, we have some missing values
num_fac12
```

```
[1] 1 2 <NA> 1 2 <NA> 2
Levels: 1 2
```

```
# removing missing values
!is.na(num_fac12)] num_fac12[
```

```
[1] 1 2 1 2 2
Levels: 1 2
```

We’ve mentioned that factors are stored as vectors of integers (for efficiency reasons). But we also said that factors are more than vectors. Even though a factor is displayed with string labels, the way it is stored internally is as integers. Why is this important to know? Because there will be occasions in which you’ll need to know exactly what numbers are associated to each level values.

Imagine you have a factor with `levels`

11, 22, 33, 44.

```
# factor
<- factor(c(22, 11, 44, 33, 11, 22, 44))
xfactor xfactor
```

```
[1] 22 11 44 33 11 22 44
Levels: 11 22 33 44
```

To obtain the integer vector associated to `xfactor`

you can use the function `unclass()`

:

```
# unclassing a factor
unclass(xfactor)
```

```
[1] 2 1 4 3 1 2 4
attr(,"levels")
[1] "11" "22" "33" "44"
```

As you can see, the levels `"11"`

, `"22"`

, `"33"`

, `"44"`

were mapped to the vector of integers `(1 2 3 4)`

.

An alternative option is to simply apply `as.numeric()`

or `as.integer()`

instead of using `unclass()`

:

```
# equivalent to unclass
as.integer(xfactor)
```

`[1] 2 1 4 3 1 2 4`

```
# equivalent to unclass
as.numeric(xfactor)
```

`[1] 2 1 4 3 1 2 4`

Although rarely used, there can be some cases in which what you need to do is revert the integer values in order to get the original factor levels. This is only possible when the levels of the factor are themselves numeric. To accomplish this use the following command:

```
# recovering numeric levels
as.numeric(levels(xfactor))[xfactor]
```

`[1] 22 11 44 33 11 22 44`

By default, `factor()`

creates a *nominal* categorical variable, not an ordinal. One way to check that you have a nominal factor is to use the function `is.ordered()`

, which returns `TRUE`

if its argument is an ordinal factor.

```
# ordinal factor?
is.ordered(num_vector)
```

`[1] FALSE`

If you want to specify an ordinal factor you must use the `ordered`

argument of `factor()`

. This is how you can generate an ordinal value from `num_vector`

:

```
# ordinal factor from numeric vector
<- factor(num_vector, ordered = TRUE)
ordinal_num ordinal_num
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 < 2 < 3
```

As you can tell from the snippet above, the levels of `ordinal_factor`

are displayed with less-than symbols `‘<’}, which means that the levels have an increasing order. We can also get an ordinal factor from our string vector:

```
# ordinal factor from character vector
<- factor(str_vector, ordered = TRUE)
ordinal_str ordinal_str
```

```
[1] a b c b c a c b
Levels: a < b < c
```

In fact, when you set `ordered = TRUE`

, R sorts the provided values in alphanumeric order. If you have the following alphanumeric vector `("a1", "1a", "1b", "b1")`

, what do you think will be the generated ordered factor? Let’s check the answer:

```
# alphanumeric vector
<- c("a1", "1a", "1b", "b1")
alphanum
# ordinal factor from character vector
<- factor(alphanum, ordered = TRUE)
ordinal_alphanum ordinal_alphanum
```

```
[1] a1 1a 1b b1
Levels: 1a < 1b < a1 < b1
```

An alternative way to specify an ordinal variable is by using the function `ordered()`

, which is just a convenient wrapper for `factor(x, ..., ordered = TRUE)`

:

```
# ordinal factor with ordered()
ordered(num_vector)
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 < 2 < 3
```

```
# same as using 'ordered' argument
factor(num_vector, ordered = TRUE)
```

```
[1] 1 2 3 1 2 3 2
Levels: 1 < 2 < 3
```

A word of caution. Don’t confuse the function `ordered()`

with `order()`

. They are not equivalent. `order()`

arranges a vector into ascending or descending order, and returns the sorted vector. `ordered()`

, as we’ve seen, is used to get ordinal factors.

Of course, you won’t always be using the default order provided by the functions `factor(..., ordered = TRUE)`

or `ordered()`

. Sometimes you want to determine categories according to a different order.

For example, let’s take the values of `str_vector`

and let’s assume that we want them in descending order, that is, `c < b < a`

. How can you do that? Easy, you just need to specify the `levels`

in the order you want them and set `ordered = TRUE`

(or use `ordered()`

):

```
# setting levels with specified order
factor(str_vector, levels = c("c", "b", "a"), ordered = TRUE)
```

```
[1] a b c b c a c b
Levels: c < b < a
```

```
# equivalently
ordered(str_vector, levels = c("c", "b", "a"))
```

```
[1] a b c b c a c b
Levels: c < b < a
```

Here’s another example. Consider a set of size values `"xs"`

extra-small, `"sm"`

small, `"md"`

medium, `"lg"`

large, and `"xl"`

extra-large. If you have a vector with size values you can create an ordinal variable as follows:

```
# vector of sizes
<- c("sm", "xs", "xl", "lg", "xs", "lg")
sizes
# setting levels with specified order
ordered(sizes, levels = c("xs", "sm", "md", "lg", "xl"))
```

```
[1] sm xs xl lg xs lg
Levels: xs < sm < md < lg < xl
```

Notice that when you create an ordinal factor, the given `levels`

will always be considered in an increasing order. This means that the first value of `levels`

will be the smallest one, then the second one, and so on. The last category, in turn, is taken as the one at the top of the scale.

Now that we have several nominal and ordinal factors, we can compare the behavior of `is.ordered()`

on two factors:

```
# is.ordered() on an ordinal factor
ordinal_str
```

```
[1] a b c b c a c b
Levels: a < b < c
```

`is.ordered(ordinal_str)`

`[1] TRUE`

```
# is.ordered() on a nominal factor
second_factor
```

```
[1] a b c b c a c b
Levels: a b c
```

`is.ordered(second_factor)`

`[1] FALSE`