13 Vectors

In this chapter, you will learn about vectors, the building blocks for storing and handling data in R. You will also learn about factors. Virtually all other data structures in R are based or derived from vectors. So learning how to manipulate data structures in R requires you to start learning how to manipulate vectors in the first place.

13.1 Motivation

As our main working example, we are going to consider the 2016-2017 starting lineup for the basketball team Golden State Warriors (GSW):

Player Position Salary Points PPG Rookie
Thompson SG 16,663,575 1742 22.3 FALSE
Curry PG 12,112,359 1999 25.3 FALSE
Green PF 15,330,435 776 10.2 FALSE
Durant SF 26,540,100 1555 25.1 FALSE
Pachulia C 2,898,000 426 6.1 FALSE

From the statistical point of view, we can say that there are six variables measured on five individuals. One concern, from the data scientist’s mind standpoint, has to do with the kind of each variable: Which variables would you characterize as quantitative, and which variables as qualitative?

From the programming point of view, you also need to consider the data type to be used for each variable: character, boolean, integer, real?

Abstract view of data in analyst's mind and objects in R

Figure 13.1: Abstract view of data in analyst’s mind and objects in R

There are several ways in which the GSW data can be implemented in R, and we will discuss them in the following chapters. For now, let’s start with vectors.

13.2 What is an R vector?

A vector is the most basic type of data structure in R. To give you an abstract visual representation of a vector, think of it as contiguous cells containing data (see diagram below). They can be of any length (including zero).

Abstract view of vectors

Figure 13.2: Abstract view of vectors

Creating vectors with c()

Among the main functions to work with vectors we have the combine function c(). This is the workhorse function to create vectors in R. Here’s how to create a vector player with the player’s last names:

player <- c('Thompson', 'Curry', 'Green', 'Durant', 'Pachulia')

player
#> [1] "Thompson" "Curry"    "Green"    "Durant"   "Pachulia"

Basically, you call c() and you type in the values, separating them by commas.

The most simple type of vectors are vectors containing one single element. For example, the following objects player1, points1 and rookie1 are all vectors with just one element:

player1 <- 'Thompson'
points1 <- 1742
rookie1 <- FALSE

In most other languages, a number like 5 or a boolean like TRUE are usually considered to be “scalars”. Likewise, most programming languages provide four main (data) types of scalars, namely integer, double, character, and boolean. R is a bit different. R does not have the concept of “scalar”, instead the simplest data structure is that of vector.

What about the concept of data types in R? As any programming language, R does have data types like integer, double, character, and boolean. And the way these are handled in R is through vectors. In other words, R has different flavors of vectors, depending on the data type that we use:

# integer
x <- 1L

# double (real)
y <- 5

# complex
z <- 3 + 5i

# logical (boolean)
a <- TRUE

# character
b <- "yosemite"

Notice the format to specify integers, e.g. 1L. This is not a typo. To indicate that a number (with no decimals) is an integer, you should append an upper case letter L at the end. Simply typing a number with no decimals, 30, doesn’t make it into an integer; you need to type 30L for it to be a data type integer.

In summary, the list below shows the 4+1 different data types in R, implemented in vectors (again, recall that R does not have scalars):

  • A double vector stores regular (i.e. real) numbers
  • An integer vector stores integers (no decimal component)
  • A character vector stores text
  • A logical vector stores TRUE’s and FALSE’s values
  • A complex vector stores complex numbers

On a technical note, we should mention that there’s an extra type of R vector: "raw"; this is a native type in R for binary format, and we won’t use it in this book, neither the "complex" type.

There are some special values with reserved names:

  • NULL is the null object (it has length zero)
  • Missing values are referred to by the symbol NA (there are different modes of NA: logical, integer, etc)
  • Inf indicates positive infinite
  • -Inf indicates negative infinite
  • NaN indicates Not a Number (don’t confuse NaN with NA)

Going back to our working example, here’s how to keep using c() to create vectors for the other variables, position, salary, ppg, and rookie

position <- c('SG', 'PG', 'PF', 'SF', 'C')

salary <- c(16663575, 12112359, 15330435, 26540100, 2898000)

ppg <- c(22.3, 25.3, 10.2, 25.1, 6.1)

rookie <- c(FALSE, FALSE, FALSE, FALSE, FALSE)

13.3 Vectors are Atomic structures

The first thing you should learn about R vectors is that they are atomic structures, which is just the fancy name to indicate that all the elements of a vector must be of the same data type, either all integers, all reals (or doubles), all characters, or all logical values.

How do you know that a given vector is of a certain data type? For better or worse, there are a couple of functions that allow you to answer this question:

  • typeof()
  • mode()

Although not commonly used within the R community, our recommended function to determine the data type of a vector is typeof(). The reason for our recommendation is because typeof() returns the data types previously listed which are what most other languages use:

typeof(player)
typeof(salary)
typeof(ppg)
typeof(rookie)

You should know that among the R community, most useRs don’t really talk about types. Instead, because of historical reasons related to the S language—on which R is based—you will often hear useRs talking about modes as given by the mode() function:

mode(player)
mode(salary)
mode(ppg)
mode(rookie)

mode() gives the storage mode of an object, and it actually relies on the output of typeof(). When applied to vectors, the main difference between mode() and typeof() is that mode() groups together types "double" and "integer" into a single mode called "numeric".

What happens if we try to create a vector mixing different data types? Say we take all the values of the first player and put them in a vector

mixed <- c('Thompson', 'SG', 16663575, 1742, 22.3, FALSE)
mixed
#> [1] "Thompson" "SG"       "16663575" "1742"     "22.3"     "FALSE"

13.4 Coercion

The way R makes sure that a vector is of a single data type is by using what is called coercion rules.

There are two coercion rules:

  • implicit coercion
  • explicit coercion

Implicit coercion is what R does when we type a command like this:

mixed <- c('Thompson', 'SG', 16663575, 1742, 22.3, FALSE)
mixed
#> [1] "Thompson" "SG"       "16663575" "1742"     "22.3"     "FALSE"

We are mixing different data types, but R has decided to convert everything into type "character". Technically speaking, R has implicitly coerced the values as characters, without asking us and without even letting us know that it did so.

If you are not familiar with implicit coercion rules, you may get an initial impression that R is acting weirdly, in a nonsensical form. The more you get familiar, you will notice some patterns. But you don’t need to struggle figuring out what R will do. You just have to remember the following hierarchy:

\[ \mathsf{character > double > integer > logical} \]

Here’s how R works in terms of coercion:

  • characters have priority over other data types: as long as one element is a character, all other elements are coerced into characters

  • if a vector has numbers (double and integer) and logicals, double will dominate

  • finally, when mixing integers and logicals, integers will dominate

The other type of coercion, known as explicit coercion, is done when you explicitly tell R to convert a certain type of vector into a different data type by using explicit coercion functions such as as.integer(), as.real(), as.character(), as.logical(). Depending on the type of input vector, and the coercion function, you may achieve what you want, or R will fail to convert things accordingly.

We can take salary, which is of type real, and convert it into integers with no issues:

as.integer(salary)
#> [1] 16663575 12112359 15330435 26540100  2898000

However, trying to convert player into an integer type will be useless:

as.integer(player)
#> Warning: NAs introduced by coercion
#> [1] NA NA NA NA NA

13.5 Manipulating Vectors: Subsetting

In addition to creating vectors, you should also learn how to do some basic manipulation of vectors. The most common type of manipulation is called subsetting, also known as indexing or subscripting, which refers to extracting elements of a vector (or another R object). To do so, you use what is known as bracket notation. This implies using (square) brackets [ ] to get access to the elements of a vector.

To subset a vector, you type the name of the vector, followed by an opening and a closing bracket. Inside the brackets you specify one or more numeric values that correspond to the position(s) of the vector element(s):

# first element
player[1]
#> [1] "Thompson"

# first three elements
player[1:3]
#> [1] "Thompson" "Curry"    "Green"

What type of things can you specify inside the brackets? Basically:

  • numeric vectors
  • logical vectors (the length of the logical vector must match the length of the vector to be subset)
  • character vectors (if the elements have names)

In addition to the brackets [], some common functions that you can use on vectors are:

  • length() gives the number of values
  • sort() sorts the values in increasing or decreasing ways
  • rev() reverses the values
  • unique() extracts unique elements
length(player)
salary[length(player)]
sort(player, decreasing = TRUE)
rev(salary)

13.5.1 Subsetting with Numeric Indices

Here are some subsetting examples using a numeric vector inside the brackets:

# fifth element of 'player'
player[4]

# numeric range
player[2:4]

# numeric vector
player[c(1, 3)]

# different order
player[c(3, 1, 2)]

# third element (four times)
player[rep(3, 4)]

13.5.2 Subsetting with Logical Indices

Logical subsetting involves using a logical vector inside the brackets. This type of subsetting is very powerful because it allows you to extract elements based on some logical condition.

To do logical subsetting, the vector that you put inside the brackets, must match the length of the manipulated vector.

Here are some examples of logical subsetting:

# dummy vector
a <- c(5, 6, 7, 8)

# logical subsetting
a[c(TRUE, FALSE, TRUE, FALSE)]
#> [1] 5 7

Logical subsetting occurs when the vector of indices that you pass inside the brackets is a logical vector.

To do logical subsetting, the vector that you put inside the brackets, should match the length of the manipulated vector. If you pass a shorter vector inside brackets, R will apply its recycling rules.

Notice that the elements of the vector that are subset are those which match the logical value TRUE.

# your turn
player[c(TRUE, TRUE, TRUE, TRUE, TRUE)]
player[c(TRUE, TRUE, TRUE, FALSE, FALSE)]
player[c(FALSE, FALSE, FALSE, TRUE, TRUE)]
player[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
player[c(FALSE, FALSE, FALSE, FALSE, FALSE)]

When subsetting a vector logically, most of the times you won’t really be providing an explicit vector of TRUE’s and FALSEs. Just imagine having a vector of 100 or 1000 or 1000000 elements, and trying to do logical subsetting by manually creating a logical vector of the same length. That would be very boring. Instead, you will be providing a logical condition or a comparison operation that returns a logical vector.

A comparison operation occurs when you use comparison operators such as:

  • > greater than
  • >= greater than or equal
  • < less than
  • <= less than or equal
  • == equal
  • != different

Notice that a comparison operation always returns a logical vector:

# example with '=='
player == 'Durant'

# example with '>'
ppg > 24

Here are some examples of logical subsetting:

# salary of Durant
salary[player == 'Durant']

# name of players with more than 24 points per game
player[ppg > 24]

In addition to using comparison operators, you can also use logical operators to produce a logical vector. The most common type of logical operators are:

  • & AND
  • | OR
  • ! negation

Run the following commands to see what R does:

# AND
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE

# OR
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE

# NOT
!TRUE
!FALSE

More examples with comparisons and logical operators:

# name of players with salary between 10 and 20 millions (exclusive)
player[salary > 10000000 & salary < 20000000]

# name of players with salary between 10 and 20 millions (inclusive)
player[salary >= 10000000 & salary <= 20000000]

13.5.3 Subsetting with Character Vectors

A third type of subsetting involves passing a character vector inside brackets. When you do this, the characters are supposed to be names of the manipulated vector.

None of the vectors player, salary, and ppg, have names. You can confirm that with the names() function applied on any of the vectors:

names(salary)
#> NULL

Create a new vector millions by converting salary into millions, and then assign player as the names of millions

# create 'millions', rounded to 2 decimals
millions <- round(salary / 1000000, 2)

# assign 'player' as names of 'millions'
names(millions) <- player

You should have a vector millions with named elements. Now you can use character subsetting:

millions["Durant"]
#> Durant 
#>   26.5

millions[c("Green", "Curry", "Pachulia")]
#>    Green    Curry Pachulia 
#>     15.3     12.1      2.9

13.5.4 Subsetting with Character Vectors

A third type of subsetting involves passing a character vector inside brackets. When you do this, the characters are supposed to be names of the manipulated vector.

None of the vectors first_name, last_name, gender, etc. have names. You can confirm that with the names() function applied on any of the vectors:

names(salary)
#> NULL

Create a new vector millions by converting salary into millions, and then assign player as the names of millions

# create 'millions', rounded to 2 decimals
millions <- round(salary / 1000000, 2)

# assign 'player' as names of 'millions'
names(millions) <- player

You should have a vector millions with named elements. Now you can use character subsetting:

millions["Durant"]
#> Durant 
#>   26.5

millions[c("Green", "Curry", "Pachulia")]
#>    Green    Curry Pachulia 
#>     15.3     12.1      2.9

13.5.5 Adding more elements

Related with subsetting, you can consider adding more elements to a given vector. For example, say you want to include data for three more players: Iguodala, McCaw, and Jones:

Player Position Salary Points PPG Rookie
Iguodala SF 11,131,368 574 7.6 FALSE
McCaw SG 543,471 282 4.0 TRUE
Jones C 1,171,560 19 1.9 TRUE

You can use bracket notation to add more elements:

player[6] <- 'Iguodala'
player[7] <- 'McCaw'
player[8] <- 'Jones'

Another option is to use c() to combine a vector with more values like this:

position <- c(position, 'SF', 'SG', 'C')
rookie <- c(rookie, FALSE, TRUE, TRUE)

Of course, you can combine both options:

salary[6] <- 11131368
salary <- c(salary, 543471, 1171560)

13.6 Vectorization

Say you want to create a vector log_salary by taking the logarithm of salaries:

log_salary <- log(salary)

When you create the vector log_salary, what you’re doing is applying a function to a vector, which in turn acts on all elements of the vector.

This is called Vectorization in R parlance. Most functions that operate with vectors in R are vectorized functions. This means that an action is applied to all elements of the vector without the need to explicitly type commands to traverse all the elements.

In many other programming languages, you would have to use a set of commands to loop over each element of a vector (or list of numbers) to transform them. But not in R.

Another example of vectorization would be the calculation of the square root of all the points per game ppg:

sqrt(ppg)

Or the conversion of salary into millions:

salary / 1000000 

Why should you care about vectorization?

If you are new to programming, learning about R’s vectorization will be very natural (you won’t stop to think about it too much). If you have some previous programming experience in other languages (e.g. C, python, perl), you know that vectorization does not tend to be a native thing.

Vectorization is essential in R. It saves you from typing many lines of code, and you will exploit vectorization with other useful functions known as the apply family functions (we’ll talk about them later in the course).

13.7 Recycling

Closely related with the concept of vectorization we have the notion of Recycling. To explain recycling let’s see an example.

salary is given in dollars, but what if you need to obtain the salaries in euros?. Let’s create a new vector euros with the converted salaries in euros. To convert from dollars to euros we could use the following conversion: 1 dollar = 0.9 euro

salary_euros <- salary * 0.9

What you just did (assuming that you did things correctly) is called Recycling. To understand this concept, you need to remember that R does not have a data structure for scalars (single numbers). Scalars are in reality vectors of length 1.

Converting dollars to euros requires this operation: salary * 0.9. Although it may not be obvious, we are multiplying two vectors: salary and 0.9. Moreover (and more important) we are multiplying two vectors of different lengths!. So how does R know what to do in this case?

Well, R uses the recycling rule, which takes the shorter vector (in this case 0.9) and recycles its elements to form a temporary vector that matches the length of the longer vector (i.e. salary).

# logical subsetting with recycling
player[TRUE]
#> [1] "Thompson" "Curry"    "Green"    "Durant"   "Pachulia"
player[c(TRUE, FALSE)]
#> [1] "Thompson" "Green"    "Pachulia"

Another recycling example

Here’s another example of recycling. Salaries of elements in an odd number positions will be divided by two; salaries of elements in an even number position will be divided by 10:

units <- c(1/2, 1/10)
new_salary <- salary * units

The elements of units are recycled and repeated as many times as elements in salary. The previous command is equivalent to this:

new_units <- rep(c(1/2, 1/10), length.out = length(salary))
salary * new_units

13.7.1 Sequences

It is very common to generate sequences of numbers. For that R provides:

  • the colon operator ":"
  • sequence function seq()
# colon operator
1:5
1:10
-3:7
10:1
# sequence function
seq(from = 1, to = 10)
seq(from = 1, to = 10, by = 1)
seq(from = 1, to = 10, by = 2)
seq(from = -5, to = 5, by = 1)

13.7.2 Repeated Vectors

There is a function rep(). It takes a vector as the main input, and then it optionally takes various arguments: times, length.out, and each.

rep(1, times = 5)        # repeat 1 five times
#> [1] 1 1 1 1 1
rep(c(1, 2), times = 3)  # repeat 1 2 three times
#> [1] 1 2 1 2 1 2
rep(c(1, 2), each = 2)
#> [1] 1 1 2 2
rep(c(1, 2), length.out = 5)
#> [1] 1 2 1 2 1

Here are some more complex examples:

rep(c(3, 2, 1), times = 3, each = 2)
#>  [1] 3 3 2 2 1 1 3 3 2 2 1 1 3 3 2 2 1 1

Summary Slides

13.8 Exercises

1) Consider the following two vectors: x and y.

x <- c(2, 4, 6, 8, 10)
y <- c("a", "e", "i", "o", "u")

What is the output of the following R commands? (BTW: they are all valid commands). Try to answer these parts without running the code in R.

a)  y[x/x]

b)  y[!(x > 5)]

c)  y[x < 10 & x != 2]

d)  y[x[-4][2]]

e)  y[as.logical(x)]

f)  y[6 - (x/2)]

2) Consider the following R code:

# peanut butter jelly sandwich
peanut <- TRUE
peanut[2] <- FALSE
yummy <- mean(peanut)
butter <- peanut + 1L
jelly <- tolower("JELLY")
sandwich <- c(peanut, butter, jelly)

What is the output of the following commands? Try to answer these parts without running the code in R.

  1. "jelly" != jelly

  2. peanut & butter

  3. typeof(yummy[peanut])

  4. sandwich[2]

  5. peanut[butter]

  6. peanut %in% peanut

  7. typeof(!yummy)

  8. length(list(peanut, butter, as.factor(jelly)))

3) Consider the following two vectors: x and y.

x <- c(1, 2, 3, 4, 5)
y <- c("a", "b", "c", "d", "e")

Match the following commands with their corresponding output. Try to answer these parts without running the code in R.

a)  y[x == 1]              ___  "a" "b" "c" "d" "e"

b)  y[x]                   ___  "e"

c)  y[x < 3]               ___  character(0)

d)  y[x/x]                 ___  "d"

e)  y[x[5]]                ___  "c" "d" "e"

f)  y['b']                 ___  NA

g)  y[0]                   ___  "a" "b"

h)  y[!(x < 3)]            ___  "c"

i)  y[x[-2][3]]            ___  "a"

j)  y[x[x[3]]]             ___  "a" "a" "a" "a" "a"

4) Which command will fail to return the first five elements of a vector x? (assume x has more than 5 elements).

  1. x[1:5]

  2. x[c(1,2,3,4,5)]

  3. head(x, n = 5)

  4. x[seq(1, 5)]

  5. x(1:5)

5) Explain the concept of atomic structures in R.

6) Explain the concept of vectorization a.k.a. vectorized operations.