21 R Data Frames
The most common format/structure for a data set is a tabular format:
with rows and columns (like a spreadsheet). When your data is in this shape,
most of the time you will work with R data frames (or similar rectangular
structures like a "matrix"
, "table"
, "tibble"
, etc).
Learning how to manipulate data frames is among the most important data computing skills in R. Because there are multiple ways in which you can manipulate a data frame, we want to expose you to both “traditional” and “modern” approaches for working with data frames and friends.
The “traditional” way of manipulating data frames in R is based on bracket
notation, e.g. dat[ , ]
, to select specific rows, columns, or cells. Also,
the use of the dollar $
operator to handle columns is fundamental, and
the use of double brackets dat[[ ]]
comes handy from time to time.
On the “modern” approach for manipulating data frames, there is the plyr
framework devised by Hadley Wickham. From his doctoral research, the first
plyr tools were available in the packages "plyr"
and "reshape"
. Nowadays,
we have the "reshape2"
package, and the extremely popular package "dplyr"
(among other packages). We introduced "dplyr"
in chapter EDA,
and we will dive deeper in chapter dplyr pipelines.
To make the most of the content covered in the next sections, we are assuming that you are familiar with the rest of data objects covered in the previous part of the book (i.e. IV Data Objects in R).
21.1 R Data Frames
A data frame is a special type of R list. In most cases, a data frame is internally stored as a list of vectors or factors, in which each vector (or factor) corresponds to a column. This implies that columns in a data frame are typically atomic structures: all elements in a given column are of the same data type. However, since a data frame is a list, you can technically have any kind of object as a column. In practice, though, having data frames with columns that are not vectors or factors is something that does not make much sense.
From the data manipulation point of view, data frames behave like a hybrid
object. On one hand, they are lists and can be manipulated like any other list
using double brackets dat[[ ]]
and dollar operator dat$name
.
On the other hand, because data frames are designed as tabular or 2-dimensional
objects, they also behave like two-dimensional arrays or matrices, admitting
bracket notation dat[ , ]
. For these reasons, there is a wide array of
functions that allows you to manipulate data frames in very convenient ways.
But to the inexperienced user, all these functions may feel overwhelming.
21.2 Inspecting data frames
One of the basic tasks when working with data frames involves inspecting its contents. Specially in the early stages of data exploration, when dealing for the first time with a new data frame, you will need to inspect things like its overall structure, which includes its dimensions (number of rows and columns), the data types of its columns, the names of columns and rows, and also be able to take a peak to some of its first or last rows, and usually obtain a summary of each column.
Let’s see an example with one of the built-in data frames in R: mtcars
.
Just a few rows and columns of mtcars
are displayed below:
#> mpg cyl disp hp drat wt
#> Mazda RX4 21.0 6 160 110 3.90 2.62
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.88
#> Datsun 710 22.8 4 108 93 3.85 2.32
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.21
#> Hornet Sportabout 18.7 8 360 175 3.15 3.44
The main function to explore the structure of not just a data frame, but of
any kind of object, is str()
. When applied to data frames, str()
returns a
report of the dimensions of the data frame, a list with the name of all the
variables, and their data types (e.g. chr
character, num
real, etc).
str(mtcars, vec.len = 1)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 ...
#> $ cyl : num 6 6 ...
#> $ disp: num 160 160 ...
#> $ hp : num 110 110 ...
#> $ drat: num 3.9 3.9 ...
#> $ wt : num 2.62 ...
#> $ qsec: num 16.5 ...
#> $ vs : num 0 0 ...
#> $ am : num 1 1 ...
#> $ gear: num 4 4 ...
#> $ carb: num 4 4 ...
The argument vec.len = 1
is optional but we like to use it because it
indicates that just the first elements in each column should be displayed.
Observe the output returned by str()
. The first line tells us that mtcars
is an object of class 'data.frame'
with 32 observations (rows) and 11 variables
(columns). Then, the set of 11 variables is listed below, each line starting
with the dollar $
operator, followed by the name of the variable, followed
by a colon :
, the data mode (all numeric num
variables in this case),
and then a couple of values in each variable.
It is specially useful to check the data type of each column in order to catch potential issues and avoid disastrous consequences or bugs in subsequent stages.
Here’s a list of useful functions to inspect a data frame:
str()
: overall structurehead()
: first rowstail()
: last rowssummary()
: descriptive statisticsdim()
: dimensionsnrow()
: number of rowsncol()
: number of columnsnames()
: names of list elements (i.e. column names)colnames()
: column namesrownames()
: row namesdimnames()
: list with column and row names
On a technical side, we should mention that a data frame is a list with special
attributes: an attribute names
for column names, an attribute
row.names
for column names, and of course its attribute class
:
attributes(mtcars)
#> $names
#> [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
#> [11] "carb"
#>
#> $row.names
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
#> [7] "Duster 360" "Merc 240D" "Merc 230"
#> [10] "Merc 280" "Merc 280C" "Merc 450SE"
#> [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
#> [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
#> [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
#> [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
#> [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
#> [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
#> [31] "Maserati Bora" "Volvo 142E"
#>
#> $class
#> [1] "data.frame"
21.3 Creating data frames
Most of the (raw) data tables you will be working with will already be in some data file. However, from time to time you will face the need to create some sort of data table in R. In these situations, you will likely have to create such table with a data frame. So let’s look at various ways to “manually”" create a data frame.
Option 1: The primary option to build a data frame is with data.frame()
.
You pass a series of vectors (or factors), of the same length, separated by commas.
Each vector (or factor) will become a column in the generated data frame.
Preferably, give names to each column like in the example below:
dat <- data.frame(
name = c('Anakin', 'Padme', 'Luke', 'Leia'),
gender = c('male', 'female', 'male', 'female'),
height = c(1.88, 1.65, 1.72, 1.50),
weight = c(84, 45, 77, 49)
)
dat
#> name gender height weight
#> 1 Anakin male 1.88 84
#> 2 Padme female 1.65 45
#> 3 Luke male 1.72 77
#> 4 Leia female 1.50 49
Option 2: Another way to create data frames is with a list
containing
vectors or factors (of the same length), which you then convert into a data
frame with data.frame()
:
# another way to create a basic data frame
lst <- list(
name = c('Anakin', 'Padme', 'Luke', 'Leia'),
gender = c('male', 'female', 'male', 'female'),
height = c(1.88, 1.65, 1.72, 1.50),
weight = c(84, 45, 77, 49)
)
tbl <- data.frame(lst)
tbl
#> name gender height weight
#> 1 Anakin male 1.88 84
#> 2 Padme female 1.65 45
#> 3 Luke male 1.72 77
#> 4 Leia female 1.50 49
Remember that a data.frame
is nothing more than a list
. So as long as the
elements in the list (vectors or factors) are of the same length, we can simply
convert the list into a data frame.
By default, data.frame()
converts character vectors into factors. You can
check that by examining the structure of the data frame with str()
:
str(tbl)
#> 'data.frame': 4 obs. of 4 variables:
#> $ name : Factor w/ 4 levels "Anakin","Leia",..: 1 4 3 2
#> $ gender: Factor w/ 2 levels "female","male": 2 1 2 1
#> $ height: num 1.88 1.65 1.72 1.5
#> $ weight: num 84 45 77 49
To prevent data.frame()
from converting strings into factors, you must use
the argument stringsAsFactors = FALSE
# strings as strings (not as factors)
dat <- data.frame(
name = c('Anakin', 'Padme', 'Luke', 'Leia'),
gender = c('male', 'female', 'male', 'female'),
height = c(1.88, 1.65, 1.72, 1.50),
weight = c(84, 45, 77, 49),
stringsAsFactors = FALSE
)
str(dat)
#> 'data.frame': 4 obs. of 4 variables:
#> $ name : chr "Anakin" "Padme" "Luke" "Leia"
#> $ gender: chr "male" "female" "male" "female"
#> $ height: num 1.88 1.65 1.72 1.5
#> $ weight: num 84 45 77 49
21.4 Basic Operations with Data Frames
Now that you have seen some ways to create data frames, let’s discuss a number of basic manipulations of data frames. We will show you examples of various operations, and then you’ll have the chance to put them practice with exercises at the end of the chapter.
- Selecting table elements:
- select a given cell
- select a set of cells
- select a given row
- select a set of rows
- select a given column
- select a set of columns
- Adding a new column
- Deleting a new column
- Renaming a column
- Moving a column
- Transforming a column
Let’s say you have a data frame dat
with the following content:
dat <- data.frame(
name = c('Leia', 'Luke', 'Han'),
gender = c('female', 'male', 'male'),
height = c(1.50, 1.72, 1.80),
jedi = c(FALSE, TRUE, FALSE),
stringsAsFactors = FALSE
)
dat
name gender height jedi
1 Leia female 1.50 FALSE
2 Luke male 1.72 TRUE
3 Han male 1.80 FALSE
21.4.1 Selecting elements
The data frame dat
is a 2-dimensional object: the 1st dimension corresponds
to the rows, while the 2nd dimension corresponds to the columns.
Because dat
has two dimensions, the bracket notation involves
working with data frames in this form: dat[ , ]
.
In other words, you have to specify values inside the
brackets for the 1st index, and the 2nd index: dat[index1, index2]
.
Selecting cells
# select value in row 1 and column 1
dat[1,1]
#> [1] "Leia"
# select value in row 2 and column 3
dat[2,3]
#> [1] 1.72
# select values in these cells
dat[1:2,3:4]
#> height jedi
#> 1 1.50 FALSE
#> 2 1.72 TRUE
It is also possible to exclude certain rows-and-columns by passing negative numeric indices:
Selecting rows
If no value is specified for index1
then all rows are included. Likewise,
if no value is specified for index2
then all columns are included.
# selecting first row
dat[1, ]
#> name gender height jedi
#> 1 Leia female 1.5 FALSE
# selecting third row
dat[3, ]
#> name gender height jedi
#> 3 Han male 1.8 FALSE
Selecting columns
# selecting second column
dat[ ,2]
#> [1] "female" "male" "male"
# selecting columns 2 to 4
dat[ ,2:4]
#> gender height jedi
#> 1 female 1.50 FALSE
#> 2 male 1.72 TRUE
#> 3 male 1.80 FALSE
More Options to Access Columns
The dollar sign also works for selecting a column of a data frame using its name
mtcars$mpg
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
You don’t need to use quote marks, but you can if you want. The following calls are equivalent.
21.4.2 Adding a column
Perhaps the simplest way to add a column is with the dollar operator $
.
You just need to give a name for the new column, and assign a vector (or factor):
# adding a column
dat$new_column <- c('a', 'e', 'i')
dat
#> name gender height jedi new_column
#> 1 Leia female 1.50 FALSE a
#> 2 Luke male 1.72 TRUE e
#> 3 Han male 1.80 FALSE i
Another way to add a column is with the column binding function cbind()
:
21.4.3 Deleting a column
The inverse operation of adding a column consists of deleting a column.
This is possible with the $
dollar operator. For instance, say you want to
remove the column new_column
. Use the $
operator to select this column,
and assign it the value NULL
(think of this as NULLifying a column):
21.4.4 Renaming a column
What if you want to rename a column? There are various options to do this.
One way is by changing the column names
attribute:
# attributes
attributes(dat)
#> $names
#> [1] "name" "gender" "height" "jedi" "weight"
#>
#> $row.names
#> [1] 1 2 3
#>
#> $class
#> [1] "data.frame"
which is more commonly accessed with the names()
function:
Notice that dat
has a list of attributes. The element names
is the vector
of column names.
You can directly modify the vector of names
; for example let’s change
gender
to sex
:
# changing rookie to rooky
attributes(dat)$names[2] <- "sex"
# display column names
names(dat)
#> [1] "name" "sex" "height" "jedi" "weight"
By the way: this approach of changing the name of a variable is very low level, and probably unfamiliar to most useRs.
21.4.5 Moving a column
A more challenging operation is when you want to move a column to a different
position. What if you want to move salary
to the last position (last column)?
One option is to create a vector of column names in the desired order, and then
use this vector (for the index of columns) to reassign the data frame like this:
21.4.6 Transforming a column
A more common operation than deleting or moving a column, is to transform the
values in a column. This can be easily accomplished with the $
operator.
For instance, let’s say that we want to transform height
from meters to
centimeters:
# converting height to centimeters
dat$height <- dat$height * 100
dat
#> name jedi height weight sex
#> 1 Leia FALSE 150 49 female
#> 2 Luke TRUE 172 77 male
#> 3 Han FALSE 180 85 male
Likewise, instead of using the $
operator, you can refer to the column using
bracket notation. Here’s how to transform weight from kilograms to pounds
(1 kg = 2.20462 pounds):
# weight into pounds
dat[ ,"weight"] <- dat[ ,"weight"] * 2.20462
dat
#> name jedi height weight sex
#> 1 Leia FALSE 150 108 female
#> 2 Luke TRUE 172 170 male
#> 3 Han FALSE 180 187 male
There is also the transform()
function which transform values interactively,
that is, temporarily:
# transform weight to kgs
transform(dat, weight = weight / 0.453592)
#> name jedi height weight sex
#> 1 Leia FALSE 150 238 female
#> 2 Luke TRUE 172 374 male
#> 3 Han FALSE 180 413 male
transform()
does its job of modifying the values of weight
but only
temporarily; if you inspect dat
you’ll see what this means:
# did weight really change?
dat
#> name jedi height weight sex
#> 1 Leia FALSE 150 108 female
#> 2 Luke TRUE 172 170 male
#> 3 Han FALSE 180 187 male
To make the changes permanent with transform()
, you need to reassign them
to the data frame: