9 Data Frames
The most common format/structure for a data set is a tabular format:
with rows and columns (like a spreadsheet). When your data is in this shape,
most of the time you will work with R data frames (or similar rectangular
structures like a
Learning how to manipulate data frames is among the most important data computing skills in R. Nowadays, there are two primary approaches for manipulating data frames. One is what I call the “traditional” or “classic” approach which is what I present in this chapter. The other is the “tidy” approach which you can think of as a modern version based on the tidy data framework mainly developed by Hadley Wickham. We leave the discussion of this alternative approach for later.
To make the most of the content covered in the next sections, I am assuming that you are familiar with the rest of data objects covered in the previous chapters of part “II Data Objects in R”.
9.1 R Data Frames
A data frame is a special type of R list. In most cases, a data frame is internally stored as a list of vectors or factors, in which each vector (or factor) corresponds to a column. This implies that columns in a data frame are typically atomic structures: all elements in a given column are of the same data type. However, since a data frame is a list, you can technically have any kind of object as a column. In practice, though, having data frames with columns that are not vectors or factors is something that does not make much sense.
From the data manipulation point of view, data frames behave like a hybrid
object. On one hand, they are lists and can be manipulated like any other list
using double brackets
dat[[ ]] and dollar operator
On the other hand, because data frames are designed as tabular or 2-dimensional
objects, they also behave like two-dimensional arrays or matrices, admitting
dat[ , ]. For these reasons, there is a wide array of
functions that allows you to manipulate data frames in very convenient ways.
But to the inexperienced user, all these functions may feel overwhelming.
9.2 Inspecting data frames
One of the basic tasks when working with data frames involves inspecting its contents. Specially in the early stages of data exploration, when dealing for the first time with a new data frame, you will need to inspect things like its overall structure, which includes its dimensions (number of rows and columns), the data types of its columns, the names of columns and rows, and also be able to take a peak to some of its first or last rows, and usually obtain a summary of each column.
Let’s see an example with one of the built-in data frames in R:
Just a few rows and columns of
mtcars are displayed below:
> mpg cyl disp hp drat wt > Mazda RX4 21.0 6 160 110 3.90 2.620 > Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 > Datsun 710 22.8 4 108 93 3.85 2.320 > Hornet 4 Drive 21.4 6 258 110 3.08 3.215 > Hornet Sportabout 18.7 8 360 175 3.15 3.440
The main function to explore the structure of not just a data frame, but of
any kind of object, is
str(). When applied to data frames,
str() returns a
report of the dimensions of the data frame, a list with the name of all the
variables, and their data types (e.g.
num real, etc).
str(mtcars, vec.len = 1) > 'data.frame': 32 obs. of 11 variables: > $ mpg : num 21 21 ... > $ cyl : num 6 6 ... > $ disp: num 160 160 ... > $ hp : num 110 110 ... > $ drat: num 3.9 3.9 ... > $ wt : num 2.62 ... > $ qsec: num 16.5 ... > $ vs : num 0 0 ... > $ am : num 1 1 ... > $ gear: num 4 4 ... > $ carb: num 4 4 ...
vec.len = 1 is optional but we like to use it because it
indicates that just the first elements in each column should be displayed.
Observe the output returned by
str(). The first line tells us that
is an object of class
'data.frame' with 32 observations (rows) and 11 variables
(columns). Then, the set of 11 variables is listed below, each line starting
with the dollar
$ operator, followed by the name of the variable, followed
by a colon
:, the data mode (all numeric
num variables in this case),
and then a couple of values in each variable.
It is specially useful to check the data type of each column in order to catch potential issues and avoid disastrous consequences or bugs in subsequent stages.
Here’s a list of useful functions to inspect a data frame:
str(): overall structure
head(): first rows
tail(): last rows
summary(): descriptive statistics
nrow(): number of rows
ncol(): number of columns
names(): names of list elements (i.e. column names)
colnames(): column names
rownames(): row names
dimnames(): list with column and row names
On a technical side, we should mention that a data frame is a list with special
attributes: an attribute
names for column names, an attribute
row.names for row names, and of course its attribute
attributes(mtcars) > $names >  "mpg" "cyl" "disp" "hp" "drat" "wt" >  "qsec" "vs" "am" "gear" "carb" > > $row.names >  "Mazda RX4" "Mazda RX4 Wag" >  "Datsun 710" "Hornet 4 Drive" >  "Hornet Sportabout" "Valiant" >  "Duster 360" "Merc 240D" >  "Merc 230" "Merc 280" >  "Merc 280C" "Merc 450SE" >  "Merc 450SL" "Merc 450SLC" >  "Cadillac Fleetwood" "Lincoln Continental" >  "Chrysler Imperial" "Fiat 128" >  "Honda Civic" "Toyota Corolla" >  "Toyota Corona" "Dodge Challenger" >  "AMC Javelin" "Camaro Z28" >  "Pontiac Firebird" "Fiat X1-9" >  "Porsche 914-2" "Lotus Europa" >  "Ford Pantera L" "Ferrari Dino" >  "Maserati Bora" "Volvo 142E" > > $class >  "data.frame"
9.3 Creating data frames
Most of the (raw) data tables you will be working with will already be in some data file. However, from time to time you will face the need to create some sort of data table in R. In these situations, you will likely have to create such table with a data frame. So let’s look at various ways to “manually”” create a data frame.
Option 1: The primary option to build a data frame is with
You pass a series of vectors (or factors), of the same length, separated by commas.
Each vector (or factor) will become a column in the generated data frame.
Preferably, give names to each column like in the example below:
<- data.frame( dat name = c('Anakin', 'Padme', 'Luke', 'Leia'), gender = c('male', 'female', 'male', 'female'), height = c(1.88, 1.65, 1.72, 1.50), weight = c(84, 45, 77, 49) ) dat> name gender height weight > 1 Anakin male 1.88 84 > 2 Padme female 1.65 45 > 3 Luke male 1.72 77 > 4 Leia female 1.50 49
Option 2: Another way to create data frames is with a
vectors or factors (of the same length), which you then convert into a data
# another way to create a basic data frame <- list( lst name = c('Anakin', 'Padme', 'Luke', 'Leia'), gender = c('male', 'female', 'male', 'female'), height = c(1.88, 1.65, 1.72, 1.50), weight = c(84, 45, 77, 49) ) <- data.frame(lst) tbl tbl> name gender height weight > 1 Anakin male 1.88 84 > 2 Padme female 1.65 45 > 3 Luke male 1.72 77 > 4 Leia female 1.50 49
Remember that a
data.frame is nothing more than a
list. So as long as the
elements in the list (vectors or factors) are of the same length, we can simply
convert the list into a data frame.
Keep in mind that in old versions of R (3.1.0 or older),
data.frame() used to
convert character vectors into factors. You can always check the data type of
each column in a data frame with
str(tbl) > 'data.frame': 4 obs. of 4 variables: > $ name : chr "Anakin" "Padme" "Luke" "Leia" > $ gender: chr "male" "female" "male" "female" > $ height: num 1.88 1.65 1.72 1.5 > $ weight: num 84 45 77 49
data.frame() from converting strings into factors, you must use
stringsAsFactors = FALSE
# strings as strings (not as factors) <- data.frame( dat name = c('Anakin', 'Padme', 'Luke', 'Leia'), gender = c('male', 'female', 'male', 'female'), height = c(1.88, 1.65, 1.72, 1.50), weight = c(84, 45, 77, 49), stringsAsFactors = FALSE ) str(dat) > 'data.frame': 4 obs. of 4 variables: > $ name : chr "Anakin" "Padme" "Luke" "Leia" > $ gender: chr "male" "female" "male" "female" > $ height: num 1.88 1.65 1.72 1.5 > $ weight: num 84 45 77 49
9.4 Basic Operations with Data Frames
Now that you have seen some ways to create data frames, let’s discuss a number of basic manipulations of data frames. We will show you examples of various operations, and then you’ll have the chance to put them in practice with some exercises listed at the end of the chapter.
- Selecting table elements:
- select a given cell
- select a set of cells
- select a given row
- select a set of rows
- select a given column
- select a set of columns
- Adding a new column
- Deleting a column
- Renaming a column
- Moving a column
- Transforming a column
Let’s say you have a data frame
dat with the following content:
<- data.frame( dat name = c('Leia', 'Luke', 'Han'), gender = c('female', 'male', 'male'), height = c(1.50, 1.72, 1.80), jedi = c(FALSE, TRUE, FALSE), stringsAsFactors = FALSE ) dat name gender height jedi1 Leia female 1.50 FALSE 2 Luke male 1.72 TRUE 3 Han male 1.80 FALSE
9.4.1 Selecting elements
The data frame
dat is a 2-dimensional object: the 1st dimension corresponds
to the rows, while the 2nd dimension corresponds to the columns.
dat has two dimensions, the bracket notation involves
working with data frames in this form:
dat[ , ].
In other words, you have to specify values inside the
brackets for the 1st index, and the 2nd index:
# select value in row 1 and column 1 1,1] dat[>  "Leia" # select value in row 2 and column 3 2,3] dat[>  1.72 # select values in these cells 1:2,3:4] dat[> height jedi > 1 1.50 FALSE > 2 1.72 TRUE
It is also possible to exclude certain rows-and-columns by passing negative numeric indices:
If no value is specified for
index1 then all rows are included. Likewise,
if no value is specified for
index2 then all columns are included.
# selecting first row 1, ] dat[> name gender height jedi > 1 Leia female 1.5 FALSE # selecting third row 3, ] dat[> name gender height jedi > 3 Han male 1.8 FALSE
# selecting second column 2] dat[ ,>  "female" "male" "male" # selecting columns 2 to 4 2:4] dat[ ,> gender height jedi > 1 female 1.50 FALSE > 2 male 1.72 TRUE > 3 male 1.80 FALSE
More Options to Access Columns
The dollar sign also works for selecting a column of a data frame using its name
$mpg mtcars>  21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 >  19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 >  30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 >  30.4 15.8 19.7 15.0 21.4
You don’t need to use quote marks, but you can if you want. The following calls are equivalent.
$'mpg' mtcars$"mpg" mtcars$`mpg`mtcars
9.4.2 Adding a column
Perhaps the simplest way to add a column is with the dollar operator
You just need to give a name for the new column, and assign a vector (or factor):
# adding a column $new_column <- c('a', 'e', 'i') dat dat> name gender height jedi new_column > 1 Leia female 1.50 FALSE a > 2 Luke male 1.72 TRUE e > 3 Han male 1.80 FALSE i
Another way to add a column is with the column binding function
# vector of weights <- c(49, 77, 85) weight # adding weights to dat <- cbind(dat, weight) dat dat> name gender height jedi new_column weight > 1 Leia female 1.50 FALSE a 49 > 2 Luke male 1.72 TRUE e 77 > 3 Han male 1.80 FALSE i 85
9.4.3 Deleting a column
The inverse operation of adding a column consists of deleting a column.
This is possible with the
$ dollar operator. For instance, say you want to
remove the column
new_column. Use the
$ operator to select this column,
and assign it the value
NULL (think of this as NULLifying a column):
# deleting a column $new_column <- NULL dat dat> name gender height jedi weight > 1 Leia female 1.50 FALSE 49 > 2 Luke male 1.72 TRUE 77 > 3 Han male 1.80 FALSE 85
9.4.4 Renaming a column
What if you want to rename a column? There are various options to do this.
One way is by changing the column
# attributes attributes(dat) > $names >  "name" "gender" "height" "jedi" "weight" > > $row.names >  1 2 3 > > $class >  "data.frame"
which is more commonly accessed with the
# column names names(dat) >  "name" "gender" "height" "jedi" "weight"
dat has a list of attributes. The element
names is the vector
of column names.
You can directly modify the vector of
names; for example let’s change
# changing rookie to rooky attributes(dat)$names <- "sex" # display column names names(dat) >  "name" "sex" "height" "jedi" "weight"
By the way: this approach of changing the name of a variable is very low level, and probably unfamiliar to most useRs.
9.4.5 Moving a column
A more challenging operation is when you want to move a column to a different
position. What if you want to move
salary to the last position (last column)?
One option is to create a vector of column names in the desired order, and then
use this vector (for the index of columns) to reassign the data frame like this:
<- c("name", "jedi", "height", "weight", "sex") reordered_names <- dat[ ,reordered_names] dat dat> name jedi height weight sex > 1 Leia FALSE 1.50 49 female > 2 Luke TRUE 1.72 77 male > 3 Han FALSE 1.80 85 male
9.4.6 Transforming a column
A more common operation than deleting or moving a column, is to transform the
values in a column. This can be easily accomplished with the
For instance, let’s say that we want to transform
height from meters to
# converting height to centimeters $height <- dat$height * 100 dat dat> name jedi height weight sex > 1 Leia FALSE 150 49 female > 2 Luke TRUE 172 77 male > 3 Han FALSE 180 85 male
Likewise, instead of using the
$ operator, you can refer to the column using
bracket notation. Here’s how to transform weight from kilograms to pounds
(1 kg = 2.20462 pounds):
# weight into pounds "weight"] <- dat[ ,"weight"] * 2.20462 dat[ , dat> name jedi height weight sex > 1 Leia FALSE 150 108.0264 female > 2 Luke TRUE 172 169.7557 male > 3 Han FALSE 180 187.3927 male
There is also the
transform() function which transform values interactively,
that is, temporarily:
# transform weight to kgs transform(dat, weight = weight / 0.453592) > name jedi height weight sex > 1 Leia FALSE 150 238.1576 female > 2 Luke TRUE 172 374.2476 male > 3 Han FALSE 180 413.1305 male
transform() does its job of modifying the values of
weight but only
temporarily; if you inspect
dat you’ll see what this means:
# did weight really change? dat> name jedi height weight sex > 1 Leia FALSE 150 108.0264 female > 2 Luke TRUE 172 169.7557 male > 3 Han FALSE 180 187.3927 male
To make the changes permanent with
transform(), you need to reassign them
to the data frame:
# transform weight to inches (permanently) <- transform(dat, weight = weight / 0.453592) dat dat> name jedi height weight sex > 1 Leia FALSE 150 238.1576 female > 2 Luke TRUE 172 374.2476 male > 3 Han FALSE 180 413.1305 male
1) Consider the following data frame
first last gender born spell 1 Harry Potter male 1980 sectumsempra 2 Hermione Granger female 1979 alohomora 3 Ron Weasley male 1980 riddikulus 4 Luna Lovegood female 1981 episkey
What commands will fail to return the data of individuals born in 1980?
df[c(TRUE, FALSE, TRUE, FALSE), ]
df[df[,4] == 1980, ]
df[df$born == 1980]
df[df$born == 1980, ]
df[ ,df$born == 1980]
Select the command that does not provide information about the data frame
Your friend is trying to display the first three rows on columns 1 (
first) and 2 (
last), by unsuccessfully using the following command. Why does the command print all columns?
1:3, 1 & 2] df[> first last gender born spell > 1 Harry Potter male 1980 sectumsempra > 2 Hermione Granger female 1979 alohomora > 3 Ron Weasley male 1980 riddikulus
Write a command that would correctly display the first two columns.
Write a command that would give you the following data from
spell first 1 sectumsempra Harry 2 alohomora Hermione 3 riddikulus Ron 4 episkey Luna
2) Consider the following data frame
first last gender title 1 Jon Snow male lord 2 Arya Stark female princess 3 Tyrion Lannister male master 4 Daenerys Targaryen female khaleesi 5 Yara Greyjoy female princess gpa 1 2.8 2 3.5 3 2.9 4 3.7 5 NA
One of your friends wrote the following R code. Help your friend find all the errors and explain what’s wrong.
# value of 'first' associated to maximum 'gpa' max_gpa <- max(dat$gpa, na.rm = TRUE) which_max_gpa <- dat$gpa = max_gpa dat$first(which_max_gpa) # gpa of title lord dat$gpa[dat[ ,title] = "lord"] # median gpa (of each gender) which_males <- dat$gender == 'male' which_females <- dat$gender == 'female' median_females <- median(dat$gpa[which_males]) median_males <- median(dat$gpa[which_males])