5 Data Manipulation with "dplyr"
In the preceding chapter, we introduced some of the key ideas behind the
Grammar of Graphics and their implementation through "ggplot2"
. In this
chapter we move on with some of the basic notions behind "dplyr"
, specifically
we introduce the set of functions commonly referred to as the basic dplyr verbs.
To continue the discussion, we use the sample data for tropical cyclones that formed during the month of September 2010 in the North Atlantic:
<- tibble(
sep2010 name = c('Gaston', 'Hermine', 'Igor', 'Julia', 'Karl', 'Lisa', 'Matthew', 'Nicole'),
wind = c(35, 60, 135, 120, 110, 75, 50, 40),
pressure = c(1005, 989, 924, 948, 956, 982, 998, 994),
category = c('ts', 'ts', 'cat4', 'cat4', 'cat3', 'cat1', 'ts', 'ts'),
days = c(1L, 4L, 13L, 8L, 4L, 6L, 3L, 1L)
)
sep2010
# A tibble: 8 × 5
name wind pressure category days
<chr> <dbl> <dbl> <chr> <int>
1 Gaston 35 1005 ts 1
2 Hermine 60 989 ts 4
3 Igor 135 924 cat4 13
4 Julia 120 948 cat4 8
5 Karl 110 956 cat3 4
6 Lisa 75 982 cat1 6
7 Matthew 50 998 ts 3
8 Nicole 40 994 ts 1
5.1 Basic "dplyr"
verbs
In no particular order of importance, let’s start with the set of elementary
"dplyr"
functions known as the basic dplyr verbs:
filter()
: select (i.e. keeping) rows that match a conditionslice()
: subset rows using their positionsselect()
: keep or drop columns using their names and typesmutate()
: modify existing variables, or add new variablesarrange()
: reorder rows using column namessummarise()
: reduce variables to valuesgroup_by()
: grouped (aggregated) operations
5.1.1 Filter
filter()
allows you to select rows by defining a condition (which could be
simple or compound). For example, we can filter sep2010
by selecting those
tropical cyclones of category tropical storm (ts
)
# using a single condition
filter(sep2010, category == "ts")
## # A tibble: 4 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Gaston 35 1005 ts 1
## 2 Hermine 60 989 ts 4
## 3 Matthew 50 998 ts 3
## 4 Nicole 40 994 ts 1
We can also use a compound logical condition: tropical storms having a wind speed greater than 40
filter(sep2010, category == "ts" & wind > 40)
## # A tibble: 2 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Hermine 60 989 ts 4
## 2 Matthew 50 998 ts 3
5.1.2 Slice
slice()
allows you to subset rows using their positions, for example:
# first row
slice(sep2010, 1)
## # A tibble: 1 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Gaston 35 1005 ts 1
# third row
slice(sep2010, 3)
## # A tibble: 1 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Igor 135 924 cat4 13
# first three rows
= slice(sep2010, 1:3)
three_rows three_rows
## # A tibble: 3 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Gaston 35 1005 ts 1
## 2 Hermine 60 989 ts 4
## 3 Igor 135 924 cat4 13
# specify a negative position
slice(sep2010, -1)
## # A tibble: 7 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Hermine 60 989 ts 4
## 2 Igor 135 924 cat4 13
## 3 Julia 120 948 cat4 8
## 4 Karl 110 956 cat3 4
## 5 Lisa 75 982 cat1 6
## 6 Matthew 50 998 ts 3
## 7 Nicole 40 994 ts 1
We should note that slice()
comes with a set of sibling functions such as:
slice_head()
to subset the first rowsslice_tail()
to subset the last rowsslice_min()
to subset rows with lowest values of a variableslice_max()
to subset rows with highest values of a variableslice_sample()
to randomly subset rows
5.1.3 Select
select()
allows you to select one or more columns by name:
select(sep2010, name, wind, pressure)
## # A tibble: 8 × 3
## name wind pressure
## <chr> <dbl> <dbl>
## 1 Gaston 35 1005
## 2 Hermine 60 989
## 3 Igor 135 924
## 4 Julia 120 948
## 5 Karl 110 956
## 6 Lisa 75 982
## 7 Matthew 50 998
## 8 Nicole 40 994
select()
also allows you to specify a range of consecutive columns via the :
operator
# consecutive columns
select(sep2010, name:pressure)
## # A tibble: 8 × 3
## name wind pressure
## <chr> <dbl> <dbl>
## 1 Gaston 35 1005
## 2 Hermine 60 989
## 3 Igor 135 924
## 4 Julia 120 948
## 5 Karl 110 956
## 6 Lisa 75 982
## 7 Matthew 50 998
## 8 Nicole 40 994
select()
also allows you to exclude one or more columns by negating their
names:
# exclude columns by name
select(sep2010, -(wind:category))
## # A tibble: 8 × 2
## name days
## <chr> <int>
## 1 Gaston 1
## 2 Hermine 4
## 3 Igor 13
## 4 Julia 8
## 5 Karl 4
## 6 Lisa 6
## 7 Matthew 3
## 8 Nicole 1
Interestingly, you can also select()
columns by their position
# columns by position
select(sep2010, 1:3)
## # A tibble: 8 × 3
## name wind pressure
## <chr> <dbl> <dbl>
## 1 Gaston 35 1005
## 2 Hermine 60 989
## 3 Igor 135 924
## 4 Julia 120 948
## 5 Karl 110 956
## 6 Lisa 75 982
## 7 Matthew 50 998
## 8 Nicole 40 994
5.1.4 Mutate
Another basic verb is mutate()
which allows you to: 1) transform existing
variables, but also 2) add new variables.
Wind speed values—column wind
—are expressed in knots. But what if we are
interested in having wind speed expressed in miles per hour? Enter mutate()
.
Here is how to add a new column wind_mph
(1 knot = 1.15078 mph)
mutate(sep2010, wind_mph = wind * 1.15078)
## # A tibble: 8 × 6
## name wind pressure category days wind_mph
## <chr> <dbl> <dbl> <chr> <int> <dbl>
## 1 Gaston 35 1005 ts 1 40.3
## 2 Hermine 60 989 ts 4 69.0
## 3 Igor 135 924 cat4 13 155.
## 4 Julia 120 948 cat4 8 138.
## 5 Karl 110 956 cat3 4 127.
## 6 Lisa 75 982 cat1 6 86.3
## 7 Matthew 50 998 ts 3 57.5
## 8 Nicole 40 994 ts 1 46.0
If we just want to transform the existing values in column wind
so that it is
expressed in mph, then we simply mutate()
it using the same variable name:
mutate(sep2010, wind = wind * 1.15078)
## # A tibble: 8 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Gaston 40.3 1005 ts 1
## 2 Hermine 69.0 989 ts 4
## 3 Igor 155. 924 cat4 13
## 4 Julia 138. 948 cat4 8
## 5 Karl 127. 956 cat3 4
## 6 Lisa 86.3 982 cat1 6
## 7 Matthew 57.5 998 ts 3
## 8 Nicole 46.0 994 ts 1
Recall that, in order to permanently change the data, you need to assign the changes to an object:
<- mutate(sep2010, wind_mph = wind * 1.15078)
sep2010b sep2010b
## # A tibble: 8 × 6
## name wind pressure category days wind_mph
## <chr> <dbl> <dbl> <chr> <int> <dbl>
## 1 Gaston 35 1005 ts 1 40.3
## 2 Hermine 60 989 ts 4 69.0
## 3 Igor 135 924 cat4 13 155.
## 4 Julia 120 948 cat4 8 138.
## 5 Karl 110 956 cat3 4 127.
## 6 Lisa 75 982 cat1 6 86.3
## 7 Matthew 50 998 ts 3 57.5
## 8 Nicole 40 994 ts 1 46.0
5.2 Reordering rows: arrange()
The next basic verb of "dplyr"
is arrange()
which allows you to reorder
rows. For example, here’s how to arrange the rows of sep2010
by wind
# order rows by wind (increasingly)
arrange(sep2010, wind)
## # A tibble: 8 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Gaston 35 1005 ts 1
## 2 Nicole 40 994 ts 1
## 3 Matthew 50 998 ts 3
## 4 Hermine 60 989 ts 4
## 5 Lisa 75 982 cat1 6
## 6 Karl 110 956 cat3 4
## 7 Julia 120 948 cat4 8
## 8 Igor 135 924 cat4 13
By default arrange()
sorts rows in increasing order. To arrange rows in
descending order you need to use the auxiliary function desc()
.
# order rows by wind (decreasingly)
arrange(sep2010, desc(wind))
## # A tibble: 8 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Igor 135 924 cat4 13
## 2 Julia 120 948 cat4 8
## 3 Karl 110 956 cat3 4
## 4 Lisa 75 982 cat1 6
## 5 Hermine 60 989 ts 4
## 6 Matthew 50 998 ts 3
## 7 Nicole 40 994 ts 1
## 8 Gaston 35 1005 ts 1
Last but not least, you can also pass more than one column for arranging purposes:
# order rows by days, and then pressure
arrange(sep2010, days, pressure)
## # A tibble: 8 × 5
## name wind pressure category days
## <chr> <dbl> <dbl> <chr> <int>
## 1 Nicole 40 994 ts 1
## 2 Gaston 35 1005 ts 1
## 3 Matthew 50 998 ts 3
## 4 Karl 110 956 cat3 4
## 5 Hermine 60 989 ts 4
## 6 Lisa 75 982 cat1 6
## 7 Julia 120 948 cat4 8
## 8 Igor 135 924 cat4 13
5.3 Summarizing values with summarise()
The next verb is summarise()
, which is also available as summarize()
.
Conceptually, this involves applying a function on one or more columns, in
order to summarize values. This is probably easier to understand with one
example.
Say you are interested in calculating the average duration of all storms
To do this “a la dplyr” you use summarise()
as follows:
# average duration
summarise(sep2010, avg_days = mean(days))
## # A tibble: 1 × 1
## avg_days
## <dbl>
## 1 5
What if you want to calculate some summary statistics for days
: min,
median, mean, and max?
# some stats for days (dplyr)
summarise(
sep2010, min = min(days),
median = median(days),
avg = mean(days),
max = max(days)
)
## # A tibble: 1 × 4
## min median avg max
## <int> <dbl> <dbl> <int>
## 1 1 4 5 13
5.4 Grouped operations
To better appreciate the power of summarise()
, we need to introduce another
major "dplyr"
basic verb: group_by()
. This is the function that
allows you to perform data aggregations, or grouped operations.
Let’s see the combination of summarise()
and group_by()
to calculate the
average number of days by category:
# average days, grouped by category
summarise(
group_by(sep2010, category),
avg_days = mean(days)
)
Here’s a less simpler example: average days, by category, displayed in descending order by average days:
arrange(
summarise(
group_by(sep2010, category),
avg_days = mean(days)),
desc(avg_days)
)