5 Data Manipulation with "dplyr"

In the preceding chapter, we introduced some of the key ideas behind the Grammar of Graphics and their implementation through "ggplot2". In this chapter we move on with some of the basic notions behind "dplyr", specifically we introduce the set of functions commonly referred to as the basic dplyr verbs.

To continue the discussion, we use the sample data for tropical cyclones that formed during the month of September 2010 in the North Atlantic:

sep2010 <- tibble(
  name = c('Gaston', 'Hermine', 'Igor', 'Julia', 'Karl', 'Lisa', 'Matthew', 'Nicole'),
  wind = c(35, 60, 135, 120, 110, 75, 50, 40),
  pressure = c(1005, 989, 924, 948, 956, 982, 998, 994),
  category = c('ts', 'ts', 'cat4', 'cat4', 'cat3', 'cat1', 'ts', 'ts'),
  days = c(1L, 4L, 13L, 8L, 4L, 6L, 3L, 1L)
)

sep2010
# A tibble: 8 × 5
  name     wind pressure category  days
  <chr>   <dbl>    <dbl> <chr>    <int>
1 Gaston     35     1005 ts           1
2 Hermine    60      989 ts           4
3 Igor      135      924 cat4        13
4 Julia     120      948 cat4         8
5 Karl      110      956 cat3         4
6 Lisa       75      982 cat1         6
7 Matthew    50      998 ts           3
8 Nicole     40      994 ts           1

5.1 Basic "dplyr" verbs

In no particular order of importance, let’s start with the set of elementary "dplyr" functions known as the basic dplyr verbs:

  • filter(): select (i.e. keeping) rows that match a condition

  • slice(): subset rows using their positions

  • select(): keep or drop columns using their names and types

  • mutate(): modify existing variables, or add new variables

  • arrange(): reorder rows using column names

  • summarise(): reduce variables to values

  • group_by(): grouped (aggregated) operations

5.1.1 Filter

filter() allows you to select rows by defining a condition (which could be simple or compound). For example, we can filter sep2010 by selecting those tropical cyclones of category tropical storm (ts)

# using a single condition
filter(sep2010, category == "ts")
## # A tibble: 4 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Gaston     35     1005 ts           1
## 2 Hermine    60      989 ts           4
## 3 Matthew    50      998 ts           3
## 4 Nicole     40      994 ts           1

We can also use a compound logical condition: tropical storms having a wind speed greater than 40

filter(sep2010, category == "ts" & wind > 40)
## # A tibble: 2 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Hermine    60      989 ts           4
## 2 Matthew    50      998 ts           3

5.1.2 Slice

slice() allows you to subset rows using their positions, for example:

# first row
slice(sep2010, 1)
## # A tibble: 1 × 5
##   name    wind pressure category  days
##   <chr>  <dbl>    <dbl> <chr>    <int>
## 1 Gaston    35     1005 ts           1
# third row
slice(sep2010, 3)
## # A tibble: 1 × 5
##   name   wind pressure category  days
##   <chr> <dbl>    <dbl> <chr>    <int>
## 1 Igor    135      924 cat4        13
# first three rows
three_rows = slice(sep2010, 1:3)
three_rows
## # A tibble: 3 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Gaston     35     1005 ts           1
## 2 Hermine    60      989 ts           4
## 3 Igor      135      924 cat4        13
# specify a negative position
slice(sep2010, -1)
## # A tibble: 7 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Hermine    60      989 ts           4
## 2 Igor      135      924 cat4        13
## 3 Julia     120      948 cat4         8
## 4 Karl      110      956 cat3         4
## 5 Lisa       75      982 cat1         6
## 6 Matthew    50      998 ts           3
## 7 Nicole     40      994 ts           1

We should note that slice() comes with a set of sibling functions such as:

  • slice_head() to subset the first rows

  • slice_tail() to subset the last rows

  • slice_min() to subset rows with lowest values of a variable

  • slice_max() to subset rows with highest values of a variable

  • slice_sample() to randomly subset rows

5.1.3 Select

select() allows you to select one or more columns by name:

select(sep2010, name, wind, pressure)
## # A tibble: 8 × 3
##   name     wind pressure
##   <chr>   <dbl>    <dbl>
## 1 Gaston     35     1005
## 2 Hermine    60      989
## 3 Igor      135      924
## 4 Julia     120      948
## 5 Karl      110      956
## 6 Lisa       75      982
## 7 Matthew    50      998
## 8 Nicole     40      994

select() also allows you to specify a range of consecutive columns via the : operator

# consecutive columns
select(sep2010, name:pressure)
## # A tibble: 8 × 3
##   name     wind pressure
##   <chr>   <dbl>    <dbl>
## 1 Gaston     35     1005
## 2 Hermine    60      989
## 3 Igor      135      924
## 4 Julia     120      948
## 5 Karl      110      956
## 6 Lisa       75      982
## 7 Matthew    50      998
## 8 Nicole     40      994

select() also allows you to exclude one or more columns by negating their names:

# exclude columns by name
select(sep2010, -(wind:category))
## # A tibble: 8 × 2
##   name     days
##   <chr>   <int>
## 1 Gaston      1
## 2 Hermine     4
## 3 Igor       13
## 4 Julia       8
## 5 Karl        4
## 6 Lisa        6
## 7 Matthew     3
## 8 Nicole      1

Interestingly, you can also select() columns by their position

# columns by position
select(sep2010, 1:3)
## # A tibble: 8 × 3
##   name     wind pressure
##   <chr>   <dbl>    <dbl>
## 1 Gaston     35     1005
## 2 Hermine    60      989
## 3 Igor      135      924
## 4 Julia     120      948
## 5 Karl      110      956
## 6 Lisa       75      982
## 7 Matthew    50      998
## 8 Nicole     40      994

5.1.4 Mutate

Another basic verb is mutate() which allows you to: 1) transform existing variables, but also 2) add new variables.

Wind speed values—column wind—are expressed in knots. But what if we are interested in having wind speed expressed in miles per hour? Enter mutate(). Here is how to add a new column wind_mph (1 knot = 1.15078 mph)

mutate(sep2010, wind_mph = wind * 1.15078)
## # A tibble: 8 × 6
##   name     wind pressure category  days wind_mph
##   <chr>   <dbl>    <dbl> <chr>    <int>    <dbl>
## 1 Gaston     35     1005 ts           1     40.3
## 2 Hermine    60      989 ts           4     69.0
## 3 Igor      135      924 cat4        13    155. 
## 4 Julia     120      948 cat4         8    138. 
## 5 Karl      110      956 cat3         4    127. 
## 6 Lisa       75      982 cat1         6     86.3
## 7 Matthew    50      998 ts           3     57.5
## 8 Nicole     40      994 ts           1     46.0

If we just want to transform the existing values in column wind so that it is expressed in mph, then we simply mutate() it using the same variable name:

mutate(sep2010, wind = wind * 1.15078)
## # A tibble: 8 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Gaston   40.3     1005 ts           1
## 2 Hermine  69.0      989 ts           4
## 3 Igor    155.       924 cat4        13
## 4 Julia   138.       948 cat4         8
## 5 Karl    127.       956 cat3         4
## 6 Lisa     86.3      982 cat1         6
## 7 Matthew  57.5      998 ts           3
## 8 Nicole   46.0      994 ts           1

Recall that, in order to permanently change the data, you need to assign the changes to an object:

sep2010b <- mutate(sep2010, wind_mph = wind * 1.15078)
sep2010b
## # A tibble: 8 × 6
##   name     wind pressure category  days wind_mph
##   <chr>   <dbl>    <dbl> <chr>    <int>    <dbl>
## 1 Gaston     35     1005 ts           1     40.3
## 2 Hermine    60      989 ts           4     69.0
## 3 Igor      135      924 cat4        13    155. 
## 4 Julia     120      948 cat4         8    138. 
## 5 Karl      110      956 cat3         4    127. 
## 6 Lisa       75      982 cat1         6     86.3
## 7 Matthew    50      998 ts           3     57.5
## 8 Nicole     40      994 ts           1     46.0

5.2 Reordering rows: arrange()

The next basic verb of "dplyr" is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of sep2010 by wind

# order rows by wind (increasingly)
arrange(sep2010, wind)
## # A tibble: 8 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Gaston     35     1005 ts           1
## 2 Nicole     40      994 ts           1
## 3 Matthew    50      998 ts           3
## 4 Hermine    60      989 ts           4
## 5 Lisa       75      982 cat1         6
## 6 Karl      110      956 cat3         4
## 7 Julia     120      948 cat4         8
## 8 Igor      135      924 cat4        13

By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().

# order rows by wind (decreasingly)
arrange(sep2010, desc(wind))
## # A tibble: 8 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Igor      135      924 cat4        13
## 2 Julia     120      948 cat4         8
## 3 Karl      110      956 cat3         4
## 4 Lisa       75      982 cat1         6
## 5 Hermine    60      989 ts           4
## 6 Matthew    50      998 ts           3
## 7 Nicole     40      994 ts           1
## 8 Gaston     35     1005 ts           1

Last but not least, you can also pass more than one column for arranging purposes:

# order rows by days, and then pressure
arrange(sep2010, days, pressure)
## # A tibble: 8 × 5
##   name     wind pressure category  days
##   <chr>   <dbl>    <dbl> <chr>    <int>
## 1 Nicole     40      994 ts           1
## 2 Gaston     35     1005 ts           1
## 3 Matthew    50      998 ts           3
## 4 Karl      110      956 cat3         4
## 5 Hermine    60      989 ts           4
## 6 Lisa       75      982 cat1         6
## 7 Julia     120      948 cat4         8
## 8 Igor      135      924 cat4        13

5.3 Summarizing values with summarise()

The next verb is summarise(), which is also available as summarize(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.

Say you are interested in calculating the average duration of all storms To do this “a la dplyr” you use summarise() as follows:

# average duration
summarise(sep2010, avg_days = mean(days))
## # A tibble: 1 × 1
##   avg_days
##      <dbl>
## 1        5

What if you want to calculate some summary statistics for days: min, median, mean, and max?

# some stats for days (dplyr)
summarise(
  sep2010, 
  min = min(days),
  median = median(days),
  avg = mean(days),
  max = max(days)
)
## # A tibble: 1 × 4
##     min median   avg   max
##   <int>  <dbl> <dbl> <int>
## 1     1      4     5    13

5.4 Grouped operations

To better appreciate the power of summarise(), we need to introduce another major "dplyr" basic verb: group_by(). This is the function that allows you to perform data aggregations, or grouped operations.

Let’s see the combination of summarise() and group_by() to calculate the average number of days by category:

# average days, grouped by category
summarise(
  group_by(sep2010, category),
  avg_days = mean(days)
)

Here’s a less simpler example: average days, by category, displayed in descending order by average days:

arrange(
  summarise(
    group_by(sep2010, category),
    avg_days = mean(days)),
  desc(avg_days)
)