9 Exploratory Data Analysis with dplyr

9.1 Introduction

In this chapter and the next one, you will start learning a couple of approaches to manipulate tables and create basic statistical graphics. To manipulate tables, we are going to use the functionality of the package "dplyr".

# in this chapter we use the package "dplyr"
library(dplyr)

This package allows you to work with tabular data in a syntactic way. This is a fairly recent package introduced a couple of years ago, but it is based on more than a decade of research and work lead by Hadley Wickham.

Later in the book you will also have the opportunity to learn more low-level manipulation tasks.

9.2 Atlantic Hurricane Data

"dplyr" comes with a data set called storms. This data is one of the data sets available in the National Hurricane Center (NHC) Data Archive, which is part of the National Oceanic and Atmospheric Administration (NOAA). In particular, the data set storms refers to the Atlantic hurricane database best track data:

http://www.nhc.noaa.gov/data/#hurdat

The data includes the positions and attributes of 198 tropical storms, measured every six hours during the lifetime of a storm.

Assuming that you’ve loaded this package, simply type the name of the data object:

storms
#> # A tibble: 10,010 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
#>  2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
#>  3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
#>  4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
#>  5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
#>  6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
#>  7 Amy    1975     6    28    12  33.3 -78   tropi… -1          25     1011
#>  8 Amy    1975     6    28    18  34   -77   tropi… -1          30     1006
#>  9 Amy    1975     6    29     0  34.4 -75.8 tropi… 0           35     1004
#> 10 Amy    1975     6    29     6  34   -74.8 tropi… 0           40     1002
#> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

You can find some technical description of storms by taking a peek at its manual (or help) documentation: ?storms.

As you can tell from the displayed output, storms is a tibble object, which is one of the data objects in R that handles data in tabular format. On a technical note, we should say that tibbles are not a native object—they come from the homonym package "tibble"—instead they are a modern version of data frames, which is the native object in R for handling general data tables.

The way tibbles are printed is very interesting. The number of rows that are displayed is limited to 10; also, depending on the width of the printing space, you will only see a few columns shown to fit such width. Notice that underneath the name of each column there is a three letter abbreviation inside angle brackets, this abbreviation indicates the data type used by R to store the values. For instance, the first column name has type <chr> which stands for character data, the second column year is of type <dbl> or double, that is real numbers or numbers with decimal digits. The fourth column day is of type <int> or integer (numbers with no decimal digits).

Here’s a full description of all the columns:

  • name: Storm name
  • year, month, and day: Date of report
  • hour: Hour of report (in UTC)
  • lat: Latitude
  • long: Longitude
  • status: Storm classification (Tropical Depression, Tropical Storm, or Hurricane)
  • category: Saffir-Simpson storm category (estimated from wind speed. -1 = Tropical Depression, 0 = Tropical Storm)
  • wind: storm’s maximum sustained wind speed (in knots)
  • pressure: Air pressure at the storm’s center (in millibars)
  • ts_diameter: Diameter of the area experiencing tropical storm strength winds (34 knots or above)
  • hu_diameter: Diameter of the area experiencing hurricane strength winds (64 knots or above)

Some Remarks

  • The data table is already in R; later you will learn how to import tables in R
  • The table is already clean, there’s no need to fix weird values, or transform from one data type to another.
  • Not only the table is clean, but it is also tidy which is the technical term to indicate that:
    • each variable forms a column
    • each observation forms a row

To better understand what tidy data implies, it is also useful to describe in what ways data sets are messy:

  • column headers are values, not variable names
  • multiple variables are stored in one column
  • variables are stored in both rows and columns

Keep in mind that we are using the storms example to get you up and running analyzing data tables, focusing on what is called Exploratory Data Analysis (EDA). Later in the book we will also describe typical tasks that are performed before EDA, such as importing data, preparation of data, cleaning, tidying, etc.

9.3 Exploratory Data Analysis

Recall the diagram of the Data Analysis Cycle:

Exploratory Data Analysis in DAC

Figure 9.1: Exploratory Data Analysis in DAC

Exploring data is one of those tasks that you will use in both the Data Preparation stage and the Core Analysis stage. EDA has a main purpose: get to know your data. EDA is very similar to when you go to the doctor and they do an initial exploration (measure your height, your weight, temperature, blood pressure; listen to your heart and lungs; look at your eyes, throat, ears; ask you questions about your eating habits, physical activity habits, etc).

To keep things relatively simple, we won’t perform a full exploration of every single variable (i.e. column) in the data. However, we encourage you to play with the functions to go beyond what we cover in this chapter. In real life, you will have to do such exploration.

Another important comment: in most courses, books, and workshops, EDA tends to receive very little attention. But don’t understimate the power of exploring your data. It may not be as glamorous as doing similations, fitting models, making predictions, or other more sophisticated activities. But that does not mean EDA is unimportant. Quite the opposite! You should spend enough time understanding the data (variables, individuals, measurements), also understanding the context in which the data was collected, and if possible, a bit about the field or discipline related with the data. This is something that can’t be easily taught in a book; but you should spend time with your clients, colleagues, experts, to know as much as possible about the data you will work with.

9.3.1 Basic Inspection of year

When you type storms, R displays the first 10 rows, which belong to storm Amy in 1975. From this, we can infer that the data contains at least one storm from 1975. We also know, from the manual documentation of storms, that there are supposed to be 198 storms. But we don’t know for what years. So in a more or less arbitrary way, let’s begin inspecting storms by focusing on the year column. Our first question is: For what years have the data been collected?

To answer this question, we need to work with column years. There are several ways in R to manipulate a column from a tabular object. Using "dplyr", there are two basic kinds of functions to extract variables: pull() and select().

Extracting a column with dplyr functions "pull" and "select"

Figure 9.2: Extracting a column with dplyr functions “pull” and “select”

Let’s do a sanity check of years. We can use the function pull() that pulls or extracts an entire column. Because there are 10010 elements in years, let’s also use unique() to find out the set of unique year values in the data. First we pull the year, and then we identify unique occurrences:

unique(pull(storms, year))
#>  [1] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
#> [16] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
#> [31] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

The same operation can be accomplished with select(). The difference between pull() and select(), is in the way the output is handled. Function select() returns output in a table format, whereas pull() returns the output as an R vector:

unique(select(storms, year))
#> # A tibble: 41 x 1
#>     year
#>    <dbl>
#>  1  1975
#>  2  1976
#>  3  1977
#>  4  1978
#>  5  1979
#>  6  1980
#>  7  1981
#>  8  1982
#>  9  1983
#> 10  1984
#> # … with 31 more rows

Based on the previous answers, we can see that storms has records during a 41-year period from 1975 to 2015.

9.3.2 Basic inspection of month

What about month? We can apply the same commands to see whether there are storms in all months:

unique(pull(storms, month))
#>  [1]  6  7  8  9 10 11 12  5  4  1

In this case, it would be better if we sort() them:

sort(unique(pull(storms, month)))
#>  [1]  1  4  5  6  7  8  9 10 11 12

Observe that not all months have recorded storms, this is the case for February (2) and March (3). Is this something to be concerned about? How is it possible that there are no recorded data for February and March? For the inexperience analyst, asking this type of questions is fundamental. As a data scientist, you will be working with data sets for which you are not necessarily an expert in that particular field of application. Since you will also be interacting with some type of experts, you should ask them as many questions as possible to clarify your understanding of the data and its context.

The answer for not having storms in February and March is because these months have to do with the end of Winter and begining of Spring in the North Atlantic, which is a period of time where there are no tropical storms. In fact, Spring months such as April and May do not tend to be typical months for hurricanes. So a further thing to explore could involve also computing the number of storms in April, and May.

9.3.3 Basic inspection of day

The same type of exploration can also be applied to column day, just to make sure that its contents make sense:

sort(unique(pull(storms, day)))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31

Based on this output, numeric values of days look good too: we have number days from 1 to 31, which is the valid range for days. Obviously this is not the case, but had we found a number of 32 or 33 or greater, that would have raised a lot of suspicion. Now, just because the number days look good, that does not automatically mean they are flawless. In this case there’s nothing to worry about, but in other situations you might need to check that the full date (year-month-day) is valid.

9.3.4 Basic inspection of storms in 1975

Let’s focus on those storms recorded in 1975. How do we select them? Computationally, this operation incolves a logical condition: year == 1975. This condition means that, from all the available year values, we get those that match 1975. This is done via "dplyr" function filter()

Extracting a row with dplyr function "filter"

Figure 9.3: Extracting a row with dplyr function “filter”

First, let’s create a subset storms75 by filtering those rows with year equal to 1975:

storms75 <- filter(storms, year == 1975)
storms75
#> # A tibble: 86 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
#>  2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
#>  3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
#>  4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
#>  5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
#>  6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
#>  7 Amy    1975     6    28    12  33.3 -78   tropi… -1          25     1011
#>  8 Amy    1975     6    28    18  34   -77   tropi… -1          30     1006
#>  9 Amy    1975     6    29     0  34.4 -75.8 tropi… 0           35     1004
#> 10 Amy    1975     6    29     6  34   -74.8 tropi… 0           40     1002
#> # … with 76 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

Once we have the set of storms that occurred in 1975, one possible question to ask is what unique() storms happened in that year:

unique(pull(storms75, name))
#> [1] "Amy"      "Caroline" "Doris"

From the returned output, there are only three storms recorded in 1975.

A similar result can be obtained with distinct(), the difference being the way in which the output is returned, in this case under the format of a tibble:

distinct(storms75, name)
#> # A tibble: 3 x 1
#>   name    
#>   <chr>   
#> 1 Amy     
#> 2 Caroline
#> 3 Doris

Now that we know there are three storms for 1975, it would be nice to count the number of rows or records for each of them. "dplyr" allows us to do this with count(), passing the name of the table, and then the name of the column for which we want to get the counts or frequencies:

count(storms75, name)
#> # A tibble: 3 x 2
#>   name         n
#>   <chr>    <int>
#> 1 Amy         30
#> 2 Caroline    33
#> 3 Doris       23

For illustration purposes, here are some commands with filtering examples. Say you want to select data for years 1975 and 1976, here are various way to do that:

# storm records in 1975 and 1976
storms75and76 <- filter(storms, year == 1975 | year == 1976)

# equivalent: storm records in 1975 and 1976
filter(storms, year %in% c(1975, 1976))

# equivalent: storm records in 1975 and 1976
filter(storms, year %in% 1975:1976)

# equivalent: because years start with 1975, we can also do
filter(storms, year < 1976)

Another example: say you want to select storms for years 1975 to 1979

# storm records in the 1970s
filter(storms, year %in% 1975:1976)

What if you want to select the rows for storm Caroline in 1975? This involves combining two simple conditions: on one hand years == 1975, and at the same time name == "Caroline", joined with the logical AND operator &:

# 1975 Caroline storm records
filter(storms, year == 1975 & name == "Caroline")

# equivalent: 1975 Caroline storm records
filter(storms, name == "Caroline" & year == 1975)

9.3.5 Group-by operations

Another common task when exploring data has to do with computations applied on certain groups or categories of data. "dplyr" provides the function group_by() which takes a data table, and we specify the column(s) on which rows will be grouped by. Most of the time, you will use group_by() with the function summarise(), which as its name indicates, allows you to compute a certain summary on the specified columns.

Group-by operations

Figure 9.4: Group-by operations

For example, we may be interested in calculating the average wind speed and average pressure of each storm in 1975. First we need to group by name, and then we use summarise() to indicate that we want to get the mean() of wind and pressure, like this:

summarise(
  group_by(storms75, name),
  avg_wind = mean(wind),
  avg_pressure = mean(pressure)
)
#> # A tibble: 3 x 3
#>   name     avg_wind avg_pressure
#>   <chr>       <dbl>        <dbl>
#> 1 Amy          46.5         995.
#> 2 Caroline     38.9        1002.
#> 3 Doris        73.7         983.

Sometimes, you’ll find convenient to assign the output into its own table:

avg_wind_pressure_75 <- summarise(
  group_by(storms75, name),
  avg_wind = mean(wind),
  avg_pressure = mean(pressure)
)

Right now, the table of summary means is ordered alphabetically by name. But perhaps you may want to organize its contents by avg_wind or by avg_pressure. Let’s see how to do this in the next subsection.

9.3.6 Arrange operations

Besides group_by() operations, another common type of manipulation is the arragement of rows based on the values of one or more columns. In "dplyr", this can easily be achieved with the function arrange(). The way this function works is passing the name of the table, and then specifying one or more columns to order rows based on such values.

Arranging rows

Figure 9.5: Arranging rows

Say you want to arrange the contents of the average summary table, by taking into account the column avg_wind:

arrange(avg_wind_pressure_75, avg_wind)
#> # A tibble: 3 x 3
#>   name     avg_wind avg_pressure
#>   <chr>       <dbl>        <dbl>
#> 1 Caroline     38.9        1002.
#> 2 Amy          46.5         995.
#> 3 Doris        73.7         983.

Likewise, you can also arrange the averages by avg_pressure:

arrange(avg_wind_pressure_75, avg_pressure)
#> # A tibble: 3 x 3
#>   name     avg_wind avg_pressure
#>   <chr>       <dbl>        <dbl>
#> 1 Doris        73.7         983.
#> 2 Amy          46.5         995.
#> 3 Caroline     38.9        1002.

The default behavior of arrange() is to organize rows in increasing order. But what if you want to organize rows in decreasing order? No problem, just use the auxiliary function desc() to indicate that rows should be arranged decreasingly:

arrange(avg_wind_pressure_75, desc(avg_wind))
#> # A tibble: 3 x 3
#>   name     avg_wind avg_pressure
#>   <chr>       <dbl>        <dbl>
#> 1 Doris        73.7         983.
#> 2 Amy          46.5         995.
#> 3 Caroline     38.9        1002.

9.3.7 Inspecting 1975 storm Amy

Let’s focus on a specific storm, for example storm Amy in 1975. For sake of simplicity, we are going to create a table amy75 containing the values of this storm:

amy75 <- filter(storms75, name == "Amy")
amy75
#> # A tibble: 30 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
#>  1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
#>  2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
#>  3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
#>  4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
#>  5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
#>  6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
#>  7 Amy    1975     6    28    12  33.3 -78   tropi… -1          25     1011
#>  8 Amy    1975     6    28    18  34   -77   tropi… -1          30     1006
#>  9 Amy    1975     6    29     0  34.4 -75.8 tropi… 0           35     1004
#> 10 Amy    1975     6    29     6  34   -74.8 tropi… 0           40     1002
#> # … with 20 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

Here’s a coupe of questions that we could investigate:

  • which are the status categories for Amy?
  • during which months was Amy active? and for how many days?
  • what are the basic summary statistics for wind and pressure?
# which are the `status` categories for Amy?
distinct(amy75, status)
#> # A tibble: 2 x 1
#>   status             
#>   <chr>              
#> 1 tropical depression
#> 2 tropical storm
# during which months was Amy active?
distinct(amy75, month)
#> # A tibble: 2 x 1
#>   month
#>   <dbl>
#> 1     6
#> 2     7
# for how many days was Amy active?
count(distinct(amy75, day))
#> # A tibble: 1 x 1
#>       n
#>   <int>
#> 1     8
# summary statistics for wind
summary(select(amy75, wind))
#>       wind     
#>  Min.   :25.0  
#>  1st Qu.:31.2  
#>  Median :50.0  
#>  Mean   :46.5  
#>  3rd Qu.:60.0  
#>  Max.   :60.0
# summary statistics for pressure
summary(select(amy75, pressure))
#>     pressure   
#>  Min.   : 981  
#>  1st Qu.: 986  
#>  Median : 987  
#>  Mean   : 995  
#>  3rd Qu.:1006  
#>  Max.   :1013

9.3.8 Summary

So far, we’ve covered several functions from "dplyr", as well as some other functions in R:

  • functions from "dplyr"
    • pull() and select()
    • filter()
    • group_by()
    • arrange() and desc()
    • count(), distinct(), summarise()
  • functions in base R
    • unique(), sort(), mean(), summary()

9.4 Exercises

1) Use "dplyr" functions/commands to create a table (e.g. tibble) storm_names_1980s containing the name and year of storms recorded during the 1980s (i.e. from 1980 to 1989).

2) Use "dplyr" functions/commands to create a table (e.g. tibble) storms_per_year containing the number of storms recorded in each year (i.e. counts or frequencies of storms in each year). This table should contain two columns: year values in the first column, and number of storms in the second column.

3) Use "dplyr" functions/commands to create a table (e.g. tibble) storm_records_per_year containing three columns: 1) name of storm, 2) year of storm, and 3) count for number of records (of the corresponding storm).

4) Use "dplyr" functions/commands to display the different (unique) types of storm status.

5) Use "dplyr" functions/commands to display the different types of storm categories.

6) Use "dplyr" functions/commands to create a table (e.g. tibble) storms_categ5 containing the name and year of those storms of category 5.

7) Use "dplyr" functions/commands to display a table showing the status, avg_pressure (average pressure), and avg_wind (average wind speed), for each type of storm category. This table should contain four columns: 1) category, 2) status, 3) avg_pressure, and 4) avg_wind.

8) Use "dplyr" functions/commands to create a table (e.g. tibble) max_wind_per_storm containing three columns: 1) year of storm, 2) name of storm, and 3) max_wind maximum wind speed record (for that storm).

9) Use "dplyr" functions/commands to create a table (e.g. tibble) max_wind_per_year containing three columns: 1) year of storm, 2) name of storm, and 3) wind maximum wind speed record (for that year). Arrange rows by wind speed in decreasing order.