3 Data Set storms

In order to have a gentle introduction, we are going to use the data set storms that comes in one of the most popular tidyverse packages: "dplyr". This package contains a large number of functions that allow us to manipulate tables in a substantially consistent and user-friendly way. We will also start with some of the graphing functions from the package "ggplot2" to produce simple visualizations.

3.1 Atlantic Hurricane Data

The aforementioned data set storms is a curated table from the so-called Hurricane Databases (HURDAT), which is a collection of databases managed by the National Hurricane Center (NHC).

  • HURDAT involves two databases: one for storms occurring in the Atlantic Ocean, and another one for storms occurring in the Eastern Pacific Ocean.

  • HURDAT contains records from year 1851 till present.

  • Keep in mind that in the past (before 1970s?), tropical depressions, that did not develop into tropical storms or hurricanes were not included within the database.

An interesting note from Wikipedia: around 1963, NASA’s Apollo space program requested data, on the climatological impacts of tropical cyclones on launches of space vehicles at the Kennedy Space Center. The basic data was taken from the National Weather Records North Atlantic Tropical to include data from 1886–1968. As a result of this work, a requirement for a computerized tropical cyclone database at the National Hurricane Center (NHC) was realized.

https://en.wikipedia.org/wiki/HURDAT

3.1.1 Data storms

The package "dplyr" contains a dataset called storms which is a subset of the NOAA Atlantic hurricane database best track data. This database is one of several data sets available in the National Hurricane Center (NHC) Data Archive, which is part of the National Oceanic and Atmospheric Administration (NOAA). In case you are curious about the specifications and format of this type of data, you can visit the following link:

http://www.nhc.noaa.gov/data/#hurdat

The data storms includes the positions and attributes of tropical systems in the North Atlantic. If you are using a version of "dplyr" greater than or equal to 1.0.10, the storms are from the period 1975 to 2020, measured every six hours during the lifetime of a storm.

Assuming that you’ve loaded "tidyverse" (or "dplyr") in R, when you type the name of the data object, you would get something like this:

storms
## # A tibble: 19,066 × 13
##    name   year month   day  hour   lat  long status      category  wind pressure
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
##  1 Amy    1975     6    27     0  27.5 -79   tropical d…       NA    25     1013
##  2 Amy    1975     6    27     6  28.5 -79   tropical d…       NA    25     1013
##  3 Amy    1975     6    27    12  29.5 -79   tropical d…       NA    25     1013
##  4 Amy    1975     6    27    18  30.5 -79   tropical d…       NA    25     1013
##  5 Amy    1975     6    28     0  31.5 -78.8 tropical d…       NA    25     1012
##  6 Amy    1975     6    28     6  32.4 -78.7 tropical d…       NA    25     1012
##  7 Amy    1975     6    28    12  33.3 -78   tropical d…       NA    25     1011
##  8 Amy    1975     6    28    18  34   -77   tropical d…       NA    30     1006
##  9 Amy    1975     6    29     0  34.4 -75.8 tropical s…       NA    35     1004
## 10 Amy    1975     6    29     6  34   -74.8 tropical s…       NA    40     1002
## # ℹ 19,056 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>

Let’s describe what’s going on in the above output.

  • As you can tell, storms is a tibble object, which is one of the data objects in R that handles data in tabular format.

  • tibbles are not a native R object; instead they are a modern version of data frames, and their related functions come from the homonym package "tibble".

The way tibbles are printed or displayed is very interesting.

  • the number of rows that are displayed is limited to 10;

  • depending on the width of the printing space, you will only see a few columns shown to fit such width,

  • underneath the name of each column there is a three letter abbreviation inside angle brackets,

  • this abbreviation indicates the data type used by R to store the values:

    • <chr> stands for character data
    • <dbl> means double (i.e. real numbers or numbers with decimal digits)
    • <int> means integer (numbers with no decimal digits)
    • <ord> indicates an ordinal factor which is how R handles categorical data
    • <log> indicates logical or boolean values (e.g. true and false)
  • notice that the last three lines indicate the number of additional rows as well as the number of additional columns and their names.

3.1.2 storms Documentation

You can find a more technical description of storms by taking a peek at its manual (or help) documentation. All you need to do is run this command:

?storms

Here’s a full description of all the columns:

  • name: Storm name

  • year, month, and day: Date of report

  • hour: Hour of report (in UTC)

  • lat: Latitude

  • long: Longitude

  • status: Storm classification (Tropical Depression, Tropical Storm, or Hurricane)

  • category: Saffir-Simpson storm category (estimated from wind speed. -1 = Tropical Depression, 0 = Tropical Storm)

  • wind: storm’s maximum sustained wind speed (in knots)

  • pressure: Air pressure at the storm’s center (in millibars)

  • ts_diameter: Diameter of the area experiencing tropical storm strength winds (34 knots or above)

  • hu_diameter: Diameter of the area experiencing hurricane strength winds (64 knots or above)

You can take a look at the manual (or help) documentation to confirm the description of the variables in data storms.

Some Remarks

  • The data table storms is already in R; later you will learn how to import tables in R

  • The table is already clean, there’s no need to fix weird values, or transform from one data type to another.

  • Not only the table is clean, but it is also tidy which is the technical term to indicate that:

    • every column is a variable.
    • every row is an observation.
    • every cell is a single value.

3.2 General Inspection

When dealing with a data table, especially for the first time, I like to do a quick inspection of the general structure of the data, meaning the number of rows and columns, the name and data-type of each column, and some times also to quickly inspect a few rows either at the top or at the bottom of the table. To do all these things there is a handful of functions:

  • str(): to get a summary of the table’s structure

  • dim(): to get the dimensions (number of rows and columns)

  • nrow(): to get just the number of rows

  • ncol(): to get just the number of columns

  • names(): to get the column names; there’s also colnames()

  • head(): to look at a few first rows

  • tail(): to look at a few last rows

For instance, to get a general summary of the table’s structure, we can use str() and its argument vec.len = 1 to simplify the amount of output:

str(storms, vec.len = 1)
## tibble [19,066 × 13] (S3: tbl_df/tbl/data.frame)
##  $ name                        : chr [1:19066] "Amy" ...
##  $ year                        : num [1:19066] 1975 ...
##  $ month                       : num [1:19066] 6 6 ...
##  $ day                         : int [1:19066] 27 27 ...
##  $ hour                        : num [1:19066] 0 6 ...
##  $ lat                         : num [1:19066] 27.5 28.5 ...
##  $ long                        : num [1:19066] -79 -79 ...
##  $ status                      : Factor w/ 9 levels "disturbance",..: 7 7 ...
##  $ category                    : num [1:19066] NA NA ...
##  $ wind                        : int [1:19066] 25 25 ...
##  $ pressure                    : int [1:19066] 1013 1013 ...
##  $ tropicalstorm_force_diameter: int [1:19066] NA NA ...
##  $ hurricane_force_diameter    : int [1:19066] NA NA ...

Likewise, to explore the dimensions, that is the number of rows and columns, you can invoke dim():

dim(storms)
## [1] 19066    13

Alternatively, you can also call nrow() or ncol() if you prefer to get just the number of rows or just the number of columns:

nrow(storms)
## [1] 19066
ncol(storms)
## [1] 13

Often, I like to use head() and/or tail() to see the first and/or the last rows of a table. In this way I can get an idea of what the data looks like without having to print all the entries.

tail(storms)
## # A tibble: 6 × 13
##   name   year month   day  hour   lat  long status       category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>    <int>
## 1 Wanda  2021    11     6    18  37.1 -38   tropical st…       NA    35     1002
## 2 Wanda  2021    11     7     0  37.4 -37.4 tropical st…       NA    35     1003
## 3 Wanda  2021    11     7     6  38.1 -36.4 tropical st…       NA    35     1004
## 4 Wanda  2021    11     7    12  39.2 -34.9 other low          NA    35     1006
## 5 Wanda  2021    11     7    18  40.9 -32.8 other low          NA    40     1006
## 6 Wanda  2021    11     8     0  43.2 -29.7 other low          NA    40     1006
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## #   hurricane_force_diameter <int>