In order to have a gentle introduction, we are going to use the data set storms
that comes in one of the most popular tidyverse packages: "dplyr"
. This package contains a large number of functions that allow us to manipulate tables in a substantially consistent and user-friendly way. We will also start with some of the graphing functions from the package "ggplot2"
to produce simple visualizations.
Atlantic Hurricane Data
The aforementioned data set storms
is a curated table from the so-called Hurricane Databases (HURDAT), which is a collection of databases managed by the National Hurricane Center (NHC).
HURDAT involves two databases: one for storms occurring in the Atlantic Ocean, and another one for storms occurring in the Eastern Pacific Ocean.
HURDAT contains records from year 1851 till present.
Keep in mind that in the past (before 1970s?), tropical depressions, that did not develop into tropical storms or hurricanes were not included within the database.
An interesting note from Wikipedia: around 1963, NASA’s Apollo space program requested data, on the climatological impacts of tropical cyclones on launches of space vehicles at the Kennedy Space Center. The basic data was taken from the National Weather Records North Atlantic Tropical to include data from 1886–1968. As a result of this work, a requirement for a computerized tropical cyclone database at the National Hurricane Center (NHC) was realized.
https://en.wikipedia.org/wiki/HURDAT
Data storms
The package "dplyr"
contains a dataset called storms
which is a subset of the NOAA Atlantic hurricane database best track data. This database is one of several data sets available in the National Hurricane Center (NHC) Data Archive, which is part of the National Oceanic and Atmospheric Administration (NOAA). In case you are curious about the specifications and format of this type of data, you can visit the following link:
http://www.nhc.noaa.gov/data/#hurdat
The data storms
includes the positions and attributes of tropical systems in the North Atlantic. If you are using a version of "dplyr"
greater than or equal to 1.0.10
, the storms are from the period 1975 to 2020, measured every six hours during the lifetime of a storm.
Assuming that you’ve loaded "tidyverse"
(or "dplyr"
) in R, when you type the name of the data object, you would get something like this:
# A tibble: 19,066 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
# ℹ 19,056 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
Let’s describe what’s going on in the above output.
As you can tell, storms
is a tibble object, which is one of the data objects in R that handles data in tabular format.
tibbles are not a native R object; instead they are a modern version of data frames, and their related functions come from the homonym package "tibble"
.
The way tibbles are printed or displayed is very interesting.
the number of rows that are displayed is limited to 10;
depending on the width of the printing space, you will only see a few columns shown to fit such width,
underneath the name of each column there is a three letter abbreviation inside angle brackets,
this abbreviation indicates the data type used by R to store the values:
<chr>
stands for character data
<dbl>
means double (i.e. real numbers or numbers with decimal digits)
<int>
means integer (numbers with no decimal digits)
<ord>
indicates an ordinal factor
which is how R handles categorical data
<log>
indicates logical or boolean values (e.g. true and false)
notice that the last three lines indicate the number of additional rows as well as the number of additional columns and their names.
storms
Documentation
You can find a more technical description of storms
by taking a peek at its manual (or help) documentation. All you need to do is run this command:
Here’s a full description of all the columns:
name
: Storm name
year
, month
, and day
: Date of report
hour
: Hour of report (in UTC)
lat
: Latitude
long
: Longitude
status
: Storm classification (Tropical Depression, Tropical Storm, or Hurricane)
category
: Saffir-Simpson storm category (estimated from wind speed. -1 = Tropical Depression, 0 = Tropical Storm)
wind
: storm’s maximum sustained wind speed (in knots)
pressure
: Air pressure at the storm’s center (in millibars)
ts_diameter
: Diameter of the area experiencing tropical storm strength winds (34 knots or above)
hu_diameter
: Diameter of the area experiencing hurricane strength winds (64 knots or above)
You can take a look at the manual (or help) documentation to confirm the description of the variables in data storms
.
General Inspection
When dealing with a data table, especially for the first time, I like to do a quick inspection of the general structure of the data, meaning the number of rows and columns, the name and data-type of each column, and some times also to quickly inspect a few rows either at the top or at the bottom of the table. To do all these things there is a handful of functions:
str()
: to get a summary of the table’s structure
dim()
: to get the dimensions (number of rows and columns)
nrow()
: to get just the number of rows
ncol()
: to get just the number of columns
names()
: to get the column names; there’s also colnames()
head()
: to look at a few first rows
tail()
: to look at a few last rows
For instance, to get a general summary of the table’s structure, we can use str()
and its argument vec.len = 1
to simplify the amount of output:
tibble [19,066 × 13] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:19066] "Amy" ...
$ year : num [1:19066] 1975 ...
$ month : num [1:19066] 6 6 ...
$ day : int [1:19066] 27 27 ...
$ hour : num [1:19066] 0 6 ...
$ lat : num [1:19066] 27.5 28.5 ...
$ long : num [1:19066] -79 -79 ...
$ status : Factor w/ 9 levels "disturbance",..: 7 7 ...
$ category : num [1:19066] NA NA ...
$ wind : int [1:19066] 25 25 ...
$ pressure : int [1:19066] 1013 1013 ...
$ tropicalstorm_force_diameter: int [1:19066] NA NA ...
$ hurricane_force_diameter : int [1:19066] NA NA ...
Likewise, to explore the dimensions, that is the number of rows and columns, you can invoke dim()
:
Alternatively, you can also call nrow()
or ncol()
if you prefer to get just the number of rows or just the number of columns:
Often, I like to use head()
and/or tail()
to see the first and/or the last rows of a table. In this way I can get an idea of what the data looks like without having to print all the entries.
# A tibble: 6 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Wanda 2021 11 6 18 37.1 -38 tropical st… NA 35 1002
2 Wanda 2021 11 7 0 37.4 -37.4 tropical st… NA 35 1003
3 Wanda 2021 11 7 6 38.1 -36.4 tropical st… NA 35 1004
4 Wanda 2021 11 7 12 39.2 -34.9 other low NA 35 1006
5 Wanda 2021 11 7 18 40.9 -32.8 other low NA 40 1006
6 Wanda 2021 11 8 0 43.2 -29.7 other low NA 40 1006
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>