3 Data Set storms
In order to have a gentle introduction, we are going to use the data set
storms
that comes in one of the most popular tidyverse packages: "dplyr"
.
This package contains a large number of functions that allow us to manipulate
tables in a substantially consistent and user-friendly way. We will also start
with some of the graphing functions from the package "ggplot2"
to produce
simple visualizations.
3.1 Atlantic Hurricane Data
The aforementioned data set storms
is a curated table from the so-called
Hurricane Databases (HURDAT), which is a collection of databases managed by
the National Hurricane Center (NHC).
HURDAT involves two databases: one for storms occurring in the Atlantic Ocean, and another one for storms occurring in the Eastern Pacific Ocean.
HURDAT contains records from year 1851 till present.
Keep in mind that in the past (before 1970s?), tropical depressions, that did not develop into tropical storms or hurricanes were not included within the database.
An interesting note from Wikipedia: around 1963, NASA’s Apollo space program requested data, on the climatological impacts of tropical cyclones on launches of space vehicles at the Kennedy Space Center. The basic data was taken from the National Weather Records North Atlantic Tropical to include data from 1886–1968. As a result of this work, a requirement for a computerized tropical cyclone database at the National Hurricane Center (NHC) was realized.
https://en.wikipedia.org/wiki/HURDAT
3.1.1 Data storms
The package "dplyr"
contains a dataset called storms
which is a subset of
the NOAA Atlantic hurricane database best track data.
This database is one of several data sets available in the National Hurricane
Center (NHC) Data Archive, which is part of the National Oceanic and Atmospheric
Administration (NOAA). In case you are curious about the specifications and
format of this type of data, you can visit the following link:
http://www.nhc.noaa.gov/data/#hurdat
The data storms
includes the positions and attributes of tropical systems
in the North Atlantic. If you are using a version of "dplyr"
greater than or
equal to 1.0.10
, the storms are from the period 1975 to 2020, measured every
six hours during the lifetime of a storm.
Assuming that you’ve loaded "tidyverse"
(or "dplyr"
) in R, when you type
the name of the data object, you would get something like this:
storms
## # A tibble: 19,066 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
## # ℹ 19,056 more rows
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>
Let’s describe what’s going on in the above output.
As you can tell,
storms
is a tibble object, which is one of the data objects in R that handles data in tabular format.tibbles are not a native R object; instead they are a modern version of data frames, and their related functions come from the homonym package
"tibble"
.
The way tibbles are printed or displayed is very interesting.
the number of rows that are displayed is limited to 10;
depending on the width of the printing space, you will only see a few columns shown to fit such width,
underneath the name of each column there is a three letter abbreviation inside angle brackets,
this abbreviation indicates the data type used by R to store the values:
<chr>
stands for character data<dbl>
means double (i.e. real numbers or numbers with decimal digits)<int>
means integer (numbers with no decimal digits)<ord>
indicates an ordinalfactor
which is how R handles categorical data<log>
indicates logical or boolean values (e.g. true and false)
notice that the last three lines indicate the number of additional rows as well as the number of additional columns and their names.
3.1.2 storms
Documentation
You can find a more technical description of storms
by taking a peek at its
manual (or help) documentation. All you need to do is run this command:
?storms
Here’s a full description of all the columns:
name
: Storm nameyear
,month
, andday
: Date of reporthour
: Hour of report (in UTC)lat
: Latitudelong
: Longitudestatus
: Storm classification (Tropical Depression, Tropical Storm, or Hurricane)category
: Saffir-Simpson storm category (estimated from wind speed. -1 = Tropical Depression, 0 = Tropical Storm)wind
: storm’s maximum sustained wind speed (in knots)pressure
: Air pressure at the storm’s center (in millibars)ts_diameter
: Diameter of the area experiencing tropical storm strength winds (34 knots or above)hu_diameter
: Diameter of the area experiencing hurricane strength winds (64 knots or above)
You can take a look at the manual (or help) documentation to confirm the
description of the variables in data storms
.
Some Remarks
The data table
storms
is already in R; later you will learn how to import tables in RThe table is already clean, there’s no need to fix weird values, or transform from one data type to another.
Not only the table is clean, but it is also tidy which is the technical term to indicate that:
- every column is a variable.
- every row is an observation.
- every cell is a single value.
3.2 General Inspection
When dealing with a data table, especially for the first time, I like to do a quick inspection of the general structure of the data, meaning the number of rows and columns, the name and data-type of each column, and some times also to quickly inspect a few rows either at the top or at the bottom of the table. To do all these things there is a handful of functions:
str()
: to get a summary of the table’s structuredim()
: to get the dimensions (number of rows and columns)nrow()
: to get just the number of rowsncol()
: to get just the number of columnsnames()
: to get the column names; there’s alsocolnames()
head()
: to look at a few first rowstail()
: to look at a few last rows
For instance, to get a general summary of the table’s structure, we can use
str()
and its argument vec.len = 1
to simplify the amount of output:
str(storms, vec.len = 1)
## tibble [19,066 × 13] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:19066] "Amy" ...
## $ year : num [1:19066] 1975 ...
## $ month : num [1:19066] 6 6 ...
## $ day : int [1:19066] 27 27 ...
## $ hour : num [1:19066] 0 6 ...
## $ lat : num [1:19066] 27.5 28.5 ...
## $ long : num [1:19066] -79 -79 ...
## $ status : Factor w/ 9 levels "disturbance",..: 7 7 ...
## $ category : num [1:19066] NA NA ...
## $ wind : int [1:19066] 25 25 ...
## $ pressure : int [1:19066] 1013 1013 ...
## $ tropicalstorm_force_diameter: int [1:19066] NA NA ...
## $ hurricane_force_diameter : int [1:19066] NA NA ...
Likewise, to explore the dimensions, that is the number of rows and columns,
you can invoke dim()
:
dim(storms)
## [1] 19066 13
Alternatively, you can also call nrow()
or ncol()
if you prefer to
get just the number of rows or just the number of columns:
nrow(storms)
## [1] 19066
ncol(storms)
## [1] 13
Often, I like to use head()
and/or tail()
to see the first and/or the last
rows of a table. In this way I can get an idea of what the data looks like
without having to print all the entries.
tail(storms)
## # A tibble: 6 × 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
## 1 Wanda 2021 11 6 18 37.1 -38 tropical st… NA 35 1002
## 2 Wanda 2021 11 7 0 37.4 -37.4 tropical st… NA 35 1003
## 3 Wanda 2021 11 7 6 38.1 -36.4 tropical st… NA 35 1004
## 4 Wanda 2021 11 7 12 39.2 -34.9 other low NA 35 1006
## 5 Wanda 2021 11 7 18 40.9 -32.8 other low NA 40 1006
## 6 Wanda 2021 11 8 0 43.2 -29.7 other low NA 40 1006
## # ℹ 2 more variables: tropicalstorm_force_diameter <int>,
## # hurricane_force_diameter <int>