11 Summary

So far, we’ve covered several functions from "dplyr", as well as some other functions in R:

functions from "dplyr"
- pull() and select()
- filter()
- group_by()
- arrange() and desc()
- count(), distinct(), summarise()
functions in base R
- unique(), sort(), mean(), summary()

11.1 Number of Storms per Year

If you recall, our first ggplot involved a barchart for the values in column year

ggplot(data = storms) + 
  geom_bar(aes(x = year))

We discovered that the 41-year period of recorded data from 1975 to 2015. We can take a further step and ask: how many storms are there in each year?

To answer this question, we need to do some data manipulation with "dplyr". Our general recommendation when working with "dplyr"’s functions, especially when you are learning about them, is to do computations step by step, deciding which columns you need to use, which rows to consider, which functions to call, and so on.

Think about the columns that we need to select to find the number of unique storms per year. We obviously need year, but this column alone it’s not enough because for any given storm we have multiple records with the same year. Therefore, we also need column name.

For illustration purposes, we are going to build the data manipulation pipeline step by step. As you get more comfortable with "dplyr" and other functions, you won’t have the need to disect every single command.

A first step is to select() variables year and name:

select(storms, year, name)

## # A tibble: 19,066 × 2
##     year name 
##    <dbl> <chr>
##  1  1975 Amy  
##  2  1975 Amy  
##  3  1975 Amy  
##  4  1975 Amy  
##  5  1975 Amy  
##  6  1975 Amy  
##  7  1975 Amy  
##  8  1975 Amy  
##  9  1975 Amy  
## 10  1975 Amy  
## # ℹ 19,056 more rows

Next, we need to group_by() year. At first glance, the previous output and the output below seem identical. But notice the tiny difference: the output below has a second line of text with some relevant information: # Groups: year [41], telling us that the values are grouped by year.

group_by(select(storms, year, name), year)

## # A tibble: 19,066 × 2
## # Groups:   year [47]
##     year name 
##    <dbl> <chr>
##  1  1975 Amy  
##  2  1975 Amy  
##  3  1975 Amy  
##  4  1975 Amy  
##  5  1975 Amy  
##  6  1975 Amy  
##  7  1975 Amy  
##  8  1975 Amy  
##  9  1975 Amy  
## 10  1975 Amy  
## # ℹ 19,056 more rows

Then, we identify the distinct() values (combination of year-name):

distinct(group_by(select(storms, year, name), year))

## # A tibble: 639 × 2
## # Groups:   year [47]
##     year name    
##    <dbl> <chr>   
##  1  1975 Amy     
##  2  1975 Blanche 
##  3  1975 Caroline
##  4  1975 Doris   
##  5  1975 Eloise  
##  6  1975 Faye    
##  7  1975 Gladys  
##  8  1975 Hallie  
##  9  1976 Belle   
## 10  1976 Dottie  
## # ℹ 629 more rows

For convenience purposes, let’s assign this table into its own object, which we can call storms_year_name

storms_year_name <- distinct(group_by(select(storms, year, name), year))

Finally, we need to count() how many storms are in each year:

count(storms_year_name, year)

## # A tibble: 47 × 2
## # Groups:   year [47]
##     year     n
##    <dbl> <int>
##  1  1975     8
##  2  1976     7
##  3  1977     6
##  4  1978    11
##  5  1979     8
##  6  1980    11
##  7  1981    11
##  8  1982     5
##  9  1983     4
## 10  1984    12
## # ℹ 37 more rows

All the previous commands can be assembled together with various embedded lines of code:

storms_per_year <- storms |>
  select(year, name) |>
  group_by(year) |>
  distinct() |>
  count()

storms_per_year

## # A tibble: 47 × 2
## # Groups:   year [47]
##     year     n
##    <dbl> <int>
##  1  1975     8
##  2  1976     7
##  3  1977     6
##  4  1978    11
##  5  1979     8
##  6  1980    11
##  7  1981    11
##  8  1982     5
##  9  1983     4
## 10  1984    12
## # ℹ 37 more rows

Now that we have the counts or frequencies, we can make our next barchart. In this case, we will use the table storms_year_name as the input data for ggplot():

ggplot(data = storms_year_name) + 
  geom_bar(aes(x = year))

By looking at the chart, there are some fairly tall bars. Although it’s hard to see exactly which years have a considerably large number of storms, eyeballing things out it seems that around 1995, 2003, 2005, and 2010 there are 20 or more storms. We can find the actual answer by using arrange(), specifying the counts to be shown in descending order—with desc():

arrange(storms_per_year, desc(n))

## # A tibble: 47 × 2
## # Groups:   year [47]
##     year     n
##    <dbl> <int>
##  1  2020    30
##  2  2005    29
##  3  1995    21
##  4  2003    21
##  5  2010    21
##  6  2011    20
##  7  2021    20
##  8  2012    19
##  9  2000    18
## 10  2017    18
## # ℹ 37 more rows

As you can tell, in the 41-year period from 1975 to 2015, there are two years, 1995 and 2005, with a maximum number of storms equal to 21.

11.2 Your Turn

Use "dplyr" functions/commands to create a table (e.g. tibble) storm_records_per_year containing three columns: 1) name of storm, 2) year of storm, and 3) count for number of recorded valued (of the corresponding storm).
Use "dplyr" functions/commands to create a table (e.g. tibble) storms_categ5 containing the name and year of those storms of category 5.
Use "dplyr" functions/commands to display a table showing the status, avg_pressure (average pressure), and avg_wind (average wind speed), for each type of storm category. This table should contain four columns: 1) category, 2) status, 3) avg_pressure, and 4) avg_wind.
Use "dplyr" functions/commands to create a table (e.g. tibble) max_wind_per_storm containing three columns: 1) year of storm, 2) name of storm, and 3) max_wind maximum wind speed record (for that storm).
Use "dplyr" functions/commands to create a table (e.g. tibble) max_wind_per_year containing three columns: 1) year of storm, 2) name of storm, and 3) wind maximum wind speed record (for that year). Arrange rows by wind speed in decreasing order.