Data Pipelines

STAT 20: Introduction to Probability and Statistics

Adapted by Gaston Sanchez

Example: Toy Data

wizards <- data.frame(
  name = c("Harry", "Bellatrix", "Hermione", "Draco"),
  house = c("Gryffindor", "Slytherin", "Gryffindor",  "Slytherin"),
  height = c(1.78, 1.57, 1.65, 1.75),
  spells = c(60, 75, 70, 55)
)

Example: Toy Data

wizards <- data.frame(
  name = c("Harry", "Bellatrix", "Hermione", "Draco"),
  house = c("Gryffindor", "Slytherin", "Gryffindor",  "Slytherin"),
  height = c(1.78, 1.57, 1.65, 1.75),
  spells = c(60, 75, 70, 55)
)

wizards
       name      house height spells
1     Harry Gryffindor   1.78     60
2 Bellatrix  Slytherin   1.57     75
3  Hermione Gryffindor   1.65     70
4     Draco  Slytherin   1.75     55

Data Pipelines

wizards
       name      house height spells
1     Harry Gryffindor   1.78     60
2 Bellatrix  Slytherin   1.57     75
3  Hermione Gryffindor   1.65     70
4     Draco  Slytherin   1.75     55


Goal: Calculate the average height of characters from Gryffindor


Let’s look at three ways to solve this.

Nesting

summarize(filter(wizards, house == "Gryffindor"),
          mean(height))
  mean(height)
1        1.715


  • Must be read from the inside out

  • Hard to keep track of arguments

Step-by-step

wizards2 <- filter(wizards, house == "Gryffindor")
summarize(wizards2, mean(height))
  mean(height)
1        1.715


  • Have to repeat data frame names

  • Creates unnecessary objects

Using the Pipe Operator

wizards |>

Using the Pipe Operator

wizards |> 
  filter(house == "Gryffindor") |> 

Using the Pipe Operator

wizards |> 
  filter(house == "Gryffindor") |> 
  summarize(mean(height))
  mean(height)
1        1.715


  • Can be read like an English paragraph

  • Only type the data once

  • No leftover objects

Understanding your pipeline

It’s good practice to understand the output of each line by breaking the pipe.


# bad pipe!
wizards |> 
  select(house) |> 
  filter(mean(height))
Error in `filter()`:
ℹ In argument: `mean(height)`.
Caused by error:
! object 'height' not found
# inspecting the pipe
wizards |> 
  select(house)
       house
1 Gryffindor
2  Slytherin
3 Gryffindor
4  Slytherin

Grouped Operations

wizards
       name      house height spells
1     Harry Gryffindor   1.78     60
2 Bellatrix  Slytherin   1.57     75
3  Hermione Gryffindor   1.65     70
4     Draco  Slytherin   1.75     55


Calculate the average height of characters across each house.

group_by()

Flag the rows of a data frame as belong to a group defined by a factor. For use in downstream operations.

wizards |> 
  group_by(house)
# A tibble: 4 × 4
# Groups:   house [2]
  name      house      height spells
  <chr>     <chr>       <dbl>  <dbl>
1 Harry     Gryffindor   1.78     60
2 Bellatrix Slytherin    1.57     75
3 Hermione  Gryffindor   1.65     70
4 Draco     Slytherin    1.75     55

group_by()

Flag the rows of a data frame as belong to a group defined by a factor. For use in downstream operations.

wizards |> 
  group_by(house) |> 
  summarize(mean(height))
# A tibble: 2 × 2
  house      `mean(height)`
  <chr>               <dbl>
1 Gryffindor           1.72
2 Slytherin            1.66

group_by() with summarize()

wizards |> 
  summarize(mean(height))
  mean(height)
1       1.6875

group_by() with summarize()

wizards |> 
  group_by(house) |> 
  summarize(mean(height))
# A tibble: 2 × 2
  house      `mean(height)`
  <chr>               <dbl>
1 Gryffindor           1.72
2 Slytherin            1.66

group_by() with filter()

wizards |> 
  filter(height == max(height))
   name      house height spells
1 Harry Gryffindor   1.78     60

group_by() with filter()

wizards |> 
group_by(house) |> 
  filter(height == max(height))
# A tibble: 2 × 4
# Groups:   house [2]
  name  house      height spells
  <chr> <chr>       <dbl>  <dbl>
1 Harry Gryffindor   1.78     60
2 Draco Slytherin    1.75     55

group_by() with arrange()

wizards |> 
  arrange(desc(height))
       name      house height spells
1     Harry Gryffindor   1.78     60
2     Draco  Slytherin   1.75     55
3  Hermione Gryffindor   1.65     70
4 Bellatrix  Slytherin   1.57     75

group_by() with arrange()

wizards |> 
  group_by(house) |> 
  arrange(desc(height))
# A tibble: 4 × 4
# Groups:   house [2]
  name      house      height spells
  <chr>     <chr>       <dbl>  <dbl>
1 Harry     Gryffindor   1.78     60
2 Draco     Slytherin    1.75     55
3 Hermione  Gryffindor   1.65     70
4 Bellatrix Slytherin    1.57     75


arrange() ignores group_by() and is always global.

group_by() with mutate()

wizards |> 
  mutate(height_z = (height - mean(height)) / sd(height))
       name      house height spells   height_z
1     Harry Gryffindor   1.78     60  0.9630715
2 Bellatrix  Slytherin   1.57     75 -1.2233611
3  Hermione Gryffindor   1.65     70 -0.3904344
4     Draco  Slytherin   1.75     55  0.6507240

group_by() with mutate()

wizards |> 
  group_by(house) |> 
  mutate(height_z = (height - mean(height)) / sd(height))
# A tibble: 4 × 5
# Groups:   house [2]
  name      house      height spells height_z
  <chr>     <chr>       <dbl>  <dbl>    <dbl>
1 Harry     Gryffindor   1.78     60    0.707
2 Bellatrix Slytherin    1.57     75   -0.707
3 Hermione  Gryffindor   1.65     70   -0.707
4 Draco     Slytherin    1.75     55    0.707