Practice: Linear Regression

Instructions: starting code

In a Quarto document file, run the following commands:

# packages
library(tidyverse)

Data Frame states

We are going to use the following data frame states with some statistics about U.S. states (from the 1970’s)

# data set with some statistics of U.S. states
states1 = data.frame(name = rownames(state.x77))
states2 = as.data.frame(state.x77[ ,2:7])
states = cbind(states1, states2)
colnames(states) = c("name", "income", "illiteracy", 
                  "life_exp", "murder", "hs_grad", "frost")
rownames(states) = 1:nrow(states)

slice(states, 1:10)
          name income illiteracy life_exp murder hs_grad frost
1      Alabama   3624        2.1    69.05   15.1    41.3    20
2       Alaska   6315        1.5    69.31   11.3    66.7   152
3      Arizona   4530        1.8    70.55    7.8    58.1    15
4     Arkansas   3378        1.9    70.66   10.1    39.9    65
5   California   5114        1.1    71.71   10.3    62.6    20
6     Colorado   4884        0.7    72.06    6.8    63.9   166
7  Connecticut   5348        1.1    72.48    3.1    56.0   139
8     Delaware   4809        0.9    70.06    6.2    54.6   103
9      Florida   4815        1.3    70.66   10.7    52.6    11
10     Georgia   4091        2.0    68.54   13.9    40.6    60


Variables in Data:

  • income: per capita income (1974)

  • illiteracy: illiteracy (1970, percent of population)

  • life_exp: life expectancy in years (1969–71)

  • murder: murder and non-negligent manslaughter rate per 100,000 population (1976)

  • hs_grad: percent high-school graduates (1970)

  • frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city

Relationship between income and illiteracy

We want to study the relationship between percent of illiteracy using per-capita income

Scatter plot

Use ggplot commands to make a scatter plot to visualize the relationship between income (x-axis) and illiteracy (y-axis), displaying the name of each state.

ggplot(_______, aes(x = _________, y = ___________)) + 
  geom_point() +
  geom_text(aes(label = _______), size = 3, alpha = 0.5)
Show answer
ggplot(states, aes(x = income, y = illiteracy)) + 
  geom_point() +
  geom_text(aes(label = name), size = 3, alpha = 0.5)

Correlation Coefficient

Complete the following pipeline to obtain the correlation, cor(), between income and illiteracy

# correlation coefficient
states |> summarize(___________________)
Show answer
# correlation coefficient
states |>
  summarize(correl = cor(income, illiteracy))

True or False: based on the states data and the preceding scatter plot …

  1. Having a larger proportion of literate people in a state creates an increment of per-capita income.

  2. States with larger per-capita incomes tend to have lower percentages of illiterate population.

  3. High levels of per-capita income produce more educated people.

  1. False. The verb “creates” implies causation.
  2. True
  3. False. The verb “produce” implies causation.

Linear Regression Model

Use the lm() function to fit a linear model between income (x) and illiteracy (y)

mod1 <- lm(____________________, data = ______)
mod1
Show answer
mod1 = lm(formula = illiteracy ~ income, data = states)
mod1

Based on the output of your model mod1, provide a verbal interpretation of the obtained coefficients.

Show answer
# Answer: For every additional unit in per-capita income, the percent of
# illiterate population tends to decrease by 0.0004

Scatter Plot with Regression Line

  1. Re-graph the scatter plot of income (x) and illiteracy (y), adding the line of the linear model with.
# visualize regression line
ggplot(______, aes(x = ______, y = _______)) + 
  geom_point() +
  geom_text(aes(label = ______), size = 3, alpha = 0.5) +
  geom_abline(slope = ______, intercept = ______)
Show answer
# visualize regression line
ggplot(states, aes(x = income, y = illiteracy)) + 
  geom_point() +
  geom_text(aes(label = name), size = 3, alpha = 0.5) +
  geom_abline(slope = -0.00043, 
              intercept = 3.0932)
  1. Instead of using geom_abline() use geom_smooth(method = "lm", se = FALSE, formula = y ~ x) and compare it with your previous graphic.
Show answer
# visualize regression line
ggplot(states, aes(x = income, y = illiteracy)) + 
  geom_point() +
  geom_text(aes(label = name), size = 3, alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, formula = y ~ x)

Predicting life_exp in terms of murder and hs_plus55

Now let’s predict life expectancy (life_exp) in terms of murder rate (murder) and an indicator variable derived from high-school graduates, specifically percent of HS graduates greater than or equal to 55 (hs_plus55).

This analysis requires mutating the states data to add the categorical variable hs_plus55 that indicates whether a state had a highschool graduation rate larger than or equal to 55 percent. This mutation requires a logical condition hs_grad >= 55

Show answer
# mutate: add variable "hs_plus55"
states <- states |> 
  mutate(hs_plus55 = hs_grad >= 55)

Scatter plot

Make a scatter plot of "murder" (x-axis) and "life_exp" (y-axis), color coding the points by "hs_plus55"

# scatter plot, color coding by hs_plus55
Show answer
# scatter plot, color coding by hs_plus55
ggplot(states, aes(x = murder, y = life_exp, color = hs_plus55)) +
  geom_point()

Multiple Regression Model

Fit a linear model predicting "life_exp" in terms of "murder" and "hs_plus55".

Show answer
# linear model predicting "life_exp" in terms of "murder" and "hs_plus55"
m2 = lm(life_exp ~ murder + hs_plus55, data = states)
m2

Provide a verbal interpretation for the coefficients in the above model.

Show answer
# Verbal interpretation of "murder" coefficient:
# For 2 states both having the same "hs_plus55" level, a unit increase in 
# "murder" is associated with a decrease of 0.2555 years in "life_exp"

# Verbal interpretation of "hs_plus55" coefficient
# For those states with the same "murder" rate, having a HS-graduation percentage 
# greater than or equal to 55%, tends to be associated with an increase of
# 0.5588 years in "life_exp"  

Plotting Challenge

Obtain a scatter plot to visualize the parallel lines associated to the regression model fitted above. Hint: Use two layers of geom_abline()—one layer per regression line.

Show answer
# challenge: scatter plot with fitted lines
ggplot(states, aes(x = murder, y = life_exp, 
                   color = hs_plus55, group = hs_plus55)) +
  geom_point() +
  geom_abline(slope = m2$coefficients[2], 
              intercept = m2$coefficients[1] + m2$coefficients[3],
              color = "turquoise") +
  geom_abline(slope = m2$coefficients[2], 
              intercept = m2$coefficients[1],
              color = "tomato")