In a Quarto document file, run the following commands:
# packageslibrary(tidyverse)
Data Frame states
We are going to use the following data frame states with some statistics about U.S. states (from the 1970’s)
# data set with some statistics of U.S. statesstates1 =data.frame(name =rownames(state.x77))states2 =as.data.frame(state.x77[ ,2:7])states =cbind(states1, states2)colnames(states) =c("name", "income", "illiteracy", "life_exp", "murder", "hs_grad", "frost")rownames(states) =1:nrow(states)slice(states, 1:10)
illiteracy: illiteracy (1970, percent of population)
life_exp: life expectancy in years (1969–71)
murder: murder and non-negligent manslaughter rate per 100,000 population (1976)
hs_grad: percent high-school graduates (1970)
frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city
Relationship between income and illiteracy
We want to study the relationship between percent of illiteracy using per-capita income
Scatter plot
Use ggplot commands to make a scatter plot to visualize the relationship between income (x-axis) and illiteracy (y-axis), displaying the name of each state.
Instead of using geom_abline() use geom_smooth(method = "lm", se = FALSE, formula = y ~ x) and compare it with your previous graphic.
Show answer
# visualize regression lineggplot(states, aes(x = income, y = illiteracy)) +geom_point() +geom_text(aes(label = name), size =3, alpha =0.5) +geom_smooth(method ="lm", se =FALSE, formula = y ~ x)
Predicting life_exp in terms of murder and hs_plus55
Now let’s predict life expectancy (life_exp) in terms of murder rate (murder) and an indicator variable derived from high-school graduates, specifically percent of HS graduates greater than or equal to 55 (hs_plus55).
This analysis requires mutating the states data to add the categorical variable hs_plus55 that indicates whether a state had a highschool graduation rate larger than or equal to 55 percent. This mutation requires a logical condition hs_grad >= 55
Make a scatter plot of "murder" (x-axis) and "life_exp" (y-axis), color coding the points by "hs_plus55"
# scatter plot, color coding by hs_plus55
Show answer
# scatter plot, color coding by hs_plus55ggplot(states, aes(x = murder, y = life_exp, color = hs_plus55)) +geom_point()
Multiple Regression Model
Fit a linear model predicting "life_exp" in terms of "murder" and "hs_plus55".
Show answer
# linear model predicting "life_exp" in terms of "murder" and "hs_plus55"m2 =lm(life_exp ~ murder + hs_plus55, data = states)m2
Provide a verbal interpretation for the coefficients in the above model.
Show answer
# Verbal interpretation of "murder" coefficient:# For 2 states both having the same "hs_plus55" level, a unit increase in # "murder" is associated with a decrease of 0.2555 years in "life_exp"# Verbal interpretation of "hs_plus55" coefficient# For those states with the same "murder" rate, having a HS-graduation percentage # greater than or equal to 55%, tends to be associated with an increase of# 0.5588 years in "life_exp"
Plotting Challenge
Obtain a scatter plot to visualize the parallel lines associated to the regression model fitted above. Hint: Use two layers of geom_abline()—one layer per regression line.
Show answer
# challenge: scatter plot with fitted linesggplot(states, aes(x = murder, y = life_exp, color = hs_plus55, group = hs_plus55)) +geom_point() +geom_abline(slope = m2$coefficients[2], intercept = m2$coefficients[1] + m2$coefficients[3],color ="turquoise") +geom_abline(slope = m2$coefficients[2], intercept = m2$coefficients[1],color ="tomato")