library(tidyverse)
theme_set(theme_classic(12))
set.seed(123)Learning objectives
In today’s lab you will…
- Fit logistic regression models to binary (0/1, no/yes) outcomes
- Interpret coefficients
- Visualize predictions
Complete sections Logistic model notation and R code and Read and explore the data before class.
Logistic model notation and R code
In statistical notation, logistic models take the form:
\[ \begin{align} \text{BinaryOutcome} &\sim Binomial(1, p) \\ logit(p) &= \beta_0 + \beta_1 \text{Predictor} \end{align} \]
In R, you can fit a logistic model like this:
logistic_model <- glm(binary_outcome ~ predictor,
data = my_data,
family = binomial(link = "logit"))In both the notation and code above, identify the:
Response family
Link function
Response
Predictor
Read and explore the data
Read the dataset in drinking_water.csv.
Describe what you think each of the columns represents.
| Column | Meaning |
|---|---|
pwsid |
|
pwsname |
|
pctpov |
|
pcthisp |
|
health |
|
medianincomehouse |
|
violation |
|
pcthisp_bin |
|
pctpov_bin |
Create three figures summarizing the relationships between race, class, and SDWA violations.
Figure 1: A scatter plot showing the proportion of districts with SDWA violations in each percent Hispanic bin.
Figure 2: A scatter plot showing the proportion of districts with SDWA violations in each percent below-poverty bin.
Figure 3: A scatter plot showing the correlation between percent Hispanic and percent below-poverty at the district level.
Fit and interpret model
Our question is:
How are SDWA violations distributed among communities by class and race?
Let’s begin by investigating class in isolation. Fit a logistic model to see how violation responds to pctpov.
Just as we did with linear regression, you can inspect coefficient estimates, p-values, and confidence intervals using summary() and confint().
What is the best estimate for the % Below Poverty coefficient?
Is the coefficient significantly different than 0? Assume critical threshold of 0.05.
What is the 95% CI for the coefficient?
Based on the sign, what is the relationship between class and SDWA violations?
Make predictions
Interpreting the coefficients directly is pretty difficult with logistic regression! What is the difference between a coefficient of 0.1 and 0.2? Rather than interpreting the coefficient directly, it’s often easier to compare predictions.
Once you have a model, you can make predictions using predict(). When calling predict(), the following parameters are important:
objectThe model you want to generate predictions fromnewdataA data frame of predictors your want predictions fortypeDo you want predictions on the link or response scale? By default,predict()shows the value of \(logit(p) = \beta_0 + \beta_1 x\) (i.e., link-scale). To get the predicted probability (i.e., response-scale) settypeto “response”.se.fitIfse.fitisTRUE,predict()will calculate the confidence interval of the response on the link scale. You can use that to get the confidence interval of \(p\). That will make a ribbon, similar to the confidence interval of \(\mu\) you saw in linear regression.
Here’s how to generate predictions (with CI) for a single predictor.
I use expand_grid() instead of tibble() when making the data frame of predictors because it gives you every combination of the values in the columns. That will be important when you have more than one predictor!
That was just the best estimate. The following code adds the 95% CI for \(p\). It uses the inverse link function to go from link-scale to response-scale.
When constructing the 95% CI for \(p\), one thing remains the same and one thing has changed from constructing the 95% CI for \(\mu\).
What remains the same: the sampling distribution is normal, so once you have the standard error you can use qnorm() to get the 2.5% and 97.5% quantiles.
What’s different: the sampling distribution exists on the link scale. In other words, the standard error is \(SE(logit(p))\) rather than \(SE(p)\). So you have to get the best estimate and standard error on the link scale, then invert the link function to get back to \(p\).
And here’s the fit with the 95% CI presented as the familiar ribbon.
What is the probability of a SDWA violation in a district where 5% of households are below the poverty line? When 50% are? Show the best estimate and 95% CI.
Your turn
Switzer and Teodoro demonstrated that the interaction between race and class is important for understanding the environmental justice dimensions of drinking water. You’ll investigate that here through the following steps.
- Create a new model with an interaction term between
pctpovandpcthisp. - Use the summary and confidence intervals to answer the following questions:
What is the 95% CI for the % Hispanic coefficient?
Is the interaction between class and race significantly different than 0? - Create a new predictor data frame with combinations of both predictors.
- Generate predictions, including both the best estimate and 95% CI.
- Plot the predictions. In your plot, put
pcthispon the x-axis and probability of SDWA violations on the y-axis. Indicatepctpovusing color and fill. To make this work, you’ll have to filter your predictions to a few levels ofpctpov(e.g., 0, 25, 50, and 75).
The coefficient for an interaction term in a logistic regression model is very difficult to interpret on its own! But that value has important consequences for the lives of millions of people. Predictions are usually more straightforward to interpret than coefficients. Use your figure and the outputs of summary() and confint() to reflect on the following prompt. Reply to the corresponding thread on Slack with your reflection to get 1 token.
In plain language, what is the relationship between race, class, and SDWA violations? Which communities face the greatest risks associated with drinking water? How would you use the statistical outputs (model summary, prediction figure) to describe that risk quantitatively? How could those statistical outputs be misinterpreted?