Resources:

Online Tutorial: UCLAstats

Readings Recommended by Lee for this script:

Rogosa, D. (1980). Comparing nonparallel regression lines. Psychological Bulletin, 88, 307–321 Bauer, D.J. and Curran, P.J., 2005. Probing interactions in fixed and multilevel regression: Inferential and graphical techniques. Multivariate behavioral research, 40(3), pp.373-400.

Set working directory.

setwd("...your_working_directory/")

Load the following libraries. If not yet installed, use install.packages("package_name")

library("car")
library("data.table")
library("tidyverse")
library("pals")
library("jtools")
library("ggplot2")
library("ggiraphExtra")
library("emmeans")
library("corrgram")
library("Matrix")

Exploring Interactions

In last week’s lab, we explored linear models that contained main effects. In other words, we focused on models that evaluated the distinct impact of two predictor variables on the response. However, it may be the case that the magnitude of effect of a predictor variable (x1) on the response y, is modified by a second predictor variable (x2; or some grouping factor). For example, a large difference in y may be observed between two levels of x2 at lower values of x1 but that difference disappears at higher values of x1. Interaction terms explore whether the effect of one predictor on a response variable (Y ~ X1) changes for different levels of predictor variable (X2). Visually, interactions can be illustrated in many ways but examples frequently draw from non-parallel lines on xy plots (e.g. intersecting lines).

Figure 1: Different Illustrations of an Interaction
Figure 1: Different Illustrations of an Interaction

What exactly are we trying to evaluate when we ask “Is there an interaction?”

Note in this lab, we will focus on effect sizes when interpreting the outputs - not null hypothesis testing (e.g. p-values/significance testing). The first section focuses on interpreting the coef output. In other words, understanding the effect sizes as they relate to linear models with interactions. In the second section, you will manipulate interactions through simulations.

Section 1: Interpretting Outputs

Note: This section covers ONLY multiplicative Interactions. We will explore non-multiplicative interactions using simulated data in Section 2.

Interactions with two categorical predictors (each with two levels)

We will use the Salaries data set for the example. Salaries is built into R and has the salary for professors of different rank (associate professor, assistant professor and professor), from different disciplines (two levels: A and B) and binary gender data (male, female) - among other variables. Below we explore whether sex modifies the relationship between salary and discipline. In other words, maybe males make higher salary in one discipline and females in the other.

salary_model <- lm(salary ~ discipline*sex, data = Salaries)
coef(salary_model)
##         (Intercept)         disciplineB             sexMale disciplineB:sexMale 
##            89064.94            22169.58            21635.04           -14109.19
interaction.plot(Salaries$discipline, Salaries$sex, Salaries$salary)

Remember, coef returns the effect sizes (aka differences in means). The interpretation of coef is as follows.

  • Intercept or β0: is the mean salary for females in discipline A. Females in discipline A make on average $89,064.94 (refer eq. A below)
  • disciplineB or β1: is the effect (or the change in mean salary) for females in discipline B. On average, females make 22169.58 more dollars in discipline B than they do in discipline A. Mean salary for females in discipline B is $111,234.52. (89,064.94 + 22,169.58). (refer eq. B below)
  • sexMale or β2: is the effect (or the change in mean salary) for males in discipline A. Males in discipline A make on average 21,635.04 more dollars than females in discipline A. Average salary for males in discipline A is $110,699.98.(89,064.94 + 21,635.04) (refer eq. C below)
  • disciplineB:sexMale or β3: Change in discipline effect if sex is male OR the change in the sex effect if the individual teaches discipline is B. The mean salary for males in discipline B can be calculated by adding all the coefficients together. (89,064.94 + 22,169.58 + 21,635.04 -14,109.19)(refer eq. D below)

If those interpretations make perfect sense to you I’d skip ahead to the next section. I like to remind myself what is happening under the hood to get those outputs and find it a helpful exercise, especially if you are newer to stats, to write out the mathematical formula. The way R codes a model can sometimes detach us from the mathematical model.

First a quick review of model components:

Figure 2: Components of a linear model
Figure 2: Components of a linear model

Now we will create a regression equation that solves for each of the conclusions above.

Calculating Intercept/ Salary for females in discipline A:

Remember, use of the lm() function with two categorical variables is the same as performing a 2x2 ANOVA (Since ANOVAs are linear models, you can get R to summarize aov() and lm() models in the same way useing summary.aov()). In cases with categorical variables, R dummy codes the variables and the structure of the regression formula. Here are the dummy codes relevant to our model above:

discipline: A = 0; B = 1 sex: female = 0; male = 1

How do I know that is the way R dummy codes these variables? R assigns 1 to the first level it encounters for a given factor. Also, the coef heading disciplineB and sexMale lets you know those two levels were assigned dummy code = 1 for each of those variables.

Let’s input the coefficients, and dummy code accordingly. If we are interested in Salary for females in discipline A then this would imply the dummy code for discipline = 0 and sexM = 0. All terms cancel with the exception of the intercept. Yay! It makes sense the intercept represents the mean salary for females in discipline A.

Calculating salary for females in discipline B:

If we are interested in Salary for females in discipline B then this would imply the dummy code for discipline = 1 and sexM = 0. Only the sex term and interaction term cancel this time. Hence, we add the effect of discipline to the intercept.

Calculating salary for males in discipline A:

Calculating salary for males in discipline B:

In summary, the effect of gender on salary is modified by the discipline to which the professors are a part of. Male professors generally make higher salary than female professors (ugh) and this disparity is greater for discipline A than B.

Need Review? Video1 OR Video2

Interaction with one continuous and one categorical predictor

Let’s grab some example data from online.

autompg = read.table(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", 
    quote = "\"", 
    comment.char = "", 
    stringsAsFactors = FALSE) |>
  
  setNames(c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")) |> #name the columns
  
  subset(hp   != "?") |> #remove missing data, which is stored as "?"
  subset(name != "plymouth reliant") |> #remove the plymouth reliant, as it causes some issues
  subset(!(cyl %in% c(3, 5))) |> #remove 3 and 5 cylinder cars (which are very rare.)
  select(-name) |> #remove the "name" column
  
  within({
    hp       <- as.numeric(hp) #change horsepower from character to numeric
    domestic <- as.numeric(origin == 1) #create a dummy variable for cars. domestic = 1
    cyl      <- as.factor(cyl) #change cyl from numeric to factor
  }) 

In autompg:

#There are two ways to code the same interaction in R using the lm() function. One uses the `*` operator and the other the `:` operator. 
model.disp_dom1 <- lm(mpg ~ disp*domestic, data = autompg)
model.disp_dom2 <- lm(mpg ~ disp + domestic + disp:domestic, data = autompg) 

#R interprets * as a combination of + (addition) and : (interaction). R should see these models as exactly the same

#Yay! outputs yield same results-just as we'd expect.
summary(model.disp_dom1)
## 
## Call:
## lm(formula = mpg ~ disp * domestic, data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8332  -2.8956  -0.8332   2.2828  18.7749 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.05484    1.80582  25.504  < 2e-16 ***
## disp           -0.15692    0.01668  -9.407  < 2e-16 ***
## domestic      -12.57547    1.95644  -6.428 3.90e-10 ***
## disp:domestic   0.10252    0.01692   6.060 3.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared:  0.7011, Adjusted R-squared:  0.6987 
## F-statistic: 296.3 on 3 and 379 DF,  p-value: < 2.2e-16
summary(model.disp_dom2)
## 
## Call:
## lm(formula = mpg ~ disp + domestic + disp:domestic, data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.8332  -2.8956  -0.8332   2.2828  18.7749 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    46.05484    1.80582  25.504  < 2e-16 ***
## disp           -0.15692    0.01668  -9.407  < 2e-16 ***
## domestic      -12.57547    1.95644  -6.428 3.90e-10 ***
## disp:domestic   0.10252    0.01692   6.060 3.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared:  0.7011, Adjusted R-squared:  0.6987 
## F-statistic: 296.3 on 3 and 379 DF,  p-value: < 2.2e-16

Let’s look at this visually:

#Define intercepts. 
int_for = coef(model.disp_dom1)[1]
int_dom = coef(model.disp_dom1)[1] + coef(model.disp_dom1)[3] #Consider the formula above that we worked out by hand. The calculation for int_dom should make sense now. 

#Define slopes
slope_for = coef(model.disp_dom1)[2]
slope_dom = coef(model.disp_dom1)[2] + coef(model.disp_dom1)[4]#Consider the formula above that we worked out by hand. The calculation for slope_dom should make sense now. 

#plot the coefficients and slopes using base R functions
plot(mpg ~ disp, data = autompg, col = domestic + 1, pch = domestic + 1, xlim = c(0, 500), ylim = c(0, 50))
abline(int_for, slope_for, col = 1, lty = 1, lwd = 2) # line for foreign cars
abline(int_dom, slope_dom, col = 2, lty = 2, lwd = 2) # line for domestic cars
legend("topright", c("Foreign", "Domestic"), pch = c(1, 2), col = c(1, 2))

#or if you don't want to do the math
ggplot(autompg, aes(disp, mpg, color = as.factor(domestic)))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

Again, since we are emphasizing effect sizes (not significance testing) let’s just look at the coef

coef(model.disp_dom1)
##   (Intercept)          disp      domestic disp:domestic 
##    46.0548423    -0.1569239   -12.5754714     0.1025184

Interpreting Model Outputs:

Remember from the ANOVA example, we cannot interpret the main effects without respect to a given level of the other variable. disp = -0.16 is not interpreted as an overall effect on mpg. Instead, to interpret disp = -0.16 we must take into account the two levels of domestic.

Means and slopes for each group:

Skip ahead if you can interpret the coef output easily and those interpretations make perfect sense. We will now write out the model like before and show how we came to the conclusions above.

Let’s interpret coefficients by implementing dummy coding.

Let’s first consider the distinct effect of engine displacement disp on mpg for foreign cars. To do so, we dummy code domestic with 0. Solving for the equation reduces it to mpg = 46.05 + (-0.16*disp). We can now interpret the coefficient estimate for disp in the summary output above as: In foreign cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.16 for every one unit change in engine power. The mean mpg for foreign cars with no displacement is 46.05.

Now let’s solve for the distinct effect of engine displacement disp on mpg for domestic cars. To do so, we dummy code domestic with 1. To solve for the equation, you will see we end up doing the following: β1 + β3 = -0.06 and β0 + β2 = 33.48. We can now interpret the coefficient estimate for disp in the summary output above as: In domestic cars, the estimated change in average fuel efficiency (mpg) is a reduction of 0.06 for every one unit change in engine power. We also see that the intercept for domestic cars is 33.48.

Summary: Foreign cars have a stronger negative relationship between engine power and miles per gallon than domestic cars. At lower values of displacement (engine power), foreign cars have greater fuel efficiency than domestic cars. However, as displacement increases (engine power increases above >120), domestic cars have greater fuel efficiency.

Review: Video 1

Check for Understanding

  1. Perform and interpret effect sizes for the following model with a categorical variable with 3 levels cyl: lm(mpg ~ disp*cyl).

Interaction with two continuous predictors

Now let’s consider two continuous predictor variables (disp and hp, engine displacement and horsepower) and their effect on miles per gallon.

# hp = horsepower
model.disp_hp <- lm(mpg ~ disp *hp, data = autompg)
coef(model.disp_hp)
##   (Intercept)          disp            hp       disp:hp 
## 52.4081997848 -0.1001737655 -0.2198199720  0.0005658269

As we did before, let’s write out the equation (eq. a). First, we we can rearrange terms by factoring out x1 (eq. b). Once inputting the coef (eq. c) we see that for a one unit increase in x1 (disp), the mean of Y (mpg) increases by the term β1 + β3*x2. This indicates that the increase in mean Y for each unit increase in x1, will differ across values of x2 (hp) (eq. d).

Let’s solve for x2 (eq. e-g). For a one unit increase in x2 (hp), the mean of Y (mpg) increases by the term β2 + β3x1 (eq. g). This indicates that the increase in mean Y for each unit increase in x2, will differ across values of x1 (disp).

To plot this we can bin data and you will see an example of this later in this script using simulated data where values are grouped by quantiles. For now, let’s interpret the coef.

coef(model.disp_hp)
##   (Intercept)          disp            hp       disp:hp 
## 52.4081997848 -0.1001737655 -0.2198199720  0.0005658269
  • β0 = 52.408 is the estimated average mpg for a car with 0 disp and 0 hp
  • β1 = -0.10 is the estimated change in average mpg for an increase in 1 disp, for a car with 0 hp. When hp increases to any value other than zero, then the influence of disp on mpg increase by 0.0005 for every level of hp.
  • β2 = -0.219 is the estimated change in average mpg for an increase in 1 hp, for a car with 0 disp. When disp takes on values above zero, then the influence of hp on mpg increases by 0.005 for every level of disp.
  • β3 = 0.0005 is an estimate of the modification to the change in average mpg for and increase in disp, for a car of a certain hp (or visa versa)

Need Review? Video1

You may be asking now, there’s got to be an automated way to get those slopes for interactions? Of course there is! But our approach above hopefully encouraged a better understanding. Use emmeans::emtrends() for a quick output of the slopes for each level of the modifier variable. For example:

emtrends(model.disp_dom1, ~ domestic, var = "disp")
##  domestic disp.trend      SE  df lower.CL upper.CL
##         0    -0.1569 0.01670 379  -0.1897  -0.1241
##         1    -0.0544 0.00282 379  -0.0599  -0.0489
## 
## Confidence level used: 0.95

Section2: The Basics of Interactions with Simulated Data

These are some functions that will come in handy later. split_data() takes a variable in data and breaks into a specified number of quantiles n_breaks. It adds a column break_num to data which is the quantile to which each observation belongs. ggint() makes an interaction plot where colors represent quantiles, which will save us from copy - pasting this code over and over.

split_data <- function(data, n_breaks, var){
  
  breaks <- quantile(data[,var], seq(0, 1, by = 1/n_breaks)) #named vector of quantiles for the input column
  data$break_num <- as.factor(sapply(data[,var], \(i) max(which(breaks <= i)))) #the quantile for each point
  
  #fixes max value returning n_breaks + 1 when it should be n_breaks
  return(within(data,{break_num[data[,var] == max(data[,var])] = n_breaks})) 
}

ggint <- function(data){
  return(
    ggplot(slopes_plot) + 
      geom_point(aes(x1, y, fill = break_num), pch = 21, color = "black") + 
      geom_smooth(aes(x1, y, group = break_num), color = "black", se = FALSE, method = lm, linewidth = 1.5)+
      geom_smooth(aes(x1, y, color = break_num), se = FALSE, method = "lm", linewidth = 1) + 
      scale_fill_manual(values = ocean.haline(6)) + #pals::ocean contains colorblind friendly palettes
      scale_color_manual(values = ocean.haline(6)) + 
      labs(fill = "x2 quantile", color = "x2 quantile") + 
      theme_classic())
}

Main Effects Model- No Interactions

Before we examine interactions let’s start with a dataset with no interactions - only additive effects. The code below creates that data frame. It is important to remember that we are simulating data, therefore, we are defining what predicts the response variable (y). This is in contrast to when you have data in hand and you fit a model to find what predicts y. Here we are defining y as y <- 5*x1 + -2*x2 + rnorm(1000, 0, 2). Which is to say we are simulating y to be dependent on two main effects: x1 and x2 with an error term (noise) that is normally distributed. Notice that by making our errors normally distributed we are “forcing” our model to meet the assumptions of linear models.

set.seed(335)

# simulate to uncorrelated predictor variables: x1 and x2. If we wanted to simulate correlated variables,
# we would so something like this: x2 <- x1 + rnorm(1000, 0, 0.2)
additive <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
  within({y <- 5*x1 + -2 *x2 + rnorm(1000, 0, 2)}) #generate a response variable that is predicted by x1 and x2, with noise.

To look for interactions in each data set, we will examine whether the relationship between x1 and y changes for different subsets of the data. Since x2 is currently a continuous variable, we will make it categorical using the split_data function. This breaks x2 into 6 parts (because we specified n_breaks = 6). The new column break_num is created in a data frame slopes_plot. A break_num = 1 represents the lowest values of x2, while break_num = 6 represents the highest values of x2. Here we break the data into 6 sections.

slopes_plot <- split_data(additive, 6, "x2")

Let’s plot the data. We see that the relationship between x1 and y is independent of x2. We know this because the slope of the line does not change whether looking exclusively at the low values of x2 or the high values of x2. No interactions here.

ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Let’s view this relationship another way by plotting the slope estimates for each x2 quantile. Note here we have n_breaks = 20 rather than n_breaks = 6 from above. The latter would only leave us with 6 points on the graph.

# `coef_plot_data` is a data frame of slope estimates (aka. effect sizes or coefficients) for the effect of X1 on Y for each group (here, quantile of X2). 
slopes_plot <- split_data(additive, 20, "x2")

coef_plot_data <- data.frame(
  'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
  'x2' = 1:20)

ggplot(coef_plot_data, aes(x1, x2)) + 
 geom_point() + 
 labs(x = "Effect size of x1 on y", y = "Quantile of x2") + 
 theme_classic()

Here we see how the slope estimate of the line varies by subset of the data. Again, no pattern here. In other words, there is no interaction among x1 and x2. The effect of x1 on y is not modified by x2. As we simulated, these predictor variables are independent of each other.

While the plot above shows compelling visual evidence for no relationship between x1 and x2, let’s verify using the effect size from a linear model. Effect size basically zero (x1 = 0.21 +/- 4.1).

summary(lm(x2 ~ x1, data = coef_plot_data))
## 
## Call:
## lm(formula = x2 ~ x1, data = coef_plot_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5538 -4.8155  0.0106  4.6486  9.5761 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   9.4441    20.6794   0.457    0.653
## x1            0.2101     4.1059   0.051    0.960
## 
## Residual standard error: 6.078 on 18 degrees of freedom
## Multiple R-squared:  0.0001455,  Adjusted R-squared:  -0.0554 
## F-statistic: 0.002619 on 1 and 18 DF,  p-value: 0.9598

Model With Interaction Effect

Now we add a multiplicative interaction, with still uncorrelated predictors. Same steps as above, but you will note the interaction added to the model in the term: 2*x1*x2

multiplicative <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
  within({y <- 5*x1 + 2*x2 + 2*x1*x2 + rnorm(1000, 0, 2)}) #here you can see the interaction (2*x1*x2)

slopes_plot <- split_data(multiplicative, 6, "x2")

#Now we see an interaction. As the values of x2 increase, so does the strength
#of the relationship between x1 and y.
ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

We can see here that for different levels of x2, the effect (or slope) of x1 on y differs.

Now, when we plot the slopes for different quantiles of x2, we should expect to see some relationship. A pattern indicates that the effect of one variable on the response will vary depending on the value of the other predictor.

#Now this coefficient plot might be more interesting. Let's examine the relationship
#between the values of x2 and the slope between x1 and y.
slopes_plot <- split_data(multiplicative, 20, "x2")

coef_plot_data <- data.frame(
  'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
  'x2' = 1:20)

#We see a linear relationship, like we simulated!
ggplot(coef_plot_data, aes(x1, x2)) + 
 geom_point() + 
 labs(x = "Effect size of x1 on y", y = "Quantile of x2") + 
 theme_classic()