Online Tutorial: UCLAstats
Readings Recommended by Lee for this script:
Rogosa, D. (1980). Comparing nonparallel regression lines. Psychological Bulletin, 88, 307–321 Bauer, D.J. and Curran, P.J., 2005. Probing interactions in fixed and multilevel regression: Inferential and graphical techniques. Multivariate behavioral research, 40(3), pp.373-400.
Set working directory.
setwd("...your_working_directory/")
Load the following libraries. If not yet installed, use
install.packages("package_name")
library("car")
library("data.table")
library("tidyverse")
library("pals")
library("jtools")
library("ggplot2")
library("ggiraphExtra")
library("emmeans")
library("corrgram")
library("Matrix")
In last week’s lab, we explored linear models that contained main effects. In other words, we focused on models that evaluated the distinct impact of two predictor variables on the response. However, it may be the case that the magnitude of effect of a predictor variable (x1) on the response y, is modified by a second predictor variable (x2; or some grouping factor). For example, a large difference in y may be observed between two levels of x2 at lower values of x1 but that difference disappears at higher values of x1. Interaction terms explore whether the effect of one predictor on a response variable (Y ~ X1) changes for different levels of predictor variable (X2). Visually, interactions can be illustrated in many ways but examples frequently draw from non-parallel lines on xy plots (e.g. intersecting lines).
What exactly are we trying to evaluate when we ask “Is there an interaction?”
With categorical predictor variables, as is the case with ANOVA, we are asking “How does one factor affect the”performance” of the other factor in explaining variance in the response y?“.
Similarly, in linear regression where at least one predictor is continuous, we are asking: “How does the effect of the continuous predictor variable on the response differ across levels of the other (continuous or categorical) predictor?”.
Note in this lab, we will focus on effect sizes when interpreting the
outputs - not null hypothesis testing (e.g. p-values/significance
testing). The first section focuses on interpreting the
coef
output. In other words, understanding the
effect sizes as they relate to linear models with
interactions. In the second section, you will manipulate interactions
through simulations.
Note: This section covers ONLY multiplicative Interactions. We will explore non-multiplicative interactions using simulated data in Section 2.
We will use the Salaries
data set for the example.
Salaries
is built into R and has the salary
for professors of different rank (associate professor, assistant
professor and professor), from different disciplines (two levels: A and
B) and binary gender data (male, female) - among other variables. Below
we explore whether sex
modifies the relationship between
salary
and discipline
. In other words, maybe
males make higher salary in one discipline and females in the other.
salary_model <- lm(salary ~ discipline*sex, data = Salaries)
coef(salary_model)
## (Intercept) disciplineB sexMale disciplineB:sexMale
## 89064.94 22169.58 21635.04 -14109.19
interaction.plot(Salaries$discipline, Salaries$sex, Salaries$salary)
Remember, coef
returns the effect sizes (aka differences
in means). The interpretation of coef
is as follows.
Intercept
or β0
: is the mean
salary for females in discipline A. Females in discipline A make on
average $89,064.94 (refer eq. A below)disciplineB
or β1
: is the effect
(or the change in mean salary) for females in discipline
B. On average, females make 22169.58 more dollars in discipline
B than they do in discipline A. Mean salary for females in discipline B
is $111,234.52. (89,064.94 + 22,169.58). (refer eq. B
below)sexMale
or β2
: is the effect (or
the change in mean salary) for males in discipline A.
Males in discipline A make on average 21,635.04 more dollars than
females in discipline A. Average salary for males in discipline A is
$110,699.98.(89,064.94 + 21,635.04) (refer eq. C
below)disciplineB:sexMale
or β3
: Change
in discipline effect if sex is male OR the change in the sex effect if
the individual teaches discipline is B. The mean salary for males in
discipline B can be calculated by adding all the coefficients together.
(89,064.94 + 22,169.58 + 21,635.04 -14,109.19)(refer eq. D
below)If those interpretations make perfect sense to you I’d skip ahead to the next section. I like to remind myself what is happening under the hood to get those outputs and find it a helpful exercise, especially if you are newer to stats, to write out the mathematical formula. The way R codes a model can sometimes detach us from the mathematical model.
First a quick review of model components:
Now we will create a regression equation that solves for each of the conclusions above.
Calculating Intercept
/ Salary for females in
discipline A:
Remember, use of the lm()
function with two categorical
variables is the same as performing a 2x2 ANOVA (Since ANOVAs are linear
models, you can get R to summarize aov()
and
lm()
models in the same way useing
summary.aov()
). In cases with categorical variables, R
dummy codes the variables and the structure of the regression formula.
Here are the dummy codes relevant to our model above:
discipline
: A = 0; B = 1 sex
:
female = 0; male = 1
How do I know that is the way R dummy codes these variables? R
assigns 1 to the first level it encounters for a given factor. Also, the
coef
heading disciplineB
and
sexMale
lets you know those two levels were assigned dummy
code = 1 for each of those variables.
Let’s input the coefficients, and dummy code accordingly. If we are interested in Salary for females in discipline A then this would imply the dummy code for discipline = 0 and sexM = 0. All terms cancel with the exception of the intercept. Yay! It makes sense the intercept represents the mean salary for females in discipline A.
Calculating salary for females in discipline B:
If we are interested in Salary for females in discipline B then this would imply the dummy code for discipline = 1 and sexM = 0. Only the sex term and interaction term cancel this time. Hence, we add the effect of discipline to the intercept.
Calculating salary for males in discipline A:
Calculating salary for males in discipline B:
In summary, the effect of gender on salary is modified by the discipline to which the professors are a part of. Male professors generally make higher salary than female professors (ugh) and this disparity is greater for discipline A than B.
Let’s grab some example data from online.
autompg = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE) |>
setNames(c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")) |> #name the columns
subset(hp != "?") |> #remove missing data, which is stored as "?"
subset(name != "plymouth reliant") |> #remove the plymouth reliant, as it causes some issues
subset(!(cyl %in% c(3, 5))) |> #remove 3 and 5 cylinder cars (which are very rare.)
select(-name) |> #remove the "name" column
within({
hp <- as.numeric(hp) #change horsepower from character to numeric
domestic <- as.numeric(origin == 1) #create a dummy variable for cars. domestic = 1
cyl <- as.factor(cyl) #change cyl from numeric to factor
})
In autompg
:
mpg
: miles per gallon (continuous dependent
variable)disp
: engine displacement (continuous predictor
variable)domestic
: domestic/foreign (categorical predictors
variable)#There are two ways to code the same interaction in R using the lm() function. One uses the `*` operator and the other the `:` operator.
model.disp_dom1 <- lm(mpg ~ disp*domestic, data = autompg)
model.disp_dom2 <- lm(mpg ~ disp + domestic + disp:domestic, data = autompg)
#R interprets * as a combination of + (addition) and : (interaction). R should see these models as exactly the same
#Yay! outputs yield same results-just as we'd expect.
summary(model.disp_dom1)
##
## Call:
## lm(formula = mpg ~ disp * domestic, data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8332 -2.8956 -0.8332 2.2828 18.7749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.05484 1.80582 25.504 < 2e-16 ***
## disp -0.15692 0.01668 -9.407 < 2e-16 ***
## domestic -12.57547 1.95644 -6.428 3.90e-10 ***
## disp:domestic 0.10252 0.01692 6.060 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared: 0.7011, Adjusted R-squared: 0.6987
## F-statistic: 296.3 on 3 and 379 DF, p-value: < 2.2e-16
summary(model.disp_dom2)
##
## Call:
## lm(formula = mpg ~ disp + domestic + disp:domestic, data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8332 -2.8956 -0.8332 2.2828 18.7749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.05484 1.80582 25.504 < 2e-16 ***
## disp -0.15692 0.01668 -9.407 < 2e-16 ***
## domestic -12.57547 1.95644 -6.428 3.90e-10 ***
## disp:domestic 0.10252 0.01692 6.060 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.308 on 379 degrees of freedom
## Multiple R-squared: 0.7011, Adjusted R-squared: 0.6987
## F-statistic: 296.3 on 3 and 379 DF, p-value: < 2.2e-16
Let’s look at this visually:
#Define intercepts.
int_for = coef(model.disp_dom1)[1]
int_dom = coef(model.disp_dom1)[1] + coef(model.disp_dom1)[3] #Consider the formula above that we worked out by hand. The calculation for int_dom should make sense now.
#Define slopes
slope_for = coef(model.disp_dom1)[2]
slope_dom = coef(model.disp_dom1)[2] + coef(model.disp_dom1)[4]#Consider the formula above that we worked out by hand. The calculation for slope_dom should make sense now.
#plot the coefficients and slopes using base R functions
plot(mpg ~ disp, data = autompg, col = domestic + 1, pch = domestic + 1, xlim = c(0, 500), ylim = c(0, 50))
abline(int_for, slope_for, col = 1, lty = 1, lwd = 2) # line for foreign cars
abline(int_dom, slope_dom, col = 2, lty = 2, lwd = 2) # line for domestic cars
legend("topright", c("Foreign", "Domestic"), pch = c(1, 2), col = c(1, 2))
#or if you don't want to do the math
ggplot(autompg, aes(disp, mpg, color = as.factor(domestic)))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
Again, since we are emphasizing effect sizes (not significance
testing) let’s just look at the coef
coef(model.disp_dom1)
## (Intercept) disp domestic disp:domestic
## 46.0548423 -0.1569239 -12.5754714 0.1025184
Interpreting Model Outputs:
Remember from the ANOVA example, we cannot interpret the main effects
without respect to a given level of the other variable.
disp = -0.16
is not interpreted as an overall effect on
mpg
. Instead, to interpret disp = -0.16
we
must take into account the two levels of domestic
.
Intercept
: Mean mpg
of
foreign cars with 0 disp
disp
: effect of disp
on
mpg
for foreign cars.domestic
: the effect of being a
domestic caron mean mpg
disp:domestic
: Change in disp
effect on mpg
if origin of car is domestic
OR the change in disp
effect on
mpg
if the car origin is foriegn
.Means and slopes for each group:
mpg
of foreign cars with 0
disp
= 46.05mpgdisp
average fuel efficiency (mpg)
decreases by 0.16.mpg
of domestic cars when
disp
= 0 is 33.48mpgmpg
) is a reduction of
0.06 for every one unit change in engine power
(disp
).Skip ahead if you can interpret the coef
output easily
and those interpretations make perfect sense. We will now write out the
model like before and show how we came to the conclusions above.
Let’s interpret coefficients by implementing dummy coding.
domestic
: 0 = foreign; 1 = domestic.Let’s first consider the distinct effect of engine displacement
disp
on mpg
for foreign cars.
To do so, we dummy code domestic
with 0. Solving for the
equation reduces it to mpg = 46.05 + (-0.16*disp)
. We can
now interpret the coefficient estimate for disp
in the
summary output above as: In foreign cars, the estimated change in
average fuel efficiency (mpg) is a reduction of 0.16 for every one unit
change in engine power. The mean mpg for foreign cars with no
displacement is 46.05.
Now let’s solve for the distinct effect of engine displacement
disp
on mpg
for domestic
cars. To do so, we dummy code domestic
with 1. To
solve for the equation, you will see we end up doing the following:
β1 + β3 = -0.06 and β0 + β2 = 33.48
. We can now interpret
the coefficient estimate for disp
in the summary output
above as: In domestic cars, the estimated change in average fuel
efficiency (mpg) is a reduction of 0.06 for every one unit change in
engine power. We also see that the intercept
for domestic
cars is 33.48.
Summary: Foreign cars have a stronger negative relationship between engine power and miles per gallon than domestic cars. At lower values of displacement (engine power), foreign cars have greater fuel efficiency than domestic cars. However, as displacement increases (engine power increases above >120), domestic cars have greater fuel efficiency.
Review: Video 1
Check for Understanding
cyl
:
lm(mpg ~ disp*cyl)
.Now let’s consider two continuous predictor
variables (disp
and hp
, engine
displacement and horsepower) and their effect on miles per gallon.
# hp = horsepower
model.disp_hp <- lm(mpg ~ disp *hp, data = autompg)
coef(model.disp_hp)
## (Intercept) disp hp disp:hp
## 52.4081997848 -0.1001737655 -0.2198199720 0.0005658269
As we did before, let’s write out the equation (eq. a). First, we we
can rearrange terms by factoring out x1 (eq. b). Once inputting the
coef
(eq. c) we see that for a one unit increase in x1
(disp
), the mean of Y (mpg
) increases by the
term β1 + β3*x2
. This indicates that the increase in mean Y
for each unit increase in x1, will differ across values of x2
(hp
) (eq. d).
Let’s solve for x2 (eq. e-g). For a one unit increase in x2
(hp
), the mean of Y (mpg
) increases by the
term β2 + β3x1
(eq. g). This indicates that the increase in
mean Y for each unit increase in x2, will differ across values of x1
(disp
).
To plot this we can bin data and you will see an example of this
later in this script using simulated data where values are grouped by
quantiles. For now, let’s interpret the coef
.
coef(model.disp_hp)
## (Intercept) disp hp disp:hp
## 52.4081997848 -0.1001737655 -0.2198199720 0.0005658269
mpg
for a
car with 0 disp
and 0 hp
mpg
for an increase in 1 disp
, for a
car with 0 hp
. When hp
increases to
any value other than zero, then the influence of disp
on
mpg
increase by 0.0005 for every level of
hp
.mpg
for an increase in 1 hp
, for a car
with 0 disp
. When disp
takes on
values above zero, then the influence of hp
on
mpg
increases by 0.005 for every level of
disp
.mpg
for and increase in
disp
, for a car of a certain hp
(or visa
versa)Need Review? Video1
You may be asking now, there’s got to be an automated way to get
those slopes for interactions? Of course there is! But our approach
above hopefully encouraged a better understanding. Use
emmeans::emtrends()
for a quick output of the slopes for
each level of the modifier variable. For example:
emtrends(model.disp_dom1, ~ domestic, var = "disp")
## domestic disp.trend SE df lower.CL upper.CL
## 0 -0.1569 0.01670 379 -0.1897 -0.1241
## 1 -0.0544 0.00282 379 -0.0599 -0.0489
##
## Confidence level used: 0.95
These are some functions that will come in handy later.
split_data()
takes a variable
in
data
and breaks into a specified number of quantiles
n_breaks
. It adds a column break_num
to
data
which is the quantile to which each observation
belongs. ggint()
makes an interaction plot where colors
represent quantiles, which will save us from copy - pasting this code
over and over.
split_data <- function(data, n_breaks, var){
breaks <- quantile(data[,var], seq(0, 1, by = 1/n_breaks)) #named vector of quantiles for the input column
data$break_num <- as.factor(sapply(data[,var], \(i) max(which(breaks <= i)))) #the quantile for each point
#fixes max value returning n_breaks + 1 when it should be n_breaks
return(within(data,{break_num[data[,var] == max(data[,var])] = n_breaks}))
}
ggint <- function(data){
return(
ggplot(slopes_plot) +
geom_point(aes(x1, y, fill = break_num), pch = 21, color = "black") +
geom_smooth(aes(x1, y, group = break_num), color = "black", se = FALSE, method = lm, linewidth = 1.5)+
geom_smooth(aes(x1, y, color = break_num), se = FALSE, method = "lm", linewidth = 1) +
scale_fill_manual(values = ocean.haline(6)) + #pals::ocean contains colorblind friendly palettes
scale_color_manual(values = ocean.haline(6)) +
labs(fill = "x2 quantile", color = "x2 quantile") +
theme_classic())
}
Main Effects Model- No Interactions
Before we examine interactions let’s start with a dataset with no
interactions - only additive effects. The code below creates that data
frame. It is important to remember that we are simulating data,
therefore, we are defining what predicts the response variable (y). This
is in contrast to when you have data in hand and you fit a model to find
what predicts y. Here we are defining y as
y <- 5*x1 + -2*x2 + rnorm(1000, 0, 2)
. Which is to say
we are simulating y to be dependent on two main effects: x1
and x2
with an error term (noise) that is normally
distributed. Notice that by making our errors normally distributed we
are “forcing” our model to meet the assumptions of linear models.
set.seed(335)
# simulate to uncorrelated predictor variables: x1 and x2. If we wanted to simulate correlated variables,
# we would so something like this: x2 <- x1 + rnorm(1000, 0, 0.2)
additive <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
within({y <- 5*x1 + -2 *x2 + rnorm(1000, 0, 2)}) #generate a response variable that is predicted by x1 and x2, with noise.
To look for interactions in each data set, we will examine whether
the relationship between x1
and y
changes for
different subsets of the data. Since x2
is currently a
continuous variable, we will make it categorical using the
split_data
function. This breaks x2
into 6
parts (because we specified n_breaks = 6
). The new column
break_num
is created in a data frame
slopes_plot
. A break_num = 1
represents the
lowest values of x2
, while break_num = 6
represents the highest values of x2
. Here we break the data
into 6 sections.
slopes_plot <- split_data(additive, 6, "x2")
Let’s plot the data. We see that the relationship between
x1
and y
is independent of x2
. We
know this because the slope of the line does not change whether looking
exclusively at the low values of x2
or the high values of
x2
. No interactions here.
ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Let’s view this relationship another way by plotting the slope
estimates for each x2
quantile. Note here we have
n_breaks = 20
rather than n_breaks = 6
from
above. The latter would only leave us with 6 points on the graph.
# `coef_plot_data` is a data frame of slope estimates (aka. effect sizes or coefficients) for the effect of X1 on Y for each group (here, quantile of X2).
slopes_plot <- split_data(additive, 20, "x2")
coef_plot_data <- data.frame(
'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
'x2' = 1:20)
ggplot(coef_plot_data, aes(x1, x2)) +
geom_point() +
labs(x = "Effect size of x1 on y", y = "Quantile of x2") +
theme_classic()
Here we see how the slope estimate of the line varies by subset of
the data. Again, no pattern here. In other words, there is no
interaction among x1
and x2
. The effect of
x1
on y
is not modified by x2
. As
we simulated, these predictor variables are independent of each
other.
While the plot above shows compelling visual evidence for no
relationship between x1
and x2
, let’s verify
using the effect size from a linear model. Effect size basically zero
(x1
= 0.21 +/- 4.1).
summary(lm(x2 ~ x1, data = coef_plot_data))
##
## Call:
## lm(formula = x2 ~ x1, data = coef_plot_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5538 -4.8155 0.0106 4.6486 9.5761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4441 20.6794 0.457 0.653
## x1 0.2101 4.1059 0.051 0.960
##
## Residual standard error: 6.078 on 18 degrees of freedom
## Multiple R-squared: 0.0001455, Adjusted R-squared: -0.0554
## F-statistic: 0.002619 on 1 and 18 DF, p-value: 0.9598
Model With Interaction Effect
Now we add a multiplicative interaction, with still
uncorrelated predictors. Same steps as above, but you will note the
interaction added to the model in the term: 2*x1*x2
multiplicative <- data.frame('x1' = rnorm(1000, 0, 1), 'x2' = rnorm(1000, 0, 1)) |>
within({y <- 5*x1 + 2*x2 + 2*x1*x2 + rnorm(1000, 0, 2)}) #here you can see the interaction (2*x1*x2)
slopes_plot <- split_data(multiplicative, 6, "x2")
#Now we see an interaction. As the values of x2 increase, so does the strength
#of the relationship between x1 and y.
ggint(slopes_plot)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
We can see here that for different levels of
x2
, the effect
(or slope) of x1
on y
differs.
Now, when we plot the slopes for different quantiles of
x2
, we should expect to see some relationship. A pattern
indicates that the effect of one variable on the response will vary
depending on the value of the other predictor.
#Now this coefficient plot might be more interesting. Let's examine the relationship
#between the values of x2 and the slope between x1 and y.
slopes_plot <- split_data(multiplicative, 20, "x2")
coef_plot_data <- data.frame(
'x1' = sapply(1:20, \(i) coef(lm(y ~ x1, data = subset(slopes_plot, break_num == i)))[[2]]),
'x2' = 1:20)
#We see a linear relationship, like we simulated!
ggplot(coef_plot_data, aes(x1, x2)) +
geom_point() +
labs(x = "Effect size of x1 on y", y = "Quantile of x2") +
theme_classic()