In this last lab we are going to examine structural equation modeling and then return to a principle that we have already mentioned in this class, that in statistics there are often multiple paths to the same answer.
Resources:
Book: Cause and Correlation in Biology, Bill Shipley 2016
Website: Introduction of the Framework
Website: Video and 3-part seminar * I highly recommend going through these seminars if you are interested in this method. The question often pops up “Isn’t SEM just linear regression” and I think this does a great job showing how these methods differ with respect to covariance in residual errors.
Structural equation modeling is a linear model framework that models both simultaneous regression equations with latent variables. Two types of models within this framework include: confirmatory factor analysis and path analysis. The working definition here, is that structural equation modeling is a model selection approach to path analysis that either includes latent variables or statistically compares different path models with the goal of comparing different causal hypotheses and examining indirect effects.
Latent variables are key components of SEM. Latent variables are immeasurable or unobservable variables but whose influence can be summarized through observable indicator variables - variables that are measured directly. That is to say, we can study things which are inherently impossible to measure by analyzing the relationships of things they cause. An example of a latent variable in ecology is climate change. Since no one has invented a climate-changeometer, we use measures of climate variables like temperature and precipitation, or rates of change in these variables, as indicators for the latent construct - climate change. These methods are widely used outside of ecology, in part because fields such as psychology use survey data items to construct latent variables of interest such as depression, anxiety and intelligence, but the theory and methods are also very useful to apply to, and consider for, ecological questions.
SEM accounts for the following relationships:
Let’s visualize the components of a path diagram in a heuristic visualization, to understand the components of a path diagram and strength of the SEM approach.
There’s a lot to take in here, and not all of these symbols will be used in every SEM diagram.
Exogenous variable: represented by (x1). These are independent variables (either observed or latent) that explains an endogenous variable. Arrows point away from them, not at them.
Endogenous variables: represented by (y1 and y2). These are dependent variables (either observed or latent) that are explained by exogenous variables in the model. Arrows point toward them and these represent causal paths.
Latent variables: represented by (L1). These are unobserved variables that predict exogenous variables in the model.
Arrows: (also called edges) are relationships among variables. Sometimes arrows are used to denote positive relationships while blunted lines represent negative relationships. Single headed arrows represent a causal relationship, while double headed arrows represent covariance, i.e. a relationship where we don’t know the direction of effect.
Vertices: (also called nodes) are the variables themselves.
Let’s look at a simplified version of this model:
In the above model we are saying that x1 is a predictor of y1 and y2. Additionally, we are saying that y1 is a predictor of y2. This is another key feature of SEMs. We can separate the direct effect of x1 on y1 and the indirect effect of x1 on y2 mediated by y1. Each line in the diagram represents a statistical model. For instance the model predicting y1 looks like this y1 = x1 + error. Each response variable gets its own error (ζ), although these are not always shown in diagrams.
Also note that we can’t have arrows that form loops in our models (the specific terminology is that these are acyclic graphs), so a model like this is invalid:
This is why SEMs are so useful for causal analysis. The basic idea of causality, that causes drive effects and not the other way around, is built into the assumptions of the model.
Typically, before conducting the analysis, it is useful to explicitly draw out the metamodel (model with the variables of interest including (+/-) relationships to represent your hypotheses.
Frequently used syntax in lavaan
here’s an example of how these characters are used when building a model in R:
model <- '
# regressions
y1 + y2 ~ f1 + f2 + x1 + x2
f1 ~ f2 + f3
f2 ~ f3 + x1 + x2
# latent variable definitions
f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6
f3 =~ y7 + y8 + y9 + y10
# variances and covariances
y1 ~~ y1
y1 ~~ y2
f1 ~~ f2
# intercepts
y1 ~ 1
f1 ~ 1'
Let’s get on to working with real data. For this lab we will simulate everything.
#Load libraries
library(lavaan)
library(lavaanPlot)
library(piecewiseSEM)
library(tidyverse)
library(rjags)
library(jagsUI)
library(coda)
library(matrixStats)
library(pals)
Here we will simulate data of caterpillar richness measured over 60 years. First I simulate that temperature is increasing. Then I simulate that caterpillar richness is declining as a function of both temperature and year. Then I add a precipitation variable with no trends and no effect on caterpillars.
set.seed(2002)
semdata <- data.frame('years' = 1:60) |> #60 years of data
within({
# precip is simulated from a random distribution with a mean of 100, standard devation of 10 . 60 values will be taken from that distribution (one for each year).
precip <- rnorm(60, 100, 10)
#temperature is modeled with an intercept of 0.05 a positive slope of magnitude 0.05 to indicate it is increasing each year by 0.05 degrees Celsius. Plus some error/ noise. Noise is taken from a Gaussian distribtuion.
temperature <- 0.05 + 0.05*years + rnorm(60, 1, 0.5)
#simulate the caterpillar richness as a function of years and temps
catrich <- 350 - 2*years -50*temperature + rnorm(60, 50, 50)
})
#visualize trend of temperature across the 60 years
ggplot(semdata, aes(years, temperature))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
#visualize caterpillar richness across years
ggplot(semdata, aes(years, catrich))+
geom_point()+
geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'
In the case where we didn’t simulate data and we are hypothesizing about these patterns, we would draw a metamodel of the relationships we are about to test (it can still be nice to do this even when you aren’t planning on using an SEM). Below is the meta model for the model we are about to write. Note that this meta model is very simplified in that it only displays the edges that represent regression or path coefficients. When creating your model, it is important to consider correlated error structure. The covariance of residuals and residual variances of the variables are typically not displayed in a path diagram, but the model output will describe this information.
Let’s first examine the observed or sample covariance matrix to get a sense of these relationships.
cov(semdata)
## years catrich temperature precip
## years 305.00000 -1358.42825 15.3312769 -23.4048821
## catrich -1358.42825 8825.22098 -82.1936621 90.2473145
## temperature 15.33128 -82.19366 1.0431215 0.2863871
## precip -23.40488 90.24731 0.2863871 97.8564803
Now let’s run our hypothesized SEM. First, we need to write a model
statement. Finally we run the model using sem()
and plot it
using lavaanPlot()
.
##Write the model
# In the model below the exogenous variables are years and temperature
# Endogenous variables are caterpillar richness and precipitation
# Precipitation is predicted by years, caterpillar richness is predicted by years and temperature, temperature is predicted by years
model1 <- '
precip + temperature ~ years
catrich ~ years + temperature'
# fitting the sem model to your data
model1.fit <- sem(model1, data = semdata)
## Warning: lavaan->lav_data_full():
## some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
# get summary of model fits (since we specified `fit.measures=TRUE`), standardized estimates (since we specified `standardized =TRUE`) this will return the same coefficients had we scaled all the data and an R-squared value
summary(model1.fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-19 ended normally after 21 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 8
##
## Number of observations 60
##
## Model Test User Model:
##
## Test statistic 5.118
## Degrees of freedom 1
## P-value (Chi-square) 0.024
##
## Model Test Baseline Model:
##
## Test statistic 175.155
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.976
## Tucker-Lewis Index (TLI) 0.854
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -580.210
## Loglikelihood unrestricted model (H1) -577.651
##
## Akaike (AIC) 1176.420
## Bayesian (BIC) 1193.175
## Sample-size adjusted Bayesian (SABIC) 1168.013
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.262
## 90 Percent confidence interval - lower 0.077
## 90 Percent confidence interval - upper 0.504
## P-value H_0: RMSEA <= 0.050 0.034
## P-value H_0: RMSEA >= 0.080 0.948
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.054
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## precip ~
## years -0.077 0.072 -1.059 0.290 -0.077 -0.135
## temperature ~
## years 0.050 0.004 13.027 0.000 0.050 0.860
## catrich ~
## years -1.702 0.652 -2.609 0.009 -1.702 -0.314
## temperature -54.746 11.124 -4.921 0.000 -54.746 -0.592
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip ~~
## .catrich 64.990 57.207 1.136 0.256 64.990 0.148
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .precip 94.459 17.246 5.477 0.000 94.459 0.982
## .temperature 0.268 0.049 5.477 0.000 0.268 0.261
## .catrich 2034.063 371.367 5.477 0.000 2034.063 0.231
##
## R-Square:
## Estimate
## precip 0.018
## temperature 0.739
## catrich 0.769
#PLot model
lavaanPlot(name = "model1", model1.fit, coefs = TRUE) # prints path diagram that specifies relationships specified in `model1` along with the estimated path coefficients from the fitted model (here, the unstandardized coefficient estimates in the summary output.)
Let’s break down the output: