In this last lab we are going to examine structural equation modeling and then return to a principle that we have already mentioned in this class, that in statistics there are often multiple paths to the same answer.

Introduction to SEM

Resources:

Book: Cause and Correlation in Biology, Bill Shipley 2016

Website: Introduction of the Framework

Website: Video and 3-part seminar * I highly recommend going through these seminars if you are interested in this method. The question often pops up “Isn’t SEM just linear regression” and I think this does a great job showing how these methods differ with respect to covariance in residual errors.

PDF: Lavaan reference Sheet

Structural Equation Modeling (SEM)

Structural equation modeling is a linear model framework that models both simultaneous regression equations with latent variables. Two types of models within this framework include: confirmatory factor analysis and path analysis. The working definition here, is that structural equation modeling is a model selection approach to path analysis that either includes latent variables or statistically compares different path models with the goal of comparing different causal hypotheses and examining indirect effects.

Latent variables are key components of SEM. Latent variables are immeasurable or unobservable variables but whose influence can be summarized through observable indicator variables - variables that are measured directly. That is to say, we can study things which are inherently impossible to measure by analyzing the relationships of things they cause. An example of a latent variable in ecology is climate change. Since no one has invented a climate-changeometer, we use measures of climate variables like temperature and precipitation, or rates of change in these variables, as indicators for the latent construct - climate change. These methods are widely used outside of ecology, in part because fields such as psychology use survey data items to construct latent variables of interest such as depression, anxiety and intelligence, but the theory and methods are also very useful to apply to, and consider for, ecological questions.

SEM accounts for the following relationships:

Path Diagrams

Let’s visualize the components of a path diagram in a heuristic visualization, to understand the components of a path diagram and strength of the SEM approach.

Figure 1: example structural equation model
Figure 1: example structural equation model

There’s a lot to take in here, and not all of these symbols will be used in every SEM diagram.

Let’s look at a simplified version of this model:

Figure 2: a simple SEM
Figure 2: a simple SEM

In the above model we are saying that x1 is a predictor of y1 and y2. Additionally, we are saying that y1 is a predictor of y2. This is another key feature of SEMs. We can separate the direct effect of x1 on y1 and the indirect effect of x1 on y2 mediated by y1. Each line in the diagram represents a statistical model. For instance the model predicting y1 looks like this y1 = x1 + error. Each response variable gets its own error (ζ), although these are not always shown in diagrams.

Also note that we can’t have arrows that form loops in our models (the specific terminology is that these are acyclic graphs), so a model like this is invalid:

Figure 3: An example of an invalid SEM
Figure 3: An example of an invalid SEM

This is why SEMs are so useful for causal analysis. The basic idea of causality, that causes drive effects and not the other way around, is built into the assumptions of the model.

Typically, before conducting the analysis, it is useful to explicitly draw out the metamodel (model with the variables of interest including (+/-) relationships to represent your hypotheses.

lavvan: R package to explore SEM

Frequently used syntax in lavaan

here’s an example of how these characters are used when building a model in R:

model <- '
# regressions
y1 + y2 ~ f1 + f2 + x1 + x2
f1 ~ f2 + f3
f2 ~ f3 + x1 + x2

# latent variable definitions
f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6
f3 =~ y7 + y8 + y9 + y10

# variances and covariances
y1 ~~ y1
y1 ~~ y2
f1 ~~ f2

# intercepts
y1 ~ 1
f1 ~ 1'

Practice

Let’s get on to working with real data. For this lab we will simulate everything.

#Load libraries
library(lavaan)
library(lavaanPlot)
library(piecewiseSEM)
library(tidyverse)
library(rjags)
library(jagsUI)
library(coda)
library(matrixStats)
library(pals)

Here we will simulate data of caterpillar richness measured over 60 years. First I simulate that temperature is increasing. Then I simulate that caterpillar richness is declining as a function of both temperature and year. Then I add a precipitation variable with no trends and no effect on caterpillars.

set.seed(2002)

semdata <- data.frame('years' = 1:60) |> #60 years of data
  within({
    # precip is simulated from a random distribution with a mean of 100, standard devation of 10 . 60 values will be taken from that distribution (one for each year).
    precip <- rnorm(60, 100, 10) 
    
    #temperature is modeled with an intercept of 0.05 a positive slope of magnitude 0.05 to indicate it is increasing each year by 0.05 degrees Celsius. Plus some error/ noise. Noise is taken from a Gaussian distribtuion.
    temperature <- 0.05 + 0.05*years + rnorm(60, 1, 0.5) 

    #simulate the caterpillar richness as a function of years and temps
    catrich <- 350 - 2*years -50*temperature + rnorm(60, 50, 50)
  })


#visualize trend of temperature across the 60 years
ggplot(semdata, aes(years, temperature))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

#visualize caterpillar richness across years
ggplot(semdata, aes(years, catrich))+
  geom_point()+
  geom_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

In the case where we didn’t simulate data and we are hypothesizing about these patterns, we would draw a metamodel of the relationships we are about to test (it can still be nice to do this even when you aren’t planning on using an SEM). Below is the meta model for the model we are about to write. Note that this meta model is very simplified in that it only displays the edges that represent regression or path coefficients. When creating your model, it is important to consider correlated error structure. The covariance of residuals and residual variances of the variables are typically not displayed in a path diagram, but the model output will describe this information.

Figure 4: A metamodel showing the causal hypotheses we want to test
Figure 4: A metamodel showing the causal hypotheses we want to test

Let’s first examine the observed or sample covariance matrix to get a sense of these relationships.

cov(semdata)
##                   years     catrich temperature      precip
## years         305.00000 -1358.42825  15.3312769 -23.4048821
## catrich     -1358.42825  8825.22098 -82.1936621  90.2473145
## temperature    15.33128   -82.19366   1.0431215   0.2863871
## precip        -23.40488    90.24731   0.2863871  97.8564803

Now let’s run our hypothesized SEM. First, we need to write a model statement. Finally we run the model using sem() and plot it using lavaanPlot().

##Write the model
# In the model below the exogenous variables are years and temperature
# Endogenous variables are caterpillar richness and precipitation
# Precipitation is predicted by years, caterpillar richness is predicted by years and temperature, temperature is predicted by years

model1 <- '
  precip + temperature ~ years
  catrich ~ years + temperature'

# fitting the sem model to your data
model1.fit <- sem(model1, data = semdata) 
## Warning: lavaan->lav_data_full():  
##    some observed variances are (at least) a factor 1000 times larger than others; use varTable(fit) to investigate
# get summary of model fits (since we specified `fit.measures=TRUE`), standardized estimates (since we specified `standardized =TRUE`) this will return the same coefficients had we scaled all the data and an R-squared value
summary(model1.fit, rsq = TRUE, fit.measures = TRUE, standardized = TRUE) 
## lavaan 0.6-19 ended normally after 21 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
## 
##   Number of observations                            60
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 5.118
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.024
## 
## Model Test Baseline Model:
## 
##   Test statistic                               175.155
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.976
##   Tucker-Lewis Index (TLI)                       0.854
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)               -580.210
##   Loglikelihood unrestricted model (H1)       -577.651
##                                                       
##   Akaike (AIC)                                1176.420
##   Bayesian (BIC)                              1193.175
##   Sample-size adjusted Bayesian (SABIC)       1168.013
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.262
##   90 Percent confidence interval - lower         0.077
##   90 Percent confidence interval - upper         0.504
##   P-value H_0: RMSEA <= 0.050                    0.034
##   P-value H_0: RMSEA >= 0.080                    0.948
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.054
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   precip ~                                                              
##     years            -0.077    0.072   -1.059    0.290   -0.077   -0.135
##   temperature ~                                                         
##     years             0.050    0.004   13.027    0.000    0.050    0.860
##   catrich ~                                                             
##     years            -1.702    0.652   -2.609    0.009   -1.702   -0.314
##     temperature     -54.746   11.124   -4.921    0.000  -54.746   -0.592
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .precip ~~                                                             
##    .catrich          64.990   57.207    1.136    0.256   64.990    0.148
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .precip           94.459   17.246    5.477    0.000   94.459    0.982
##    .temperature       0.268    0.049    5.477    0.000    0.268    0.261
##    .catrich        2034.063  371.367    5.477    0.000 2034.063    0.231
## 
## R-Square:
##                    Estimate
##     precip            0.018
##     temperature       0.739
##     catrich           0.769
#PLot model
lavaanPlot(name = "model1", model1.fit, coefs = TRUE) # prints path diagram that specifies relationships specified in `model1` along with the estimated path coefficients from the fitted model (here, the unstandardized coefficient estimates in the summary output.)

Let’s break down the output: