#Load libraries
library(hillR)
library(tidyverse)
library(vegan)
library(asbio)
library(gridExtra)
library(cowplot)
library(nFactors)

Overview:

  1. Calculate diversity indices to reduce dimensionality of species columns as response variables. As part of this, you will review: richness, Simpsons diversity, and Shannon diversity and the associated concept of Hill numbers.
  2. Explore multivariate methods through multiple methods including: Multivariate Analysis of Variance, Ordination and Multivariate Regression.

Lab Introduction: The Analysis of Multivariate Data

Welcome back! So far the data sets we have been working with have not contained more than two predictor or response variables. However, that is often NOT the case in ecology and instead we are interested in how multiple response variables are related simultaneously to one or more predictors. What happens if we are interested in the impacts of nitrogen addition on a plant community and we measure the cover of all plants we find? Suddenly instead of nitrogen predicting one plant we are using nitrogen to predict MANY plants. What if we are interested in the impacts of many climate variables on plant germination?

Today’s lab will show how different multivariate methods can help to identify patterns AND reduce the dimensionality of our data. This includes useful summary statistics such as richness, diversity indices, and hill numbers. This also includes multivariate techniques such as PCA, RDA, factor analysis, and random forest. This is certainly not an exhaustive list, but these are commonly used techniques that can help us better understand data sets with many variables.

Figure 1: Summary & Comparison of Multivariate Methods
Figure 1: Summary & Comparison of Multivariate Methods

Summarizing Multivariate Response Variables: Richness, Diversity, and Hill Numbers

Background

Brief Overview

This section focuses on alpha diversity. Richness, diversity indices, and hill numbers are indices that summarize multivariate data across some unit of collection. These indices are often used in reference to species but, because they represent counts of unique entities (richness) or counts of unique entities with consideration to their abundance (evenness), they have broad applications (e.g. DNA, semiotics, OTUm etc). These summary statistics can serve as either predictors or response variables, depending on the data and the questions being asked. The key take away of this section, albeit we go into a bit of detail about measures of diversity, is that diversity indices function to effectively reduce dimensionality of variables - that would otherwise be multiple columns of species abundance data.

Richness, Simpson’s and Shannon’s Diversity As mentioned above, richness refers the number of unique entities within a group. Methods that take abundance into account, and thereby represent measures of evenness, include Simpson and Shannon diversity. Both of these indices implement the proportion of individuals from a given species (pi) relative to the total number of individuals. Differences in their calculations however, make it such that Simpson’s diversity puts more weight on common species and evenness than Shannon’s diversity which will be more sensitive to species richness, hence the presence of rare species. One is not better than the other and often it is good practice to report values for each as they add different layers of information. These are often plotted in diveristy profile curves and using the transformed values richness, shannon, & simpson using Hill numbers.

Figure 2: Shannon’s and Simpson’s Diveristy indeces
Figure 2: Shannon’s and Simpson’s Diveristy indeces

Hill Numbers

Following an important publication (Jost et al. 2006), Hill numbers have become, in many cases, the preferred method to report diversity. Hill numbers are a convenient algebraic transformation of richness and other diversity indices (e.g. Simpsons and Shannon), such that values for diversity can be intuitively interpreted as “equivalent numbers of species”. For example, if in group A diversity = 4 and group B diversity = 8, we can say group B is twice as diverse as A if your metric for diversity is transformed to equivalent species. If you did not use hill numbers, you could deduce that diversity is greater in group B, but not by an intuitive or comparable magnitude. When you transform values for diversity into effective species, you are linearizing the relationship between different measures of diversity (Figure 2). Click here if you are new to the concept of Hill Numbers. Click this link if you are interested in learning more about this transformation.. Hill numbers are calculated using an equation which contains the term “q”. By increasing q we increase the importance of abundant species. q = 0 is species richness, which ignores abundance. q = 1 and q = 2 are Shannon’s and inverse Simpson’s diversity respectively. Hill numbers are expressed as the effective number of species (or whatever else you are measuring). This should be thought of as the equivalent number of equally abundant species.

Figure 3: Linearizing the relationship between diveristy indices to make for more intuitive comparison
Figure 3: Linearizing the relationship between diveristy indices to make for more intuitive comparison

Image taken from here

Estimating Diversity We do not get into detail here, but if you are less familiar with methods of calculating and estimating diversity check out: Chapter 13: The Measurement of Biodiversity in A Primer of Ecological Statistics (Gotelli & Ellison 2013). In the code to follow, we will calculate species diversity which serves to describe diversity for the different elevations from which we collected data. It is important to remember that if we wish to compare diversity of two groups thoroughly, sampling effect needs to be taken into account. Sampling effect refers to the idea that:

  1. The diversity of species sampled from a group is correlated with abundance in that group 2) We collect only a sample of what is there such that if we were to continue collecting, we would expect to accumulate more species.

To make inferences about the diversity of an assemblage and get confidence intervals around our estimates typically involves: 1) creating rarefaction curves, 2) comparing these rarefaction curves statistically or by gauging overlap in their confidence intervals 3) extrapolating species richness beyond the range of sampling units through the use of asymptotic estimators (Anne Chao does a lot of work on this). In figure 3 below, we see a diversity profile with rarefaction curves. Each rarefaction curve has the interpolated and extrapolated component. Each line represents a unique measure of diversity richness and the y axis can be interpreted as species equivalents. We see that diversity for both measures and including richness, isn’t very different among the girdled and logged groups.

Figure 4: Diversity profiles for two treatments (forest types - logged and girdled) for all three diveristy measures. These diversity profiles indicate interpolated and extrapolated diveristy (using a Chao asymptotic estimator) and include their 95% CI.
Figure 4: Diversity profiles for two treatments (forest types - logged and girdled) for all three diveristy measures. These diversity profiles indicate interpolated and extrapolated diveristy (using a Chao asymptotic estimator) and include their 95% CI.

Practice

The data we will work with is comprised of 3 variables: abundance, plots, elevation. The abundance contains 20 plant species across 3 elevations. There are 20 plots at each elevation for a total of 60 plots. We want to know how plant communities change with elevation. In this case we have 20 response variables, so what can we do? We could run 20 different linear models, but is there a more efficient way? We probably do not want to have to discuss 20 models in a paper. This is where summaries like richness and diversity come in.

community_data <- read.csv("community_data.csv")

There are lots of packages in R to calculate diversity and hill numbers. A few popular ones include:

  • vegan
  • iNext
  • spadeR

We will do it manually to show under-the-hood processes. Below we create 3 functions to calculate richness and diversity manually.

Richness

#The function calculates the total number of unique species. This function uses the `length` function to return the number of entries in data that are greater (or not equal to 1) using `!=0`
richness <- function(data){
  return(length(data[data != 0]))
}

#using the `apply` to apply `richness` (the function we made above) to rows in columns 3 through 22. `rich` is a vector of the number of unique species in each row- in other words, the number of unique species in each plot for each elevation
rich <- apply(community_data[,3:22], 1, richness)
rich
##  [1]  5  5  5  5  5  5  4  5  5  4  5  4  5  5  5  5  5  5  5  4 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
## [38] 20 20 20 13 12 11 13 14 14 14 14 13 12 14 16 14 16 12 14 16 15 14 13

Shannon’s Diversity

#the function calculates Shannon's diversity. Recall that Shannon's diversity is calculated as the sum of the proportion of individuals of species of the n'th species in a given community multiplied by the log of that proportion. Lastly, the sum is multiplied by -1. 
shannon <- function(data){
  p <- data[data != 0]/sum(data)
  return(-sum(p*log(p)))
}

#applying `shannon()` using `apply()`. The arguments indicate we apply `shannon()` to all rows (indicated by `apply(,1,)` in columns 3 through 22 
shan <- apply(community_data[,3:22], 1, shannon) 
shan
##  [1] 1.432260 1.419891 1.494874 1.556983 1.452994 1.313304 1.353525 1.520321 1.467236 1.340860 1.409487 1.382378
## [13] 1.385268 1.510398 1.505855 1.414772 1.419642 1.399484 1.428677 1.277404 2.944390 2.954487 2.942325 2.962073
## [25] 2.934473 2.961329 2.952017 2.956611 2.960266 2.951858 2.948055 2.949616 2.948878 2.947339 2.946155 2.953208
## [37] 2.913105 2.974281 2.929710 2.966129 1.847007 1.740696 1.595708 1.976775 1.732708 1.902856 1.962446 2.018403
## [49] 1.800655 1.814542 1.814255 2.017998 1.741357 2.010400 1.715785 1.851572 2.030285 1.993660 1.815625 1.759718

Simpson’s Diversity

simpson <- function(data){
  p <- data/sum(data)
  return(1 - sum(p^2))
}

simp <- apply(community_data[,3:22], 1, simpson)
simp
##  [1] 0.7328000 0.7242798 0.7538644 0.7786961 0.7382271 0.6776406 0.7350000 0.7681756 0.7428571 0.7266667 0.7180900
## [12] 0.7480469 0.7235996 0.7673469 0.7600000 0.7355556 0.7283737 0.7145062 0.7219388 0.6995398 0.9450986 0.9459146
## [23] 0.9445538 0.9464592 0.9436399 0.9465318 0.9458115 0.9460317 0.9462132 0.9460267 0.9453663 0.9456055 0.9455661
## [34] 0.9450607 0.9451993 0.9453979 0.9418262 0.9479921 0.9433013 0.9471548 0.7659172 0.7477551 0.7376602 0.8110723
## [45] 0.7596953 0.7866391 0.8064000 0.8177778 0.7750865 0.7805493 0.7700617 0.8044898 0.7585323 0.8057958 0.7543750
## [56] 0.7812500 0.8143210 0.8048907 0.7690625 0.7620408

If we did the same calculations as above using the vegan package, it would look like:

rich.vegan <- specnumber(community_data[,3:22]) #richness
rich.vegan
##  [1]  5  5  5  5  5  5  4  5  5  4  5  4  5  5  5  5  5  5  5  4 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
## [38] 20 20 20 13 12 11 13 14 14 14 14 13 12 14 16 14 16 12 14 16 15 14 13
shan.vegan <- diversity(community_data[,3:22], index = "shannon") # shannon diversity
shan.vegan
##  [1] 1.432260 1.419891 1.494874 1.556983 1.452994 1.313304 1.353525 1.520321 1.467236 1.340860 1.409487 1.382378
## [13] 1.385268 1.510398 1.505855 1.414772 1.419642 1.399484 1.428677 1.277404 2.944390 2.954487 2.942325 2.962073
## [25] 2.934473 2.961329 2.952017 2.956611 2.960266 2.951858 2.948055 2.949616 2.948878 2.947339 2.946155 2.953208
## [37] 2.913105 2.974281 2.929710 2.966129 1.847007 1.740696 1.595708 1.976775 1.732708 1.902856 1.962446 2.018403
## [49] 1.800655 1.814542 1.814255 2.017998 1.741357 2.010400 1.715785 1.851572 2.030285 1.993660 1.815625 1.759718
simp.vegan <- diversity(community_data[,3:22], index = "simpson") # simpson diversity
simp.vegan
##  [1] 0.7328000 0.7242798 0.7538644 0.7786961 0.7382271 0.6776406 0.7350000 0.7681756 0.7428571 0.7266667 0.7180900
## [12] 0.7480469 0.7235996 0.7673469 0.7600000 0.7355556 0.7283737 0.7145062 0.7219388 0.6995398 0.9450986 0.9459146
## [23] 0.9445538 0.9464592 0.9436399 0.9465318 0.9458115 0.9460317 0.9462132 0.9460267 0.9453663 0.9456055 0.9455661
## [34] 0.9450607 0.9451993 0.9453979 0.9418262 0.9479921 0.9433013 0.9471548 0.7659172 0.7477551 0.7376602 0.8110723
## [45] 0.7596953 0.7866391 0.8064000 0.8177778 0.7750865 0.7805493 0.7700617 0.8044898 0.7585323 0.8057958 0.7543750
## [56] 0.7812500 0.8143210 0.8048907 0.7690625 0.7620408

Next we will combine into a data frame.

diversity_data <- cbind(community_data[c('elevation', 'plot')],
                        rich, shan, simp)

head(diversity_data)
##   elevation  plot rich     shan      simp
## 1      high Plot1    5 1.432260 0.7328000
## 2      high Plot2    5 1.419891 0.7242798
## 3      high Plot3    5 1.494874 0.7538644
## 4      high Plot4    5 1.556983 0.7786961
## 5      high Plot5    5 1.452994 0.7382271
## 6      high Plot6    5 1.313304 0.6776406

Notice we have effectively reduced the number of response variables in our data set from 20 to 3. This is much more manageable for running linear models. Of course by doing this we change our response variable. So instead of being able to say something like “this plant is more abundant at lower elevations” we can say that “lower elevations see greater richness and diversity.”

Recall that hill numbers are transformed versions of richness, Simpson’s and Shannon’s diversity.

Here I use the HillR package to calculate them.

# calculates richness
diversity_data$h0 <- hillR::hill_taxa(community_data[,3:22], q = 0)
# calculates effective shannon's diversity (i.e. exp(shan))
diversity_data$h1 <- hillR::hill_taxa(community_data[,3:22], q = 1) 
#calculated inverse simpson's diversity (i.e. 1/(1-simp))
diversity_data$h2 <- hillR::hill_taxa(community_data[,3:22], q = 2) 
diversity_data
##    elevation   plot rich     shan      simp h0        h1        h2
## 1       high  Plot1    5 1.432260 0.7328000  5  4.188152  3.742515
## 2       high  Plot2    5 1.419891 0.7242798  5  4.136668  3.626866
## 3       high  Plot3    5 1.494874 0.7538644  5  4.458773  4.062802
## 4       high  Plot4    5 1.556983 0.7786961  5  4.744486  4.518672
## 5       high  Plot5    5 1.452994 0.7382271  5  4.275897  3.820106
## 6       high  Plot6    5 1.313304 0.6776406  5  3.718440  3.102128
## 7       high  Plot7    4 1.353525 0.7350000  4  3.871048  3.773585
## 8       high  Plot8    5 1.520321 0.7681756  5  4.573693  4.313609
## 9       high  Plot9    5 1.467236 0.7428571  5  4.337229  3.888889
## 10      high Plot10    4 1.340860 0.7266667  4  3.822329  3.658537
## 11      high Plot11    5 1.409487 0.7180900  5  4.093855  3.547231
## 12      high Plot12    4 1.382378 0.7480469  4  3.984365  3.968992
## 13      high Plot13    5 1.385268 0.7235996  5  3.995897  3.617940
## 14      high Plot14    5 1.510398 0.7673469  5  4.528532  4.298246
## 15      high Plot15    5 1.505855 0.7600000  5  4.508007  4.166667
## 16      high Plot16    5 1.414772 0.7355556  5  4.115549  3.781513
## 17      high Plot17    5 1.419642 0.7283737  5  4.135641  3.681529
## 18      high Plot18    5 1.399484 0.7145062  5  4.053107  3.502703
## 19      high Plot19    5 1.428677 0.7219388  5  4.173176  3.596330
## 20      high Plot20    4 1.277404 0.6995398  4  3.587314  3.328228
## 21       mid  Plot1   20 2.944390 0.9450986 20 18.999077 18.214476
## 22       mid  Plot2   20 2.954487 0.9459146 20 19.191880 18.489281
## 23       mid  Plot3   20 2.942325 0.9445538 20 18.959874 18.035505
## 24       mid  Plot4   20 2.962073 0.9464592 20 19.338019 18.677342
## 25       mid  Plot5   20 2.934473 0.9436399 20 18.811584 17.743058
## 26       mid  Plot6   20 2.961329 0.9465318 20 19.323645 18.702715
## 27       mid  Plot7   20 2.952017 0.9458115 20 19.144527 18.454113
## 28       mid  Plot8   20 2.956611 0.9460317 20 19.232691 18.529412
## 29       mid  Plot9   20 2.960266 0.9462132 20 19.303115 18.591906
## 30       mid Plot10   20 2.951858 0.9460267 20 19.141488 18.527679
## 31       mid Plot11   20 2.948055 0.9453663 20 19.068823 18.303716
## 32       mid Plot12   20 2.949616 0.9456055 20 19.098615 18.384224
## 33       mid Plot13   20 2.948878 0.9455661 20 19.084538 18.370907
## 34       mid Plot14   20 2.947339 0.9450607 20 19.055185 18.201908
## 35       mid Plot15   20 2.946155 0.9451993 20 19.032637 18.247937
## 36       mid Plot16   20 2.953208 0.9453979 20 19.167351 18.314322
## 37       mid Plot17   20 2.913105 0.9418262 20 18.413881 17.189854
## 38       mid Plot18   20 2.974281 0.9479921 20 19.575536 19.227848
## 39       mid Plot19   20 2.929710 0.9433013 20 18.722196 17.637097
## 40       mid Plot20   20 2.966129 0.9471548 20 19.416607 18.923185
## 41       low  Plot1   13 1.847007 0.7659172 13  6.340811  4.271992
## 42       low  Plot2   12 1.740696 0.7477551 12  5.701311  3.964401
## 43       low  Plot3   11 1.595708 0.7376602 11  4.931819  3.811849
## 44       low  Plot4   13 1.976775 0.8110723 13  7.219423  5.293030
## 45       low  Plot5   14 1.732708 0.7596953 14  5.655949  4.161383
## 46       low  Plot6   14 1.902856 0.7866391 14  6.705014  4.686895
## 47       low  Plot7   14 1.962446 0.8064000 14  7.116716  5.165289
## 48       low  Plot8   14 2.018403 0.8177778 14  7.526295  5.487805
## 49       low  Plot9   13 1.800655 0.7750865 13  6.053613  4.446154
## 50       low Plot10   12 1.814542 0.7805493 12  6.138265  4.556831
## 51       low Plot11   14 1.814255 0.7700617 14  6.136501  4.348993
## 52       low Plot12   16 2.017998 0.8044898 16  7.523249  5.114823
## 53       low Plot13   14 1.741357 0.7585323 14  5.705080  4.141340
## 54       low Plot14   16 2.010400 0.8057958 16  7.466304  5.149220
## 55       low Plot15   12 1.715785 0.7543750 12  5.561042  4.071247
## 56       low Plot16   14 1.851572 0.7812500 14  6.369827  4.571429
## 57       low Plot17   16 2.030285 0.8143210 16  7.616253  5.385638
## 58       low Plot18   15 1.993660 0.8048907 15  7.342357  5.125333
## 59       low Plot19   14 1.815625 0.7690625 14  6.144913  4.330176
## 60       low Plot20   13 1.759718 0.7620408 13  5.810798  4.202401

We can plot these to examine the difference in simulated diversity at three elevations. Here we see that diversity is greatest at mid elevations. Here, richness is 20 and higher q’s are around 18. This tells us that the diversity here is equivalent to a community that contains 18 equally abundant species. Compare this with high low elevations. Here we also see high richness, however the drop off between q=0 and q=1 tells us that the communities are uneven and are dominated by a few species.

plot_data <- diversity_data |>
  select(elevation, plot, h0, h1, h2) |>
  gather(3:5, key = "var", value = "val") |>
  transform(elevation = factor(elevation, levels = c('low','mid','high')))

ggplot(plot_data, aes(var, val))+
  geom_errorbar(stat = 'summary', width = .5)+
  geom_point(stat = 'summary')+
  labs(x = "q", y = "Effective number of species")+
  scale_x_discrete(labels = c("0", "1", "2"))+
  facet_grid(cols=vars(elevation))+
  theme_classic()

Setting the Stage for Multivariate Predictors: A Quick Review on Multicollinearity

Multicollinearity among variables is common when data sets have many variables and yet an assumption of regressions is that variables are not correlated. Further when multiple response variables are taken from the same individual they are not independent of each other. Essentially, if a linear model contains variables that are too highly correlated, the model will have trouble fitting parameters. Let’s simulate some data and take a quick look.

Let’s say we are interested in the physical strain exerted by a protagonist as they flee the evil beings (trollocs, fades, etc.). To do this we measure their running speed, their fear, and their heart rate.

set.seed(100) 

#simulate 50 observations, with a mean of 10 and standard deviation of 2
running_speed <- rnorm(50, 10, 2) 

#simulate a variable `fear`. Fear will be correlated with running speed `0.5*running_speed` and we will add some variation to that with `rnorm(50,0,0.15)`
fear <- 0.5*running_speed + rnorm(50, 0, 0.15) 

#simulate a variable `heart_rate`. Heart rate will be correlated with running speed, as well `8.5*running_speed`. We will add some variation to that with `rnorm(50,0,1)`. We make the average heart beat 70 so that values will be realistic values for beats per minute
heart_rate <- 70 + 8.5*running_speed + rnorm(50, 0, 1)

We want to know what drives up their heart rate more, their fear or running speed? We plot each predictor fear and running_speed against heart_rate.

plot(running_speed, heart_rate) 

plot(fear, heart_rate)

Here we see that both variables- fear and running_speed are linearly related to heart rate. Are those two predictors correlated?

cor(running_speed, fear)
## [1] 0.9825859

Looks like they are. Now we will input these variables into separate linear models and output their estimates for their coefficients. We are doing this step so that we can next compare these coefficients to those from a model that includes both predictors.

speed_model <- lm(heart_rate ~ running_speed) # simple linear model with running speed predicting heart rate
summary(speed_model)
## 
## Call:
## lm(formula = heart_rate ~ running_speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0306 -0.4018 -0.1688  0.5249  1.7252 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   69.92361    0.71962   97.17   <2e-16 ***
## running_speed  8.49770    0.06992  121.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8018 on 48 degrees of freedom
## Multiple R-squared:  0.9968, Adjusted R-squared:  0.9967 
## F-statistic: 1.477e+04 on 1 and 48 DF,  p-value: < 2.2e-16
fear_model <- lm(heart_rate ~ fear)
summary(fear_model)
## 
## Call:
## lm(formula = heart_rate ~ fear)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2175 -1.9266 -0.1752  1.4859  5.7363 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.1539     2.2581   35.05   <2e-16 ***
## fear         15.2129     0.4387   34.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.76 on 48 degrees of freedom
## Multiple R-squared:  0.9616, Adjusted R-squared:  0.9608 
## F-statistic:  1203 on 1 and 48 DF,  p-value: < 2.2e-16

Now, let’s check out the coefficients for the model that considers fear and running_speed together as predictors in the model.

full_model <- lm(heart_rate ~ running_speed + fear)
summary(full_model)
## 
## Call:
## lm(formula = heart_rate ~ running_speed + fear)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0103 -0.4106 -0.1734  0.5309  1.6958 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    69.8555     0.7800   89.56   <2e-16 ***
## running_speed   8.5875     0.3801   22.59   <2e-16 ***
## fear           -0.1665     0.6927   -0.24    0.811    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8097 on 47 degrees of freedom
## Multiple R-squared:  0.9968, Adjusted R-squared:  0.9966 
## F-statistic:  7240 on 2 and 47 DF,  p-value: < 2.2e-16

Here we see the problem with collinearity. The relationship between fear and heart rate disappears and the confidence (taken from the standard errors and p-values) is much lower for both estimates. Multicollinearity can cause small changes in the model to have drastic impacts on the coefficient estimates. So what does this have to do with multivariate data sets? If we have data sets with too many predictor variables some of these are bound to be correlated. Also, all of the models we have run so far have one response variable but what if we have more?

Multivariate Analyses: PCA, Factor Analysis, RDA

Matrix algebra is central to understanding the math behind multivariate analyses. It is highly recommended that you take time to read/watch the recommended resources to get a true understanding of these methods if you are not yet familiar. I give an in-depth example for MANOVA using excel to explicitly show the matrix math involved with MANOVA, but do not get that in depth with the others. Helpful review includes:

Variance, Covariance, Correlation Matrices It’s no surprise that if I am interested in understanding how to model two or more response variables, or reduce multiple variables into a representative component, that a natural step would be to ask how, mathematically, do those variables covary with each other. Click here for review on the concept of covariance. The structure of variance, covariance and correlation among multiple variables in a multivariate data set is a major component of the under-the-hood math involved in many multivariate statistics. Three matrices that you will read about repeatedly when conducting such analyses are:

  1. covariance matrix (also called the variance-covariance matrix, variance matrix) (Fig. 4A)
  2. matrix of sample standard deviations (Fig.4B )
  3. matrix of sample correlations (Fig.4C)
Figure 5: Outlining components of a covariance, standard deiation and sample correlation matrix
Figure 5: Outlining components of a covariance, standard deiation and sample correlation matrix

Types of multivariate analyses

Gotelli & Ellison group multivariate analyses into 4 categories. While I think factor analysis fits better in a lesson on structural equation modeling, it is mentioned here under ordination, because it can be used, like PCA, as a data reduction technique.

  1. MANOVA (multivariate ANOVA): Compares multivariate means. Similar to the ANOVA design, where you compute AMONG and WITHIN group variance, MANOVA extends this to cases where you have multiple response variables (and, the predictor is categorical) - hence, matrix algebra is necessary.
  2. Ordination: Creates new variables called principal axes along which samples are ordered or scored. These effectively reduce dimensionality of the data set (e.g. data reduction techniques). Ordination methods include: Principal Component Analyses (PCA), Factor Analysis, Principal Coordinates Analysis, Correspondence Analysis, Non-metric Multidimensional Scaling and many others. These methods are useful for data exploration and pattern recognition. The resulting bi-plots (graphs that plot samples along two principal axes) allow us to explore relationships among environmental variables and species assemblages. Additionally, the scores or factor lodgings can be uses as response or predictor variables in a model.
  3. Classification: Grouping similar objects into classes that are distinguishable to neighboring classes. Classification methods include: Cluster Analysis and Discriminate Analysis.
  4. Multivariate multiple regression: Regression with multiple response variables. These include: Redundancy analysis and canonical correspondence analysis.

MANOVA

Mechanics of MANOVA

Let’s read in some data. I choose the same pitcher plant data set referenced in Chapter 12 of Primer of Ecological Statistics (Gotelli and Ellison 2012). The data set has 10 measurements of the pitcher plant (Darlingtonia californica) at 4 sites. The four sites will be treated as our categorical predictor variable. Later we will subset the 10 variables to the 7 variables referenced in the chapter. By using a MANOVA we are evaluating whether the morphological measurements for Darlingtonia differ significantly among sites.

Figure 6: Null Hypotheses for the pitcher plant example and a picutre Darlingtonia californica along with the measurments taken
Figure 6: Null Hypotheses for the pitcher plant example and a picutre Darlingtonia californica along with the measurments taken
pitcher_data <- read.csv("manova_example.csv", header=TRUE)
head(pitcher_data)
##   site height mouth.diam tube.diam keel.diam wing1.length wing2.length wingspread hoodarea wingarea tubearea
## 1  TJH    654       38.4      16.6       6.4           85           76         55    63.77    33.65    87.15
## 2  TJH    413       22.2      17.2       5.9           55           26         60    21.10     7.36    44.78
## 3  TJH    610       31.2      19.9       6.7           62           60         78    28.47    15.75    56.64
## 4  TJH    546       34.4      20.8       6.3           84           79         95    48.82    30.47    76.31
## 5  TJH    665       30.5      20.4       6.6           60           51         30    29.48    11.33   100.22
## 6  TJH    665       33.6      19.5       6.6           84           66         82    55.67    27.54   106.12
pitcher_long <- gather(pitcher_data, variable, value, height:tubearea, factor_key=TRUE)
head(pitcher_long)
##   site variable value
## 1  TJH   height   654
## 2  TJH   height   413
## 3  TJH   height   610
## 4  TJH   height   546
## 5  TJH   height   665
## 6  TJH   height   665

Let’s examine the data visually

ggplot(pitcher_long, aes(site, value, fill = site)) + 
  geom_boxplot(outlier.shape = NA)+
  facet_wrap(. ~ variable, scales = 'free', shrink = TRUE)+
  xlab('')+
  ylab('')

Assumptions and preliminary tests

MANOVA makes the following assumptions about the data. This section is not focused on meeting these assumptions. Instead I focus on comparing the process of conducting a MANOVA through R and by hand.

pitcher_long |>
  group_by(variable) |>
  summarise(N = n())
## # A tibble: 10 × 2
##    variable         N
##    <fct>        <int>
##  1 height          84
##  2 mouth.diam      84
##  3 tube.diam       84
##  4 keel.diam       84
##  5 wing1.length    84
##  6 wing2.length    84
##  7 wingspread      84
##  8 hoodarea        84
##  9 wingarea        84
## 10 tubearea        84

There are 84 observations per variable.

Evaluate Multivariate normality

The null hypothesis of the Doornik-Hansen test for multivariate normality is that the variables are multivariate normal(Doornik and Hansen 2008).

#we are making a matrix of the independent variables of interest. I chose to work with the variables in columns 2 through 8 of `pitcher_data` for consistency with  the example in the Gotelli & Ellison chapter 12.`manova()`, which we use below, requires the dependent variables to be in a matrix of their own. `vars` would be a data.frame if I did not indicate `as.matrix()`
vars <- as.matrix(pitcher_data[2:8]) 
head(vars)
##      height mouth.diam tube.diam keel.diam wing1.length wing2.length wingspread
## [1,]    654       38.4      16.6       6.4           85           76         55
## [2,]    413       22.2      17.2       5.9           55           26         60
## [3,]    610       31.2      19.9       6.7           62           60         78
## [4,]    546       34.4      20.8       6.3           84           79         95
## [5,]    665       30.5      20.4       6.6           60           51         30
## [6,]    665       33.6      19.5       6.6           84           66         82
DH.test(pitcher_data[2:8], Y.names = NULL) #Doornik-Hansen test for multivariate normality
## $multi
##          E df P(Chi > E)
## 1 1920.485 14          0
## 
## $univ
##            E df   P(Chi > E)
## Y1 308.49811  2 1.024440e-67
## Y2 330.63864  2 1.594896e-72
## Y3 326.21447  2 1.456896e-71
## Y4 331.54489  2 1.013776e-72
## Y5 216.22224  2 1.116699e-47
## Y6 318.52362  2 6.815139e-70
## Y7  88.84302  2 5.104847e-20

The data do not meet assumptions of multivariate normality. We could consider running MANOVA on the data after transforming the outcome variables. You can also perform the test regardless as MANOVA is fairly robust to deviations from normality.

Evaluate Multicollinearity

cor(vars) #you can run options(scipen = 999) if R is outputing parts of this in scientific notation
##                  height  mouth.diam     tube.diam  keel.diam  wing1.length wing2.length wingspread
## height        1.0000000  0.63006322  1.571517e-01 -0.1421400  3.362256e-01   0.31731146  0.1718566
## mouth.diam    0.6300632  1.00000000 -4.686177e-02 -0.3917945  5.729505e-01   0.43080463  0.2413659
## tube.diam     0.1571517 -0.04686177  1.000000e+00  0.5169862 -6.306916e-05   0.08470713  0.2038861
## keel.diam    -0.1421400 -0.39179448  5.169862e-01  1.0000000 -3.358236e-01  -0.27446240 -0.1902531
## wing1.length  0.3362256  0.57295054 -6.306916e-05 -0.3358236  1.000000e+00   0.82140285  0.6149062
## wing2.length  0.3173115  0.43080463  8.470713e-02 -0.2744624  8.214029e-01   1.00000000  0.6999089
## wingspread    0.1718566  0.24136594  2.038861e-01 -0.1902531  6.149062e-01   0.69990888  1.0000000

Wing length 1 and 2 are highly correlated.

Run MANOVA function

The manova() function accepts a formula argument with the dependent variables formatted as a matrix and the grouping factor on the right of the ~. In milliseconds manova() is doing a bunch of matrix math that is essentially summarizing variance-covariance matrices. These matrices effectively summarize within and among group variance in the multivariate data.

pitcher.manova <- manova(vars ~ pitcher_data$site) # run the  model
pitcher.manova
## Call:
##    manova(vars ~ pitcher_data$site)
## 
## Terms:
##                 pitcher_data$site Residuals
## height                    79790.1  655148.8
## mouth.diam                 1187.2    2080.3
## tube.diam                   215.7     566.6
## keel.diam                   113.2     256.5
## wing1.length              11670.9   27697.9
## wing2.length               6949.2   36111.2
## wingspread                20489.7   80820.2
## Deg. of Freedom                 3        80
## 
## Residual standard errors: 90.49508 5.099353 2.661245 1.790527 18.60709 21.24595 31.78447
## Estimated effects may be unbalanced

pitcher_data$site values are analogous to the values along the diagonal of the hypothesis matrix(H). The hypothesis matrix, more generally referred to as a sum of squares & cross product matrix, is a variance-covariance matrix that essentially shows how the group means vary from the overall mean. This is analogous to “among group variance” in ANOVA.

Residuals values are the diagonal, or variance, of the error matrix(E). The error matrix is a variance-covariance matrix or sum of squares & cross product matrix that represents within group variance.

Together, the H and E matrices are used to generate test statistics (e.g. Wilk’s lambda, Pillais trace, Hotelling-Lawley trace, Roys greatest root) that help us evaluate the occurrence of a significant difference. All of them essentially evaluate the ratio of the error matrix (E) to the total variance (E+H). Therefore, it is possible to derive an F statistic from these. Below we see that the default test statistics used in manova() is Pillai’s trace. Pillais trace is noted for being the most forgiving to violations of the MANOVA assumptions (i.e. multivariate normality).

pitcher.manova<-summary(manova(vars ~ pitcher_data$site))# get a summary of the model 
pitcher.manova
##                   Df Pillai approx F num Df den Df    Pr(>F)    
## pitcher_data$site  3 1.1156    6.428     21    228 4.273e-14 ***
## Residuals         80                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output indicates there are significant differences in pitcher plant measurements at each site.

In the code below we can extract the Hypothesis Matrix (H) and Error matrix (E) after running the summary function.

pitcher.manova$SS
## $`pitcher_data$site`
##                 height mouth.diam   tube.diam   keel.diam wing1.length wing2.length wingspread
## height       79790.128  8531.8580 -1200.14200 -2791.22071   26539.6879   16569.6700 26676.7264
## mouth.diam    8531.858  1187.1745  -352.02552  -353.65717    2812.9177    1450.2677  1077.1965
## tube.diam    -1200.142  -352.0255   215.67971    95.29806    -589.2937    -208.4982  1003.1442
## keel.diam    -2791.221  -353.6572    95.29806   113.24999   -1036.7754    -634.1192  -598.4751
## wing1.length 26539.688  2812.9177  -589.29370 -1036.77541   11670.9094    8523.6188  9579.4701
## wing2.length 16569.670  1450.2677  -208.49824  -634.11924    8523.6188    6949.1870  8188.8191
## wingspread   26676.726  1077.1965  1003.14423  -598.47513    9579.4701    8188.8191 20489.7484
## 
## $Residuals
##                  height  mouth.diam tube.diam  keel.diam wing1.length wing2.length wingspread
## height       655148.765 22343.64200 4968.2170  448.16000   30651.9550   39878.5800 20217.3450
## mouth.diam    22343.642  2080.27218  277.1055  -76.97283    3685.3490    3659.7657  3314.2368
## tube.diam      4968.217   277.10552  566.5778  182.73444     588.9437     700.1232   811.9058
## keel.diam       448.160   -76.97283  182.7344  256.47894    -244.4603    -461.0058  -565.9177
## wing1.length  30651.955  3685.34900  588.9437 -244.46030   27697.9002   25296.2145 29254.4347
## wing2.length  39878.580  3659.76567  700.1232 -461.00576   25296.2145   36111.2297 38039.3476
## wingspread    20217.345  3314.23683  811.9058 -565.91773   29254.4347   38039.3476 80820.2039

We can explore each variable separately - “univariate results”.

summary(aov(vars ~ pitcher_data$site))
##  Response height :
##                   Df Sum Sq Mean Sq F value  Pr(>F)  
## pitcher_data$site  3  79790 26596.7  3.2477 0.02614 *
## Residuals         80 655149  8189.4                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response mouth.diam :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## pitcher_data$site  3 1187.2  395.72  15.218 6.355e-08 ***
## Residuals         80 2080.3   26.00                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response tube.diam :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## pitcher_data$site  3 215.68  71.893  10.151 9.715e-06 ***
## Residuals         80 566.58   7.082                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response keel.diam :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## pitcher_data$site  3 113.25  37.750  11.775 1.815e-06 ***
## Residuals         80 256.48   3.206                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wing1.length :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## pitcher_data$site  3  11671  3890.3  11.236 3.143e-06 ***
## Residuals         80  27698   346.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wing2.length :
##                   Df Sum Sq Mean Sq F value   Pr(>F)   
## pitcher_data$site  3   6949 2316.40  5.1317 0.002687 **
## Residuals         80  36111  451.39                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response wingspread :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## pitcher_data$site  3  20490  6829.9  6.7606 0.0004025 ***
## Residuals         80  80820  1010.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output tells us there are significant differences in the means for ALL of the morphological variables among the 4 sites.

Considerations for MANOVA:

MANOVA has multiple assumptions to consider which may limit its applicability: 1) observations are independent and randomly sampled 2) within-group errors are equal among groups and normally distributed 3) covariances are equal among groups 4) Errors of the multivariate variables must conform to a multivariate normal distribution.

If residuals are not multivariate normal, (refer to Gotelli & Ellsion 2012 pg. 394-98 for a helpful discussion on how to evaluate multivariate normality), then PERMANOVA (permutational multivariate ANOVA) is a non-parametric alternative to MANOVA.

PermANOVA (permutation analysis of variance) aka. Distance-based MANOVA aka. non-parametric MANOVA

Click here for a more detailed discussion of PERMANOVA.

This method has earned itself many names - most commonly referred to as PerMANOVA. Each name emphasizes a different attribute of the design. Similar to ANOVA and MANOVA and regression, this method functions to test if there is a difference in a measure of centrality (here centroids) and spread (here dispersion). This method was developed relatively recently after Anderson (2001) showed that the sums of squares could be calculated directly using distances among data points, rather than the distances from the data points to the mean (e.g. ANOVA, MANOVA).