ANOVA & ANCOVA

Degrees of Freedom

Watch this at home

Regression In Context

Regression in Context

Regression is just one (the simplest) of many types of linear models that can describe linear relationships (relationships where effects are additive).

The larger family of related methods is called the General Linear Model

General Linear Model

Family of statistical models of the form:

\[Y_i = \beta_0 + \beta_1X_i + \beta_2X_i +\ ... +\ \beta_nX_i + \epsilon_i\]

where:

$\beta_0$ is the y-intercept (value of y where x= 0)
$\beta_1$ is the additive effect of the 1st variable on $X_i$
$\beta_2$ is the additive effect of the 2nd variable on $X_i$
$...$
$\beta_n$ is the additive effect of the nth variable on $X_i$
$\epsilon_i$ is the error term

General Linear Model

Includes many common statistical methods:

Regression
Multiple Regression
ANOVA
ANCOVA

All of these use the same function in R, called lm()

Choosing a General Linear Model

First decide which variables are explanatory and response and how many

Response variable is always continuous.

Single continuous explanatory variable
- Regression
Multiple continuous explanatory variables
- Multiple Regression
All explanatory variables categorical
- ANOVA (Analysis of Variance)
Explanatory variables both continuous and categorical
- ANCOVA (Analysis of Covariance)

Examples - pick the right method

Is a person’s response time (seconds) on a cognitive task explained by their age (years)?

Is brain size (cc) in primates explained by body size (g) and the type of diet consumed (frugivore vs. folivore vs. insectivore)?

Is the the diameter of the femoral (mm) head explained by body mass (g) and age (years)?

Is a persons income ($) predicted by the city they live in and their level of education (high school, bachelors, post-grad degree)

ANOVA & ANCOVA

Basic Goal of ANOVA

Compare the means of groups that have been sampled randomly, to test whether or not they differ.

The different levels of the categorical variable representing different groups are called treatments.

Each observation is called a replicate and is assumed to be independent from all other replicates.

So how do we compare group means?

Confusingly, it’s all about variances (hence ANOVA).

Intuitive Picture of ANOVA

Colored bars at top represent variances within groups.

Black bars at top represent variances ignoring groups.

The ANOVA linear model

\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]

$\mu$ is the population grand mean ($\bar{Y_{ij}}$ an unbiased estimator of $\mu$)

$A_i$ is the additive linear effect compared to the grand mean for a given treatment $i$ ($A_i$‘s sum to 0).

$\epsilon_{ij}$ is the error variance (the variance of individual points around their treatment group means. These are assumed to be distributed as $N(0,\sigma^2)$).

ANOVA linear model - Simulation

\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]

Let’s see how changing model parameters affects the y variable. We will start a few variables that will be kept constant.

grandMean <- 10
groups <- rep(c("A", "B"), 100)
myError <- rnorm(200, mean = 0, sd=3)

ANOVA linear model - Simulation

y <- grandMean + c(-5, 5) + myError
plot(y~factor(groups))

ANOVA linear model - Simulation

y <- grandMean + c(-0.3, 0.3) + myError
plot(y~factor(groups))

The steps in a one-way ANOVA

Calculate the Total SS
Calculate the Within-group SS
Calculate the Between-group SS
Calculate Mean Squares using appropriate df.
Calculate the F ratio
Calculate the p-value

As you could tell from your reading, calculating the appropriate sum of squares is complicated, and depends on your experimental design. Be careful (or just use Monte Carlo)!

Walking through the ANOVA

Example data from Gotelli Table 10.1

##   unmanipulated control treatment
## 1            10       9        12
## 2            12      11        13
## 3            12      11        15
## 4            13      12        16

These data represent 12 experimental plots to see the effect of soil temperature on the flowering period in larkspur flowers. The treatment plots had heating coils installed and activated. The control plot had heating coils installed but never turned on. The unmanipulated plots were untouched.

We want to know if there is a difference in mean flowering length between these groups.

We are going to walk through the calculation of ANOVA on the whiteboard

via GIPHY

Walking through the ANOVA

Source	degrees freedom	SS	Mean Square (MS)	Expected Mean Square	F-ratio
Among groups	a-1	$\sum\limits_{i=1}^a\sum\limits_{j=1}^n(\bar{Y}_i - \bar{Y})^2$	$\frac{SS_{among_groups}}{(a-1)}$	$\sigma^2 + n\sigma^2_A$	$\frac{MS_{among}}{MS_{within}}$
within groups (residual)	a(n-1)	$\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y}_i)^2$	$\frac{SS_{within}}{a(n-1)}$	$\sigma^2$
Total	an-1	$\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y})^2$	$\frac{SS_{total}}{(an-1)}$	$\sigma^2_Y$

Assumptions of ANOVA

The samples are independent and identically distributed
The residuals $\epsilon_i$ are normally distributed
Within-group variances are similar across all groups (‘homoscedastic”)
Observations are properly categorized in groups

Supporting assumptions

Main effects are additive (no strong interactions).
Balanced design (equal number of observations in all groups). If you are going to do a lot of ANOVA, you need to read about the ANOVA in R and other models for calculating SS that are more appropriate for unbalanced designs.

ANOVA in R

myMod <- lm(iris$Petal.Length~iris$Species)
summary(myMod)

## 
## Call:
## lm(formula = iris$Petal.Length ~ iris$Species)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.260 -0.258  0.038  0.240  1.348 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.46200    0.06086   24.02   <2e-16 ***
## iris$Speciesversicolor  2.79800    0.08607   32.51   <2e-16 ***
## iris$Speciesvirginica   4.09000    0.08607   47.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared:  0.9414, Adjusted R-squared:  0.9406 
## F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16

ANOVA in R

anova(myMod)

## Analysis of Variance Table
## 
## Response: iris$Petal.Length
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## iris$Species   2 437.10 218.551  1180.2 < 2.2e-16 ***
## Residuals    147  27.22   0.185                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking assumptions

plot(iris$Petal.Length ~ iris$Species)

#Homogeneity of variances
bartlett.test(iris$Petal.Length ~ iris$Species)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris$Petal.Length by iris$Species
## Bartlett's K-squared = 55.423, df = 2, p-value = 9.229e-13

Checking assumptions

plot(myMod)

Two-Way ANOVA

Life gets more complicated when we are considering more than one factor

\[Y_{ij}=\mu + A_i + B_j + AB_{ij} + \epsilon_{ij}\]

With more than one factor, we have the possibility of non-additive effects of one factor on the other. These are called interaction terms

Simple model (no interaction)

data(warpbreaks)
model <- lm(warpbreaks$breaks ~ warpbreaks$tension + warpbreaks$wool)
anova(model)

## Analysis of Variance Table
## 
## Response: warpbreaks$breaks
##                    Df Sum Sq Mean Sq F value   Pr(>F)   
## warpbreaks$tension  2 2034.3 1017.13  7.5367 0.001378 **
## warpbreaks$wool     1  450.7  450.67  3.3393 0.073614 . 
## Residuals          50 6747.9  134.96                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA

Have possibility of non-additive effects, where value of factor A influences value of factor B

Interpreting ANOVA interactions

More complex model (with interaction)

modelInteraction <- lm(warpbreaks$breaks ~ warpbreaks$tension * warpbreaks$wool)
anova(modelInteraction)

## Analysis of Variance Table
## 
## Response: warpbreaks$breaks
##                                    Df Sum Sq Mean Sq F value    Pr(>F)    
## warpbreaks$tension                  2 2034.3 1017.13  8.4980 0.0006926 ***
## warpbreaks$wool                     1  450.7  450.67  3.7653 0.0582130 .  
## warpbreaks$tension:warpbreaks$wool  2 1002.8  501.39  4.1891 0.0210442 *  
## Residuals                          48 5745.1  119.69                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

More complex model (with interaction)

interaction.plot(warpbreaks$tension, warpbreaks$wool, warpbreaks$breaks)

Two-Way ANOVA in R - Order Matters

If your ANOVA is unbalanced (different sample sizes in differnt groups), then the order in which you specify the factors matters!

anova(lm(response ~ pred1 * pred2))

Is not the same as

anova(lm(response ~ pred2 * pred1))

This is only true is the sample sizes differ.

This behavior differs from all other stats packages. You can read more about the ANOVA in R. You can use the car package to compute ANOVA in the same way as SPSS or other packages

ANCOVA

ANOVA with a continuous covariate

It’s a hybrid of regression and ANOVA

Think of it as doing an ANOVA on the residuals of a regression

\[Y_{ij}=\mu + A_i + \beta_i(X_{ij} - \bar{X_i}) + \epsilon_{ij}\]

ANCOVA example

Question: does Femoral Head Diameter differ in male and female baboons after accounting for differences in overall size?

Is the slope of FHD ~ BM the same in the two groups?

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ANCOVA in R

Using baboons data

baboons <- read.table("https://stats.are-awesome.com/datasets/baboons.txt", header=T)
head(baboons)

##      BM SEX  FHD
## 1  8.62   F 19.9
## 2  9.07   F 17.9
## 3  9.98   F 18.9
## 4 10.43   F 18.8
## 5 10.89   F 20.3
## 6 11.34   F 19.5

ANCOVA in R

Just add in the covariate to test whether the intercepts differ between sexes

summary(lm(baboons$FHD ~ baboons$SEX + baboons$BM))

## 
## Call:
## lm(formula = baboons$FHD ~ baboons$SEX + baboons$BM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35995 -0.54163  0.05521  0.53846  1.28351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.6474     0.8665  20.367  9.1e-16 ***
## baboons$SEXM   2.3996     0.8404   2.855   0.0092 ** 
## baboons$BM     0.1594     0.0721   2.210   0.0378 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7614 on 22 degrees of freedom
## Multiple R-squared:  0.8955, Adjusted R-squared:  0.886 
## F-statistic: 94.27 on 2 and 22 DF,  p-value: 1.622e-11

ANCOVA in R with interaction

Use the * to test for interaction (i.e. different slopes between sexes)

summary(lm(baboons$FHD ~ baboons$SEX * baboons$BM))

## 
## Call:
## lm(formula = baboons$FHD ~ baboons$SEX * baboons$BM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35157 -0.48111  0.06238  0.51091  1.16405 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              16.9481     1.3085  12.953 1.76e-11 ***
## baboons$SEXM              4.1274     2.5473   1.620   0.1201    
## baboons$BM                0.2195     0.1109   1.979   0.0611 .  
## baboons$SEXM:baboons$BM  -0.1059     0.1472  -0.720   0.4798    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7699 on 21 degrees of freedom
## Multiple R-squared:  0.898,  Adjusted R-squared:  0.8834 
## F-statistic: 61.64 on 3 and 21 DF,  p-value: 1.401e-10

Interpreting ANCOVA

Work through example of brain size

brains <- read.csv("https://stats.are-awesome.com/datasets/Boddy_2012_data.txt")
head(brains)

##                 Species.Name           Order Brain.Mass..g. Body.Mass..g.   EQ
## 1               Homo sapiens        Primates        1250.43      65142.86 5.72
## 2          Phocoena phocoena Cetartiodactyla        1735.00     142430.00 4.43
## 3 Lagenorhynchus obliquidens Cetartiodactyla        1136.75      89750.00 4.10
## 4      Stenella coeruleoalba Cetartiodactyla         883.75      63500.00 4.12
## 5         Tursiops truncatus Cetartiodactyla        1573.00     170480.00 3.51
## 6          Delphinus delphis Cetartiodactyla         797.26      65086.96 3.65
##   Independent.contrast.derived.EQ PGLS.ML.dervived.EQ
## 1                       1.1491544           12.633650
## 2                       0.9732993           10.962980
## 3                       0.8534335            9.476202
## 4                       0.8253700            9.066825
## 5                       0.7877916            8.923060
## 6                       0.7330865            8.059241
##                                                  Reference Notes
## 1                          Sherwood/this study; Nowak 1991      
## 2               Crile & Quiring 1940; Silva & Downing 1995      
## 3                                      Ridgway et al. 1966      
## 4              Pilleri & Busnel 1969; Silva & Downing 1995      
## 5                         Marino 2000; Ridgway et al. 1966      
## 6 Marino 2000; Pilleri & Busnel 1969; Silva & Downing 1995

Question

Do primates a different average brain size from other mammals, when controlling for bodysize?

Source	degrees freedom	SS	Mean Square (MS)	Expected Mean Square	F-ratio
Among groups	a-1	\(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(\bar{Y}_i - \bar{Y})^2\)	\(\frac{SS_{among_groups}}{(a-1)}\)	\(\sigma^2 + n\sigma^2_A\)	\(\frac{MS_{among}}{MS_{within}}\)
within groups (residual)	a(n-1)	\(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y}_i)^2\)	\(\frac{SS_{within}}{a(n-1)}\)	\(\sigma^2\)
Total	an-1	\(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y})^2\)	\(\frac{SS_{total}}{(an-1)}\)	\(\sigma^2_Y\)