Phylogenetic Comparative Methods

Skills Bootcamp - Joining datasets

Suppose you had a dataset that looks like this:

date	individual	direction of travel	individual natal group
Jan 1	LeRoy	North	Green Stars
Jan 3	LeRoy	North	Green Stars
Jan 1	Lucinda	South	Black Stars
Jan 3	Lucinda	North	Black Stars

It is redundant to keep track of the natal group for each individual in this format, as the natal group never changes.

Its not so bad in small datasets, but gets super redundant when you gets 100s of rows

Skills Bootcamp - Joining datasets

Its better to keep them separate, and only join them together when you need to…

Directions of travel

date	individual	direction of travel
Jan 1	LeRoy	North
Jan 3	LeRoy	North
Jan 1	Lucinda	South
Jan 3	Lucinda	North

Natal groups

individual	natal group
LeRoy	Green Stars
Lucinda	Black Stars

Skills Bootcamp - Joining datasets

Luckily, there is a dplyr function to do that…

library(dplyr)
left_join(table1, table2)

This will join the tables based on any columns that have shared names (in this case individual).

If you need to manually specify (because columns have different names) you could do this

left_join(table1, table2, by = c("colname_from_table1" = "colname_from_table1"))

Comparative Methods

Anatomy of a phylogenetic tree

nodes are the blue circles

tips are the brown squares

edges connect nodes (and tips)

edges have a length (known as a branch length)

trees are hierarchical, the pattern of branching is the topology, and the deepest node is the root node

Topology

The topology represents a series of branching events, so these trees have identical topologies.

The differences between these trees are purely aesthetic

Still same old topology

Note: all these trees are made using the basic plot function in R. We will see more later.

Phylogenetic Trees in R

The ape package provides functions for interacting with phylogenetic trees.

The most common file format for phylogenetic trees is the Newick format (which is incorporated in the Nexus format).

textTree <- 
  "(Gibbon:1.6, (Gorilla:1, (Chimp:0.2, Human:0.2):0.8):0.6);"

Notice sister taxa are separated by a comma, enclosed in parentheses. Branch lengths go after a colon after the taxon name.

The structure is hierarchical, so read it from inside out.

Phylogenetic Trees in R

Trees in Newick format can be read with read.tree()

Tree on previous slide can be read and plotted as follows:

plot(read.tree(text=textTree))

Trees in Nexus format (e.g. from the program Mesquite) can be read with read.nexus()

Note: both of these functions are in the ape package.

Phylogenetic Trees in R

Like most everything, a phylogenetic tree in R is just a special type of list (of class phylo)

str(myTree)

## List of 4
##  $ edge       : int [1:6, 1:2] 5 5 6 6 7 7 1 6 2 7 ...
##  $ edge.length: num [1:6] 1.6 0.6 1 0.8 0.2 0.2
##  $ Nnode      : int 3
##  $ tip.label  : chr [1:4] "Gibbon" "Gorilla" "Chimp" "Human"
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

Phylogenetic Trees in R

drop.tip() - drops named tips from a tree

myTree

## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   Gibbon, Gorilla, Chimp, Human
## 
## Rooted; includes branch lengths.

Phylogenetic Trees in R

drop.tip() - drops named tips from a tree

drop.tip(myTree, c("Chimp", "Human"))

## 
## Phylogenetic tree with 2 tips and 1 internal nodes.
## 
## Tip labels:
##   Gibbon, Gorilla
## 
## Rooted; includes branch lengths.

Question

What is the #1 assumption we have made in parametric stats throughout the semester?

Answer

Each observation is independent, and therefore residuals are independent and (hopefully) normally distributed

The Problem - Graphically

statistics assumes

The Problem - Graphically

evolution provides

The Problem - Graphically

Regression assumes residuals are independent, but, sometimes residuals more likely to be (\(\pm\)) based on phylo.

This is called phylogenetic autocorrelation and causes problems

The Problem is Insidious

This example is extreme…some people call this a grade shift

Folks have long been wary of grade shifts (solution: separate analyses for groups)

The Problem is Insidious

Even without obvious grade shifts, phylogenetic autocorrelation of residuals causes 2 problems:

increases variance of parameter estimates (e.g. slopes and intercepts)
increases Type I (false positive) error rate

Recall the General Linear Model

\[Y_i = \beta_0 + \beta_1X_i + \beta_2X_i +\ ... +\ \beta_nX_i + \epsilon_i\]

where:

\(\beta_0\) is the y-intercept (value of y where x= 0)
\(\beta_1X_i\) is the slope value for the 1st x variable
\(\epsilon_i\) is the error term, distributed as a normal random variable

The solution to the problem of phylogenetic autocorrelation is to relax assumptions about the error term.

Generalized Linear Model

Similar in structure to general linear models, but allows specification of the assumed residual error structure

This error structure is represented as an expected variance/covariance matrix

Requires branch lengths and a particular model of evolution

The most common assumed model of evolution is Brownian motion

Brownian Motion

Traits evolve in a random direction at each time step, independent of previous changes

Assumes no natural selection

Assumes constant rate of change

Brownian motion is assumed in most studies you will read which are trying to “correct for phylogeny” (not a very useful way to talk about it)

Phylogenetic Generalized Linear Model

Homo and chimp have much shorter branches connecting them than chimp and gorilla

Branch lengths represented as variance/covariance matrix using functions in ape

We use this phyloVCV in pGLM, instead of a normal error term. (Often called PGLS, but pGLM is more general term)

Quantifying Phylogenetic Signal

Residual autocorrelation in proportion to the Brownian VCV is a reasonable starting assumption, but we don’t want to always assume this

Better to estimate how much phylogenetic signal exists

\(\lambda\) provides this estimate

varies between 0 and 1, and scales the branch lengths of the tree (and thus the VCV matrix)

lambda - branch length transformations

Estimating lambda

when doing PGLS, you can estimate the most appropriate value of lambda for your data

if \(\lambda = 0\) then PGLS is equivalent to general linear model (non-phylogenetic)

if \(\lambda = 1\) then PGLS is equivalent to phylogenetically independent contrasts (an older way of “correcting” for phylogeny)

Other models of evolution

Even when we estimate lambda, we are only slightly relaxing our assumption that our trait evolved by Brownian motion.

There are other models of trait evolution such as:

Ornstein-Uhlenbeck: like Brownian motion but with stabilizing selection around some ‘optimum’ value (or multiple ‘optima’)

Early Burst: model of trait evolution where evolution happens faster near the root of the tree, slower towards tips

We won’t consider these further, but if you are serious about comparative stats, you need to explore whether other models besides Brownian are more appropriate.

PGLS in R

caper package is most user friendly

Three steps:

Read in your data and your tree
Use the comparative.data() function to match up your tree with your dataframe
Use the pgls() function, specifying that lambda should be estimated by maximum likelihood

Example:

pgls(response ~ predictor1 + predictor2, data=myCompData, lambda="ML")

Generalized LS ain’t just for phylogenies

Phylogenetic autocorrelation is just one kind of residual autocorrelation. Spatial autocorrelation is another type……

Generalized Least Squares is the big family of models which have non normal error terms.

Challenge 1 - on your own time

Read in this data on ungulate brain and body size https://stats.are-awesome.com/datasets/ungs.txt

Replace the spaces in the Species columns with underscores.

you will use this data frame in the next part of the challenge.

Challenge 2 - on your own time

Read in this phylogenetic tree using the ape::read.nexus() function https://stats.are-awesome.com/datasets/BinindaEmondsBestDatesUngulates.nex

Plot the tree

Use the caper:comparative.data() function to make the comparative data object.

Do a PGLS testing the hypothesis that ungulate brain size is a function of body size and diet, while controlling for phylogeny assuming a brownian motion model of evolution.

What is the maximum likelihood of lambda? Is diet a significant predictor of brain size? What about body size?

Practice on your own time - Joining two datasets

Read in these two datasets: https://stats.are-awesome.com/datasets/barr_astrag_2014.txt & https://stats.are-awesome.com/datasets/ungulateMasses.txt
Join these two dataframes together
What is the slope of the linear relationship between the B variable and the body mass variable?