Skills Bootcamp - Joining datasets

Suppose you had a dataset that looks like this:

date individual direction of travel individual natal group
Jan 1 LeRoy North Green Stars
Jan 3 LeRoy North Green Stars
Jan 1 Lucinda South Black Stars
Jan 3 Lucinda North Black Stars

It is redundant to keep track of the natal group for each individual in this format, as the natal group never changes.

Its not so bad in small datasets, but gets super redundant when you gets 100s of rows

Skills Bootcamp - Joining datasets

Its better to keep them separate, and only join them together when you need to…

Directions of travel

date individual direction of travel
Jan 1 LeRoy North
Jan 3 LeRoy North
Jan 1 Lucinda South
Jan 3 Lucinda North

Natal groups

individual natal group
LeRoy Green Stars
Lucinda Black Stars

Skills Bootcamp - Joining datasets

Luckily, there is a dplyr function to do that…

library(dplyr)
left_join(table1, table2)

This will join the tables based on any columns that have shared names (in this case individual).

If you need to manually specify (because columns have different names) you could do this

left_join(table1, table2, by = c("colname_from_table1" = "colname_from_table1"))

Comparative Methods

Anatomy of a phylogenetic tree

nodes are the blue circles

tips are the brown squares

edges connect nodes (and tips)

edges have a length (known as a branch length)

trees are hierarchical, the pattern of branching is the topology, and the deepest node is the root node

Topology

The topology represents a series of branching events, so these trees have identical topologies.

The differences between these trees are purely aesthetic

Still same old topology

Note: all these trees are made using the basic plot function in R. We will see more later.

Phylogenetic Trees in R

  • The ape package provides functions for interacting with phylogenetic trees.
  • The most common file format for phylogenetic trees is the Newick format (which is incorporated in the Nexus format).
textTree <- 
  "(Gibbon:1.6, (Gorilla:1, (Chimp:0.2, Human:0.2):0.8):0.6);"
  • Notice sister taxa are separated by a comma, enclosed in parentheses. Branch lengths go after a colon after the taxon name.
  • The structure is hierarchical, so read it from inside out.

Phylogenetic Trees in R

Phylogenetic Trees in R

Trees in Newick format can be read with read.tree()

Tree on previous slide can be read and plotted as follows:

plot(read.tree(text=textTree))

Trees in Nexus format (e.g. from the program Mesquite) can be read with read.nexus()

Note: both of these functions are in the ape package.

Phylogenetic Trees in R

Like most everything, a phylogenetic tree in R is just a special type of list (of class phylo)

str(myTree)
## List of 4
##  $ edge       : int [1:6, 1:2] 5 5 6 6 7 7 1 6 2 7 ...
##  $ edge.length: num [1:6] 1.6 0.6 1 0.8 0.2 0.2
##  $ Nnode      : int 3
##  $ tip.label  : chr [1:4] "Gibbon" "Gorilla" "Chimp" "Human"
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

Phylogenetic Trees in R

drop.tip() - drops named tips from a tree

myTree
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   Gibbon, Gorilla, Chimp, Human
## 
## Rooted; includes branch lengths.

Phylogenetic Trees in R

drop.tip() - drops named tips from a tree

drop.tip(myTree, c("Chimp", "Human"))
## 
## Phylogenetic tree with 2 tips and 1 internal nodes.
## 
## Tip labels:
##   Gibbon, Gorilla
## 
## Rooted; includes branch lengths.

Question

What is the #1 assumption we have made in parametric stats throughout the semester?

Answer

Each observation is independent, and therefore residuals are independent and (hopefully) normally distributed

The Problem - Graphically

statistics assumes

The Problem - Graphically

evolution provides

The Problem - Graphically

Regression assumes residuals are independent, but, sometimes residuals more likely to be (\(\pm\)) based on phylo.

This is called phylogenetic autocorrelation and causes problems

The Problem is Insidious

This example is extreme…some people call this a grade shift

Folks have long been wary of grade shifts (solution: separate analyses for groups)

The Problem is Insidious

Even without obvious grade shifts, phylogenetic autocorrelation of residuals causes 2 problems:

  • increases variance of parameter estimates (e.g. slopes and intercepts)
  • increases Type I (false positive) error rate

Recall the General Linear Model

\[Y_i = \beta_0 + \beta_1X_i + \beta_2X_i +\ ... +\ \beta_nX_i + \epsilon_i\]

where:

  • \(\beta_0\) is the y-intercept (value of y where x= 0)
  • \(\beta_1X_i\) is the slope value for the 1st x variable
  • \(\epsilon_i\) is the error term, distributed as a normal random variable

The solution to the problem of phylogenetic autocorrelation is to relax assumptions about the error term.

Generalized Linear Model

  • Similar in structure to general linear models, but allows specification of the assumed residual error structure
  • This error structure is represented as an expected variance/covariance matrix
  • Requires branch lengths and a particular model of evolution
  • The most common assumed model of evolution is Brownian motion

Brownian Motion

  • Traits evolve in a random direction at each time step, independent of previous changes
  • Assumes no natural selection
  • Assumes constant rate of change
  • Brownian motion is assumed in most studies you will read which are trying to “correct for phylogeny” (not a very useful way to talk about it)

Phylogenetic Generalized Linear Model

  • Homo and chimp have much shorter branches connecting them than chimp and gorilla
  • Branch lengths represented as variance/covariance matrix using functions in ape
  • We use this phyloVCV in pGLM, instead of a normal error term. (Often called PGLS, but pGLM is more general term)

Quantifying Phylogenetic Signal

  • Residual autocorrelation in proportion to the Brownian VCV is a reasonable starting assumption, but we don’t want to always assume this
  • Better to estimate how much phylogenetic signal exists
  • \(\lambda\) provides this estimate
  • varies between 0 and 1, and scales the branch lengths of the tree (and thus the VCV matrix)

lambda - branch length transformations

Estimating lambda

  • when doing PGLS, you can estimate the most appropriate value of lambda for your data
  • if \(\lambda = 0\) then PGLS is equivalent to general linear model (non-phylogenetic)
  • if \(\lambda = 1\) then PGLS is equivalent to phylogenetically independent contrasts (an older way of “correcting” for phylogeny)

Other models of evolution

  • Even when we estimate lambda, we are only slightly relaxing our assumption that our trait evolved by Brownian motion.
  • There are other models of trait evolution such as:
  • Ornstein-Uhlenbeck: like Brownian motion but with stabilizing selection around some ‘optimum’ value (or multiple ‘optima’)
  • Early Burst: model of trait evolution where evolution happens faster near the root of the tree, slower towards tips
  • We won’t consider these further, but if you are serious about comparative stats, you need to explore whether other models besides Brownian are more appropriate.

PGLS in R

caper package is most user friendly

Three steps:

  1. Read in your data and your tree
  2. Use the comparative.data() function to match up your tree with your dataframe
  3. Use the pgls() function, specifying that lambda should be estimated by maximum likelihood

Example:

pgls(response ~ predictor1 + predictor2, data=myCompData, lambda="ML")

Generalized LS ain’t just for phylogenies

  • Phylogenetic autocorrelation is just one kind of residual autocorrelation. Spatial autocorrelation is another type……
  • Generalized Least Squares is the big family of models which have non normal error terms.

Challenge 1 - on your own time

Challenge 2 - on your own time

  • Plot the tree
  • Use the caper:comparative.data() function to make the comparative data object.
  • Do a PGLS testing the hypothesis that ungulate brain size is a function of body size and diet, while controlling for phylogeny assuming a brownian motion model of evolution.
  • What is the maximum likelihood of lambda? Is diet a significant predictor of brain size? What about body size?

Practice on your own time - Joining two datasets