Multivariate Stats II

PCA video

Discriminant Function Analysis

start with \(p\) variables measured for \(m\) groups.

DFA finds a linear combination of the \(p\) variables that maximizes the distance between groups

\[Z = a_1X_1 + a_2X_2 + ... + a_pX_p\]

DFA tries to maximise the F ratio of between group to within group variation (\(M_B/M_W\))

This is an eigenvalue problem

Discriminant Function Analysis

Assuming you have more measurements than groups, there will be \(m - 1\) canonical discriminant functions that maximize the ratio \(M_B/M_W\).

These are indicated by \(Z_1\), \(Z_2\), \(...\), \(Z_{m-1}\).

\(Z_1\) captures as much distance between groups as possible.

\(Z_2\) captures as much variation as possible, subject to the condition that the variation captured is uncorrelated (orthogonal) to \(Z_1\), and so on with the remaining canonical discriminant functions.

Discriminant Function Analysis

First two discriminant functions often captures majority of group differences.

If so, we can use reduced set of variables to visualize \(p\) dimensional dataset in 2 dimensions.

Learn By Doing - DFA in R

Do a DFA of the iris data using the mass::lda() function

library(MASS)
iris_lda <- lda(Species ~ Sepal.Width + Sepal.Length + Petal.Width + Petal.Length, data=iris)
iris_lda
## Call:
## lda(Species ~ Sepal.Width + Sepal.Length + Petal.Width + Petal.Length, 
##     data = iris)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Group means:
##            Sepal.Width Sepal.Length Petal.Width Petal.Length
## setosa           3.428        5.006       0.246        1.462
## versicolor       2.770        5.936       1.326        4.260
## virginica        2.974        6.588       2.026        5.552
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Width   1.5344731 -2.16452123
## Sepal.Length  0.8293776 -0.02410215
## Petal.Width  -2.8104603 -2.83918785
## Petal.Length -2.2012117  0.93192121
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

Plotting the results from DFA

To get the scores from the model object, we need use the predict() function. This is a generic function that can be use to get the predicted values from most kinds of model objects.

For the DFA object, the predict() function will out put the predicted class (i.e. the species), as well as the posterior probabiliy of each observation belonging to the diffrent classes, and finally, it returns the predicted values of each specimen for each discriminant function (e.g. LD1 and LD2). These values are what we want, and they are accessed in the x slot in the predicted object.

predicted<-predict(iris_lda)
head(predicted$class)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
head(predicted$x)
##        LD1        LD2
## 1 8.061800 -0.3004206
## 2 7.128688  0.7866604
## 3 7.489828  0.2653845
## 4 6.813201  0.6706311
## 5 8.132309 -0.5144625
## 6 7.701947 -1.4617210

Plotting the results from DFA

Make your own dataframe withpredicted LD values, and species values from original iris dataset

library(ggplot2)
forPlot <- data.frame(predicted$x, Species=iris$Species)
ggplot(data=forPlot, aes(x=LD1, y=LD2, color=Species)) + geom_point()

Correspondence Analysis

  • A method for visualizing a 2-way contingency table
  • The goal is to have rows (often taxa) and colums (often sites) appear in same ordination plot
  • Often called reciprocal averaging
  • Site scores are weighted averages of species values, and species scores are a weighted > average of site values
  • useful for count data and presence/absence

Correspondence Analysis

bovids <- read.table("https://stats.are-awesome.com/datasets/bovid_occurrences.txt", header=TRUE, sep="\t")
library(tidyr)
bovids <- pivot_wider(bovids, names_from = site, values_from=count)
bovids <- as.data.frame(bovids) #convert to plain dataframe
row.names(bovids) <- bovids$taxon
bovids<-bovids[,2:9]
head(bovids)
##              site1 site2 site3 site4 site5 site6 site7 site8
## Gazella        184   291   313   295    95    98    61   114
## Connochaetes   185   281   297   276   136   159    86   172
## Tragelaphus     91   145   145   155   229   260   137   295
## Aepyceros      155   219   214   183   295   330   185   369

Correspondence Analysis

  • ca package in R
  • Row points (blue) appear close to rows with similar column values
  • Column points (red) appear close to columns with similar row values

Correspondence Analysis

Code to do CA and make plot on previous page

library("ca")
myCA <- ca(bovids)
forPlot <- data.frame(rowcoordX=myCA$rowcoord[,1], rowcoordY=myCA$rowcoord[,2],rowlabels=rownames(myCA$rowcoord),
                      colcoordX=myCA$colcoord[,1], colcoordY=myCA$colcoord[,2],collabels=rownames(myCA$colcoord))
ggplot(data=forPlot) + 
  geom_text(aes(x=rowcoordX,y=rowcoordY, label=rowlabels), color="blue", size=8) + 
  geom_text(aes(x=colcoordX,y=colcoordY, label=collabels), color="red", size=8) +
  labs(x="Dimension 1", y="Dimension 2", title="Corresponce Analysis - Bovid Abundances") + 
  scale_x_continuous(limits=c(-2, 2))