start with \(p\) variables measured for \(m\) groups.
DFA finds a linear combination of the \(p\) variables that maximizes the distance between groups
\[Z = a_1X_1 + a_2X_2 + ... + a_pX_p\]
DFA tries to maximise the F ratio of between group to within group variation (\(M_B/M_W\))
This is an eigenvalue problem
Assuming you have more measurements than groups, there will be \(m - 1\) canonical discriminant functions that maximize the ratio \(M_B/M_W\).
These are indicated by \(Z_1\), \(Z_2\), \(...\), \(Z_{m-1}\).
\(Z_1\) captures as much distance between groups as possible.
\(Z_2\) captures as much variation as possible, subject to the condition that the variation captured is uncorrelated (orthogonal) to \(Z_1\), and so on with the remaining canonical discriminant functions.
First two discriminant functions often captures majority of group differences.
If so, we can use reduced set of variables to visualize \(p\) dimensional dataset in 2 dimensions.
Do a DFA of the iris data using the mass::lda() function
library(MASS) iris_lda <- lda(Species ~ Sepal.Width + Sepal.Length + Petal.Width + Petal.Length, data=iris) iris_lda
## Call: ## lda(Species ~ Sepal.Width + Sepal.Length + Petal.Width + Petal.Length, ## data = iris) ## ## Prior probabilities of groups: ## setosa versicolor virginica ## 0.3333333 0.3333333 0.3333333 ## ## Group means: ## Sepal.Width Sepal.Length Petal.Width Petal.Length ## setosa 3.428 5.006 0.246 1.462 ## versicolor 2.770 5.936 1.326 4.260 ## virginica 2.974 6.588 2.026 5.552 ## ## Coefficients of linear discriminants: ## LD1 LD2 ## Sepal.Width 1.5344731 -2.16452123 ## Sepal.Length 0.8293776 -0.02410215 ## Petal.Width -2.8104603 -2.83918785 ## Petal.Length -2.2012117 0.93192121 ## ## Proportion of trace: ## LD1 LD2 ## 0.9912 0.0088
To get the scores from the model object, we need use the predict() function. This is a generic function that can be use to get the predicted values from most kinds of model objects.
For the DFA object, the predict() function will out put the predicted class (i.e. the species), as well as the posterior probabiliy of each observation belonging to the diffrent classes, and finally, it returns the predicted values of each specimen for each discriminant function (e.g. LD1 and LD2). These values are what we want, and they are accessed in the x slot in the predicted object.
predicted<-predict(iris_lda) head(predicted$class)
## [1] setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica
head(predicted$x)
## LD1 LD2 ## 1 8.061800 -0.3004206 ## 2 7.128688 0.7866604 ## 3 7.489828 0.2653845 ## 4 6.813201 0.6706311 ## 5 8.132309 -0.5144625 ## 6 7.701947 -1.4617210
Make your own dataframe withpredicted LD values, and species values from original iris dataset
library(ggplot2) forPlot <- data.frame(predicted$x, Species=iris$Species) ggplot(data=forPlot, aes(x=LD1, y=LD2, color=Species)) + geom_point()
bovids <- read.table("https://stats.are-awesome.com/datasets/bovid_occurrences.txt", header=TRUE, sep="\t")
library(tidyr)
bovids <- pivot_wider(bovids, names_from = site, values_from=count)
bovids <- as.data.frame(bovids) #convert to plain dataframe
row.names(bovids) <- bovids$taxon
bovids<-bovids[,2:9]
head(bovids)
## site1 site2 site3 site4 site5 site6 site7 site8 ## Gazella 184 291 313 295 95 98 61 114 ## Connochaetes 185 281 297 276 136 159 86 172 ## Tragelaphus 91 145 145 155 229 260 137 295 ## Aepyceros 155 219 214 183 295 330 185 369
ca package in RCode to do CA and make plot on previous page
library("ca")
myCA <- ca(bovids)
forPlot <- data.frame(rowcoordX=myCA$rowcoord[,1], rowcoordY=myCA$rowcoord[,2],rowlabels=rownames(myCA$rowcoord),
colcoordX=myCA$colcoord[,1], colcoordY=myCA$colcoord[,2],collabels=rownames(myCA$colcoord))
ggplot(data=forPlot) +
geom_text(aes(x=rowcoordX,y=rowcoordY, label=rowlabels), color="blue", size=8) +
geom_text(aes(x=colcoordX,y=colcoordY, label=collabels), color="red", size=8) +
labs(x="Dimension 1", y="Dimension 2", title="Corresponce Analysis - Bovid Abundances") +
scale_x_continuous(limits=c(-2, 2))