

Often we collect many different variables
We want to answer questions like:
These questions are the domain of multivariate statistics
Recall the sample covariance for two variables, calculated from the sum of cross-products
\[\ s_{XY} = \frac{\sum\limits_{i=1}^n(X_i - \bar{X})(Y_i - \bar{Y})}{(n-1)}\]
With a multivariate sample comprising \(p\) variables, we can define a sample covariance matrix:
\[ S = \left[\begin{array}{ccc} s_{11} & s_{12} & .. & s_{1p}\\ s_{21}& s_{22} & .. & s_{2p} \\ .. & .. & .. & .. \\ s_{p1} & s_{p2} & .. & s_{pp} \end{array}\right]\]
What do the diagonal elements represent?
var(iris[,-5])
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 ## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 ## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 ## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
Recall the correlation coeffiecient for two variables, which is a scaled version of the covariance.
\[ correlation\ coeffiecient = \frac{cov\ XY}{(sd\ X \times sd\ Y)}\]
or more formally:
\[ r = \frac{s_{xy}}{(s_{X} \times s_{Y})}\]
We can then compute the correlation matrix for a multivariate sample of \(p\) variables.
\[ R = \left[\begin{array}{cccc} 1 & r_{12} & .. & r_{1p}\\ r_{21}& 1 & .. & r_{2p} \\ .. & .. & .. & .. \\ r_{p1} & r_{p2} & .. & 1 \end{array}\right]\]
Why are the diagonal elements all 1?
cor(iris[,-5])
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 ## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 ## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 ## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
With a single variable (e.g. femur length) it is easy to conceptualize how far apart two observations are.
As we add more measurements, it becomes less obvious to tell how “far apart” two specimens are.
If the Humerus of indivual A is 15cm and the Humerus of indiviaual B is 22cm, then how far apart are individuals A and B?
We need a multivariate distance metric.

\[c = \sqrt{a^2 + b^2 }\]

Assuming 2 variables, we can compute the distance as the hypotenuse of a triangle:
\[d_{ij} =\sqrt{(x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2}\]
We can compute a Euclidian distance for any number of variables:
\[d_{ij} = \sum\limits_{k=1}^p \sqrt{ (x_{ik})^2 - (x_{jk})^2 }\]
Euclidian distances can easily be swamped by large scale measurements.
To avoid this, you can calculate a z score by subtracting the mean of the measurement and dividing by the standard deviation.
The dist() function in R calculates Euclidian distances by default.
Calculates the distance of an observation from its multivariate centroid (mean of all measurements), taking into account the covariance of the variables.
\[D^2_{ij} = \sum\limits_{r=1}^p \sum\limits_{s=1}^p (x_{r} - \mu_{r})\ v^{rs} (x_{s} - \mu_{s})\]
where \(v^{rs}\) is the covariance between variables \(r\) and \(s\)
means <- c(mean(iris$Sepal.Length), mean(iris$Sepal.Width), mean(iris$Petal.Length), mean(iris$Petal.Width)) VCV <- var(iris[,-5])
mhdists <- mahalanobis(iris[,-5], means, VCV) mhdists[1:10]
## [1] 2.134468 2.849119 2.081339 2.452382 2.462155 3.883418 2.862108 1.833300 ## [9] 3.384073 2.375218
Several multivariate techiques derive directly from matrices of distances.
dist(iris[1:5,-5])
## 1 2 3 4 ## 2 0.5385165 ## 3 0.5099020 0.3000000 ## 4 0.6480741 0.3316625 0.2449490 ## 5 0.1414214 0.6082763 0.5099020 0.6480741
start with a distance matrix
assume each individual is in a group of 1
join individuals within a given distance into a group
continue joining groups until there is a single group
visualize with a tree (dendrogram)
library(dplyr) #load dplyr just to have the pipe %>% iris[,-5] %>% dist %>% hclust %>% plot
Goal is to “summarize” variation in multivariate data
Reducing dimensionality of dataset is called ordination and many multivariate techniques fall in to this category
Reduce the number of dimensions needed to describe most of the variance
Start with some data points

Examine the variance of the points along some axis…

This axis covers much more variance in our points…

Imagine these points in some XY coordinate system….

We can still find the axis of maximum variation….

Then we find the orthogonal axis, 90\(^\circ\) from major axis

Next, we rotate the data to use the two new principal components axes we found…

Consider a system of linear equations:
\[a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = \lambda x_1 \\ a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = \lambda x_2\\ ... \\ a_{n1}x_1 + a_{n2}x_2 + ... + a_{nn}x_n = \lambda x_n\]
These equations only hold true for some values of \(\lambda\), which are called the eigenvalues.
There are up to \(n\) eigenvalues.
These equations can be solved for a given eigenvalue (e.g., the \(ith\)), and the resulting set of values is called the \(ith\) eigenvector.
I don’t expect this to be crystal clear if you haven’t studied linear algebra!
There are 2 functions in R
prcomp()princomp()These differ in their implementation and default arguments, but provide similar results.
However, prcomp() is preferred for numerical accuracy
These are bivariate plots of the familiar iris dataset, with 4 numerical variables. Lets do a PCA to summarize all the variation in a smaller number of variables.
There are two main ways to call the PCA function:
I like the second option, because it is more explicit exactly which variables are going in.
irisPCA <- prcomp(iris[,1:4],
scale=TRUE, center=TRUE)
#note both of these are equivalent
irisPCA <- prcomp(~Sepal.Length + Petal.Length + Petal.Width + Sepal.Width,
data=iris,
scale=TRUE, center=TRUE)
Now that we have a PCA object, we can look at the output:
irisPCA
## Standard deviations (1, .., p=4): ## [1] 1.7083611 0.9560494 0.3830886 0.1439265 ## ## Rotation (n x k) = (4 x 4): ## PC1 PC2 PC3 PC4 ## Sepal.Length 0.5210659 0.37741762 -0.7195664 0.2612863 ## Petal.Length 0.5804131 0.02449161 0.1421264 -0.8014492 ## Petal.Width 0.5648565 0.06694199 0.6342727 0.5235971 ## Sepal.Width -0.2693474 0.92329566 0.2443818 -0.1235096
summary(irisPCA)
## Importance of components: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.7084 0.9560 0.38309 0.14393 ## Proportion of Variance 0.7296 0.2285 0.03669 0.00518 ## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
To plot irisPCA you can use the base biplot(irisPCA) function which is ugly but effective, but using the ggfortify package is much easier.
library(ggfortify) autoplot(irisPCA, data=iris, colour="Species", loadings=TRUE, loadings.label=TRUE)
## PCA - Example cont.
Notice on the previous example that we are color coding by Species, which is a column that exists in the original iris dataset, but DOES NOT exist in the PCA model object (because PCA only works on numeric columns). This is why you have to use data=iris, colour="Species" to get the color coding right. Also ggfortify isn’t as forgiving as ggplot2 is on how you spell ‘colour’.
I prefer to plot my own PCA plots using ggplot2.
If you look inside the irisPCA object using the str() function you will see that there are a bunch of components. You are most interested in the one called x because this contains the new coordinates for each observation in the new PCA space.
head(irisPCA$x)
## PC1 PC2 PC3 PC4 ## 1 -2.257141 0.4784238 -0.12727962 0.024087508 ## 2 -2.074013 -0.6718827 -0.23382552 0.102662845 ## 3 -2.356335 -0.3407664 0.04405390 0.028282305 ## 4 -2.291707 -0.5953999 0.09098530 -0.065735340 ## 5 -2.381863 0.6446757 0.01568565 -0.035802870 ## 6 -2.068701 1.4842053 0.02687825 0.006586116
Now we can make a new dataframe for our plot, using the x values from the PCA, and the Species column from the original data
forPlot <- data.frame(irisPCA$x, Species=iris$Species) ggplot(data = forPlot, mapping=aes(x=PC1, y=PC2, color=Species)) + geom_point()
Note…it is harder to plot the eigenvectors (the arrows) if you are doing it yourself. I never use them because I don’t find them that helpful. If you do, I suggest going with ggfortify