Data that falls into a discrete number of categories.
The different possible values of a categorical variable are called levels
bird_sightings <-
factor(c("pigeon", "pigeon", "goshawk",
"eagle", "goshawk"))
levels(bird_sightings)
## [1] "eagle" "goshawk" "pigeon"
myData <- data.frame(
site = sample(c("forest1","meadow2"), 1000, replace=T),
species = sample(c("Homo_sapiens","Bison_bison", "Canis_latrans"),1000, replace=T)
)
head(myData)
## site species ## 1 forest1 Canis_latrans ## 2 meadow2 Bison_bison ## 3 meadow2 Homo_sapiens ## 4 meadow2 Canis_latrans ## 5 forest1 Bison_bison ## 6 meadow2 Canis_latrans
R has useful tools for counting occurrences
table(myData$site,myData$species)
## ## Bison_bison Canis_latrans Homo_sapiens ## forest1 158 165 176 ## meadow2 174 153 174
This is called a contingency table. Analysis of categorical data always operates on contingency tables.
A more real (but still fake) example.
| Site | A. afarensis | Ar. ramidus | Aepyceros melampus |
|---|---|---|---|
| Hadar | 120 | 0 | 600 |
| Aramis | 0 | 90 | 220 |
Made up of counts or frequencies of observations in each category.
Rows are indexed by \(i\) and columns are indexed by \(j\).
There are \(n\) rows in a table and \(m\) columns.
Analyzing contingency tables requires the raw counts: not percentages, proportions, etc.
| Site | A. afarensis | Ar. ramidus | Aepyceros melampus |
|---|---|---|---|
| Hadar | 120 | 0 | 600 |
| Aramis | 0 | 90 | 220 |
Null hypothesis: no association between \(site\) variable and the \(species\) variable.
Alternative hypothesis: There is a relationship between the \(site\) variable and the \(species\) variable
To reject the null hypothesis, we need to ask, what are the expected values of the cells, assuming no association?
| Site | A. afarensis | Ar. ramidus | Aepyceros melampus |
|---|---|---|---|
| Hadar | 120 | 0 | 600 |
| Aramis | 0 | 90 | 220 |
Intuitively, what would you expect the value of each cell to be assuming the row and column variable are unrelated???
Going back to probability, the probability of being an A. afarensis at Hadar is a shared event made up of two simple events:
We simply multiply these probabilities and multiply by the sample size
Expected value of this cell is \(0.0814 * 1030 = 83.84\)
Shortcut for computing expected cell frequencies:
\[\hat{Y}_{i,j} = \frac{row\ total\times{column\ total}}{sample\ size} = \frac{\sum\limits_{j=1}^mY_{i,j}\times\sum\limits_{i=1}^nY_{i,j}}{N}\]
Volunteer: calculate by hand on board!
| Site | A. afarensis | Ar. ramidus | Aepyceros melampus |
|---|---|---|---|
| Hadar | 120 | 0 | 600 |
| Aramis | 0 | 90 | 220 |
Karl Pearson came up with a test statistic to quantify how much the observed counts differ from the expected values:
\[X^2_{Pearson} = \sum\limits_{all\ cells}\frac{(Observed-Expected)^2}{Expected}\]
This is analogous to the residual sum of squares in linear modeling.
Chi-square has a known parametric distribution
Can be used to calculate p-values
myTable <- table(myData$site,myData$species) chisq.test(myTable)
## ## Pearson's Chi-squared test ## ## data: myTable ## X-squared = 1.2313, df = 2, p-value = 0.5403
Note, you can either pass two vectors of data, or a pre-made contingency table.
More appropriate when sample sizes are low.
General rule is to use Fisher’s if expected value for any cell is < 5.
fisher.test(myTable)
## ## Fisher's Exact Test for Count Data ## ## data: myTable ## p-value = 0.5424 ## alternative hypothesis: two.sided
These test how closely observed data fit some underlying distribution (e.g., binomial, uniform, normal)
For discrete cases, chi-square can be used as a goodness of fit statistic.
For instance: say we counted the frequency of a A. afarensis in 4 different geological strata through time.
afarensis <- c(24, 32, 19, 36)
We can use chi-square to test how well this fits a uniform distribution:
chisq.test(afarensis)
## ## Chi-squared test for given probabilities ## ## data: afarensis ## X-squared = 6.3694, df = 3, p-value = 0.09496
We could specify some other distribution by passing a vector of probabilities.
chisq.test(afarensis, p=c(.4, .1, .1, .4))
## ## Chi-squared test for given probabilities ## ## data: afarensis ## X-squared = 55.937, df = 3, p-value = 4.333e-12
The Kolmogorov-Smirnov is a commonly used goodness of fit test for continuous data.
The KS test compares the cumulative distribution function (CDF) of a set of observed data to a theoretical distribution.
The single largest deviation of the empirical from the theoretical is the KS statistic. This is used to compute a p-value.
Can be used for any distribution, not just the normal distribution.
ks.test(rnorm(100), "pnorm")
## ## One-sample Kolmogorov-Smirnov test ## ## data: rnorm(100) ## D = 0.056277, p-value = 0.9094 ## alternative hypothesis: two-sided
ks.test(rnorm(100)^2, "pnorm")
## ## One-sample Kolmogorov-Smirnov test ## ## data: rnorm(100)^2 ## D = 0.5, p-value < 2.2e-16 ## alternative hypothesis: two-sided
Read in this dataset on political party affiliation and gender in the USA. https://stats.are-awesome.com/datasets/party_affiliation.txt
Make a plot to visualize the data that looks like the following
Use chi-square to test the hypothesis that there is a gendered difference in party affiliation.
Check out the built-in dataset trees.
Are the tree heights drawn from a normal distribution? Test this hypothesis using the appropriate goodness of fit test.