Watch this at home
Watch this at home
Regression is just one (the simplest) of many types of linear models that can describe linear relationships (relationships where effects are additive).
The larger family of related methods is called the General Linear Model
Family of statistical models of the form:
\[Y_i = \beta_0 + \beta_1X_i + \beta_2X_i +\ ... +\ \beta_nX_i + \epsilon_i\]
where:
Includes many common statistical methods:
All of these use the same function in R, called lm()
First decide which variables are explanatory and response and how many
Response variable is always continuous.
Is a person’s response time (seconds) on a cognitive task explained by their age (years)?
Is brain size (cc) in primates explained by body size (g) and the type of diet consumed (frugivore vs. folivore vs. insectivore)?
Is the the diameter of the femoral (mm) head explained by body mass (g) and age (years)?
Is a persons income ($) predicted by the city they live in and their level of education (high school, bachelors, post-grad degree)
Compare the means of groups that have been sampled randomly, to test whether or not they differ.
The different levels of the categorical variable representing different groups are called treatments.
Each observation is called a replicate and is assumed to be independent from all other replicates.
So how do we compare group means?
Colored bars at top represent variances within groups.
Black bars at top represent variances ignoring groups.
\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]
\(\mu\) is the population grand mean (\(\bar{Y_{ij}}\) an unbiased estimator of \(\mu\))
\(A_i\) is the additive linear effect compared to the grand mean for a given treatment \(i\) (\(A_i\)‘s sum to 0).
\(\epsilon_{ij}\) is the error variance (the variance of individual points around their treatment group means. These are assumed to be distributed as \(N(0,\sigma^2)\)).
\[Y_{ij}=\mu + A_i + \epsilon_{ij}\]
Let’s see how changing model parameters affects the y variable. We will start a few variables that will be kept constant.
grandMean <- 10
groups <- rep(c("A", "B"), 100)
myError <- rnorm(200, mean = 0, sd=3)
y <- grandMean + c(-5, 5) + myError plot(y~factor(groups))
y <- grandMean + c(-0.3, 0.3) + myError plot(y~factor(groups))
As you could tell from your reading, calculating the appropriate sum of squares is complicated, and depends on your experimental design. Be careful (or just use Monte Carlo)!
Example data from Gotelli Table 10.1
## unmanipulated control treatment ## 1 10 9 12 ## 2 12 11 13 ## 3 12 11 15 ## 4 13 12 16
These data represent 12 experimental plots to see the effect of soil temperature on the flowering period in larkspur flowers. The treatment plots had heating coils installed and activated. The control plot had heating coils installed but never turned on. The unmanipulated plots were untouched.
We want to know if there is a difference in mean flowering length between these groups.
| Source | degrees freedom | SS | Mean Square (MS) | Expected Mean Square | F-ratio | |
|---|---|---|---|---|---|---|
| Among groups | a-1 | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(\bar{Y}_i - \bar{Y})^2\) | \(\frac{SS_{among_groups}}{(a-1)}\) | \(\sigma^2 + n\sigma^2_A\) | \(\frac{MS_{among}}{MS_{within}}\) | |
| within groups (residual) | a(n-1) | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y}_i)^2\) | \(\frac{SS_{within}}{a(n-1)}\) | \(\sigma^2\) | ||
| Total | an-1 | \(\sum\limits_{i=1}^a\sum\limits_{j=1}^n(Y_{ij} - \bar{Y})^2\) | \(\frac{SS_{total}}{(an-1)}\) | \(\sigma^2_Y\) |
myMod <- lm(iris$Petal.Length~iris$Species) summary(myMod)
## ## Call: ## lm(formula = iris$Petal.Length ~ iris$Species) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.260 -0.258 0.038 0.240 1.348 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.46200 0.06086 24.02 <2e-16 *** ## iris$Speciesversicolor 2.79800 0.08607 32.51 <2e-16 *** ## iris$Speciesvirginica 4.09000 0.08607 47.52 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4303 on 147 degrees of freedom ## Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406 ## F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16
anova(myMod)
## Analysis of Variance Table ## ## Response: iris$Petal.Length ## Df Sum Sq Mean Sq F value Pr(>F) ## iris$Species 2 437.10 218.551 1180.2 < 2.2e-16 *** ## Residuals 147 27.22 0.185 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(iris$Petal.Length ~ iris$Species)
#Homogeneity of variances bartlett.test(iris$Petal.Length ~ iris$Species)
## ## Bartlett test of homogeneity of variances ## ## data: iris$Petal.Length by iris$Species ## Bartlett's K-squared = 55.423, df = 2, p-value = 9.229e-13
plot(myMod)
Life gets more complicated when we are considering more than one factor
\[Y_{ij}=\mu + A_i + B_j + AB_{ij} + \epsilon_{ij}\]
With more than one factor, we have the possibility of non-additive effects of one factor on the other. These are called interaction terms
data(warpbreaks) model <- lm(warpbreaks$breaks ~ warpbreaks$tension + warpbreaks$wool) anova(model)
## Analysis of Variance Table ## ## Response: warpbreaks$breaks ## Df Sum Sq Mean Sq F value Pr(>F) ## warpbreaks$tension 2 2034.3 1017.13 7.5367 0.001378 ** ## warpbreaks$wool 1 450.7 450.67 3.3393 0.073614 . ## Residuals 50 6747.9 134.96 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Have possibility of non-additive effects, where value of factor A influences value of factor B
modelInteraction <- lm(warpbreaks$breaks ~ warpbreaks$tension * warpbreaks$wool) anova(modelInteraction)
## Analysis of Variance Table ## ## Response: warpbreaks$breaks ## Df Sum Sq Mean Sq F value Pr(>F) ## warpbreaks$tension 2 2034.3 1017.13 8.4980 0.0006926 *** ## warpbreaks$wool 1 450.7 450.67 3.7653 0.0582130 . ## warpbreaks$tension:warpbreaks$wool 2 1002.8 501.39 4.1891 0.0210442 * ## Residuals 48 5745.1 119.69 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(warpbreaks$tension, warpbreaks$wool, warpbreaks$breaks)
If your ANOVA is unbalanced (different sample sizes in differnt groups), then the order in which you specify the factors matters!
anova(lm(response ~ pred1 * pred2))
Is not the same as
anova(lm(response ~ pred2 * pred1))
This is only true is the sample sizes differ.
This behavior differs from all other stats packages. You can read more about the ANOVA in R. You can use the car package to compute ANOVA in the same way as SPSS or other packages
ANOVA with a continuous covariate
It’s a hybrid of regression and ANOVA
Think of it as doing an ANOVA on the residuals of a regression
\[Y_{ij}=\mu + A_i + \beta_i(X_{ij} - \bar{X_i}) + \epsilon_{ij}\]
Question: does Femoral Head Diameter differ in male and female baboons after accounting for differences in overall size?
Is the slope of FHD ~ BM the same in the two groups?
## Warning: `qplot()` was deprecated in ggplot2 3.4.0. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated.
Using baboons data
baboons <- read.table("https://stats.are-awesome.com/datasets/baboons.txt", header=T)
head(baboons)
## BM SEX FHD ## 1 8.62 F 19.9 ## 2 9.07 F 17.9 ## 3 9.98 F 18.9 ## 4 10.43 F 18.8 ## 5 10.89 F 20.3 ## 6 11.34 F 19.5
Just add in the covariate to test whether the intercepts differ between sexes
summary(lm(baboons$FHD ~ baboons$SEX + baboons$BM))
## ## Call: ## lm(formula = baboons$FHD ~ baboons$SEX + baboons$BM) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.35995 -0.54163 0.05521 0.53846 1.28351 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 17.6474 0.8665 20.367 9.1e-16 *** ## baboons$SEXM 2.3996 0.8404 2.855 0.0092 ** ## baboons$BM 0.1594 0.0721 2.210 0.0378 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.7614 on 22 degrees of freedom ## Multiple R-squared: 0.8955, Adjusted R-squared: 0.886 ## F-statistic: 94.27 on 2 and 22 DF, p-value: 1.622e-11
Use the * to test for interaction (i.e. different slopes between sexes)
summary(lm(baboons$FHD ~ baboons$SEX * baboons$BM))
## ## Call: ## lm(formula = baboons$FHD ~ baboons$SEX * baboons$BM) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.35157 -0.48111 0.06238 0.51091 1.16405 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.9481 1.3085 12.953 1.76e-11 *** ## baboons$SEXM 4.1274 2.5473 1.620 0.1201 ## baboons$BM 0.2195 0.1109 1.979 0.0611 . ## baboons$SEXM:baboons$BM -0.1059 0.1472 -0.720 0.4798 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.7699 on 21 degrees of freedom ## Multiple R-squared: 0.898, Adjusted R-squared: 0.8834 ## F-statistic: 61.64 on 3 and 21 DF, p-value: 1.401e-10
brains <- read.csv("https://stats.are-awesome.com/datasets/Boddy_2012_data.txt")
head(brains)
## Species.Name Order Brain.Mass..g. Body.Mass..g. EQ ## 1 Homo sapiens Primates 1250.43 65142.86 5.72 ## 2 Phocoena phocoena Cetartiodactyla 1735.00 142430.00 4.43 ## 3 Lagenorhynchus obliquidens Cetartiodactyla 1136.75 89750.00 4.10 ## 4 Stenella coeruleoalba Cetartiodactyla 883.75 63500.00 4.12 ## 5 Tursiops truncatus Cetartiodactyla 1573.00 170480.00 3.51 ## 6 Delphinus delphis Cetartiodactyla 797.26 65086.96 3.65 ## Independent.contrast.derived.EQ PGLS.ML.dervived.EQ ## 1 1.1491544 12.633650 ## 2 0.9732993 10.962980 ## 3 0.8534335 9.476202 ## 4 0.8253700 9.066825 ## 5 0.7877916 8.923060 ## 6 0.7330865 8.059241 ## Reference Notes ## 1 Sherwood/this study; Nowak 1991 ## 2 Crile & Quiring 1940; Silva & Downing 1995 ## 3 Ridgway et al. 1966 ## 4 Pilleri & Busnel 1969; Silva & Downing 1995 ## 5 Marino 2000; Ridgway et al. 1966 ## 6 Marino 2000; Pilleri & Busnel 1969; Silva & Downing 1995
Do primates a different average brain size from other mammals, when controlling for bodysize?