Hypothesis Testing

Review

Who can explain this slide for me!?

Hypothesis Testing

Inductive versus deductive reasoning

Induction is bottom up approach to reasoning, proceeding from specific observations to general explanations.

Deduction goes the other way: from general to specific.

All methods of doing science use both inductive and deductive reasoning, but the emphasis that they receive differs.

Deduction

Emphasis is on falsification
Requires multiple working hypotheses
In the end, there is (hopefully) only one that hasn’t been falsified
Limitation: “correct” hypothesis MUST be among the alternatives studied

Induction

Emphasis is on confirmation
Builds and modifies hypothesis based on previous knowledge
Limitation: may “get off on wrong foot” if hypothesis is just plain wrong

Testing Hypotheses

Consider a dataset of femoral head diameters from male and female baboons.

Make a boxplot of FHD by sex with points overlaid on the boxes.

Null Hypothesis

The null hypothesis is the simplest possible explanation for a phenomenon. This explanation is usually that random variation is responsible for any apparent pattern.

Question: In the case of our baboons, what is the null hypothesis?

Alternatives to the Null

Next, we create one or more statistical alternative hypotheses.

Alternative = difference between male and female femoral head diameters is too great to be accounted for by chance alone.

Most of the time, we don’t explicitly specify the alternative, we just suffice it so say that the alternative is “not \(H_0\)”

The alternative hypothesis is focused on the pattern in the data…not the cause of it.

The all powerful p-value

In terms of conditional probability, the p value represents

\[ P(Data\ |\ H_0) \]

The all powerful p-value

For baboons, we might use a T-test to compare means.

Like all parametric statistics, the T-test returns a test statistic.

Because we are assuming that the null hypothesis is true for the moment, we can compute an expected T distribution.

The all powerful p-value

Which R function would tell me the cumulative amount of probably density in this distribution up to our observed value?

The all powerful p-value

Our T-test would return a very low p value.

Highly unlikely to get two sex groups with means as different as ours, if the variation was attributable only to chance. We know this because we know the distribution of the T statistic when the null hypothesis if true.

If you only remember one thing from this class…

The p-value is an estimate of how likely our data are, assuming the null hypothesis is true.

Low p values do not mean the alternative hypothesis has been proven.

Errors in Hypothesis Testing

The Truth	Retain \(H_0\)	Reject \(H_0\)
\(H_0\) True	Correct!	Type I error (\(\alpha\))
\(H_0\) False	Type II error (\(\beta\))	Correct!

Errors in Hypothesis Testing

Which type of error is worse?

Errors in Hypothesis Testing

Statistical power is related to Type II errors, and is calculated as \(1 - \beta\).

This tells us how likely we are to detect an effect when one actually exists.

Before starting a study it is worth doing a power analysis to determine the rate at which an effect of a given size will be detected with a given sample size.

Errors in Hypothesis Testing

The risk of Type I and Type II error rates are inversely related to one other.

This relationship is not simple, though. Type II error rates depend on a lot of things, like the sample size and strength of the effect.

Statistical Significance versus Significance

Given very large sample sizes, the risk of Type II errors is low, but this also means the risk of Type I errors is high.

Tiny, virtually meaningless differences can become statistically significant.

When you find statistical significance, you have to ask

how big is the effect?
should I care?

Statistical Significance versus Significance

Two populations (n=1,000,000)
differ in height by about 0.01% (p<0.0000000001)
Should I care? Is this biologically meaningful?
Depends….but for many questions the answer is no.
‘Significant’ means differences beyond chance expectations
‘Effect size’ is how big the difference is. This is what is biologically important

Statistical Significance versus Significance

Don’t be a blind follower of \(p<0.05\) !!!!!

p-values don’t measure effect size!!!!

Statistical significance versus effect size

Many bacon lovers learned difference between effect size and significance a couple of years ago…

Three Hypothesis Testing Frameworks

We will walk through testing the same hypothesis in all three frameworks

Monte Carlo
Parametric
Bayesian

Our example data

Ant data from Gotelli CH 5

Read this data into R yourself!

Lets test the hypothesis that habitat type influences ant distributions. Based on this hypothesis, we would expect that the number of ant nests in forests differs from the number of ant nests in fields.

What is the null hypothesis here?

Monte Carlo

Town of Monte Carlo

Monte Carlo refers to a world-famous casino, in a town on the French Riviera (in the Principality of Monaco).

In Monte Carlo analysis, data are randomly reshuffled over and over to specify the null hypothesis, and these reshufflings are compared against the observed data.

Monte Carlo

There are four steps:

Decide on a test statistic
By reshuffling, create a distribution of the test statistic that would be expected under the null hypothesis
Decided on a one-tailed or two-tailed test
Compare the observed to the null distribution and calculate the p value.

Monte Carlo - Step 1 - Test statistic

We will use the absolute value of the difference in group means. We will need to calculate this over and over, so we will make a function to do it.

abs.mean <- function(df){
  library(dplyr)
  means <- 
    df %>%
    group_by(Habitat) %>%
    summarize(meanAnts = mean(n_Ants))
  return(
    abs(means$meanAnts[1] - means$meanAnts[2])
    )
}

abs.mean(ants)

## [1] 3.75

Monte Carlo - Step 2 - Shuffle

Now, we reshuffle the ant nest counts with the sample() function. Note: by default, the function draws a sample of the same size as the original vector, without replacement.

ants$n_Ants

##  [1]  9  6  4  6  7 10 12  9 12 10

sample(ants$n_Ants)

##  [1] 12  7  6 10  9 10  9  4 12  6

Monte Carlo - Step 2 - Shuffle

We can do this 1000 times by using a for loop

results <- numeric(1000)

for(i in 1:1000) {
  
  mix <- ants
  mix$n_Ants <- sample(mix$n_Ants)
  results[i] <- abs.mean(mix)
}

Monte Carlo - Step 2 - Shuffle

We can look at the results with a histogram

Monte Carlo - Step 3 - one or two tailed?

When we compute the p value, we will ask how many mean differences computed on the randomized data are as extreme or more extreme compared to our observed mean difference. This is known as the tail probability.

Should we look at just one side of the distribution or both?

We are looking at the absolute value, of the difference, so we want a 1 tailed test.

Monte Carlo - Step 4 - Calculate p

To get our 1-tailed p-value, we just count up how many of the mean differences in the randomized data are as big as (or bigger than) our observed mean difference (3.75).

count_extreme_diffs <- sum(results >= 3.75)
count_extreme_diffs / 1000

## [1] 0.034

Shortcut…

mean(results >= 3.75)

## [1] 0.034

for a ★ who can explain how this works?

Monte Carlo

What does our p value of 0.034 mean?

If there is no relationship between habitat and and nest count, we would expect to observe data with a mean difference as extreme or more extreme than ours about 3.4% of the time.

If you only remember one thing from this class…

The p-value is an estimate of how likely our data are, assuming the null hypothesis is true.

Low p values do not mean the alternative hypothesis has been proven.

Monte Carlo Pros and Cons

Pros:

fewest assumptions
null hypothesis is very clear and easy to understand

Cons:

you have to “roll-your-own”
you don’t get the same answer every time

Parametric

Assumes data were sampled from a specified distribution (usually normal distribution).

Parameters of this distribution used to calculate tail probabilities data given a null hypothesis.

\[P(data\ |\ hypothesis)\]

Question….

Which R function computes tail (i.e. cumulative) probability for the normal distribution?

dnorm()?

pnorm()?

qnorm()?

rnorm()?

Parametric

Specify the test statistic
Specify the null distribution
Calculate the tail probability

Parametric - Step 1 - Test Statistic

F statistic or F ratio - useful for comparing whether group means differ

Basis for ANOVA

\[ F = \frac{(variance\ between\ groups + variance\ within\ groups)}{variance\ within\ groups}\]

Parametric - Step 1 - Test Statistic

We will spend a lot of time talking about how to calculate the F ratio

Our F-ratio in the case of the ant nests is

\[ F = 33.75 / 3.84 = 8.78 \]

Parametric - Step 2 - Specify null dist

The null hypothesis is data were drawn from single population.

If true, then our F ratio tends to be about 1 (similar variance between and within groups)

Larger and larger values of F are less and less likely to be observed under the null hypothesis.

The F statistic follows a known F distribution.

Parametric - Step 2 - Specify null dist

Parametric - Step 3 - Compute the p-value

The upper tail probability can be computed easily because we are dealing with a known distribution.

1 - pf(8.78, df1=1, df2=8)

## [1] 0.0180603

Parametric - Step 3 - Compute the p-value

For another ★

Who can tell me what this p-value of 0.018 means?

If you only remember one thing from this class…

The p-value is an estimate of how likely our data are, assuming the null hypothesis is true.

Low p values do not mean the alternative hypothesis has been proven.

Parametric Assumptions

data independent
data come from specified distribution
additional tests make additional assumptions - these are sometimes most critical

Parametric Pros and Cons

Pros

ubiquitous in standard stats packages, textbooks, articles
extremely well understood (by statisticians)

Cons

easy to misuse
p values misinterpreted
assumptions not always tested
less flexible than Monte Carlo

Bayesian

Bayes’ theorem discovered by Rev. Thomas Bayes in the mid-18th century

Based on inverse probability

Bayesian

\[P(hypothesis\ |\ data) = \frac{P(hypothesis) * P(data\ |\ hypothesis)}{P(data)}\]

\[Posterior\ Probability = \frac{Prior\ Probability * Likelihood}{Marginal\ Likelihood}\]

Bayesian

Specify the hypothesis
Specify the parameters as random variables
Specify the priors
Calculate the likelihood
Calculate the posterior
Interpret

Bayesian - Step 1 - Hypothesis

Turns the question around compared to parametric. Considering our ants:

Parametric asks: \[P(F_{obs} = 8.78\ |\ F_{theoretical})\]

Bayesian asks: \[P(F\geq5.32\ |\ F_{obs} = 8.78)\]

Note: 5.32 is the critical value from this F distribution, used to mathematically specify alternative hypothesis.

Bayesian - Step 2 - Parameters

Specify parameters as random variables, instead of assuming there is a single fixed (but unknown) value of parameters

So the mean of the ants in the field would be: \[ \mu_{field} \sim N(\lambda_{field}, \sigma^2)\]

Bayesian - Step 3 - Priors

\[Posterior\ Probability = \frac{Prior\ Probability * Likelihood}{Marginal\ Likelihood}\]

What are the expected probability distributions for our parameters before we do the experiment?

Can either use uninformative flat priors, or base them off of our previous knowledge about ant colony densities in field and forest.

Bayesian - Step 3 - Priors

Incorporating prior knowledge into hypothesis testing is fundamental distinction

Bayesian - Step 4 - Likelihood

Next we estimate the likelihood.

\[Posterior\ Probability = \frac{Prior\ Probability * Likelihood}{Marginal\ Likelihood}\]

The standard probability \(P(data\ |\ hypothesis)\) is the probability of our data given the null hypothesis.

Likelihood is proportional to probability, but:

based on a single dataset, can be recalculated based on many different parameter values
doesn’t sum to 1 (first axiom of probability)

Bayesian - Step 5 - Posterior

\[Posterior\ Probability = \frac{Prior\ Probability * Likelihood}{Marginal\ Likelihood}\]

Now we just do the math to calculate the posterior distributions for our parameters. We can then use Monte Carlo to produce an expected distribution of F value with our posterior values.

Bayesian - Step 7 - Interpret

Tail probability of observed F in our simulated F distribution is P = 0.67

Much higher than our parametric or Monte Carlo estimates

Would change if we used different priors

Bayesian

Pros

Explicitly builds on previous work (which reflects scientific progress, and the way we think)
Intuitive interpretation, less prone to cookbook critical values of p

Cons

It is really friggin’ complicated
Use of different priors results in different information, potentially subjective

Summary

Most of what you do will be parametric stats: be sure to think about assumptions

Be very careful about \(\ p < 0.05\)

Bayesian offers a powerful (if complicated) alternative

Monte Carlo is flexible and awesome, and you should do it!