Probability distributions are models which describe different types of datasets that are commonly observed in the real world.
They are mathematical functions which offer an easy way to estimate summary statistics, and to estimate how likely a given value is in a given dataset.
Probability distributions form the basis of most statistical hypothesis tests.
A random variable –in statistical terms– is a variable whose value depends on random chance.
Each random variable has one or more parameters governing the probability of different outcomes.
There are two types of random variables:
discrete random variables which have a limited number of possible discrete outcomes and
continuous random variables which have a (theoretically) unlimited number of possible outcomes.
Think of an example of a discrete random variables and an example of a continuous random variables for your area of scientific interest. What is the sample space for the random variable?
If we plot the values from a random variable using a histogram, the shape of the histogram reflects the shape of the probability distribution of that random variable.
The various distributions form the basis of parametric statistics.
A Bernouili trial describes an event that has exactly two possible outcomes: success and failure.
Success occurs with a given probability, for example \(p = 0.2\).
We know how likely a trial is to result in success, but any given trial may result in either success or failure, and this cannot be predicted for a single trial (the outcome is stochastic).
The result of a series of \(n\) Bernoulili trials with \(X\) successful outcomes results in a binomial random variable.
The expected value of the binomial distribution is \(E(x) = np\)
Similar to the binomial distribution, but describes rare events, when the number of trials \(n\) is unknown.
Requires a single rate parameter \(\lambda\).
It is commonly pops up when examining the number of events occuring through time (e.g., the number of pieces of mail recieved per day, or the number of speciation events occuring per millenium).
The expected value and variance for the Poisson distribution are both equal to lambda
At large values of lambda, the Poisson approximates a normal distribution with mean \(\lambda\).
The uniform distribution represents a function in which the probability density is equal for each sub-interval across the a given range.
This results in a flat frequency distribution.
The expected value over the range \(a\) to \(b\) is \((a + b)/2\).
An example might be the spatial coordinates of plants which are competing for nutrients and light.
The normal distribution (or Gaussian distribution) is the familiar bell-curve shaped distribution that is symmetrical around the mean, with diminishing tails.
Many phenomena in nature are distributed as a normal distribution, especially continuous measurement values.
The normal distribution has two parameters, the mean (\(\mu\)) and the standard deviation (\(\sigma\)).
The log-normal distribution resembles a normal distribution when it is logged.
Many biological characteristics are log-normally distributed.
The exponential distribution is the continuous version of the Poisson distribution. It is goverend by a single rate parameter.
There are four main R functions associated with each probability distribution. In the functions below, you will replace dist with the abbreviation for that distribution (norm, binom, etc.)
rdist() - random values drawn from the dist functionddist() - probability density of dist at a particular pointpdist() - cumulative (tail) probability of the dist functionqdist() - quantiles of dist (quantiles are the converse of cumulative probability)Simulate data from a binomial distribution
sim <- rbinom(n = 5000, size=5000, prob = 0.5) qplot(sim, main="binomial distribution p = 0.5")
Expected outcome of 5000 trials is to have have approximately 2500 successful outcomes.
It is less likely to get many more or less than 2500 successes.
Note: the binomial distribution approximates the normal distribution at very large values of \(n\).
the quantile function answers the question: at what point of \(x\) has the cumulative probability reached a particular value?
The value of x at which the cumulative probaility function hits 0.5 is the quantile value for 0.5 (aka. the 50th percentile).
Consider a normal distribution with a mean of 10 and a standard deviation of 1.5.
We can calculate the cumulative probability up to a value of 5 like this:
pnorm(q = 5, mean = 10, sd = 1.5)
## [1] 0.0004290603
If we were dealing with real data, we would be forced to say that an observation of 5 is very uncommon in this distribution, and the cumulative probability gives us an estimate of just how unlikely
This is fundamental, as this is exactly what a p value represents, but more on that next week.
Challenge: What is the cumulative probability at a value of 10 in the same normal distribution?
Quantiles are like cumulative probability turned on its head. Cumulative probabilty asks “how much probability density occurs at less than or equal to a value \(x\)”. A quantile is the opposite: “what is the value of \(x\) at which a given proportion of the probability density occurs?”
The most familiar examples are percentiles. Take a normal distribution of exam scores.
set.seed(1237) grades <- rnorm(1000, mean=75, sd=10)
qplot(grades, color=I("white"))
Where is 90th percentile?. At this \(x\) value, 90% of observations are less than or equal to \(x\).
ninety_percentile <- qnorm(0.9, mean=75, sd = 10)
qplot(grades, main="grades distribution with 90th percentile indicated", color=I("white")) +
geom_vline(xintercept=ninety_percentile, color="red")
We can check to see if this matches by counting how many observations are less than or equal to the ninety_percentile calculted from theqnorm() function.
sum(grades <= ninety_percentile) / length(grades)
## [1] 0.893
Why isn’t this number exactly 0.9?
It is often very useful to test an empirical distribution of values to see how closely it approximates a normal distribution. There are two main ways to do this:
The Q-Q plot compares your empirical cumulative distribution to the theoretical normal cumulative distribution. If the data are normal, then the Q-Q plot will look like a straight line.
qqnorm(rnorm(1000))
If your data are not normally distributed, the Q-Q plot will look curved, or banana shaped. It will take practice to figure out how far a Q-Q plot can deviate from straight line without creating problems.
qqnorm(rnorm(1000)^2)
qqnorm(rnorm(1000)^3)
The Shapiro-Wilk Test yields a test statistic W, by finding the largest deviation from the expected line in a qqplot. This is a powerful test for normality, but does not work well when there are many ties in the data.
shapiro.test(rnorm(1000))
## ## Shapiro-Wilk normality test ## ## data: rnorm(1000) ## W = 0.99704, p-value = 0.0608
shapiro.test(rnorm(1000)^2)
## ## Shapiro-Wilk normality test ## ## data: rnorm(1000)^2 ## W = 0.70506, p-value < 2.2e-16