Tidy Data

Data Shapes - short versus long

Short

animal    nToes   size                smell
chicken   4       small               so-so
cow       2       big         objectionable
pig       2       medium  more_objectionable

Long

animal   variable              value
cow     size                    big
cow     nToes                     2
cow     smell         objectionable
pig     size                 medium
pig     nToes                     2
pig     smell    more_objectionable
chicken size                  small
chicken nToes                     4
chicken smell                 so-so

But the long form is stupid!

Why would I ever use it??

I’m glad you asked.

Two reasons come up most often:

You want to take advantage of group_by() in the dplyr package to do complex stuff

You want to take advantage of facets in ggplot2

dplyr and ggplot2 just work better on the long form

(which makes sense, as all these packages are part of the tidyverse)

Example data

filepath <- "https://stats.are-awesome.com/datasets/studentgrades.txt"
grades <- read.table(filepath, header=T, sep=',')
head(grades)

##            students            TA exam1 exam2 exam3
## 1           Guy May       Raphael 57.18 56.79 74.91
## 2   Genevieve Clark       Raphael 56.51 64.14 61.45
## 3     Jody Phillips     Donatello 71.49 71.95 71.14
## 4 Claire Fitzgerald Michaelangelo 80.93 71.33 68.11
## 5    Everett Graves       Raphael 62.64 45.31 55.21
## 6       Lance Logan      Leonardo 70.18 64.03 64.81

Two main functions

pivot_longer() takes multiple columns, and collapse them into two key - value columns

pivot_wider() takes two columns (key - value pairs) and spreads them into separate columns

pivot_longer()

Collapses multiple columns into two key - value pairs. Check out ?tidyr_tidy_select to see all the options for how you can select columns

library(tidyverse)
head(grades)

##            students            TA exam1 exam2 exam3
## 1           Guy May       Raphael 57.18 56.79 74.91
## 2   Genevieve Clark       Raphael 56.51 64.14 61.45
## 3     Jody Phillips     Donatello 71.49 71.95 71.14
## 4 Claire Fitzgerald Michaelangelo 80.93 71.33 68.11
## 5    Everett Graves       Raphael 62.64 45.31 55.21
## 6       Lance Logan      Leonardo 70.18 64.03 64.81

grades <- pivot_longer(grades,cols = c(exam1, exam2, exam3), names_to = "test", values_to="grade")
head(grades)

## # A tibble: 6 × 4
##   students        TA      test  grade
##   <chr>           <chr>   <chr> <dbl>
## 1 Guy May         Raphael exam1  57.2
## 2 Guy May         Raphael exam2  56.8
## 3 Guy May         Raphael exam3  74.9
## 4 Genevieve Clark Raphael exam1  56.5
## 5 Genevieve Clark Raphael exam2  64.1
## 6 Genevieve Clark Raphael exam3  61.4

Now you can use the test variable in facets, to make multiple plots easily

pivot_longer()

ggplot(aes(x=TA, y=grade, fill=TA), data=grades) + 
  geom_boxplot() + 
  facet_wrap(~test) + 
  theme_bw(15) + 
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

pivot_wider()

Does the opposite of pivot_longer(). Spreads single column across multiple columns

head(grades)

## # A tibble: 6 × 4
##   students        TA      test  grade
##   <chr>           <chr>   <chr> <dbl>
## 1 Guy May         Raphael exam1  57.2
## 2 Guy May         Raphael exam2  56.8
## 3 Guy May         Raphael exam3  74.9
## 4 Genevieve Clark Raphael exam1  56.5
## 5 Genevieve Clark Raphael exam2  64.1
## 6 Genevieve Clark Raphael exam3  61.4

grades %>% pivot_wider(names_from=test, values_from=grade)

## # A tibble: 50 × 5
##    students          TA            exam1 exam2 exam3
##    <chr>             <chr>         <dbl> <dbl> <dbl>
##  1 Guy May           Raphael        57.2  56.8  74.9
##  2 Genevieve Clark   Raphael        56.5  64.1  61.4
##  3 Jody Phillips     Donatello      71.5  72.0  71.1
##  4 Claire Fitzgerald Michaelangelo  80.9  71.3  68.1
##  5 Everett Graves    Raphael        62.6  45.3  55.2
##  6 Lance Logan       Leonardo       70.2  64.0  64.8
##  7 Anne Craig        Michaelangelo  74.4  79.1  72.6
##  8 Ted Hunt          Leonardo       73.0  79.8  87.5
##  9 Angelina Wright   Michaelangelo  70.1  69.7  62.6
## 10 Janet Padilla     Michaelangelo  71.1  69.1  80.1
## # ℹ 40 more rows

Challenge - using this dataset

filepath <- "https://stats.are-awesome.com/datasets/barr_astrag_2014.txt"
astrag <- read.table(filepath, header=TRUE, sep="\t")
head(astrag)

##   individual                 Taxon Habitat    ACF  APD     B DistRad DMTD
## 1  AMNH81690    Aepyceros_melampus      LC 384.72 6.89 15.16    9.22 1.78
## 2  AMNH82050    Aepyceros_melampus      LC     NA 6.09 14.00    8.73 1.64
## 3  AMNH83534    Aepyceros_melampus      LC     NA 6.14 17.22    9.19 1.70
## 4  AMNH85150    Aepyceros_melampus      LC     NA 5.80 15.27    8.85 1.72
## 5 AMNH233038 Alcelaphus_buselaphus       O 607.83 7.46 18.47   11.33 2.26
## 6  AMNH34717 Alcelaphus_buselaphus       O 465.91 6.56 15.47    9.79 1.78
##   DTArea   LML   MIN   MML PMTD ProxRad  PTArea   WAF   WAT
## 1 553.14 35.78 28.04 34.39 6.33   10.00  915.50 21.39 21.83
## 2 558.17 35.46 27.28 33.18 6.54   10.50  841.79 22.47 21.11
## 3 554.00 39.15 29.99 36.94 7.63   10.65  976.25 22.07 21.54
## 4 498.67 36.61 28.02 34.09 6.97   10.00  841.71 22.55 21.03
## 5 900.76 45.25 36.03 43.55 6.83   13.85 1557.05 28.98 27.07
## 6 649.47 36.81 30.80 35.95 4.28   10.91 1139.50 24.04 25.84

Data Shapes - short versus long

Short

Long

But the long form is stupid!

Why would I ever use it??

I’m glad you asked.

Two reasons come up most often:

dplyr and ggplot2 just work better on the long form

Example data

Two main functions

pivot_longer()

pivot_longer()

pivot_wider()

Challenge - using this dataset

Challenge - Make this Plot