Lesson 1
Lesson 1
the Tidyverse
Alexandra Emmons, PhD
Bioinformatics Training and Education Program (BTEP)
8 lessons directed toward data wrangling
• L1: Introduction to R, RStudio, • L5: The pipe, filtering, and
and the Tidyverse joining data tables
• L2: Getting Started, the basics • L6: Split, apply, combine
• L3: Loading and reshaping data • L7: Introduction to Bioconductor
• L4: Data visualization with -omics classes (containers)
ggplot2 • L8: Data Wrangling Review and
Practice
No Coding
Coding
• A language and statistical computing environment
• Open source, for and by scientists
• Widespread community
What is R? • Extended use through package installation
• R Packages are collections of R functions,
compiled code and sample data
Why should we use R?
• Great for statistical analysis, data visualization, and report generation
• Supports large scale data analysis
• Removes some of the human error associated with excel
• Ever growing community
• Many ways to get help
• Field specific packages and workflows
• Problems are “googlable”
Comprehensive R Archive
Network
Github
Where do
we find R Bioconductor
packages?
Check out METACRAN
An integrated development
environment (IDE) for R
Course registrants, please fill out this form with your DNAnexus
information.
Let’s take a tour of Rstudio IDE
Data Wrangling
Best Practices for data analysis
1. Keep raw data separate from analyzed data.
2. Keep spreadsheet data Tidy (or as tidy as possible)
3. Trust but Verify
--- From https://datacarpentry.org/genomics-r-intro/03-basics-
factors-dataframes/index.html
What is
tidy data?
**Having tidy data is useful
but not always necessary.
Do not worry about strict
adherence to the rules.
Your data should be in
whatever format that
makes your life easier for
analysis.**
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
Guidelines to keep spreadsheets tidy
• Be consistent
• Choose meaningful names for things; no spaces
• Write dates as YYYY-MM-DD
• No empty cells
• Put just one thing in a cell
• Don’t use font color or highlighting as data
• Save the data as plain text files
--- https://jhudatascience.org/tidyversecourse/intro.html
Tidyverse
• An opinionated collection of R
packages designed for data
science. All packages share an
underlying design philosophy,
grammar, and data structures. ---
tidyverse.org
• Core packages:
• dplyr, ggplot2, forcats, tibble,
readr, stringr, tidyr, and purr
What is data wrangling?
• Data wrangling is a catch all
phrase for cleaning, Tidy
transforming, and summarizing
data Transform Summarize
• Ex, browseVignettes(package="dplyr")
Coursera
Tutorials
• Modeling Data in the Tidyverse
Dataquest
Many others…
• glittr