0% found this document useful (0 votes)
12 views24 pages

Lesson 1

The document outlines a course on data wrangling with R using the Tidyverse, consisting of eight lessons covering topics from R basics to data visualization and Bioconductor. It emphasizes the importance of tidy data practices and provides resources for learning and support. Additionally, it introduces RStudio as an integrated development environment and highlights best practices for data analysis.

Uploaded by

RaushanYadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

Lesson 1

The document outlines a course on data wrangling with R using the Tidyverse, consisting of eight lessons covering topics from R basics to data visualization and Bioconductor. It emphasizes the importance of tidy data practices and provides resources for learning and support. Additionally, it introduces RStudio as an integrated development environment and highlights best practices for data analysis.

Uploaded by

RaushanYadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Wrangling with R: Using

the Tidyverse
Alexandra Emmons, PhD
Bioinformatics Training and Education Program (BTEP)
8 lessons directed toward data wrangling
• L1: Introduction to R, RStudio, • L5: The pipe, filtering, and
and the Tidyverse joining data tables
• L2: Getting Started, the basics • L6: Split, apply, combine
• L3: Loading and reshaping data • L7: Introduction to Bioconductor
• L4: Data visualization with -omics classes (containers)
ggplot2 • L8: Data Wrangling Review and
Practice

No Coding
Coding
• A language and statistical computing environment
• Open source, for and by scientists
• Widespread community
What is R? • Extended use through package installation
• R Packages are collections of R functions,
compiled code and sample data
Why should we use R?
• Great for statistical analysis, data visualization, and report generation
• Supports large scale data analysis
• Removes some of the human error associated with excel
• Ever growing community
• Many ways to get help
• Field specific packages and workflows
• Problems are “googlable”
Comprehensive R Archive
Network
Github
Where do
we find R Bioconductor
packages?
Check out METACRAN
An integrated development
environment (IDE) for R

Includes a console, code editor, and


What is R tools for plotting, history, debugging,
Studio? and workspace management.

Open-source and can be installed


locally or used through a browser
(RStudio Server, Posit Cloud)
DNAnexus
• A Cloud-based platform for NextGen Sequence analysis for which CCR
has a "site-license”
• We will be using this platform to provide a uniform, stable,
preinstalled interface for R training.
• Uses RStudio server
• Integrates course-notes
• R packages installed and ready to use
• The data ready to use and in one place; no need to download

Course registrants, please fill out this form with your DNAnexus
information.
Let’s take a tour of Rstudio IDE
Data Wrangling
Best Practices for data analysis
1. Keep raw data separate from analyzed data.
2. Keep spreadsheet data Tidy (or as tidy as possible)
3. Trust but Verify
--- From https://datacarpentry.org/genomics-r-intro/03-basics-
factors-dataframes/index.html
What is
tidy data?
**Having tidy data is useful
but not always necessary.
Do not worry about strict
adherence to the rules.
Your data should be in
whatever format that
makes your life easier for
analysis.**
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
Guidelines to keep spreadsheets tidy
• Be consistent
• Choose meaningful names for things; no spaces
• Write dates as YYYY-MM-DD
• No empty cells
• Put just one thing in a cell
• Don’t use font color or highlighting as data
• Save the data as plain text files
--- https://jhudatascience.org/tidyversecourse/intro.html
Tidyverse
• An opinionated collection of R
packages designed for data
science. All packages share an
underlying design philosophy,
grammar, and data structures. ---
tidyverse.org
• Core packages:
• dplyr, ggplot2, forcats, tibble,
readr, stringr, tidyr, and purr
What is data wrangling?
• Data wrangling is a catch all
phrase for cleaning, Tidy
transforming, and summarizing
data Transform Summarize

• The primary packages we will Visualize


focus on for this purpose are &
tidyr and dplyr. Model
Getting Help
Stack Overflow and other forums
• Public Q&A platform
• Ex: https://stackoverflow.com/questions/65095565/make-some-
sample-names-unique-according-to-conditions
Vignettes

• Ex, browseVignettes(package="dplyr")

Coursera

• JHU Tidyverse Skills for Data Science in R Specialization


• Introduction to the Tidyverse
• Importing Data in the Tidyverse
• Wrangling Data in the Tidyverse
• Visualizing Data in the Tidyverse

Tutorials
• Modeling Data in the Tidyverse

Dataquest

• Intro to data analysis in R


• Data Visualization in R
• Data Cleaning in R

Bioconductor tutorials / workflows

Many others…

• glittr

For a Coursera or Dataquest license go to https://bioinformatics.ccr.cancer.gov/btep/self-learning/


Course Materials
Materials for each lesson will be found at
https://btep.ccr.cancer.gov/docs/data-wrangle-with-r-2023/.

Course materials will be updated prior to each lesson.


BTEP

Other R courses Email us


R Introductory Series ncibtep@nih.gov
Data Visualization with R
BTEP Coding Club
• Once a month
• Tailored bioinformatics training to the NCI community.
• 1-hour demo / tutorial of a bioinformatics tool, software, skill, or
platform.
• Ranges in experience level from beginner to advanced.
• Email us at ncibtep@nih.gov if there is a specific topic you would
like to see featured.

Check out past events here:


https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/
Helpful things to know before
getting started
Terms to Know
• Function - code written to perform a specific task
• Example: Getwd()
• String – a sequence of one or more characters
• Enclosed by parentheses
• Data frame – object that stores tabular data; all variables are of the same
length
• Directory – location where files are stored
• Working directory – your current directory
• Package – the fundamental unit of shareable code, bundling together code,
data, documentation, and tests. This is how we extend the use of R.
• Library – a directory of installed packages
• Example: library(dplyr)
Directory Structures
• A file path shows us the location of a file. These are nested structures.
• .libPaths()
• Will show us the location of installed R packages
• For example:
• [1] "/Library/Frameworks/R.framework/Versions/4.1/Resources/library"
• Absolute file path
• The complete file path
• Relative file path
• A shortcut path from some other directory
In summary

Today we… Next time…


Learned about advantages of R and RStudio Get ready for some coding fun in RStudio
Navigated the RStudio environment (DNAnexus)
Learned about concepts related to data Learn R basics
wrangling
Reviewed resources available for getting help

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy