0% found this document useful (0 votes)
5 views4 pages

Week2 Cheat Sheet Data Wrangling With Tidyverse

The document is a cheat sheet for data wrangling using the Tidyverse in R, detailing various commands and their syntax along with descriptions and examples. It covers package installation, data manipulation functions, handling missing values, data normalization, and visualization techniques. Additionally, it includes a changelog and authorship information.

Uploaded by

moonb4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Week2 Cheat Sheet Data Wrangling With Tidyverse

The document is a cheat sheet for data wrangling using the Tidyverse in R, detailing various commands and their syntax along with descriptions and examples. It covers package installation, data manipulation functions, handling missing values, data normalization, and visualization techniques. Additionally, it includes a changelog and authorship information.

Uploaded by

moonb4115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CheatSheet - Data Wrangling with Tidyverse

Commands Syntax Description Example


install.packages
is used to install
install package install.packages("packagename") the packages install.packages("tidyverse")
from the R
library.
library() Load
load package library(packagename) the package from library(tidyverse)
R library.
download.file()
to download the
file locally
using the
download.file()
function.

url naming the


download.file(url, destfile, method, quiet = FALSE, mode = URL of a download.file(url, destfile =
download.file "w",cacheOK = TRUE,headers = NULL, …) resource to be "lax_to_jfk.tar.gz")
downloaded.

destfile a
character string
with the name
where the
downloaded file
is saved.
untar() is used
to extract files
from a tar archive
untar untar() is done with untar("lax_to_jfk.tar.gz")
untar function
from the utils
package.
read_csv() reads
read_csv read_csv(file) the csv file using read_csv("lax_to_jfk/lax_to_jfk.csv")
readr package.
Missing Values
and
Formatting
is.na(x) returns
a vector of TRUE
or FALSE
is.na is.na(x) depending if the is.na(c(1, na)) # FALSE TRUE
according
element in x is
NA or not.
anyNA() returns
TRUE if x
anyNA anyNA(x, recursive = FALSE) contains any NAs anyNA(c(1, na)) # TRUE
and FALSE
otherwise.
sum() is used to
sum sum(object) sum(is.na(carrierdelay))
calculate sum.
summarize summarize(X, by, FUN, summarize() summarize(count =
…,stat.name=deparse(substitute(X)),type=c('variables','matrix'), function reduces sum(is.na(carrierdelay)))
subset=TRUE,keepcolnames=FALSE) a data frame to a
summary of just
one vector or
value.

X a vector or
matrix capable of
being operated
on by the
function
specified as the
FUN argument

by one or more
stratification
variables. If a
single variable,
by may be a
vector, otherwise
it should be a list.

FUN a function
of a single vector
argument, used to
create the
statistical
summaries for
summarize. FUN
may compute any
number of
statistics.
map() functions
transform their
input by applying
a function to each
map map(.x, .f, ...) map(sub_airline, ~sum(is.na(.)))
element and
returning a vector
the same length
as the input.
dim returns the
dimension of the
dim dim(object) dim(sub_airline)
matrix, array, or
data frame.
drop_na() drop
drop_na drop_na(object) rows containing drop_na(carrierdelay)
missing values.
replace_na
replace missing
values.

data A data frame


or vector.

replace If data is replace_na(list(carrierdelay = 0,


a data frame, a weatherdelay = 0, nasdelay = 0,
replace_na replace_na(data, replace, ...)
securitydelay = 0, lateaircraftdelay
named list giving
= 0))
the value to
replace NA with
for each column.
If data is a
vector, a single
value used for
replacement.
mean() calculate
the arithmetic
mean of the
mean mean(x, na.rm) elements of the mean(drop_na_rows$carrierdelay)
numeric vector
passed to it as
argument.
mutate function
in R (mutate,
mutate_all and
mutate, date_airline %>% select(year, month,
mutate_at) is
mutate_all, mutate(data, ...) day) %>% mutate_all(type.convert) %>%
used to create mutate_if(is.character, as.numeric)
mutate_if
new variable or
column to the
dataframe in R.
Data
Normalization
Simple scaling xnew=xold/xmax Simple scaling sub_airline$arrdelay /
divides each max(sub_airline$arrdelay)
value by the
maximum value
in a feature. The
new range is
between 0 and 1.
Min-max
subtracts the
minimum value
from the original
and divides by (sub_airline$arrdelay -
the maximum min(sub_airline$arrdelay))
Min-max xnew= (xold-xmax) / (xmax-xmin)
minus the /(max(sub_airline$arrdelay) -
minimum. The min(sub_airline$arrdelay))
minimum
becomes 0 and
the maximum
becomes 1.
Standardization
(Z-score)
subtracts the
mean ( 𝜇 ) of the (sub_airline$arrdelay -
Z-score xnew= (xold - 𝜇) / 𝜎 mean(sub_airline$arrdelay)) /
feature and sd(sub_airline$arrdelay)
divides by the
standard
deviation ( 𝜎 ).
Binning Data
ggplot is a
plotting package
that makes it ggplot(data = sub_airline, mapping =
aes(x = arrdelay)) +
ggplot ggplot(df, aes(x, y, other aesthetics)) simple to create geom_histogram(bins = 100, color =
complex plots "white", fill = "red")
from data in a
data frame.
ntile() function
is used to divide
the data into N sub_airline %>% mutate(quantile_rank
ntile ntile(data)
bins there by = ntile(sub_airline$arrdelay,4))
providing ntile
rank.
geom_histogram()
function display geom_histogram(bins = 4, color =
geom_histogram geom_histogram(*arguments) the counts with "white", fill = "red")
bars.
Indicator
variable
spread a key-
value pair across
multiple columns
* data is your
dataframe of
interest.
* key is the
column whose
sub_airline %>%
spread spread(data, key, value) values will spread(reporting_airline, arrdelay)
become variable
names.
* value is the
column where
values will fill in
under the new
variables created
from key.
slice()looks at
slice slice(num1 : num5 ) the specified slice(1:5)
rows.
factor()
function is used
to encode a sub_airline %>%
vector as a factor, mutate(reporting_airline =
factor factor(x) If argument factor(reporting_airline,labels =
ordered is TRUE, c("aa", "as", "dl", "ua", "b6", "pa
the factor levels (1)", "hp", "tw", "vx")))
are assumed to be
ordered.

Author(s)
D.M. Naidu
Changelog
Date Version Changed by Change Description
2023-05-11 1.1 Eric Hao & Vladislav Boyko Updated Page Frames
2020-08-11 1.0 D.M. Naidu Initial Version

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy