Ashish Srivastava r Lab File
Ashish Srivastava r Lab File
PRACTICAL FILE
SUBMITTED BY:
Ayush Kumar
Roll No. I-24028
Section-I MBA(BA) =
SUBMITTED TO:
Dr Shruti Traymbak
Associate Professor
1
INDEX
1. Introduction to R, Introduction to
RStudio, R data types (vectors,
lists, matrices, data frames, and
factors), basic R commands.
2. Installation of R studio, Loading
and exploring datasets, sub-setting
and filtering data, renaming and
reordering columns.
3. Introduction to library in R Studio -
dplyr, tidyr and tidyverse. Creating
new variables using the dplyr
package.
4. Introduction to Reshaping data and
using pivoting (long to wide and
vice versa) to reshape data
5. Introduction to data manipulation
using the tidyr package (e.g.,
gather, spread, separate, unite).
6. Introduction to ggplot2-create
visualizations using the popular R
package ggplot2
7. Introduction to Scatter Plot, line
plots, bar charts, histograms, and
boxplots Create different types of
plots with help of ggplot2 (scatter
plots, line plots, bar charts,
histograms, and boxplots)
8. Introduction to Enhancing Visuals:
Customizing Titles, Labels, Colors,
and Themes
Building Complexity: Adding
Layers with Geoms
Breaking Down Data: Faceting for
Multivariable Insights
Preserving Work: Saving and
Exporting Plots
9. Introduction to advanced
visualization techniques like
creating multi-panel plots, and
handling overplotting using
transparency and jittering
2
10. Introduction to joins in R, different
types of joins (inner, left, right,
full), merging datasets using dplyr
functions (inner_join, left_join,
etc.),
11. Introduction to hypothesis testing,
types of t-tests (one-sample,
independent two-sample, paired
sample),
12. Assumptions of t-tests, conducting
t-tests in R, interpreting results, and
understanding p-values and
confidence intervals
3
Introduction to R
R programming is a leading tool for machine learning, statistics, and data analysis, allowing for
the easy creation of objects, functions, and packages. Designed by Ross Ihaka and Robert Gentleman
at the University of Auckland and developed by the R Development Core Team, R Language is
platform-independent and open-source, making it accessible for use across all operating systems
without licensing costs. Beyond its capabilities as a statistical package, R integrates with other
languages like C and C++, facilitating interaction with various data sources and statistical tools.
With a growing community of users and high demand in the Data Science job market, R is one of
the most sought-after programming languages today. Originating as an implementation of the S
programming language with influences from Scheme, R has evolved since its conception in 1992,
with its first stable beta version released in 2000.
The R Language stands out as a powerful tool in the modern era of statistical computing and data
analysis. Widely embraced by statisticians, data scientists, and researchers, the R Language offers
an extensive suite of packages and libraries tailored for data manipulation, statistical modeling, and
visualization. In this article, we explore the features, benefits, and applications of the R
Programming Language, shedding light on why it has become an indispensable asset for data-driven
professionals across various industries.
The core R language is augmented by a large number of extension packages, containing reusable
code, documentation, and sample data.
R software is open source and free software. It is licensed by the GNU Project and available
under the GNU General Public License. It is written primarily in C, Fortan, and R
itself. Precompiled executables are provided for various operating systems.
4
Features of R Programming Language
1. Open-Source
You don’t have to pay any money to download R on your computer. It is free and open-source
software. Furthermore, you can contribute towards the development of R, customize its packages,
and add more features.
6. Data Wrangling
The process of cleansing large and inconsistent data sets in order to facilitate computation and
further analysis is known as data wrangling. This is a time-consuming process. R's broad tool
collection can be utilized for database management and wrangling.
7. No Compilation
The R language is interpreted rather than compiled. As a result, no compiler is required to compile
code into an executable program. The R code is evaluated step by step and turned straight into
machine-level calls. This significantly reduces the time required to run a R script.
5
Additionally, it may be integrated with programs written in FORTRAN, C, C++, Java, and Python,
among other computer languages.
Programming in R:
Since R is much similar to other widely used languages syntactically, it is easier to code and learn
in R. Programs can be written in R in any of the widely used IDE like R Studio, Rattle, Tinn-R, etc.
After writing the program save the file with the extension .r.
Applications of R
R is used for Data Science. It offers us a wide range of statistics-related libraries. Additionally, it
offers a setting for statistical computation and design. Many quantitative analysts utilize R as a
programming language. As a result, it aids in data import and cleansing. In environmental science,
R is used to analyze and simulate environmental data, climate data, and ecological data. The most
common language is R. It is used by a large number of data analysts and research programmers. As
a result, it is employed as a fundamental financial instrument.
-Click on "install R for the first time" link to download the R executable(.exe) file.
-Run the R executable file to start installation, and allow the app to makechanges to your device.
6
-Follow the installation instructions.
R has now been successfully installed on your Windows OS. Open the RGUI to start writing R codes.
7
DATA TYPES IN R
DATA STRUCTURES IN R
VECTORS
A vector is simply a list of items that are of the same type.To combine the list of items to a vector, use
the c() function and separate theitems by a comma.
Using c () function
Using seq () function
Using rep () function
8
MATRIX
Matrices are the R objects in which the elements are arranged in a two dimensional rectangular
layout.
FACTOR
A factor is a way to store categorical data in R. It groups values into categories called levels.
DATA FRAME
A data frame is like a table. It is used to store data in rows and columns, where each columns can have
a different data types.
9
LIST
A list in R is a collection of objects and it can hold different data types together, unlike vectors that
hold only one type.
10
dplyr Package in R
The dplyr package is one of the most popular and powerful tools in R for data manipulation.
It provides a set of simple and intuitive functions for data transformation using a "grammar of data
manipulation."
select()
• Purpose: Choose specific columns from a dataset.
• Syntax: select(data, column1, column2, ...)
filter()
• Purpose: Filter rows based on conditions.
• Syntax: filter(data, condition)
mutate()
• Purpose: Create or modify columns.
• Syntax: mutate(data, new_column = expression)
11
arrange()
• Purpose: Sort rows by one or more columns.
• Syntax: arrange(data, column1, desc(column2))
summarise()
• Purpose: Summarize data by applying functions like mean, sum, etc.
• Syntax: summarise(data, summary = function(column))
group_by()
• Purpose: Group data by one or more variables.
• Syntax: group_by(data, column)
12
Tidyr and tidyverse package in R
This standard structure helps ensure compatibility with data manipulation and visualization tools.
13
Importance of tidyr package
• Consistency- Tidy data ensures compatibility with R's tidyverse tools, allowing smooth workflows.
• Readability- A well-organized dataset is easier to understand, troubleshoot, and share.
• Flexibility- With tidy data, you can quickly switch between different tools for manipulation,
visualization, or modeling.
• Error Reduction- Clean, structured data minimizes the risk of errors during analysis.
• Seamless Integration- Tidyverse functions are designed to work together. For example, you can
use tidyr to clean data and pass it directly to ggplot2 for visualization.
Advantages of tidyr
Common functions in R -
separate()
• splits one column into multiple columns
• Syntax- separate(data, col, into, sep)
unite()
• The unite() function combines multiple columns into a single column, using a specified delimiter.
• It is useful for creating a single identifier or combining related information.
• Syntax-unite(data,col,…, sep=”delimiter”)
14
• In the tidyr package, the gather() and spread() functions are used to reshape data. They are part of
the tidy data principles where each variable is a column, each observation is a row, and each type of
observational unit forms a table.
• Although these functions have been replaced by pivot_longer() and pivot_wider() in modern
versions of tidyr, they are still worth learning for legacy code.
gather()
• The gather() function is used to convert data from wide format to long format.
• Syntax- gather(data, key, value, ..., na.rm = FALSE, factor_key = FALSE)
spread()
• The spread() function is used to convert data from long format to wide format.
• Syntax- spread(data, key, value)
15
What is pivot_longer() and pivot_wider() in R?
In R, pivot_longer() and pivot_wider() are functions from the tidyr package used to reshape data.
They are part of the tidy data principles: each variable in a column, each observation in a row, and
one type of observational unit per table.
Pivot_longer()
• Converts wide format data to long format.
• Combines multiple columns into two columns- One for the variable names. One for the
corresponding values.
Pivot_wider()
• Converts long format data to wide format.
• Expands rows into columns: A specific column's values become column names. Another column
provides the values to populate these columns.
pivot_longer()
• Converts wide data to long format.
• Syntax- pivot_longer (data, cols, names_to, values_to )
pivot_wider()
• Converts long data to wide format.
• Syntax- pivot_wider (data, names_from, values_from)
16
DATA VISUALIZATION IN R
The ggplot2 package in R is one of the most popular and versatile tools for creating highquality
visualizations. It is based on the grammar of graphics, where you build plots layer by layer by
specifying data, aesthetics, and geometry.
➢ Basic Template
• Syntax- ggplot(data, aes(x = variable1, y = variable2)) + geom_type() + additional_layers()
A. Scatter Plot
• A scatter plot is a graphical representation of the relationship between two continuous variables. It
is often used to identify patterns, trends, and possible correlations between the variables.
• In R, scatter plots are simple to create using base R or advanced packages like ggplot2 for enhanced
visualization.
• Axes: The x-axis represents one variable, and the y-axis represents the other variable.
• Points: Each point in the plot represents an observation from the dataset.
17
Fig. Basic Scatter Plot using base R
B. Line Plot
• A line plot is a type of graph used to visualize data points connected by straight lines. It is
particularly useful for displaying trends over time or continuous data.
• Line plots are widely used in time series analysis, where the x-axis represents time, and the y-axis
represents a numerical value. They help in visualizing patterns, trends, and fluctuations over time.
18
• In R, line plots can be created using the base plotting system or more advanced plotting systems
like ggplot2.
C. Bar Chart
• A bar chart (also known as a bar plot) is a graphical representation of categorical data using
rectangular bars. The length or height of each bar is proportional to the value or frequency of the
category it represents. Bar charts are widely used for comparing quantities across different
categories.
• In R, you can create bar charts using both base R and the ggplot2 package. Bar charts are
particularly useful for visualizing the distribution of categorical data or comparing different groups.
19
Versatile: Bar charts can display counts, percentages, or other numerical values for each category.
Customization: With R, especially ggplot2, bar charts can be highly customized for various
visualization needs.
jitter() Function
In R Programming, jittering means adding small amount of random noise to a numeric vector object.
In this article, we’ll learn to use jitter() function and create a plot to visualize them.
transparency Function
the alpha() function that comes built-in with the ggplot2 package to specify the transparency that
should be used in points on a ggplot2 scatterplot.
Note that you can supply a value between 0 and 1 for the alpha argument in each method. A value
of 0 will cause the points to be completely transparent while a value of 1 will cause the points to be
completely visible.
20
MEAN, MEDIAN and MODE
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series. The function mean () is used to calculate this in R.
Median
The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.
Mode
The mode is the value that has highest number of occurrences in a set of data. Unlike mean and
median, mode can have both numeric and character data. R does not have a standard in-built function
to calculate mode. So we create a user function to calculate mode of a data set in R. This function
takes the vector as input and gives the mode value as output.
21
VARIANCE
var() function in R Language computes the sample variance of a vector. It is the measure of how
much value is away from the mean value.
STANDARD DEVIATION
sd() function is used to compute the standard deviation of given values in R. It is the square root of
its variance.
22
Introduction to joins in R
Joins are a fundamental operation to combine data from multiple datasets based on common
columns. R, with its powerful data manipulation capabilities, offers several methods to perform joins
effectively
Inner Join:
Retains rows that have matching values in both datasets. Think of it as the intersection of two sets.
Left Join:
Retains all rows from the left dataset and matching rows from the right. Non-matching rows in the
right dataset are filled with NA values
Right Join:
A right join in R returns all rows from the right (second) data frame, and the matching rows from
the left (first) data frame. If there's no match for a row in the left data frame, it's included with NA
values for the left data frame's columns.
Semi Join:
Returns all rows from the left table where there is a match in the right table.
Anti Join:
Returns all rows from the left table where there is no match in the right table.
Full Join :
A full join in R combines two data frames, retaining all rows from both data frames. If a row doesn't
have a match in the other data frame, it's included with missing values (NA) for the columns from
the other data frame.
23
Hypothesis Testing
Hypothesis testing is a statistical method used to determine whether a hypothesis about a population
parameter is likely to be true. It involves collecting sample data, analyzing it, and making inferences
about the population.
1. Formulate Hypotheses:
o Null Hypothesis (H₀): A statement of no effect or no difference.
o Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis.
2. Set the Significance Level (α): This is the probability of rejecting the null hypothesis when it's
actually true (Type I error). Common values are 0.05 and 0.01.
4. Calculate the Test Statistic: This is a numerical value that summarizes the sample data and is used
to assess the evidence against the null hypothesis.
5. Determine the P-value: The p-value is the probability of obtaining a test statistic as extreme or
more extreme than the observed one, assuming the null hypothesis is true.
6. Make a Decision:
o Reject H₀: If the p-value is less than the significance level (α), we reject the null hypothesis in
favor of the alternative hypothesis.
o Fail to Reject H₀: If the p-value is greater than or equal to α, we fail to reject the null hypothesis.
Assumptions of T-Tests
Types of T-Tests
T-tests are a specific type of hypothesis test used when the population standard deviation is
unknown. There are three main types:
1. One-Sample T-Test:
24
2. Independent Two-Sample T-Test:
o Compares the means of two related samples (e.g., before-and-after measurements, matched pairs).
o Used to determine if there is a significant difference between the means of the two related
populations.
25