0% found this document useful (0 votes)
4 views25 pages

Ashish Srivastava r Lab File

The document is a practical file for an MBA course on data management using R, detailing various aspects of R programming, including installation, data types, data manipulation with packages like dplyr and tidyr, and data visualization with ggplot2. It covers fundamental concepts such as reshaping data, hypothesis testing, and advanced visualization techniques. The file is submitted by a student and outlines the structure and content of the practical work undertaken in the course.

Uploaded by

ayushkumarfbd02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Ashish Srivastava r Lab File

The document is a practical file for an MBA course on data management using R, detailing various aspects of R programming, including installation, data types, data manipulation with packages like dplyr and tidyr, and data visualization with ggplot2. It covers fundamental concepts such as reshaping data, hypothesis testing, and advanced visualization techniques. The file is submitted by a student and outlines the structure and content of the practical work undertaken in the course.

Uploaded by

ayushkumarfbd02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Section-I MBA(BA)

BASICS OF DATA MANAGEMENT WITH “R”


(BMBA 152)

PRACTICAL FILE

SUBMITTED BY:
Ayush Kumar
Roll No. I-24028
Section-I MBA(BA) =

SUBMITTED TO:
Dr Shruti Traymbak
Associate Professor

1
INDEX

S.No Content Date Page No. Sign

1. Introduction to R, Introduction to
RStudio, R data types (vectors,
lists, matrices, data frames, and
factors), basic R commands.
2. Installation of R studio, Loading
and exploring datasets, sub-setting
and filtering data, renaming and
reordering columns.
3. Introduction to library in R Studio -
dplyr, tidyr and tidyverse. Creating
new variables using the dplyr
package.
4. Introduction to Reshaping data and
using pivoting (long to wide and
vice versa) to reshape data
5. Introduction to data manipulation
using the tidyr package (e.g.,
gather, spread, separate, unite).
6. Introduction to ggplot2-create
visualizations using the popular R
package ggplot2
7. Introduction to Scatter Plot, line
plots, bar charts, histograms, and
boxplots Create different types of
plots with help of ggplot2 (scatter
plots, line plots, bar charts,
histograms, and boxplots)
8. Introduction to Enhancing Visuals:
Customizing Titles, Labels, Colors,
and Themes
Building Complexity: Adding
Layers with Geoms
Breaking Down Data: Faceting for
Multivariable Insights
Preserving Work: Saving and
Exporting Plots
9. Introduction to advanced
visualization techniques like
creating multi-panel plots, and
handling overplotting using
transparency and jittering

2
10. Introduction to joins in R, different
types of joins (inner, left, right,
full), merging datasets using dplyr
functions (inner_join, left_join,
etc.),
11. Introduction to hypothesis testing,
types of t-tests (one-sample,
independent two-sample, paired
sample),
12. Assumptions of t-tests, conducting
t-tests in R, interpreting results, and
understanding p-values and
confidence intervals

3
Introduction to R

R programming is a leading tool for machine learning, statistics, and data analysis, allowing for
the easy creation of objects, functions, and packages. Designed by Ross Ihaka and Robert Gentleman
at the University of Auckland and developed by the R Development Core Team, R Language is
platform-independent and open-source, making it accessible for use across all operating systems
without licensing costs. Beyond its capabilities as a statistical package, R integrates with other
languages like C and C++, facilitating interaction with various data sources and statistical tools.

With a growing community of users and high demand in the Data Science job market, R is one of
the most sought-after programming languages today. Originating as an implementation of the S
programming language with influences from Scheme, R has evolved since its conception in 1992,
with its first stable beta version released in 2000.

The R Language stands out as a powerful tool in the modern era of statistical computing and data
analysis. Widely embraced by statisticians, data scientists, and researchers, the R Language offers
an extensive suite of packages and libraries tailored for data manipulation, statistical modeling, and
visualization. In this article, we explore the features, benefits, and applications of the R
Programming Language, shedding light on why it has become an indispensable asset for data-driven
professionals across various industries.

R programming language is an implementation of the S programming language. It also combines


with lexical scoping semantics inspired by Scheme. Moreover, the project was conceived in 1992,
with an initial version released in 1995 and a stable beta version in 2000.

The core R language is augmented by a large number of extension packages, containing reusable
code, documentation, and sample data.

R software is open source and free software. It is licensed by the GNU Project and available
under the GNU General Public License. It is written primarily in C, Fortan, and R
itself. Precompiled executables are provided for various operating systems.

4
Features of R Programming Language

1. Open-Source
You don’t have to pay any money to download R on your computer. It is free and open-source
software. Furthermore, you can contribute towards the development of R, customize its packages,
and add more features.

2. Strong Ability to Design Graphics


R has improved libraries that make it possible to create interactive graphics. As a result, data
visualization and representation are relatively simple. R can generate various flow diagrams, from
straightforward charts to intricate, interactive ones.

3. Extensive Range of Packages


CRAN, or the Comprehensive R Archive Network, contains over 10,000 different packages and
extensions that help handle a wide range of data science challenges. R contains a large set of
packages for many subjects, such as astronomy, biology, and so forth. While R was developed for
academic objectives, it is now also utilized in industry.

4. Efficient in Software Development


R has an extensive development environment, which means it may be used for both statistical
computing and software development. R is an object-oriented programming language. It also
includes a powerful package called Rshiny that can be used to create full-fledged web apps.

5. Computing in a Distributed Environment


Tasks are split across numerous processing nodes in distributed computing to minimize processing
time and boost efficiency. R offers tools like “ddR” and “multiDplyr” that allow it to process big
data sets using distributed computing.

6. Data Wrangling
The process of cleansing large and inconsistent data sets in order to facilitate computation and
further analysis is known as data wrangling. This is a time-consuming process. R's broad tool
collection can be utilized for database management and wrangling.

7. No Compilation
The R language is interpreted rather than compiled. As a result, no compiler is required to compile
code into an executable program. The R code is evaluated step by step and turned straight into
machine-level calls. This significantly reduces the time required to run a R script.

8. Enables Quick Calculations


R supports a wide range of complicated operations on vectors, arrays, data frames, and other data
objects of various sizes. Furthermore, all of these actions occur at breakneck speed. It includes a
variety of operator suites to execute these varied calculations.

9. Integration with Other Technologies


Many other technologies, frameworks, software programs, and programming languages can be
combined with R. To use Hadoop's distributed computing capabilities. It can be linked with it.

5
Additionally, it may be integrated with programs written in FORTRAN, C, C++, Java, and Python,
among other computer languages.

10. Compatibility with Multiple Platforms


R allows for cross-platform compatibility. It can run on any operating system and in any software
environment. It can also run on any hardware setup without the need for any further workarounds.

Programming in R:

Since R is much similar to other widely used languages syntactically, it is easier to code and learn
in R. Programs can be written in R in any of the widely used IDE like R Studio, Rattle, Tinn-R, etc.
After writing the program save the file with the extension .r.

Applications of R

R is used for Data Science. It offers us a wide range of statistics-related libraries. Additionally, it
offers a setting for statistical computation and design. Many quantitative analysts utilize R as a
programming language. As a result, it aids in data import and cleansing. In environmental science,
R is used to analyze and simulate environmental data, climate data, and ecological data. The most
common language is R. It is used by a large number of data analysts and research programmers. As
a result, it is employed as a fundamental financial instrument.

Installing R on your machine

-Go to the CRAN website.

-Click on "Download R for Windows".

-Click on "install R for the first time" link to download the R executable(.exe) file.

-Run the R executable file to start installation, and allow the app to makechanges to your device.

-Select the installation language.

6
-Follow the installation instructions.

-Click on Finish to exit the installation setup.

R has now been successfully installed on your Windows OS. Open the RGUI to start writing R codes.

7
DATA TYPES IN R

Basic Data Types Values Examples

Numeric Set of all real numbers "numeric_value <-3.14"

Integer Set of all integers, Z "integer_value <- 42L"

Logical TRUE and FALSE "logical_value<- TRUE"

Complex Set of complex numbers "complex_value <- 1 + 2i"

“a”, “b”, “c”, …, “@”, “#”, "character_value <- "Hello


Character
“$”, …., “1”, “2”, …etc Geeks"

DATA STRUCTURES IN R

VECTORS
A vector is simply a list of items that are of the same type.To combine the list of items to a vector, use
the c() function and separate theitems by a comma.

There are different ways to create vector in R which are follows: -

Using c () function
Using seq () function
Using rep () function

Let’s take an example: -

8
MATRIX
Matrices are the R objects in which the elements are arranged in a two dimensional rectangular
layout.

FACTOR
A factor is a way to store categorical data in R. It groups values into categories called levels.

DATA FRAME
A data frame is like a table. It is used to store data in rows and columns, where each columns can have
a different data types.

9
LIST
A list in R is a collection of objects and it can hold different data types together, unlike vectors that
hold only one type.

INTRODUCTION TO LIBRARY IN R STUDIO


To install and load the package in R -
• Use command install.package(“package name”)
• Then library(“library name”)

10
dplyr Package in R
The dplyr package is one of the most popular and powerful tools in R for data manipulation.
It provides a set of simple and intuitive functions for data transformation using a "grammar of data
manipulation."

Key Functions in dplyr-

select()
• Purpose: Choose specific columns from a dataset.
• Syntax: select(data, column1, column2, ...)

filter()
• Purpose: Filter rows based on conditions.
• Syntax: filter(data, condition)

mutate()
• Purpose: Create or modify columns.
• Syntax: mutate(data, new_column = expression)

11
arrange()
• Purpose: Sort rows by one or more columns.
• Syntax: arrange(data, column1, desc(column2))

summarise()
• Purpose: Summarize data by applying functions like mean, sum, etc.
• Syntax: summarise(data, summary = function(column))

group_by()
• Purpose: Group data by one or more variables.
• Syntax: group_by(data, column)

12
Tidyr and tidyverse package in R

What is tidy data?


Tidy data is a structured way of organizing datasets to make them easier to work with for analysis.

A dataset is tidy when:


Each variable is a column.
Each observation is a row.
Each value is a single cell.

This standard structure helps ensure compatibility with data manipulation and visualization tools.

Why is tidy data important?


• Simplifies data exploration and analysis.
• Facilitates the use of R’s powerful tools like ggplot2 for visualization or dplyr for manipulation.
• Reduces errors in data transformation processes.

What is tidyverse package?


• The tidyverse is a collection of R packages designed for data science. It provides tools to import,
clean, transform, visualize, and model data in a consistent and user-friendly way.

• Key Packages in the tidyverse


i. dplyr : Data manipulation (filtering, grouping, summarizing).
ii. tidyr: Reshaping and tidying datasets.
iii. ggplot2: Data visualization.
iv. readr: Importing data from files (e.g., CSV, TSV).
v. tibble: Enhanced data frames for better readability.
vi. purrr: Functional programming with lists.
vii. stringr: String manipulation.
viii. forcats: Working with categorical (factor) data.

What is tidyr package?


The tidyr package, part of the tidyverse, is specifically designed to clean and reshape datasets into
tidy format.

It includes tools to:


• Reshape data:
-Convert wide data into long format (pivot_longer).
-Convert long data into wide format (pivot_wider).
• Handle missing values:
-Remove missing data (drop_na).
-Fill or replace missing values (fill, replace_na).
• Split and combine columns:
-Separate a column into multiple columns (separate).
-Unite multiple columns into one (unite).

13
Importance of tidyr package

• Consistency- Tidy data ensures compatibility with R's tidyverse tools, allowing smooth workflows.
• Readability- A well-organized dataset is easier to understand, troubleshoot, and share.
• Flexibility- With tidy data, you can quickly switch between different tools for manipulation,
visualization, or modeling.
• Error Reduction- Clean, structured data minimizes the risk of errors during analysis.
• Seamless Integration- Tidyverse functions are designed to work together. For example, you can
use tidyr to clean data and pass it directly to ggplot2 for visualization.

Advantages of tidyr

• Simplifies reshaping and cleaning messy datasets.


• Works seamlessly with other tidyverse packages like dplyr.
• Improves readability and maintainability of data transformation code.

Installation and loading

Common functions in R -

separate()
• splits one column into multiple columns
• Syntax- separate(data, col, into, sep)

unite()
• The unite() function combines multiple columns into a single column, using a specified delimiter.
• It is useful for creating a single identifier or combining related information.
• Syntax-unite(data,col,…, sep=”delimiter”)

14
• In the tidyr package, the gather() and spread() functions are used to reshape data. They are part of
the tidy data principles where each variable is a column, each observation is a row, and each type of
observational unit forms a table.
• Although these functions have been replaced by pivot_longer() and pivot_wider() in modern
versions of tidyr, they are still worth learning for legacy code.

gather()
• The gather() function is used to convert data from wide format to long format.
• Syntax- gather(data, key, value, ..., na.rm = FALSE, factor_key = FALSE)

spread()
• The spread() function is used to convert data from long format to wide format.
• Syntax- spread(data, key, value)

Introduction to Reshaping Data

Pipe Operator (%>%)


• The pipe operator (%>%) is used to chain multiple dplyr functions together, making the code
cleaner and more readable.

15
What is pivot_longer() and pivot_wider() in R?
In R, pivot_longer() and pivot_wider() are functions from the tidyr package used to reshape data.
They are part of the tidy data principles: each variable in a column, each observation in a row, and
one type of observational unit per table.

What does each function do?

Pivot_longer()
• Converts wide format data to long format.
• Combines multiple columns into two columns- One for the variable names. One for the
corresponding values.

Pivot_wider()
• Converts long format data to wide format.
• Expands rows into columns: A specific column's values become column names. Another column
provides the values to populate these columns.

pivot_longer()
• Converts wide data to long format.
• Syntax- pivot_longer (data, cols, names_to, values_to )

pivot_wider()
• Converts long data to wide format.
• Syntax- pivot_wider (data, names_from, values_from)

16
DATA VISUALIZATION IN R

The ggplot2 package in R is one of the most popular and versatile tools for creating highquality
visualizations. It is based on the grammar of graphics, where you build plots layer by layer by
specifying data, aesthetics, and geometry.

➢ Load the ggplot2 Library


• Use command install.package(ggplot2)
• Then library(ggplot2)

➢ Understand the ggplot2 structure


• Start with ggplot(data, aes(...)) where data is your dataset and aes maps variables to visual
properties like x, y, color, etc.
• Start with ggplot(data, aes(...)) where data is your dataset and aes maps variables to visual
properties like x, y, color, etc.

➢ Basic Template
• Syntax- ggplot(data, aes(x = variable1, y = variable2)) + geom_type() + additional_layers()

INTRODUCTION TO BASIC PLOTS IN R

A. Scatter Plot
• A scatter plot is a graphical representation of the relationship between two continuous variables. It
is often used to identify patterns, trends, and possible correlations between the variables.
• In R, scatter plots are simple to create using base R or advanced packages like ggplot2 for enhanced
visualization.
• Axes: The x-axis represents one variable, and the y-axis represents the other variable.
• Points: Each point in the plot represents an observation from the dataset.

• When to use scatter plots-


 When analyzing the relationship between two numeric variables
 When looking for linear or non-linear trends in the data
 When assessing the spread or clustering of data points.

• Advantages of scatter plot-


 Easy to visualize relationships between two variables.
 Helpful in identifying trends and outliers.
 Supports overlays like trend lines or multiple variable comparison.

• Useful for identifying:


 Positive or negative correlations.
 Outliers.
 Non-linear relationships.

17
Fig. Basic Scatter Plot using base R

Fig. Basic Scatter Plot using ggplot

B. Line Plot
• A line plot is a type of graph used to visualize data points connected by straight lines. It is
particularly useful for displaying trends over time or continuous data.
• Line plots are widely used in time series analysis, where the x-axis represents time, and the y-axis
represents a numerical value. They help in visualizing patterns, trends, and fluctuations over time.

18
• In R, line plots can be created using the base plotting system or more advanced plotting systems
like ggplot2.

• Advantages of line plots-


 Great for showing trends over time or continuous variables.
 Useful in identifying patterns such as periodic behavior, outliers, or changes in trends
 Easy to understand and intuitive for representing sequential data

• When to use line plots-


 To display time-series data.
 When the data has an inherent order (e.g., measurements taken at different time points).
 To visualize trends, patterns, or fluctuations in data over a continuous range (such as months,
years, or years).

C. Bar Chart
• A bar chart (also known as a bar plot) is a graphical representation of categorical data using
rectangular bars. The length or height of each bar is proportional to the value or frequency of the
category it represents. Bar charts are widely used for comparing quantities across different
categories.
• In R, you can create bar charts using both base R and the ggplot2 package. Bar charts are
particularly useful for visualizing the distribution of categorical data or comparing different groups.

• Advantages of Bar charts-


 Clear Comparison: Bar charts make it easy to compare different categories side by side.
 Simple & Intuitive: They are easy to understand and interpret, even for non-experts.

19
 Versatile: Bar charts can display counts, percentages, or other numerical values for each category.
 Customization: With R, especially ggplot2, bar charts can be highly customized for various
visualization needs.

• When to use bar charts-


 To compare the frequency or count of different categories
 To display categorical data with a clear comparison between groups
 To visualize the summary statistics of categorical variables (e.g., count, mean, etc.)

jitter() Function

In R Programming, jittering means adding small amount of random noise to a numeric vector object.
In this article, we’ll learn to use jitter() function and create a plot to visualize them.

Syntax: jitter(x, factor)


Parameters:
x:represents numeric vector
factor: represents numeric value for factor specification

transparency Function

the alpha() function that comes built-in with the ggplot2 package to specify the transparency that
should be used in points on a ggplot2 scatterplot.
Note that you can supply a value between 0 and 1 for the alpha argument in each method. A value
of 0 will cause the points to be completely transparent while a value of 1 will cause the points to be
completely visible.

20
MEAN, MEDIAN and MODE

Mean

It is calculated by taking the sum of the values and dividing with the number of values in a data
series. The function mean () is used to calculate this in R.

# Create a vector. x <- c(12,7,3,4.2,18,2,54,-21,8,-5)


# Find Mean. result.mean <- mean(x) print(result.mean) When we execute the above code, it
produces the following result − [1] 8.22

Median

The middle most value in a data series is called the median. The median() function is used in R to
calculate this value.

# Create the vector. x <- c(12,7,3,4.2,18,2,54,-21,8,-5)


# Find the median. median.result <- median(x) print(median.result) When we execute the above
code, it produces the following result − [1] 5.6

Mode

The mode is the value that has highest number of occurrences in a set of data. Unlike mean and
median, mode can have both numeric and character data. R does not have a standard in-built function
to calculate mode. So we create a user function to calculate mode of a data set in R. This function
takes the vector as input and gives the mode value as output.

21
VARIANCE

var() function in R Language computes the sample variance of a vector. It is the measure of how
much value is away from the mean value.

Computing variance of a vector


# R program to illustrate
# variance of vector
# Create example vector x <- c(1, 2, 3, 4, 5, 6, 7)
# Apply var function in R var(x) print(x)
Output:4.667

STANDARD DEVIATION

sd() function is used to compute the standard deviation of given values in R. It is the square root of
its variance.

Computing standard deviation of a vector


# R program to illustrate
# standard deviation of vector
# Create example vector x2 <- c(1, 2, 3, 4, 5, 6, 7)
# Compare with sd function sd(x2) print(x2)
Output: 2.216

22
Introduction to joins in R

Joins are a fundamental operation to combine data from multiple datasets based on common
columns. R, with its powerful data manipulation capabilities, offers several methods to perform joins
effectively

Inner Join:
Retains rows that have matching values in both datasets. Think of it as the intersection of two sets.

Left Join:
Retains all rows from the left dataset and matching rows from the right. Non-matching rows in the
right dataset are filled with NA values

Right Join:
A right join in R returns all rows from the right (second) data frame, and the matching rows from
the left (first) data frame. If there's no match for a row in the left data frame, it's included with NA
values for the left data frame's columns.

Semi Join:
Returns all rows from the left table where there is a match in the right table.

Anti Join:
Returns all rows from the left table where there is no match in the right table.

Full Join :
A full join in R combines two data frames, retaining all rows from both data frames. If a row doesn't
have a match in the other data frame, it's included with missing values (NA) for the columns from
the other data frame.

23
Hypothesis Testing

Hypothesis testing is a statistical method used to determine whether a hypothesis about a population
parameter is likely to be true. It involves collecting sample data, analyzing it, and making inferences
about the population.

The General Process of Hypothesis Testing

1. Formulate Hypotheses:
o Null Hypothesis (H₀): A statement of no effect or no difference.
o Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis.

2. Set the Significance Level (α): This is the probability of rejecting the null hypothesis when it's
actually true (Type I error). Common values are 0.05 and 0.01.

3. Collect Sample Data: Gather a representative sample from the population.

4. Calculate the Test Statistic: This is a numerical value that summarizes the sample data and is used
to assess the evidence against the null hypothesis.

5. Determine the P-value: The p-value is the probability of obtaining a test statistic as extreme or
more extreme than the observed one, assuming the null hypothesis is true.

6. Make a Decision:
o Reject H₀: If the p-value is less than the significance level (α), we reject the null hypothesis in
favor of the alternative hypothesis.
o Fail to Reject H₀: If the p-value is greater than or equal to α, we fail to reject the null hypothesis.

Assumptions of T-Tests

To ensure the validity of t-tests, certain assumptions must be met:


1. Independence: Observations within each group should be independent of each other.
2. Normality: The data within each group should be approximately normally distributed.
3. Equal Variance (for independent t-tests): The variances of the two populations being compared
should be equal.

Types of T-Tests

T-tests are a specific type of hypothesis test used when the population standard deviation is
unknown. There are three main types:

1. One-Sample T-Test:

o Compares the mean of a single sample to a known or hypothesized population mean.


o Used to determine if the sample mean is significantly different from the population mean.

24
2. Independent Two-Sample T-Test:

o Compares the means of two independent samples.


o Used to determine if there is a significant difference between the means of two populations.

3. Paired Sample T-Test:

o Compares the means of two related samples (e.g., before-and-after measurements, matched pairs).
o Used to determine if there is a significant difference between the means of the two related
populations.

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy