0% found this document useful (0 votes)

129 views58 pages

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

This document discusses data manipulation and visualization techniques in R. It covers commonly used R packages for data manipulation like dplyr, data.table, tidyr, and lubridate. It also discusses creating visualizations using ggplot2 package in R, including creating scatter plots, bar plots, box plots, and histograms. The document provides examples of using functions from these packages to filter, arrange, mutate and summarize data, as well as examples of plotting different graph types.

Uploaded by

Kundan Vanama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views58 pages

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

Uploaded by

Kundan Vanama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

UNIT - 2

Data manipulation with R

&
Data visualization in Watson Studio
• When it comes to Predictive Modelling, data
Manipulation is an important and unavoidable
phase.
• Machine learning algorithms are just not
sufficient to build a robust predictive model.
• The approach must be to understand the
business problem, the data, performing data
manipulations, and then extracting business
insights.
• Majority of time is spent in understanding the
data and manipulating data as required.
Data Manipulation
• Data Manipulation is also called as Data Exploration
(also known as Data Wrangling or Data Cleaning).
• Data Manipulation is done to improve data
accuracy and precision.
• Data Manipulation is a mandatory step when it
comes to predictive modelling because of the many
faults in data collection process, because of many
uncontrollable factors involved in data collection.
Following are some of the points you
need to consider for Data Manipulation
• You can use the inbuilt functions in R to manipulate
data.
• You can use the packages available in CRAN. As
these packages are tried and tested, they are more
efficient.
• You can also use ML algorithms. For example, tree
based boosting algorithms take care of missing data
and outliers. Though time-efficient, you will need to
have a very thorough understanding of data.
dplyr Package
• dplyr is a powerful R-package which transforms and
summarizes tabular data with rows and columns.
• It includes 5 major data manipulation commands:
– filter – filters the data based on a condition
– select –used to select columns of interest from a data set
– arrange –used to arrange data set values on ascending or
descending order
– mutate – used to create new variables from existing variables
– summarise (with group_by) – used to perform analysis by
commonly used operations such as min, max, mean count etc.
• filter(df,condition)

• select(df,var1,var2….)

• arrange(df,var1,desc(var2)…)
• dat1<-mutate(dat,marks=c(50,60,70,80,90))
• > print(dat1)
• Sno name dept marks
• 1 2 bb CSE 50
• 2 1 aa ECE 60
• 3 4 cc IT 70
• 4 3 dd ECE 80
• 5 5 aa CSE 90
• dat<-data.frame(
• Sno=c(2,1,4,3,5),
• name=c("bb","aa","cc","dd","aa"),
• dept=c("CSE","ECE","IT","ECE","CSE")
• )
• dat1<-mutate(dat,marks=c(50,60,70,80,90))
• summarize(dat1,avg=mean(marks))
• sample_n(dat1,2)
• sample_frac(dat1,0.4)
data.table package
• A data table is nothing but a group of related facts
arranged in labeled rows and columns and is used to
record information.
• data.table can be used to perform faster
manipulation in a data set. Using data.table reduces
computing time when compared to data.frame.
• A data table has 3 parts namely DT[i,j,by]. Here, we
are instructing R to subset the rows using ‘i’, to
calculate ‘j’ which is grouped by ‘by’. Most of the
times, ‘by’ relates to categorical variable.
data.table package
• dat<-data.frame(
• Sno=c(2,1,4,3,5),
• name=c("bb","aa","cc","dd","aa"),
• dept=c("CSE","ECE","IT","ECE","CSE")
• )
• dat1<-data.table(dat)
• class(dat1)
• dat1[2:4]
• dat1[dept=='CSE']
• dat1[dept %in% c('CSE','IT')]
reshape2 Package
• reshape2 is an R package, was written by
Hadley Wickham which makes it easy to
transform data between wide and long
formats.
• Use the reshape2 package to reshape your
data. Using the reshape2 package, we can
combine features that have unique values. It
has 2 functions namely melt and cast.
• melt: Converts data from wide format to long
format. It is a form of restructuring where
multiple categorical columns are ‘melted’ into
unique rows. Let us understand it using the
code below.
• cast: converts data from long format to wide
format. It starts with melted data and reshapes
into long format. It’s the reverse of melt
function. It has two functions
namely, dcast and acast.
• - dcast returns a data frame as output.
- acast returns a vector/matrix/array as the
output.
• dat
• sno name
• 1 aa
• 2 bb
• 3 cc
readr Package
• The readr package is also developed by Hadley
Wickham to deal with reading in large flat files quickly.
• ‘readr’ is used to read various forms of data in R. It is
very fast. The characters are not converted to factors.
It helps in reading the following data:
• Delimited files with read_delim(), read_csv(),
read_tsv(), and read_csv2().
• Fixed width files with read_fwf(), and read_table().
• Web log files with read_log()
dum <- read_csv("D:/dum1.csv")
dum
class(dum)
{dum$name(for only name column)}
tidyr Package
• tidyr is a package which was developed by
Hadley Wickham which makes it easy to tidy
your data.
• To make the data look neat and tidy, use the
tidyr package. The package has 4 major
functions. You can use these functions if you
are stuck in the data exploration phase, along
with dplyr.
• gather() – ‘gathers’ multiple columns and converts them into
key:value pairs. This function transforms wide form of data to
long form. It can be used as an alternative to ‘melt’ in reshape
package.

• spread() – Does reverse of gather. It accepts a key:value pair and

converts it into separate columns.

• separate() – Splits a column into multiple columns.

• unite() – Does reverse of separate. It unites multiple columns

into single column
• sno=c(1,2)
• name=c('Nusrath Khan','MV Kamal')
• ddate<-c('22/12/2019','22/12/2019')
• dat<-data.frame(sno,name,ddate)
• sdat<-separate(dat,ddate,c('day','month','year'))
• sdat<-dat%>%separate(ddate,c('day','month','year')) %>
% separate(name,c('FN','SN'))
• sdat
Lubridate Package
• Lubridate package, makes it easier to work with
dates and times.
• Use the Lubridate package to reduce the issues
related to working of data time variable in R. The
inbuilt function of this package helps in easy parsing
in dates and times. Lubridate is used with data
comprising of timely data.
• Following are three basic tasks that are accomplished
using Lubridate – The update, duration function, and
data extraction functions.
Working with Base R Graphics
• ggplot2 Package
ggplot2 offers a wide range of colors and patterns.
• ggplot2 is included in the tidyverse package.
• You must be proficient with plotting at least 3
graphs –
– Scatter Plot
– Bar Plot
– Histogram.
Elements of ggplot2
• Data: The data-set for which we would want
to plot a graph.
• Aesthetics: The metrics onto which we plot
our data, we can map xaxis, yaxis, fill, col,
shape, size.
• Geometry: Visual Elements to plot the data.
• Facet: Groups by which we divide the data.
• ggplot2 functions like data in the 'long'
format, i.e., a column for every dimension,
and a row for every observation.
• Well-structured data will save you lots of time
when making figures with ggplot2
Box Plot
To save Plot
• ggsave("name_of_file.png", Surveys_plot,
width = 15, height = 10)
Scatter Plot
• A Scatter Plot is a graph in which the values of
two variables are plotted along two axes, the
pattern of the resulting points revealing any
correlation present.
• With scatter plots we can explain how the
variables relate to each other. Which is defined
as correlation. Positive, Negative, and None
(no correlation) are the three types of
correlation.
Advantages of a Scatter Diagram
• Relationship between two variables can be
viewed.
• For non-linear pattern, this is the best method.
• Maximum and minimum value, can be easily
determined.
• Observation and reading is easy to understand
• Plotting the diagram is very simple.
Limitations of a Scatter Diagram
• With Scatter diagrams we cannot get the exact
extent of correlation.
• Quantitative measure of the relationship
between the variable cannot be viewed. Only
shows the quantitative expression.
• The relationship can only show for two
variables.
Bar Plot
• A barplot (or barchart) is one of the most
common type of graphic. It shows the
relationship between a numeric variable and a
categoric variable.
• Bar Plot are classified into four types of graphs
- bar graph or bar chart, line graph, pie chart,
and diagram.
Advantages of Bar plot:
• Bar charts are easy to understand and
interpret.
• Relationship between size and value helps for
in easy comparison.
• They're simple to create.
• They can help in presenting very large or very
small values easily.
Histogram
• A histogram represents the frequency distribution
of continuous variables. while, a bar graph is a
diagrammatic comparison of discrete variables.
Histogram presents numerical data whereas bar
graph shows categorical data.
The histogram is drawn in such a way that there is
no gap between the bars.
• Advantages of Histogram:
Histogram helps to identify different data, the
frequency of the data occurring in the dataset and
categories which are difficult to interpret in a tabular
form. It helps to visualize the distribution of the data.
• Limitations of Histogram:
A histogram can present data that is misleading as it
has many bars.
Only two sets of data are used, but to analyze certain
types of statistical data, more than two sets of data
are necessary
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• plot(ds$per~ds$sem,ylab='sem',xlab='per',mai
n='sem vs per',col='blue',pch=16)
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• barplot(per,names.arg = sem)
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• hist(ds$per,xlab='ds$per',col="orange",main=‘
percentage of Marks’,pch=19)

Assignment Questions and Solution
No ratings yet
Assignment Questions and Solution
16 pages
How To Choose The Right Data Visualization
100% (2)
How To Choose The Right Data Visualization
26 pages
CRT, LCD, LED, TV Technologies Like Liquid Crystal Display
No ratings yet
CRT, LCD, LED, TV Technologies Like Liquid Crystal Display
28 pages
Bivariate Data Notes
No ratings yet
Bivariate Data Notes
12 pages
BT-Managing Data - Assessment 2
No ratings yet
BT-Managing Data - Assessment 2
17 pages
DV Unit 2 Update
No ratings yet
DV Unit 2 Update
13 pages
Graph Plotting in R Programming
No ratings yet
Graph Plotting in R Programming
12 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
CRM Cheat Sheet
No ratings yet
CRM Cheat Sheet
7 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
R Module 6 - Data Summarization
No ratings yet
R Module 6 - Data Summarization
25 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Unit 3
No ratings yet
Unit 3
36 pages
Unit3 R
No ratings yet
Unit3 R
19 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Unit3 R
No ratings yet
Unit3 R
30 pages
UL2
No ratings yet
UL2
2 pages
Module IV
No ratings yet
Module IV
43 pages
Rintro
No ratings yet
Rintro
42 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
03 UnderstandData
No ratings yet
03 UnderstandData
29 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Histograms and Density Plots in R
No ratings yet
Histograms and Density Plots in R
9 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Basics: TH TH TH TH TH TH TH
No ratings yet
Basics: TH TH TH TH TH TH TH
3 pages
DV - Unit 2
No ratings yet
DV - Unit 2
73 pages
Unit-4 Big Data Analytics Methods Using R
No ratings yet
Unit-4 Big Data Analytics Methods Using R
57 pages
Beautiful Graphics in R
No ratings yet
Beautiful Graphics in R
238 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
BA Notes
No ratings yet
BA Notes
5 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Data Visualization in R Sem-III 2021 PDF
No ratings yet
Data Visualization in R Sem-III 2021 PDF
57 pages
Guide To Create: Beautiful Graphics in R
No ratings yet
Guide To Create: Beautiful Graphics in R
48 pages
Basic R Commands For Data Analysis
No ratings yet
Basic R Commands For Data Analysis
7 pages
Unit 4 Ba Shivdas
No ratings yet
Unit 4 Ba Shivdas
17 pages
Unit 2
No ratings yet
Unit 2
76 pages
Business Analytics - L2
No ratings yet
Business Analytics - L2
41 pages
R File Code
No ratings yet
R File Code
16 pages
Unit 5 - R and Data Analysis
No ratings yet
Unit 5 - R and Data Analysis
29 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
9 pages
R Prog
No ratings yet
R Prog
27 pages
R - Charts and Graphs
No ratings yet
R - Charts and Graphs
21 pages
R
No ratings yet
R
13 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
Cours BI - R
No ratings yet
Cours BI - R
18 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
SYBA SEM IV Testing and Assessment Unit IV
No ratings yet
SYBA SEM IV Testing and Assessment Unit IV
16 pages
Placement Report 17-18 PDF
No ratings yet
Placement Report 17-18 PDF
10 pages
Bimtech Campus Review
No ratings yet
Bimtech Campus Review
20 pages
Passive Optical Network of Company ECI
No ratings yet
Passive Optical Network of Company ECI
39 pages
Correlation Analysis
No ratings yet
Correlation Analysis
54 pages
Scatter Plots
No ratings yet
Scatter Plots
14 pages
Pareto Analysis
No ratings yet
Pareto Analysis
11 pages
Scatter Plots
No ratings yet
Scatter Plots
12 pages
Business Statistics: Prof. Lancelot JAMES
No ratings yet
Business Statistics: Prof. Lancelot JAMES
103 pages
DoesACorrelationExist Teacher
No ratings yet
DoesACorrelationExist Teacher
11 pages
Quality Control Circle QCC & 7 QC Tools Training Course Outline
No ratings yet
Quality Control Circle QCC & 7 QC Tools Training Course Outline
4 pages
Quality Control Part 2
No ratings yet
Quality Control Part 2
4 pages
17 Important Data Visualization Techniques - HBS Online
No ratings yet
17 Important Data Visualization Techniques - HBS Online
12 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
Scatter Plots and Linear Regression
No ratings yet
Scatter Plots and Linear Regression
2 pages
A Powerpoint Training Presentation: by Keith H. Cooper
No ratings yet
A Powerpoint Training Presentation: by Keith H. Cooper
38 pages
Root Cause Analysis
No ratings yet
Root Cause Analysis
65 pages
Data Visualization in Python
No ratings yet
Data Visualization in Python
11 pages
Diamond B
No ratings yet
Diamond B
2 pages
Worksheet On Correlation
No ratings yet
Worksheet On Correlation
2 pages
Quality Ch10 Tools
No ratings yet
Quality Ch10 Tools
76 pages
Math Questions
No ratings yet
Math Questions
20 pages
Quality Management Quotes
No ratings yet
Quality Management Quotes
8 pages
Chapter 1 Introduction To Visualization
No ratings yet
Chapter 1 Introduction To Visualization
53 pages
Statistics For Business and Economics: Describing Data: Graphical
No ratings yet
Statistics For Business and Economics: Describing Data: Graphical
36 pages
STATG5 - Simple Linear Regression Using SPSS Module
No ratings yet
STATG5 - Simple Linear Regression Using SPSS Module
16 pages
GMAT Integrated Reasoning Ebook
No ratings yet
GMAT Integrated Reasoning Ebook
46 pages
LEAN - SIX SIGMA MANAGEMENT - PPT All Units
No ratings yet
LEAN - SIX SIGMA MANAGEMENT - PPT All Units
282 pages
Module 4 - Exercises
No ratings yet
Module 4 - Exercises
2 pages
Mathematics: Answer Key
No ratings yet
Mathematics: Answer Key
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

Uploaded by

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

Uploaded by

UNIT - 2

Data manipulation with R

• spread() – Does reverse of gather. It accepts a key:value pair and

• separate() – Splits a column into multiple columns.

• unite() – Does reverse of separate. It unites multiple columns

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.