0% found this document useful (0 votes)
129 views58 pages

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

This document discusses data manipulation and visualization techniques in R. It covers commonly used R packages for data manipulation like dplyr, data.table, tidyr, and lubridate. It also discusses creating visualizations using ggplot2 package in R, including creating scatter plots, bar plots, box plots, and histograms. The document provides examples of using functions from these packages to filter, arrange, mutate and summarize data, as well as examples of plotting different graph types.

Uploaded by

Kundan Vanama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views58 pages

Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio

This document discusses data manipulation and visualization techniques in R. It covers commonly used R packages for data manipulation like dplyr, data.table, tidyr, and lubridate. It also discusses creating visualizations using ggplot2 package in R, including creating scatter plots, bar plots, box plots, and histograms. The document provides examples of using functions from these packages to filter, arrange, mutate and summarize data, as well as examples of plotting different graph types.

Uploaded by

Kundan Vanama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

UNIT - 2

Data manipulation with R


&
Data visualization in Watson Studio
• When it comes to Predictive Modelling, data
Manipulation is an important and unavoidable
phase.
• Machine learning algorithms are just not
sufficient to build a robust predictive model.
• The approach must be to understand the
business problem, the data, performing data
manipulations, and then extracting business
insights.
• Majority of time is spent in understanding the
data and manipulating data as required.
Data Manipulation
• Data Manipulation is also called as Data Exploration
(also known as Data Wrangling or Data Cleaning).
• Data Manipulation is done to improve data
accuracy and precision.
• Data Manipulation is a mandatory step when it
comes to predictive modelling because of the many
faults in data collection process, because of many
uncontrollable factors involved in data collection.
Following are some of the points you
need to consider for Data Manipulation
• You can use the inbuilt functions in R to manipulate
data.
• You can use the packages available in CRAN. As
these packages are tried and tested, they are more
efficient.
• You can also use ML algorithms. For example, tree
based boosting algorithms take care of missing data
and outliers. Though time-efficient, you will need to
have a very thorough understanding of data.
dplyr Package
• dplyr is a powerful R-package which transforms and
summarizes tabular data with rows and columns.
• It includes 5 major data manipulation commands:
– filter – filters the data based on a condition
– select –used to select columns of interest from a data set
– arrange –used to arrange data set values on ascending or
descending order
– mutate – used to create new variables from existing variables
– summarise (with group_by) – used to perform analysis by
commonly used operations such as min, max, mean count etc.
• filter(df,condition)

• select(df,var1,var2….)

• arrange(df,var1,desc(var2)…)
• dat1<-mutate(dat,marks=c(50,60,70,80,90))
• > print(dat1)
• Sno name dept marks
• 1 2 bb CSE 50
• 2 1 aa ECE 60
• 3 4 cc IT 70
• 4 3 dd ECE 80
• 5 5 aa CSE 90
• dat<-data.frame(
• Sno=c(2,1,4,3,5),
• name=c("bb","aa","cc","dd","aa"),
• dept=c("CSE","ECE","IT","ECE","CSE")
• )
• dat1<-mutate(dat,marks=c(50,60,70,80,90))
• summarize(dat1,avg=mean(marks))
• sample_n(dat1,2)
• sample_frac(dat1,0.4)
data.table package
• A data table is nothing but a group of related facts
arranged in labeled rows and columns and is used to
record information.
• data.table can be used to perform faster
manipulation in a data set. Using data.table reduces
computing time when compared to data.frame.
• A data table has 3 parts namely DT[i,j,by]. Here, we
are instructing R to subset the rows using ‘i’, to
calculate ‘j’ which is grouped by ‘by’. Most of the
times, ‘by’ relates to categorical variable.
data.table package
• dat<-data.frame(
• Sno=c(2,1,4,3,5),
• name=c("bb","aa","cc","dd","aa"),
• dept=c("CSE","ECE","IT","ECE","CSE")
• )
• dat1<-data.table(dat)
• class(dat1)
• dat1[2:4]
• dat1[dept=='CSE']
• dat1[dept %in% c('CSE','IT')]
reshape2 Package
• reshape2 is an R package, was written by
Hadley Wickham which makes it easy to
transform data between wide and long
formats.
• Use the reshape2 package to reshape your
data. Using the reshape2 package, we can
combine features that have unique values. It
has 2 functions namely melt and cast.
• melt: Converts data from wide format to long
format. It is a form of restructuring where
multiple categorical columns are ‘melted’ into
unique rows. Let us understand it using the
code below.
• cast: converts data from long format to wide
format. It starts with melted data and reshapes
into long format. It’s the reverse of melt
function. It has two functions
namely, dcast and acast.
• - dcast returns a data frame as output.
- acast returns a vector/matrix/array as the
output.
• dat
• sno name
• 1 aa
• 2 bb
• 3 cc
readr Package
• The readr package is also developed by Hadley
Wickham to deal with reading in large flat files quickly.
• ‘readr’ is used to read various forms of data in R. It is
very fast. The characters are not converted to factors.
It helps in reading the following data:
• Delimited files with read_delim(), read_csv(),
read_tsv(), and read_csv2().
• Fixed width files with read_fwf(), and read_table().
• Web log files with read_log()
dum <- read_csv("D:/dum1.csv")
dum
class(dum)
{dum$name(for only name column)}
tidyr Package
• tidyr is a package which was developed by
Hadley Wickham which makes it easy to tidy
your data.
• To make the data look neat and tidy, use the
tidyr package. The package has 4 major
functions. You can use these functions if you
are stuck in the data exploration phase, along
with dplyr.
• gather() – ‘gathers’ multiple columns and converts them into
key:value pairs. This function transforms wide form of data to
long form. It can be used as an alternative to ‘melt’ in reshape
package.

• spread() – Does reverse of gather. It accepts a key:value pair and


converts it into separate columns.

• separate() – Splits a column into multiple columns.

• unite() – Does reverse of separate. It unites multiple columns


into single column
• sno=c(1,2)
• name=c('Nusrath Khan','MV Kamal')
• ddate<-c('22/12/2019','22/12/2019')
• dat<-data.frame(sno,name,ddate)
• sdat<-separate(dat,ddate,c('day','month','year'))
• sdat<-dat%>%separate(ddate,c('day','month','year')) %>
% separate(name,c('FN','SN'))
• sdat
Lubridate Package
• Lubridate package, makes it easier to work with
dates and times.
• Use the Lubridate package to reduce the issues
related to working of data time variable in R. The
inbuilt function of this package helps in easy parsing
in dates and times. Lubridate is used with data
comprising of timely data.
• Following are three basic tasks that are accomplished
using Lubridate – The update, duration function, and
data extraction functions.
Working with Base R Graphics
• ggplot2 Package
ggplot2 offers a wide range of colors and patterns.
• ggplot2 is included in the tidyverse package.
• You must be proficient with plotting at least 3
graphs –
– Scatter Plot
– Bar Plot
– Histogram.
Elements of ggplot2
• Data: The data-set for which we would want
to plot a graph.
• Aesthetics: The metrics onto which we plot
our data, we can map xaxis, yaxis, fill, col,
shape, size.
• Geometry: Visual Elements to plot the data.
• Facet: Groups by which we divide the data.
• ggplot2 functions like data in the 'long'
format, i.e., a column for every dimension,
and a row for every observation.
• Well-structured data will save you lots of time
when making figures with ggplot2
Box Plot
To save Plot
• ggsave("name_of_file.png", Surveys_plot,
width = 15, height = 10)
Scatter Plot
• A Scatter Plot is a graph in which the values of
two variables are plotted along two axes, the
pattern of the resulting points revealing any
correlation present.
• With scatter plots we can explain how the
variables relate to each other. Which is defined
as correlation. Positive, Negative, and None
(no correlation) are the three types of
correlation.
Advantages of a Scatter Diagram
• Relationship between two variables can be
viewed.
• For non-linear pattern, this is the best method.
• Maximum and minimum value, can be easily
determined.
• Observation and reading is easy to understand
• Plotting the diagram is very simple.
Limitations of a Scatter Diagram
• With Scatter diagrams we cannot get the exact
extent of correlation.
• Quantitative measure of the relationship
between the variable cannot be viewed. Only
shows the quantitative expression.
• The relationship can only show for two
variables.
Bar Plot
• A barplot (or barchart) is one of the most
common type of graphic. It shows the
relationship between a numeric variable and a
categoric variable.
• Bar Plot are classified into four types of graphs
- bar graph or bar chart, line graph, pie chart,
and diagram.
Advantages of Bar plot:
• Bar charts are easy to understand and
interpret.
• Relationship between size and value helps for
in easy comparison.
• They're simple to create.
• They can help in presenting very large or very
small values easily.
Histogram
• A histogram represents the frequency distribution
of continuous variables. while, a bar graph is a
diagrammatic comparison of discrete variables.
Histogram presents numerical data whereas bar
graph shows categorical data.
The histogram is drawn in such a way that there is
no gap between the bars.
• Advantages of Histogram:
Histogram helps to identify different data, the
frequency of the data occurring in the dataset and
categories which are difficult to interpret in a tabular
form. It helps to visualize the distribution of the data.
• Limitations of Histogram:
A histogram can present data that is misleading as it
has many bars.
Only two sets of data are used, but to analyze certain
types of statistical data, more than two sets of data
are necessary
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• plot(ds$per~ds$sem,ylab='sem',xlab='per',mai
n='sem vs per',col='blue',pch=16)
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• barplot(per,names.arg = sem)
• sem=c(1,2,3,4,5,6,7,8)
• per=c(73,78,86,67,84,60,80,74)
• ds<-data.frame(sem,per)
• hist(ds$per,xlab='ds$per',col="orange",main=‘
percentage of Marks’,pch=19)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy