0% found this document useful (0 votes)
7 views23 pages

UNIT 3 - Exploratory Graphs

Uploaded by

VIGNESH BABU T R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views23 pages

UNIT 3 - Exploratory Graphs

Uploaded by

VIGNESH BABU T R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

VCET R 2021 21PCS02- Exploratory Data Analysis 2024

UNIT 3 – EXPLORING THE DATA VISUALY

 Principles of Analytic Graphics

 Show comparisons

 Show multivariate data

 Exploratory Graphs: Characteristics of exploratory graphs

 Boxplot , Histogram , Barplot , Scatterplots

 Plotting Systems: The Base Plotting System

 The ggplot2

 CO3: Experiment with the statistics and group the nature of the data [K3]
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Exploratory Graphs

 Visualizing the data via graphics can be important at the beginning stages of data analysis.
 To understand basic properties of the data,
 to find simple patterns in data, and to suggest possible modeling strategies.
 In later stages of an analysis, graphics can be used to “debug” an analysis,
 if an unexpected (but not necessarily wrong) result occurs, or ultimately, to communicate your findings to
others.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Characteristics of Exploratory Graphs

 Exploratory graphs are usually made very quickly and a lot of them are made in the process of checking out the data.
 The goal of making exploratory graphs is usually developing a personal understanding of the data
and to prioritize tasks for follow up.
 Graphs can reveal patterns, outliers, and relationships within the data that may not be immediately apparent from the raw data.
 Details like axis orientation or legends, while present, are generally cleaned up and prettified if the graph is going to be used for
communication later.
 Often color and plot symbol size are used to convey various dimensions of information.
 Some common examples of Exploratory Graphs that can be created using ggplot2 of R programming include:
1. Scatter plots: It is used to visualize the relationship between two variables. It is used to identify relationships between all pairs of
variables in a data set.
2. Histograms: It is used to visualize the distribution of a single variable.

3. Box plots: It is used to visualize the distribution of a variable and identify outliers.
4. Bar Plot: It is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the
values which they represent. The bar plots can be plotted horizontally or vertically.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Simple Summaries: One Dimension

 For one dimensional summarize, there are number of options in R.


 Five-number summary: This gives the minimum, 25th percentile, median, 75th percentile, maximum of the data
and is quick check on the distribution of the data (see the fivenum())
 Boxplots: Boxplots are a visual representation of the five-number summary plus a bit more information. In
particular, boxplots commonly plot outliers that go beyond the bulk of the data. This is implemented via
the boxplot() function
 Barplot: Barplots are useful for visualizing categorical data, with the number of entries for each category being
proportional to the height of the bar. Think “pie chart” but actually useful. The barplot can be made with
the barplot() function.
 Histograms: Histograms show the complete empirical distribution of the data, beyond the five data points shown
by the boxplots. Here, you can easily check skewwness of the data, symmetry, multi-modality, and other features.
The hist() function makes a histogram, and a handy function to go with it sometimes is the rug() function.
 Density plot: The density() function computes a non-parametric estimate of the distribution of a variables
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Five-number summary

 5 number summary comes under the concept of Statistics which deals with the collection of data, analyzing it,
interpreting, and presenting the data in an organized manner.
Calculating 5 number summary
 In order to find the 5 number summary, we need the data to be sorted. If not sort it first in ascending order and then find it.
 Minimum Value: It is the smallest number in the given data, and the first number when it is sorted in ascending order.
 Maximum Value: It is the largest number in the given data, and the last number when it is sorted in ascending order.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Contd…

 Median: Middle value between the minimum and maximum value. Below is the formula to find median,
 Median = (n + 1)/2th term
 Quartile 1: Middle/center value between the minimum and median value. We can simply identify the middle
value between median and minimum value for a small dataset. If it is a big dataset with so many numbers then
better to use a formula,
 Quartile 1 = ((n + 1)/4)th term
 Quartile 3: Middle/center value between median and maximum value.
 Quartile 3 = (3(n + 1)/4)th term
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example

Question: Find the 5 number summary for the given data 10, 20, 5, 15, 25, 30, 8
 Solution:
 Step-1 Sort the given data in ascending order.
 5, 8, 10, 15, 20, 25, 30
 Step-2
 As the given data is same as the above examples we can get minimum value, median and maximum from there.
 So, Minimum = 5
 Maximum = 30
 Median = 15
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Contd…

 Now find 1st and 3rd quartile either by using formula or by picking center value. Both gives same result.
 For Quartile-1 Formula is ((n + 1)/4)th term where n is the count of numbers in the dataset.
 n = 7 because there are 7 numbers in the data.
 Quartile-1 = ((7 + 1)4)th term
 = (8/4)th term
 = 2nd term
 2nd term is 8 So, Quartile-1 = 8
 In the same way find the quartile-3 using the formula (3(n + 1)/4) th term.
 Quartile 3 = (3(7 + 1)/4)th term
 = (3(8)/4)th term
 = (24/4)th term
 = 6th term
 6th term is 25 so Quartile-3 = 25
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Boxplot

 A box plot is a data plot type that shows a set of five descriptive statistics of the data: the minimum and maximum
values (excluding the outliers), the median, and the first and third quartiles.
Syntax: barplot(data, xlab, ylab)
 where:
 data is the data vector to be represented on y-axis
 xlab is the label given to x-axis
 ylab is the label given to y-axis

 The basic command is boxplot() and it has a range of options:


VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Histogram

 Histogram is a graphical representation used to create a graph with bars representing the frequency of grouped data
in vector. Histogram is same as bar chart but only difference between them is histogram represents frequency of
grouped data rather than data itself.
Syntax: hist(x, col, border, main, xlab, ylab)
 where:
 x is data vector
 col specifies the color of the bars to be filled
 border specifies the color of border of bars
 main specifies the title name of histogram
 xlab specifies the x-axis label
 ylab specifies the y-axis label
 The basic command in R is hist()
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Barplot

 Bar plot or Bar Chart in R is used to represent the values in data vector as height of the bars. The data vector
passed to the function is represented over y-axis of the graph.
 Bar chart can behave like histogram by using table() function instead of data vector.
 Syntax: barplot(data, xlab, ylab)
 where:
 data is the data vector to be represented on y-axis
 xlab is the label given to x-axis
 ylab is the label given to y-axis
 The basic command barplot() function to create bar charts. Here, both vertical and Horizontal bars can be drawn
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Scatter plot

 A scatter plot is used when you have two variables to plot against one another. R has a basic command to perform this task.
 The command is plot().
 As usual with R there are many additional parameters that you can add to customize your plots.
 The basic command is:
 plot(x, y, pch, xlab, xlim, col, bg, ...)
Where:
 x, y – the names of the variables (you can also use a formula of the form y ~ x to “tell” R how to present the data.
 pch – a number giving the plotting symbol to use. The default (1) produces an open circle (try values 0–25).
 xlab, ylab – character strings to use as axis labels.
 xlim, ylim – the limits of the axes in the form c(start, end).
 col – the colour for the plotting symbols.
 bg – if using open symbols you use bg to specify the fill (background) colour.
 … – there are many additional parameters that you might use.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Contd…
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

References

 Roger D. Peng, “Exploratory Data Analysis with R”, 1 st edition, Leanpub, 2020.
 https://bookdown.org/rdpeng/exdata/principles-of-analytic-graphics.html
 https://www.geeksforgeeks.org/iraq-war/
 https://www.analyticsvidhya.com/blog/2021/05/five-number-summary-for-analysis/
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Plotting Systems: The Base Plotting System

 There are three different plotting systems in R and they each have different characteristics and modes of operation.
They three systems are the base plotting system, the lattice system, and the ggplot2 system.
 The Base Plotting System
 The base plotting system is the original plotting system for R. The basic model is sometimes referred to as the
“artist’s palette” model.
 The idea is you start with blank canvas and build up from there.
 In more R-specific terms, you typically start with plot function (or similar plot creating function) to initiate a plot
and then annotate the plot with various annotation functions (text, lines, points, axis)
 The base plotting system is often the most convenient plotting system to use because it mirrors how we sometimes
think of building plots and analyzing data.
 If we don’t have a completely well-formed idea of how we want to look at some data, often we’ll start by “throwing
some data on the page” and then slowly add more information to it as our thought process evolves.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Example , Car Dataset

Plotting Systems
data(cars)
>
 > ## Create the plot / draw canvas


> with(cars, plot(speed, dist)) >


> ## Add annotation
 > title("Speed vs. Stopping distance")
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

The Lattice System

 The lattice plotting system is implemented in the lattice


package which comes with every installation of R (although
it is not loaded by default).
 To use the lattice plotting functions you must first load the
lattice package with the library function.
 > library(lattice)
 With the lattice system, plots are created with a single
function call, such as xyplot or bwplot.
 There is no real distinction between functions that create or
initiate plots and functions that annotate plots because it all
happens at once.
 Lattice plots tend to be most useful for conditioning types of
plots, i.e. looking at how y changes with x across levels of z.
 These types of plots are useful for looking at
multidimensional data and often allow you to squeeze a lot of
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

The ggplot2 System

 The ggplot2 plottings system attempts to split the


difference between base and lattice in a number
of ways.
 the ggplot2 system automatically deals with
spacings, text, titles but also allows you to
annotate by “adding” to a plot.
 The ggplot2 system is implemented in the
ggplot2 package, which is available from CRAN
(it does not come with R).
 You can install it from CRAN via
 > install.packages("ggplot2")
 and then load it into R via the library function.
 > library(ggplot2)
VCET R 2021 21PCS02- Exploratory Data Analysis 2024

Thank You

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy