UNIT 3 - Exploratory Graphs
UNIT 3 - Exploratory Graphs
Show comparisons
The ggplot2
CO3: Experiment with the statistics and group the nature of the data [K3]
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Exploratory Graphs
Visualizing the data via graphics can be important at the beginning stages of data analysis.
To understand basic properties of the data,
to find simple patterns in data, and to suggest possible modeling strategies.
In later stages of an analysis, graphics can be used to “debug” an analysis,
if an unexpected (but not necessarily wrong) result occurs, or ultimately, to communicate your findings to
others.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Exploratory graphs are usually made very quickly and a lot of them are made in the process of checking out the data.
The goal of making exploratory graphs is usually developing a personal understanding of the data
and to prioritize tasks for follow up.
Graphs can reveal patterns, outliers, and relationships within the data that may not be immediately apparent from the raw data.
Details like axis orientation or legends, while present, are generally cleaned up and prettified if the graph is going to be used for
communication later.
Often color and plot symbol size are used to convey various dimensions of information.
Some common examples of Exploratory Graphs that can be created using ggplot2 of R programming include:
1. Scatter plots: It is used to visualize the relationship between two variables. It is used to identify relationships between all pairs of
variables in a data set.
2. Histograms: It is used to visualize the distribution of a single variable.
3. Box plots: It is used to visualize the distribution of a variable and identify outliers.
4. Bar Plot: It is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the
values which they represent. The bar plots can be plotted horizontally or vertically.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Five-number summary
5 number summary comes under the concept of Statistics which deals with the collection of data, analyzing it,
interpreting, and presenting the data in an organized manner.
Calculating 5 number summary
In order to find the 5 number summary, we need the data to be sorted. If not sort it first in ascending order and then find it.
Minimum Value: It is the smallest number in the given data, and the first number when it is sorted in ascending order.
Maximum Value: It is the largest number in the given data, and the last number when it is sorted in ascending order.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Contd…
Median: Middle value between the minimum and maximum value. Below is the formula to find median,
Median = (n + 1)/2th term
Quartile 1: Middle/center value between the minimum and median value. We can simply identify the middle
value between median and minimum value for a small dataset. If it is a big dataset with so many numbers then
better to use a formula,
Quartile 1 = ((n + 1)/4)th term
Quartile 3: Middle/center value between median and maximum value.
Quartile 3 = (3(n + 1)/4)th term
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Example
Question: Find the 5 number summary for the given data 10, 20, 5, 15, 25, 30, 8
Solution:
Step-1 Sort the given data in ascending order.
5, 8, 10, 15, 20, 25, 30
Step-2
As the given data is same as the above examples we can get minimum value, median and maximum from there.
So, Minimum = 5
Maximum = 30
Median = 15
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Contd…
Now find 1st and 3rd quartile either by using formula or by picking center value. Both gives same result.
For Quartile-1 Formula is ((n + 1)/4)th term where n is the count of numbers in the dataset.
n = 7 because there are 7 numbers in the data.
Quartile-1 = ((7 + 1)4)th term
= (8/4)th term
= 2nd term
2nd term is 8 So, Quartile-1 = 8
In the same way find the quartile-3 using the formula (3(n + 1)/4) th term.
Quartile 3 = (3(7 + 1)/4)th term
= (3(8)/4)th term
= (24/4)th term
= 6th term
6th term is 25 so Quartile-3 = 25
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Boxplot
A box plot is a data plot type that shows a set of five descriptive statistics of the data: the minimum and maximum
values (excluding the outliers), the median, and the first and third quartiles.
Syntax: barplot(data, xlab, ylab)
where:
data is the data vector to be represented on y-axis
xlab is the label given to x-axis
ylab is the label given to y-axis
Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Histogram
Histogram is a graphical representation used to create a graph with bars representing the frequency of grouped data
in vector. Histogram is same as bar chart but only difference between them is histogram represents frequency of
grouped data rather than data itself.
Syntax: hist(x, col, border, main, xlab, ylab)
where:
x is data vector
col specifies the color of the bars to be filled
border specifies the color of border of bars
main specifies the title name of histogram
xlab specifies the x-axis label
ylab specifies the y-axis label
The basic command in R is hist()
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Barplot
Bar plot or Bar Chart in R is used to represent the values in data vector as height of the bars. The data vector
passed to the function is represented over y-axis of the graph.
Bar chart can behave like histogram by using table() function instead of data vector.
Syntax: barplot(data, xlab, ylab)
where:
data is the data vector to be represented on y-axis
xlab is the label given to x-axis
ylab is the label given to y-axis
The basic command barplot() function to create bar charts. Here, both vertical and Horizontal bars can be drawn
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Scatter plot
A scatter plot is used when you have two variables to plot against one another. R has a basic command to perform this task.
The command is plot().
As usual with R there are many additional parameters that you can add to customize your plots.
The basic command is:
plot(x, y, pch, xlab, xlim, col, bg, ...)
Where:
x, y – the names of the variables (you can also use a formula of the form y ~ x to “tell” R how to present the data.
pch – a number giving the plotting symbol to use. The default (1) produces an open circle (try values 0–25).
xlab, ylab – character strings to use as axis labels.
xlim, ylim – the limits of the axes in the form c(start, end).
col – the colour for the plotting symbols.
bg – if using open symbols you use bg to specify the fill (background) colour.
… – there are many additional parameters that you might use.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Example
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Contd…
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
References
Roger D. Peng, “Exploratory Data Analysis with R”, 1 st edition, Leanpub, 2020.
https://bookdown.org/rdpeng/exdata/principles-of-analytic-graphics.html
https://www.geeksforgeeks.org/iraq-war/
https://www.analyticsvidhya.com/blog/2021/05/five-number-summary-for-analysis/
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
There are three different plotting systems in R and they each have different characteristics and modes of operation.
They three systems are the base plotting system, the lattice system, and the ggplot2 system.
The Base Plotting System
The base plotting system is the original plotting system for R. The basic model is sometimes referred to as the
“artist’s palette” model.
The idea is you start with blank canvas and build up from there.
In more R-specific terms, you typically start with plot function (or similar plot creating function) to initiate a plot
and then annotate the plot with various annotation functions (text, lines, points, axis)
The base plotting system is often the most convenient plotting system to use because it mirrors how we sometimes
think of building plots and analyzing data.
If we don’t have a completely well-formed idea of how we want to look at some data, often we’ll start by “throwing
some data on the page” and then slowly add more information to it as our thought process evolves.
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Plotting Systems
data(cars)
>
> ## Create the plot / draw canvas
> with(cars, plot(speed, dist)) >
> ## Add annotation
> title("Speed vs. Stopping distance")
VCET R 2021 21PCS02- Exploratory Data Analysis 2024
Thank You