0% found this document useful (0 votes)
7 views55 pages

Lecture#06 - EDA2 - Graphical Data Analysis

Lecture #06 focuses on Exploratory Data Analysis through graphical techniques, emphasizing the importance of data visualization for analyzing relationships and characteristics. It covers various graphical presentations for univariate, bivariate, and spatial data, including bar charts, histograms, scatter plots, and box plots, along with guidelines on selecting appropriate graphs based on data type and analysis goals. The lecture concludes with a discussion on the effectiveness of visual aids in understanding complex data.

Uploaded by

nafij.alam2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views55 pages

Lecture#06 - EDA2 - Graphical Data Analysis

Lecture #06 focuses on Exploratory Data Analysis through graphical techniques, emphasizing the importance of data visualization for analyzing relationships and characteristics. It covers various graphical presentations for univariate, bivariate, and spatial data, including bar charts, histograms, scatter plots, and box plots, along with guidelines on selecting appropriate graphs based on data type and analysis goals. The lecture concludes with a discussion on the effectiveness of visual aids in understanding complex data.

Uploaded by

nafij.alam2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Lecture #06: Exploratory

Data Analysis 2:
Graphical Data Analysis
Menu
 Introduction
 Visualization of Data
 Graphical Presentation of Univariate Data

Bar Chart

Histogram

Dot Plots

Steam and Leaf Plots

Box and Whisker Plots

Pie and Donut Diagrams
 Graphical Presentation of Bivariate Data

Scatter Plots
 Graphical Presentation of Spatial Data

Contour Plots

Matrix Plots
 Other Types of Graphs

Star/Kite Plot

Chernoff Faces
 Concluding Remarks
Visualization
Visualization is the conversion of data into a
visual or tabular format so that the
characteristics of the data and the
relationships among data items or attributes
can be analyzed or reported.
 Visualization of data is one of the most
powerful and appealing techniques for data
exploration.

Humans have a well developed ability to
analyze large amounts of information that is
presented visually

Can detect general patterns and trends

Can detect outliers and unusual patterns
Example: Sea Surface Temperature
 The following shows the Sea Surface
Temperature (SST) for July 1982

Tens of thousands of data points are
summarized in a single figure
Representation
 Is the mapping of information to a visual format
 Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
 Example:

Objects are often represented as points

Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape

If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
Arrangement
 Is the placement of visual elements within a
display
 Can make a large difference in how easy it
is to understand the data
 Example:
Selection
 Is the elimination or the de-emphasis of certain
objects and attributes
 Selection may involve the choosing a subset of
attributes

Dimensionality reduction is often used to reduce the
number of dimensions to two or three

Alternatively, pairs of attributes can be considered
 Selection may also involve choosing a subset of
objects

A region of the screen can only show so many points

Can sample, but want to preserve points in sparse
areas
Goals of Graphing

1. Presentation of Descriptive Statistics


2. Presentation of Evidence
3. Some people understand subject
matter better with visual aids
4. Provide a sense of the underlying
data generating process (scatter-
plots)
Which graph to use?
 Depends on type of data
 Depends on what you want to illustrate
 Depends on available statistical software
Bar Chart
Birth Order of Spring 1998 Stat 250 Students

40

30
Percent

20

10

Middle Oldest Only Youngest


Birth Order
n=92 students
Bar Chart
 Summarizes categorical data.
 Horizontal axis represents categories,
while vertical axis represents either counts
(“frequencies”) or percentages (“relative
frequencies”).
 Used to illustrate the differences in
percentages (or counts) between
categories.
Visualization Techniques: Histograms
 Histogram

Usually shows the distribution of values of a single variable

Divide the values into bins and show a bar plot of the number of
objects in each bin.

The height of each bar indicates the number of objects

Shape of histogram depends on the number of bins
 Example: Petal Width (10 and 20 bins, respectively)
Two-Dimensional Histograms
 Show the joint distribution of the values of two
attributes
 Example: petal width and petal length

What does this tell us?
Histogram
Age of Spring 1998 Stat 250 Students

50
Frequency (Count)

40

30

20

10

18 19 20 21 22 23 24 25 26 27
Age (in years)
n=92 students
Histogram
 Divide measurement up into equal-sized
categories.
 Determine number (or percentage) of
measurements falling into each category.
 Draw a bar for each category so bars’
heights represent number (or percent)
falling into the categories.
 Label and title appropriately.
Too few categories
Age of Spring 1998 Stat 250 Students

60
Frequency (Count)

50

40

30

20

10

18 23 28
Age (in years)
n=92 students
Too many categories
GPAs of Spring 1998 Stat 250 Students

6
Frequency (Count)

2 3 4
GPA
n=92 students
Histograms
Histogram Histogram

25 25

20 20

15 15

10 10

5 5

0 0
.4 .8 .2 .6 48 .4 .8 .2 .6 65

.4

48

.4

65
.8

.2

.6

.8

.2

.6
34 37 41 44 to 51 54 58 61 to

34

51
37

41

44

54

58

61
to

to
to to to to .6 to to to to .6

to

to
to

to

to

to

to

to
.6

.6
31 .4 .8 .2 44 48 .4 .8 .2 61

31

44

48

61
.4

.8

.2

.4

.8

.2
34 37 41 51 54 58

34

37

41

51

54

58
Histogram

30

25
20

15

10
5

0
38

42

46

50

54

58

62

66

70

74
to

to

to

to

to

to

to

to

to

to
34

38

42

46

50

54

58

62

66

70
Dot Plot

Fastest Ever Driving Speed


226 Stat 100 Students, Fall '98

100
Men

126
Women
70 80 90 100 110 120 130 140 150 160
Speed
Dot Plot
 Summarizes measurement data.
 Horizontal axis represents measurement
scale.
 Plot one dot for each data point.
Stem-and-Leaf Plot
Stem-and-leaf of Shoes N = 139 Leaf Unit = 1.0

12 0 223334444444
63 0 555555555555566666666677777778888888888888999999999
(33) 1 000000000000011112222233333333444
43 1 555555556667777888
25 2 0000000000023
12 2 5557
8 3 0023
4 3
4 4 00
2 4
2 5 0
1 5
1 6
1 6
1 7
1 7 5
Stem and Leaf plots
Before:
N = 10 Median = 3.1 Quartiles = 3, 3.3
During:
N = 10 Median = 5.2 Quartiles = 2.3,
2 : 88 8.8
3 : 00112333
-1 : 0
After: -0 :
N = 10 Median = 5.5 Quartiles = 4.1, 6.7 0:
1:5
3 : 24
4 : 125 2 : 33
5: 3:3
6 : 567 4:
7:3 5:
6:
High: 17 7:1
8 : 68
9 : 15
Stem-and-Leaf Plot
 Summarizes measurement data.
 Each data point is broken down into a
“stem” and a “leaf.”
 First, “stems” are aligned in a column.
 Then, “leaves” are attached to the stems.
Visualization Techniques: Box Plots
 Box Plots

Invented by J. Tukey

Another way of displaying the distribution of data

Following figure shows the basic part of a box plot

outlier

10th percentile

75th percentile

50th percentile
25th percentile

10th percentile
Example of Box Plots
 Box plots can be used to compare
attributes
Box Plot
 Summarizes measurement data.
 Vertical (or horizontal) axis represents
measurement scale.
 Lines in box represent the 25th percentile
(“first quartile”), the 50th percentile
(“median”), and the 75th percentile (“third
quartile”), respectively.
An aside...
 Roughly speaking:

The “25th percentile” is the number such that
25% of the data points fall below the number.

The “median” or “50th percentile” is the
number such that half of the data points fall
below the number.

The “75th percentile” is the number such that
75% of the data points fall below the number.
Box Plot (cont’d)
 “Whiskers” are drawn to the most extreme
data points that are not more than 1.5
times the length of the box beyond either
quartile.

Whiskers are useful for identifying outliers.
 “Outliers,” or extreme observations, are
denoted by asterisks.

Generally, data points falling beyond the
whiskers are considered outliers.
Using Box Plots to Compare
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998
160
Fastest Speed (mph)

110

60
female male
Gender
Pie Charts:
Proportions of Donut-Eating Professors by Weight Class

130-150
151-185
186-210
211-240
241-270
271-310
311+
Actually, why not use a donut
graph. Duh!
Proportions of Donut-Eating Professors by Weight Class

130-150
151-185
186-210
211-240
241-270
271-310
311+

See Excel for other options!!!!


Which graph to use when?
 Stem-and-leaf plots and dotplots are good
for small data sets, while histograms and
box plots are good for large data sets.
 Boxplots and dotplots are good for
comparing two groups.
 Boxplots are good for identifying outliers.
 Histograms and boxplots are good for
identifying “shape” of data.
Visual Displays of Bivariate Data

Variable 1 Variable 2 Display


Example

Categorical Categorical Crosstabs

Categorical Continuous Box plots

Continuous Continuous Scatter plots


Visualization Techniques:
Scatter Plots
 Scatter plots

Attributes values determine the position

Two-dimensional scatter plots most common,
but can have three-dimensional scatter plots

Often additional attributes can be displayed by
using the size, shape, and color of the markers
that represent the objects

It is useful to have arrays of scatter plots can
compactly summarize the relationships of
several pairs of attributes
• See example on the next slide
Scatter Plot Array of Iris
Attributes
With Outlier and Out of Range Value
M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.57, B=-0.6, t=-1.97, p=0.08, N=10

16
8

14
DURING,N=10
6

10 12
AFTER
4

8
2

6
0

4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8
Standard Normal Quantiles DURING
r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12*

16
8

14
6

AFTER,N=10
10 12
DURING
4

8
2

6
0

4 6 8 10 12 14 16 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5


AFTER Standard Normal Quantiles
Without Outlier
M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.92, B=-0.37, t=-6.33, p=0, N=9

7
8

6
DURING,N=10
6

AFTnew
4

5
2

4
0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8


Standard Normal Quantiles DURING
r=-0.92, B=-2.3, t=-6.33, p=0, N=9 M= 5.17,Sd= 1.50,Sk= 0.10,K=-1.67

7
8

6
6

AFTnew,N=9
DURING
4

5
2

4
0

4 5 6 7 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5


AFTnew Standard Normal Quantiles
With Corrected Out of Range Value
M= 5.17,Sd= 1.50,Sk= 0.10,K=-1.67 r=-0.92, B=-2.09, t=-6.4, p=0, N=9
7

8
6
AFTnew,N=9

DURnew
6
5

4
4

2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 4 5 6 7
Standard Normal Quantiles AFTnew
r=-0.92, B=-0.41, t=-6.4, p=0, N=9 M= 5.35,Sd= 3.37,Sk= 0.00,K=-1.81
7

8
DURnew,N=10
6
AFTnew

6
5

4
4

2 4 6 8 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5


DURnew Standard Normal Quantiles
Scatter Plots
Foot sizes of Spring 1998 Stat 250 students

31
30
Right foot (in cm)

29
28
27
26
25
24
23
22
22 23 24 25 26 27 28 29 30 31
Left foot (in cm)
n=88 students
Scatter Plots
 Summarizes the relationship between two
measurement variables.
 Horizontal axis represents one variable
and vertical axis represents second
variable.
 Plot one point for each pair of
measurements.
No relationship
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
32
31
Left forearm (in cm)

30
29
28
27
26
25
24
23
22
52 57 62
Head circumference (in cm)
n=89 students
Scatter-Plots – positive and
negative correlations
200.0 600.0
150.0 500.0
100.0 400.0
50.0 300.0
0.0 200.0
-50.0 0 20 40 60
100.0
-100.0 0.0
-150.0 0 20 40 60
Visualization Techniques: Contour
Plots
 Contour plots

Useful when a continuous attribute is measured on a
spatial grid

They partition the plane into regions of similar values

The contour lines that form the boundaries of these
regions connect points with equal values

The most common example is contour maps of
elevation

Can also display temperature, rainfall, air pressure,
etc.
• An example for Sea Surface Temperature (SST) is provided
on the next slide
Contour Plot Example: SST Dec, 1998

Celsius
Visualization Techniques: Matrix Plots
 Matrix plots

Can plot the data matrix

This can be useful when objects are sorted according
to class

Typically, the attributes are normalized to prevent one
attribute from dominating the plot

Plots of similarity or distance matrices can also be
useful for visualizing the relationships between objects

Examples of matrix plots are presented on the next two
slides
Visualization of the Iris Data Matrix

standard
deviation
Visualization of the Iris Correlation Matrix
Visualization Techniques: Parallel
Coordinates
 Parallel Coordinates

Used to plot the attribute values of high-dimensional
data

Instead of using perpendicular axes, use a set of
parallel axes

The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the
points are connected by a line

Thus, each object is represented as a line

Often, the lines representing a distinct class of objects
group together, at least for some attributes

Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots
Other Visualization Techniques
 Star Plots

Similar approach to parallel coordinates, but axes
radiate from a central point

The line connecting the values of an object is a
polygon
 Chernoff Faces

Approach created by Herman Chernoff

This approach associates each attribute with a
characteristic of a face

The values of each attribute determine the appearance
of the corresponding facial characteristic

Each object becomes a separate face

Relies on human’s ability to distinguish faces
Star Plots for Iris Data

Setosa

Versicolour

Virginica
Chernoff Faces
Setosa

Versicolour

Virginica
Scales of Graphs
 It is very important to pay attention to the
scale that you are using when you are
plotting.
 Compare the following graphs created
from identical data
Summary
 Examine all your variables thoroughly and
carefully before you begin analysis
 Use visual displays whenever possible
 Transform each variable as necessary to
deal with mistakes, outliers, and
distributions
Concluding Remarks
 Many possible types of graphs.
 Use common sense in reading graphs.
 When creating graphs, don’t summarize
your data too much or too little.
 When creating graphs, label everything for
others.
 Remember you are trying to communicate
something to others!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy