Lecture#06 - EDA2 - Graphical Data Analysis
Lecture#06 - EDA2 - Graphical Data Analysis
Data Analysis 2:
Graphical Data Analysis
Menu
Introduction
Visualization of Data
Graphical Presentation of Univariate Data
Bar Chart
Histogram
Dot Plots
Steam and Leaf Plots
Box and Whisker Plots
Pie and Donut Diagrams
Graphical Presentation of Bivariate Data
Scatter Plots
Graphical Presentation of Spatial Data
Contour Plots
Matrix Plots
Other Types of Graphs
Star/Kite Plot
Chernoff Faces
Concluding Remarks
Visualization
Visualization is the conversion of data into a
visual or tabular format so that the
characteristics of the data and the
relationships among data items or attributes
can be analyzed or reported.
Visualization of data is one of the most
powerful and appealing techniques for data
exploration.
Humans have a well developed ability to
analyze large amounts of information that is
presented visually
Can detect general patterns and trends
Can detect outliers and unusual patterns
Example: Sea Surface Temperature
The following shows the Sea Surface
Temperature (SST) for July 1982
Tens of thousands of data points are
summarized in a single figure
Representation
Is the mapping of information to a visual format
Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
Example:
Objects are often represented as points
Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
Arrangement
Is the placement of visual elements within a
display
Can make a large difference in how easy it
is to understand the data
Example:
Selection
Is the elimination or the de-emphasis of certain
objects and attributes
Selection may involve the choosing a subset of
attributes
Dimensionality reduction is often used to reduce the
number of dimensions to two or three
Alternatively, pairs of attributes can be considered
Selection may also involve choosing a subset of
objects
A region of the screen can only show so many points
Can sample, but want to preserve points in sparse
areas
Goals of Graphing
40
30
Percent
20
10
50
Frequency (Count)
40
30
20
10
18 19 20 21 22 23 24 25 26 27
Age (in years)
n=92 students
Histogram
Divide measurement up into equal-sized
categories.
Determine number (or percentage) of
measurements falling into each category.
Draw a bar for each category so bars’
heights represent number (or percent)
falling into the categories.
Label and title appropriately.
Too few categories
Age of Spring 1998 Stat 250 Students
60
Frequency (Count)
50
40
30
20
10
18 23 28
Age (in years)
n=92 students
Too many categories
GPAs of Spring 1998 Stat 250 Students
6
Frequency (Count)
2 3 4
GPA
n=92 students
Histograms
Histogram Histogram
25 25
20 20
15 15
10 10
5 5
0 0
.4 .8 .2 .6 48 .4 .8 .2 .6 65
.4
48
.4
65
.8
.2
.6
.8
.2
.6
34 37 41 44 to 51 54 58 61 to
34
51
37
41
44
54
58
61
to
to
to to to to .6 to to to to .6
to
to
to
to
to
to
to
to
.6
.6
31 .4 .8 .2 44 48 .4 .8 .2 61
31
44
48
61
.4
.8
.2
.4
.8
.2
34 37 41 51 54 58
34
37
41
51
54
58
Histogram
30
25
20
15
10
5
0
38
42
46
50
54
58
62
66
70
74
to
to
to
to
to
to
to
to
to
to
34
38
42
46
50
54
58
62
66
70
Dot Plot
100
Men
126
Women
70 80 90 100 110 120 130 140 150 160
Speed
Dot Plot
Summarizes measurement data.
Horizontal axis represents measurement
scale.
Plot one dot for each data point.
Stem-and-Leaf Plot
Stem-and-leaf of Shoes N = 139 Leaf Unit = 1.0
12 0 223334444444
63 0 555555555555566666666677777778888888888888999999999
(33) 1 000000000000011112222233333333444
43 1 555555556667777888
25 2 0000000000023
12 2 5557
8 3 0023
4 3
4 4 00
2 4
2 5 0
1 5
1 6
1 6
1 7
1 7 5
Stem and Leaf plots
Before:
N = 10 Median = 3.1 Quartiles = 3, 3.3
During:
N = 10 Median = 5.2 Quartiles = 2.3,
2 : 88 8.8
3 : 00112333
-1 : 0
After: -0 :
N = 10 Median = 5.5 Quartiles = 4.1, 6.7 0:
1:5
3 : 24
4 : 125 2 : 33
5: 3:3
6 : 567 4:
7:3 5:
6:
High: 17 7:1
8 : 68
9 : 15
Stem-and-Leaf Plot
Summarizes measurement data.
Each data point is broken down into a
“stem” and a “leaf.”
First, “stems” are aligned in a column.
Then, “leaves” are attached to the stems.
Visualization Techniques: Box Plots
Box Plots
Invented by J. Tukey
Another way of displaying the distribution of data
Following figure shows the basic part of a box plot
outlier
10th percentile
75th percentile
50th percentile
25th percentile
10th percentile
Example of Box Plots
Box plots can be used to compare
attributes
Box Plot
Summarizes measurement data.
Vertical (or horizontal) axis represents
measurement scale.
Lines in box represent the 25th percentile
(“first quartile”), the 50th percentile
(“median”), and the 75th percentile (“third
quartile”), respectively.
An aside...
Roughly speaking:
The “25th percentile” is the number such that
25% of the data points fall below the number.
The “median” or “50th percentile” is the
number such that half of the data points fall
below the number.
The “75th percentile” is the number such that
75% of the data points fall below the number.
Box Plot (cont’d)
“Whiskers” are drawn to the most extreme
data points that are not more than 1.5
times the length of the box beyond either
quartile.
Whiskers are useful for identifying outliers.
“Outliers,” or extreme observations, are
denoted by asterisks.
Generally, data points falling beyond the
whiskers are considered outliers.
Using Box Plots to Compare
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998
160
Fastest Speed (mph)
110
60
female male
Gender
Pie Charts:
Proportions of Donut-Eating Professors by Weight Class
130-150
151-185
186-210
211-240
241-270
271-310
311+
Actually, why not use a donut
graph. Duh!
Proportions of Donut-Eating Professors by Weight Class
130-150
151-185
186-210
211-240
241-270
271-310
311+
16
8
14
DURING,N=10
6
10 12
AFTER
4
8
2
6
0
4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8
Standard Normal Quantiles DURING
r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12*
16
8
14
6
AFTER,N=10
10 12
DURING
4
8
2
6
0
7
8
6
DURING,N=10
6
AFTnew
4
5
2
4
0
7
8
6
6
AFTnew,N=9
DURING
4
5
2
4
0
8
6
AFTnew,N=9
DURnew
6
5
4
4
2
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 4 5 6 7
Standard Normal Quantiles AFTnew
r=-0.92, B=-0.41, t=-6.4, p=0, N=9 M= 5.35,Sd= 3.37,Sk= 0.00,K=-1.81
7
8
DURnew,N=10
6
AFTnew
6
5
4
4
31
30
Right foot (in cm)
29
28
27
26
25
24
23
22
22 23 24 25 26 27 28 29 30 31
Left foot (in cm)
n=88 students
Scatter Plots
Summarizes the relationship between two
measurement variables.
Horizontal axis represents one variable
and vertical axis represents second
variable.
Plot one point for each pair of
measurements.
No relationship
Lengths of left forearms and head circumferences
of Spring 1998 Stat 250 Students
32
31
Left forearm (in cm)
30
29
28
27
26
25
24
23
22
52 57 62
Head circumference (in cm)
n=89 students
Scatter-Plots – positive and
negative correlations
200.0 600.0
150.0 500.0
100.0 400.0
50.0 300.0
0.0 200.0
-50.0 0 20 40 60
100.0
-100.0 0.0
-150.0 0 20 40 60
Visualization Techniques: Contour
Plots
Contour plots
Useful when a continuous attribute is measured on a
spatial grid
They partition the plane into regions of similar values
The contour lines that form the boundaries of these
regions connect points with equal values
The most common example is contour maps of
elevation
Can also display temperature, rainfall, air pressure,
etc.
• An example for Sea Surface Temperature (SST) is provided
on the next slide
Contour Plot Example: SST Dec, 1998
Celsius
Visualization Techniques: Matrix Plots
Matrix plots
Can plot the data matrix
This can be useful when objects are sorted according
to class
Typically, the attributes are normalized to prevent one
attribute from dominating the plot
Plots of similarity or distance matrices can also be
useful for visualizing the relationships between objects
Examples of matrix plots are presented on the next two
slides
Visualization of the Iris Data Matrix
standard
deviation
Visualization of the Iris Correlation Matrix
Visualization Techniques: Parallel
Coordinates
Parallel Coordinates
Used to plot the attribute values of high-dimensional
data
Instead of using perpendicular axes, use a set of
parallel axes
The attribute values of each object are plotted as a
point on each corresponding coordinate axis and the
points are connected by a line
Thus, each object is represented as a line
Often, the lines representing a distinct class of objects
group together, at least for some attributes
Ordering of attributes is important in seeing such
groupings
Parallel Coordinates Plots
Other Visualization Techniques
Star Plots
Similar approach to parallel coordinates, but axes
radiate from a central point
The line connecting the values of an object is a
polygon
Chernoff Faces
Approach created by Herman Chernoff
This approach associates each attribute with a
characteristic of a face
The values of each attribute determine the appearance
of the corresponding facial characteristic
Each object becomes a separate face
Relies on human’s ability to distinguish faces
Star Plots for Iris Data
Setosa
Versicolour
Virginica
Chernoff Faces
Setosa
Versicolour
Virginica
Scales of Graphs
It is very important to pay attention to the
scale that you are using when you are
plotting.
Compare the following graphs created
from identical data
Summary
Examine all your variables thoroughly and
carefully before you begin analysis
Use visual displays whenever possible
Transform each variable as necessary to
deal with mistakes, outliers, and
distributions
Concluding Remarks
Many possible types of graphs.
Use common sense in reading graphs.
When creating graphs, don’t summarize
your data too much or too little.
When creating graphs, label everything for
others.
Remember you are trying to communicate
something to others!