2 2 Data
2 2 Data
— Chapter 2 —
2
Dataset
A Set of Unit Price Data for Items Sold at a Branch of AllElectronics
3
Quantile Plot
Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100*fi % of the data are
below or equal to the value xi
5
Histogram Analysis
Histogram: Graph display of tabulated frequencies, shown as bars
It shows what proportion of cases fall into each of several categories
The categories are usually specified as non-overlapping intervals of
some variable. The categories (bars) must be adjacent
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
6
Scatter plot
Provides a first look at bivariate data to see clusters of points,
outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
Determines the relationship, pattern, or trend between two
numeric attributes.
7
Positively and Negatively Correlated Data
8
Uncorrelated Data
9
Example
10
Chapter 2: Getting to Know Your Data
Summary
11
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
are
Lower when objects are more alike
12
Data Matrix and Dissimilarity Matrix
Data matrix
n data points with p x 11 ... x 1f ... x 1p
dimensions ... ... ... ... ...
x ... x if ... x ip
Two modes
i1
... ... ... ... ...
x ... x nf ... x np
n1
Dissimilarity matrix
0
n data points, but
d(2,1) 0
registers only the
d(3,1 ) d ( 3,2 ) 0
distance
A triangular matrix : : :
d ( n ,1) d ( n ,2 ) ... ... 0
Single mode
13
Proximity Measure for Nominal Attributes
Object j
18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
19
Ordinal Variables
20
Example
21
Example
22
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
23
Example: Cosine Similarity
cos(d1, d2) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
24
Chapter 2: Getting to Know Your Data
Summary
25
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of research.
26
References
W. Cleveland, Visualizing Data, Hobart Press, 1993
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
27