0% found this document useful (0 votes)
2 views27 pages

2 2 Data

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses various graphical displays for basic statistical descriptions, including boxplots, quantile plots, histograms, and scatter plots. It also covers data attributes, similarity and dissimilarity measures, and distance metrics like Minkowski distance, providing foundational knowledge for data preprocessing. The chapter emphasizes the importance of understanding data types and relationships to gain insights for further analysis.

Uploaded by

wasiqbarat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

2 2 Data

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses various graphical displays for basic statistical descriptions, including boxplots, quantile plots, histograms, and scatter plots. It also covers data attributes, similarity and dissimilarity measures, and distance metrics like Minkowski distance, providing foundational knowledge for data preprocessing. The chapter emphasizes the importance of understanding data types and relationships to gain insights for further analysis.

Uploaded by

wasiqbarat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
1
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary


 Quantile plot: each value xi is paired with fi indicating that
approximately 100*fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one
univariate distribution against the corresponding quantiles of
another
 Histogram: x-axis are values, y-axis represents frequencies
 Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane

2
Dataset
 A Set of Unit Price Data for Items Sold at a Branch of AllElectronics

3
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data xi data sorted in increasing order, fi
indicates that approximately 100*fi % of the data are
below or equal to the value xi

Data Mining: Concepts and Techniques 4


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

5
Histogram Analysis
 Histogram: Graph display of tabulated frequencies, shown as bars
 It shows what proportion of cases fall into each of several categories
 The categories are usually specified as non-overlapping intervals of
some variable. The categories (bars) must be adjacent
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
6
Scatter plot
 Provides a first look at bivariate data to see clusters of points,
outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
 Determines the relationship, pattern, or trend between two
numeric attributes.

7
Positively and Negatively Correlated Data

 The left half fragment is positively


correlated
 The right half is negative correlated

8
Uncorrelated Data

9
Example

10
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

11
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)


 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

12
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p  x 11 ... x 1f ... x 1p 
 
dimensions  ... ... ... ... ... 
x ... x if ... x ip 
 Two modes
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 n data points, but
 d(2,1) 0 
registers only the  
 d(3,1 ) d ( 3,2 ) 0 
distance  
 A triangular matrix  : : : 
 d ( n ,1) d ( n ,2 ) ... ... 0 
 Single mode

13
Proximity Measure for Nominal Attributes

 Nominal attributes can take 2 or more states,


e.g., red, yellow, blue, green (generalization of a
binary attribute)
 Method 1: Simple matching
m: # of matches, p: total # of variables
d ( i , j )  p p m
 Method 2: Use a large number of binary attributes
 creating a new binary attribute for each of the
M nominal states
14
Proximity Measure for Binary Attributes

Object j

 A contingency table for binary


Object i
data

 Distance measure for symmetric


binary variables:

 Distance measure for asymmetric


binary variables:

 Similarity measure for asymmetric


binary variables:
15
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 Gender is a symmetric attribute


 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N be set to 0
0  1
d ( jack , mary )   0 . 33
2  0  1
1  1
d ( jack , jim )   0 . 67
1  1  1
1  2
d ( jim , mary )   0 . 75
1  1  2
16
Distance on Numeric Data: Minkowski Distance
 Minkowski distance : A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are


two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
17
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are

different between two binary vectors

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm, Chebyshev) distance.


 This is the maximum difference between any component

(attribute) of the vectors

18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
19
Ordinal Variables

 An ordinal variable can be discrete or continuous


 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by its rank:
 Mf represents the number of possible states that variable f have
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by:

 compute the dissimilarity using methods for interval-


scaled variables

20
Example

 Show the dissimilarity matrix for the above 4 objects.

21
Example

 f = {fair, good, excellent}


 Mf =3  z1 = 0.0 (fair), z2 = 0.5 (good), z3 = 1.0 (excellent)
 The following dissimilarity matrix is obtained using the Euclidean
distance:

22
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency


vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the norm of vector d

23
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

24
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

25
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.

26
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy