0% found this document useful (0 votes)

2 views27 pages

2 2 Data

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses various graphical displays for basic statistical descriptions, including boxplots, quantile plots, histograms, and scatter plots. It also covers data attributes, similarity and dissimilarity measures, and distance metrics like Minkowski distance, providing foundational knowledge for data preprocessing. The chapter emphasizes the importance of understanding data types and relationships to gain insights for further analysis.

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views27 pages

2 2 Data

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights reserved.
1
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Quantile plot: each value xi is paired with fi indicating that
approximately 100*fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one
univariate distribution against the corresponding quantiles of
another
 Histogram: x-axis are values, y-axis represents frequencies
 Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane

2
Dataset
 A Set of Unit Price Data for Items Sold at a Branch of AllElectronics

3
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data xi data sorted in increasing order, fi
indicates that approximately 100*fi % of the data are
below or equal to the value xi

Data Mining: Concepts and Techniques 4

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

5
Histogram Analysis
 Histogram: Graph display of tabulated frequencies, shown as bars
 It shows what proportion of cases fall into each of several categories
 The categories are usually specified as non-overlapping intervals of
some variable. The categories (bars) must be adjacent
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
6
Scatter plot
 Provides a first look at bivariate data to see clusters of points,
outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
 Determines the relationship, pattern, or trend between two
numeric attributes.

7
Positively and Negatively Correlated Data

 The left half fragment is positively

correlated
 The right half is negative correlated

8
Uncorrelated Data

9
Example

10
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

11
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)

 Numerical measure of how different two data objects

are
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

12
Data Matrix and Dissimilarity Matrix
 Data matrix
 n data points with p  x 11 ... x 1f ... x 1p 
 
dimensions  ... ... ... ... ... 
x ... x if ... x ip 
 Two modes
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 0 
 n data points, but
 d(2,1) 0 
registers only the  
 d(3,1 ) d ( 3,2 ) 0 
distance  
 A triangular matrix  : : : 
 d ( n ,1) d ( n ,2 ) ... ... 0 
 Single mode

13
Proximity Measure for Nominal Attributes

 Nominal attributes can take 2 or more states,

e.g., red, yellow, blue, green (generalization of a
binary attribute)
 Method 1: Simple matching
m: # of matches, p: total # of variables
d ( i , j )  p p m
 Method 2: Use a large number of binary attributes
 creating a new binary attribute for each of the
M nominal states
14
Proximity Measure for Binary Attributes

Object j

 A contingency table for binary

Object i
data

 Distance measure for symmetric

binary variables:

 Distance measure for asymmetric

binary variables:

 Similarity measure for asymmetric

binary variables:
15
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 Gender is a symmetric attribute

 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N be set to 0
0  1
d ( jack , mary )   0 . 33
2  0  1
1  1
d ( jack , jim )   0 . 67
1  1  1
1  2
d ( jim , mary )   0 . 75
1  1  2
16
Distance on Numeric Data: Minkowski Distance
 Minkowski distance : A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
17
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are

different between two binary vectors

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm, Chebyshev) distance.

 This is the maximum difference between any component

(attribute) of the vectors

18
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
19
Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by its rank:
 Mf represents the number of possible states that variable f have
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by:

 compute the dissimilarity using methods for interval-

scaled variables

20
Example

 Show the dissimilarity matrix for the above 4 objects.

21
Example

 f = {fair, good, excellent}

 Mf =3  z1 = 0.0 (fair), z2 = 0.5 (good), z3 = 1.0 (excellent)
 The following dissimilarity matrix is obtained using the Euclidean
distance:

22
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency

vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the norm of vector d

23
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

24
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Summary

25
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of research.

26
References
 W. Cleveland, Visualizing Data, Hobart Press, 1993
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
 U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
 D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
 E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
 C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
27

Co - Ownership
100% (1)
Co - Ownership
7 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Jane Eyre (简·爱)
No ratings yet
Jane Eyre (简·爱)
646 pages
GOD's AWESOME ANIMALS: MY BLOG
No ratings yet
GOD's AWESOME ANIMALS: MY BLOG
472 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Here Are 40 Common Accounting Interview Questions and Answers For Freshers
No ratings yet
Here Are 40 Common Accounting Interview Questions and Answers For Freshers
4 pages
Letter To Governor
No ratings yet
Letter To Governor
4 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Marketing Channels LH
No ratings yet
Marketing Channels LH
64 pages
Critical Thinking Debate
No ratings yet
Critical Thinking Debate
16 pages
ETC - Self-DevElopment and Communication
No ratings yet
ETC - Self-DevElopment and Communication
16 pages
CHEMISTRY Grade 9 Retake
No ratings yet
CHEMISTRY Grade 9 Retake
8 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Prisoner Diving Gear
No ratings yet
Prisoner Diving Gear
2 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Forensic Mass Spectrometry - Scientific and Legal Precedents
No ratings yet
Forensic Mass Spectrometry - Scientific and Legal Precedents
15 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Wa0018.
No ratings yet
Wa0018.
17 pages
CH 2
No ratings yet
CH 2
35 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
The Really Useful Piano Poster-1
No ratings yet
The Really Useful Piano Poster-1
1 page
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
02 Data
No ratings yet
02 Data
66 pages
02 Data
No ratings yet
02 Data
35 pages
CAMEL Analysis of HDFC Bank 2024 Only Bank Statement
No ratings yet
CAMEL Analysis of HDFC Bank 2024 Only Bank Statement
12 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
02 Data
No ratings yet
02 Data
65 pages
CH 2
No ratings yet
CH 2
68 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Module 1
No ratings yet
Module 1
64 pages
GD
No ratings yet
GD
18 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
Activins in Adipogenesis and Obesity: Review
No ratings yet
Activins in Adipogenesis and Obesity: Review
4 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Sem A Tic Microsoft
No ratings yet
Sem A Tic Microsoft
31 pages
Gandhi, Islam and More
No ratings yet
Gandhi, Islam and More
2 pages
Acad Cal S1 2013-2014 - v1 PDF
No ratings yet
Acad Cal S1 2013-2014 - v1 PDF
2 pages
Week 2 Nursery
No ratings yet
Week 2 Nursery
12 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02data Part4
No ratings yet
02data Part4
28 pages
Lect 3
No ratings yet
Lect 3
51 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
24 pages
02 Data
No ratings yet
02 Data
41 pages
02 Data
No ratings yet
02 Data
65 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Lec 5
No ratings yet
Lec 5
24 pages
Data ch2
No ratings yet
Data ch2
16 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
02 Data
No ratings yet
02 Data
62 pages
Grade 9 - Ems - Exam - Term 4
No ratings yet
Grade 9 - Ems - Exam - Term 4
6 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 CHAPTER 2 Gears and Gear Trains
No ratings yet
02 CHAPTER 2 Gears and Gear Trains
23 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
SQL Joins Cheat Sheet
No ratings yet
SQL Joins Cheat Sheet
1 page
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Quiz Mythology
No ratings yet
Quiz Mythology
4 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Normandy vs. Duque
No ratings yet
Normandy vs. Duque
2 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
History of Fifth Philippine Republic
No ratings yet
History of Fifth Philippine Republic
5 pages
Detailed Lesson Plan (DLP) Format: Instructional Planning
100% (1)
Detailed Lesson Plan (DLP) Format: Instructional Planning
3 pages
Finance (Pay Cell) Department
No ratings yet
Finance (Pay Cell) Department
2 pages
Admission Test For The Degree Course in Medicine and Surgery Academic Year 2020/2021
No ratings yet
Admission Test For The Degree Course in Medicine and Surgery Academic Year 2020/2021
45 pages
Foreign Exchange: - Purchase and Sale of National Currencies - Huge Market
No ratings yet
Foreign Exchange: - Purchase and Sale of National Currencies - Huge Market
102 pages
Exercises of Sets and Functions
From Everand
Exercises of Sets and Functions
Simone Malacrida
No ratings yet
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2 2 Data

Uploaded by

2 2 Data

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Boxplot: graphic display of five-number summary

Data Mining: Concepts and Techniques 4

 The left half fragment is positively

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Value is higher when objects are more alike

 Often falls in the range [0,1]

 Dissimilarity (e.g., distance)

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

 Nominal attributes can take 2 or more states,

 A contingency table for binary

 Distance measure for symmetric

 Distance measure for asymmetric

 Similarity measure for asymmetric

 Gender is a symmetric attribute

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are

different between two binary vectors

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm, Chebyshev) distance.

(attribute) of the vectors

 An ordinal variable can be discrete or continuous

 compute the dissimilarity using methods for interval-

 Show the dissimilarity matrix for the above 4 objects.

 f = {fair, good, excellent}

 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency

 Ex: Find the similarity between documents 1 and 2.

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.