0% found this document useful (0 votes)
16 views22 pages

Concepts and Techniques: Data Mining

The document discusses concept description in data mining, focusing on characterization and comparison of concepts. It describes data generalization and summarization techniques for characterization, including attribute-oriented induction which generalizes data attributes. Examples are provided to illustrate characterization of student data and comparison of graduate and undergraduate students.

Uploaded by

varinda0322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

Concepts and Techniques: Data Mining

The document discusses concept description in data mining, focusing on characterization and comparison of concepts. It describes data generalization and summarization techniques for characterization, including attribute-oriented induction which generalizes data attributes. Examples are provided to illustrate characterization of student data and comparison of graduate and undergraduate students.

Uploaded by

varinda0322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining:

Concepts and Techniques

— Chapter 4 —

Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved

06/03/24 Data Mining: Concepts and Techniques 1


06/03/24 Data Mining: Concepts and Techniques 2
What is Concept Description?
 Descriptive vs. predictive data mining
 Descriptive mining: describes concepts or task-relevant

data sets in concise, summarative, informative,


discriminative forms
 Predictive mining: Based on data and analysis,

constructs models for the database, and predicts the


trend and properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct

summarization of the given collection of data


 Comparison: provides descriptions comparing two or

more collections of data


06/03/24 Data Mining: Concepts and Techniques 72
Data Generalization and Summarization-based
Characterization
 Data generalization
 A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to
higher ones.
1
2
3
4
Conceptual levels
5
 Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach

06/03/24 Data Mining: Concepts and Techniques 73


Attribute-Oriented Induction
 Proposed in 1989 (KDD ‘89 workshop)
 Not confined to categorical data nor particular measures
 How it is done?
 Collect the task-relevant data (initial relation) using a
relational database query
 Perform generalization by attribute removal or
attribute generalization
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts
 Interactive presentation with users

06/03/24 Data Mining: Concepts and Techniques 74


Basic Principles of Attribute-Oriented Induction
 Data focusing: task-relevant data, including dimensions, and
the result is the initial relation
 Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization operator
on A, or (2) A’s higher level concepts are expressed in terms of
other attributes
 Attribute-generalization: If there is a large set of distinct values
for A, and there exists a set of generalization operators on A,
then select an operator and generalize A
 Attribute generalization-threshold control: typical 2-8,
specified/default
 Generalized relation threshold control: control the final
relation/rule size
06/03/24 Data Mining: Concepts and Techniques 75
Attribute-Oriented Induction: Basic Algorithm

 InitialRel: Query processing of task-relevant data, deriving


the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.

06/03/24 Data Mining: Concepts and Techniques 76


Example
 DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MA”, “MBA”, “PhD” }
06/03/24 Data Mining: Concepts and Techniques 77
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Working Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Relation: Lachance Canada Richmond
collection Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
of task … … … … … Burnaby … …

relevant Removed Retained Sci,Eng, Country Age range City Removed Excl,
data Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

06/03/24 Data Mining: Concepts and Techniques 78


Presentation of Generalized Results
 Generalized relation:
 Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to contingency
tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative information associated with it, e.g.,
grad ( x)  male( x) 
birth _ region( x) "Canada"[t :53%]  birth _ region( x) " foreign"[t : 47%].
06/03/24 Data Mining: Concepts and Techniques 79
Mining Class Comparisons

 Comparison: Comparing two or more classes


 Method:
 Data collection
 Dimension relevance analysis
 Synchronous generalization:
 prime target class relation
 prime contrasting class relation
 Presentation of the derived comparison
 Relevance Analysis:
 Find attributes (features) which best distinguish different classes

06/03/24 Data Mining: Concepts and Techniques 80


 Compare the general properties between the graduate
students and the undergraduate students at Big
University, given the attributes name, gender, major,
birth place, birth date, residence, phone#, and gpa.

06/03/24 Data Mining: Concepts and Techniques 81


Initial working relations: the target class (graduate students)

Initial working relations: the contrasting class (undergraduate students)

06/03/24 Data Mining: Concepts and Techniques 82


Prime generalized relation for the target class (graduate students)

Prime generalized relation for the contrasting class (undergraduate students)

06/03/24 Data Mining: Concepts and Techniques 83


Quantitative Discriminant Rules

 Cj = target class
 qa = a generalized tuple covers some tuples of class
but can also cover some tuples of contrasting class

 d-weight
 range: [0, 1] count(qa  Cj )
d  weight  m

 count(q
i 1
a  Ci )

 quantitative discriminant rule form

 X, target_cla ss(X)  condition(X) [d : d_weight]

06/03/24 Data Mining: Concepts and Techniques 84


Example: Quantitative Discriminant Rule

Status Major Age_range Gpa Count


Graduate Science 21-25 Good 90
Undergraduate Science 21-25 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple
 Quantitative discriminant rule
X , graduate _ student ( X ) 
major ( X ) " Science" age _ range ( X ) "21  25" gpa ( X ) " good " [d : 30%]

 where 90/(90 + 210) = 30%

06/03/24 Data Mining: Concepts and Techniques 85


Class Description

 Quantitative characteristic rule


 X, target_class(X)  condition(X) [t : t_weight]
necessary

 Quantitative discriminant rule


 X, target_cla ss(X)  condition(X) [d : d_weight]
sufficient

 Quantitative description rule


 X, target_class(X) 
condition 1(X) [t : w1, d : w 1]  ...  conditionn(X) [t : wn, d : w n]
 necessary and sufficient

06/03/24 Data Mining: Concepts and Techniques 86


Example: Quantitative Description Rule
Location/item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt


Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions

Crosstab showing associated t-weight, d-weight values and total number


(in thousands) of TVs and computers sold at AllElectronics in 1998

 Quantitative description rule for target class Europe


 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]

06/03/24 Data Mining: Concepts and Techniques 87


Summary
 Efficient algorithms for computing data cubes
 Multiway array aggregation

 BUC

 H-cubing

 Star-cubing

 High-D OLAP by minimal cubing

 Further development of data cube technology


 Discovery-drive cube

 Multi-feature cubes

 Cube-gradient analysis

 Anther generalization approach: Attribute-Oriented Induction

06/03/24 Data Mining: Concepts and Techniques 88


References (I)
 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,
and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96
 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data
warehouses. SIGMOD’97
 R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..
SIGMOD’99
 Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, Multi-Dimensional Regression
Analysis of Time-Series Data Streams, VLDB'02
 G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients
in Data Cubes. VLDB’ 01
 J. Han, Y. Cai and N. Cercone, Knowledge Discovery in Databases: An Attribute-Oriented
Approach, VLDB'92
 J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex
Measures. SIGMOD’01
06/03/24 Data Mining: Concepts and Techniques 89
References (II)
 L. V. S. Lakshmanan, J. Pei, and J. Han, Quotient Cube: How to Summarize the
Semantics of a Data Cube, VLDB'02
 X. Li, J. Han, and H. Gonzalez, High-Dimensional OLAP: A Minimal Cubing Approach,
VLDB'04
 K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97
 K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. EDBT'98
 S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. EDBT'98
 G. Sathe and S. Sarawagi. Intelligent Rollups in Multidimensional OLAP Data. VLDB'01
 D. Xin, J. Han, X. Li, B. W. Wah, Star-Cubing: Computing Iceberg Cubes by Top-Down
and Bottom-Up Integration, VLDB'03
 D. Xin, J. Han, Z. Shao, H. Liu, C-Cubing: Efficient Computation of Closed Cubes by
Aggregation-Based Checking, ICDE'06
 W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing
Data Cube Size. ICDE’02
 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for
simultaneous multidimensional aggregates. SIGMOD’97

06/03/24 Data Mining: Concepts and Techniques 90


06/03/24 Data Mining: Concepts and Techniques 91

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy