0% found this document useful (0 votes)
22 views21 pages

Chap3 Basic Classification New 2

Uploaded by

abeeralammar11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

Chap3 Basic Classification New 2

Uploaded by

abeeralammar11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining

Classification: Basic Concepts and


Techniques

Lecture Notes for Chapter 3

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

2/1/2021 Introduction to Data Mining, 2 nd Edition 1


Classification: Definition

 Given a collection of records (training set )


– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output

 Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y

2/1/2021 Introduction to Data Mining, 2 nd Edition 2


Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content

Identifying Features extracted from malignant or benign


tumor cells x-rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

2/1/2021 Introduction to Data Mining, 2 nd Edition 3


General Approach for Building
Classification Model

2/1/2021 Introduction to Data Mining, 2 nd Edition 4


Classification Techniques

 Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines (SVM) Text mining
– Neural Networks, Deep Neural Nets

 Ensemble Classifiers
– Boosting, Bagging, Random Forests

2/1/2021 Introduction to Data Mining, 2 nd Edition 5


Example of a Decision Tree

cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
8 No Single 85K Yes
< 80K > 80K

9 No Married 75K No NO YES


10 No Single 90K Yes
10

Training Data Model: Decision Tree

2/1/2021 Introduction to Data Mining, 2 nd Edition 6


Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 7


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 8


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 9


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 10


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 11


Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

2/1/2021 Introduction to Data Mining, 2 nd Edition 12


Another Example of Decision Tree

cal cal us
i i o
or or nu
teg
teg
nti
ass
l
ca ca co c MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K
fits the same data!
Yes
10

2/1/2021 Introduction to Data Mining, 2 nd Edition 13


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

2/1/2021 Introduction to Data Mining, 2 nd Edition 14


Decision Tree Induction

 Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

2/1/2021 Introduction to Data Mining, 2 nd Edition 15


Design Issues of Decision Tree Induction

 How should training records be split?


– Method for expressing test condition
 depending on attribute types
– Measure for evaluating the goodness of a test
condition

 How should the splitting procedure stop?


– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
2/1/2021 Introduction to Data Mining, 2 nd Edition 16
Methods for Expressing Test Conditions

 Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

2/1/2021 Introduction to Data Mining, 2 nd Edition 17


Test Condition for Nominal Attributes

 Multi-way split:
Marital
– Use as many partitions as Status
distinct values.

Single Divorced Married

 Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

2/1/2021 Introduction to Data Mining, 2 nd Edition 18


Test Condition for Ordinal Attributes

 Multi-way split: Shirt


Size
– Use as many partitions
as distinct values
Small
Medium Large Extra Large

 Binary split: Shirt Shirt


Size Size
– Divides values into two
subsets
– Preserve order {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

property among Shirt


attribute values Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
2/1/2021 Introduction to Data Mining, 2 nd Edition 19
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

2/1/2021 Introduction to Data Mining, 2 nd Edition 20


Decision Tree Based Classification
 Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are
employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are
interacting)
 Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that
can distinguish between classes together but not individually) may be
passed over in favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

2/1/2021 Introduction to Data Mining, 2 nd Edition 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy