0% found this document useful (0 votes)

3 views76 pages

Classification Part 1

Classification is the process of assigning objects to predefined categories, utilizing input data characterized by attributes and class labels. It involves learning a target function to create a classification model, which can be used for descriptive and predictive modeling. Various algorithms, such as decision trees and neural networks, are employed to build these models, with performance evaluated through metrics like accuracy and confusion matrices.

Uploaded by

somya.234017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views76 pages

Classification Part 1

Uploaded by

somya.234017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

CLASSIFICATION

CLASSIFICATION
 Task of assigning objects to one of several predefined categories.
 Examples
 Predicting tumor cells as benign or malignant
 Classifying credit card transactions as legitimate or fraudulent
 Detecting spam email messages based on message header and content
 Categorizing news stories as finance, weather, entertainment, sports, etc.
 Classifying galaxies based on their shape
CLASSIFICATION

 Input data is collection of records.

 Each record is characterized by tuple (x,y), where x is the attribute set and y is a
special attribute designated as class label.
 The attribute set could be continuous or discrete, but the class label should be
discrete.
CLASSIFICATION
 Classification is the task of learning a target function f that maps each
attribute set x to one of the predefined class labels y.
 The target function is also informally known as classification model.
 A classification model is useful for the following purposes:
 Descriptive Modeling : A classification model can serve as an
explanatory tool to distinguish between objects of different classes.
 A descriptive model which can summarize the features that define a
vertebrate as a mammal, reptile, bird, fish or amphibian.
 Predictive Modeling: A classification model can be used to predict class
labels of unknown records.
CLASSIFICATION

 Classification is most suited for techniques for

predicting or describing datasets with binary
or nominal categories.
 Less suitable for ordinal categories as they do
not consider the implicit ordering among the
categories.
 Other forms of relationships, such as the
subclass-superclass relationships among
categories are also ignored.
GENERAL APPROACH

 A classification approach (or classifier) is a systematic approach to build classification

models from an input data set.
 Examples: Decision Tree Classifiers, Neural Networks, Support Vector Machines,
Naïve Bayes Classifier.
 Each technique employs a learning algorithm to identify a model that best fits the
relationship between the attribute set and class label of the input data.
 The model generated by the learning algorithm should both fit the input data well
and correctly predict the class label of records it has never seen before.
 The key objective of the learning algorithm is to build models with good
generalization capability.
GENERAL APPROACH

 Given a collection of records (training set )

 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other attributes.
 Goal: previously unseen records should be assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
Illustration of Classification Task
ILLUSTRATION OF CLASSIFICATION TASK
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?

15 No Large 67K ?
10

Test Set
CONFUSION MATRIX

 Confusion Matrix: Tabulates the counts of

test records correctly and incorrectly
predicted by the model.
 It is used to evaluate the performance of a
classification model.
 Total Number of Correct Predictions made
by the model
 TP +TN
 Total Number of Incorrect Predictions
 FP + FN
 Performance Metric: Accuracy
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝑇𝑁
 Accuracy = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝐹𝑃+𝐹𝑁
 Error Rate = =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑃+𝐹𝑃+𝐹𝑁+𝑇𝑁

 Classification algorithms seek models that attain the highest accuracy or

equivalently the lowest error rate.
HOW A DECISION TREE WORKS
 Vertebrate Classification
 Attributes:
 Name (Whale, Frog, Cat, Leopard)
 Body Temperature (Warm Blooded/ Cold Blooded)
 Skin Cover (Hair/ Scales/ Fur/ Feather)
 Gives Birth (Yes/No)
 Aquatic Creature (Yes/No)
 Aerial Creature (Yes/No)
 Has Legs (Yes/No)
 Hibernates (Yes/No)
 Class Label: Mammal/ Non-mammal
HOW A DECISION TREE WORKS
 Suppose a new species is discovered by scientists. How can we tell whether it is a
mammal or a non-mammal?
 An approach is to pose a series of questions about the characteristics of species.
 Q1: Is the species warm blooded or cold blooded?
 If cold blooded definitely not a mammal. Otherwise it is a bird or a mammal.
 Q2: Do the females of the species give birth to their young?
 Those that give birth are definitely mammals and those that do not are likely to be
non-mammals.
 Each time we receive an answer a follow-up question is asked until we reach the
conclusion about the class label of the record.
HOW A DECISION TREE WORKS?

 The series of questions and their possible answers can be organized in the form of
decision tree.
 It is a hierarchical structure that consists of nodes and directed edges.
A tree has three types of nodes
• Root Node: One root node with no incoming
edges and zero or more outgoing edges.
• Internal Nodes: Exactly one incoming edge and
two or more outgoing edges.
• Leaf or Terminal Nodes: Exactly one incoming
edge and no outgoing edges.
HOW A DECISION TREE WORKS?

 Each leaf node is assigned a class label.

 The non-terminal nodes, which
include the root and other internal
nodes, contain attribute test
conditions to separate records that
have different characteristics.
HOW A DECISION TREE WORKS?

 In order to classify a test record, apply test condition to the

record and follow the appropriate branch based on the
outcome of the test.
 This will lead to either an internal node, for which a new
test condition is applied or to a leaf node.
 The class label associated with the leaf node is then
assigned to the record.
EXAMPLE OF A DECISION TREE
Splitting Attributes

Tid Refund Marital Taxable Refund

Status Income Cheat
Yes No
2,3,5,6,8,9,10
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
3,5,8,10 2,6,9
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes
< 80K > 80K
6 No Married 60K No
7 Yes Divorced 220K No NO YES
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Model: Decision Tree
10

Training Data
ANOTHER EXAMPLE OF DECISION TREE

Tid Refund Marital Taxable MarSt Single,

Status Income Cheat Married Divorced
1 Yes Single 125K No
NO Refund
2 No Married 100K No
Yes No
3 No Single 70K No
4 Yes Married 120K No NO TaxInc
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
There could be more than one tree that fits
the same data!
10 No Single 90K Yes
10
DECISION TREE CLASSIFICATION TASK

Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
APPLY MODEL TO TEST DATA
Test Data
Start from the root of tree.
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No