0% found this document useful (0 votes)
4 views12 pages

03-Data Mining The Classification Problem

The document discusses the classification problem in data mining, focusing on building classifiers from pre-classified examples to predict the class of new instances. It outlines various classification models, applications such as credit approval and medical diagnosis, and emphasizes the importance of simplicity in algorithm design. Additionally, it introduces the OneR algorithm as a simple method for classification and mentions the Weka software for implementing machine learning tasks.

Uploaded by

anas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

03-Data Mining The Classification Problem

The document discusses the classification problem in data mining, focusing on building classifiers from pre-classified examples to predict the class of new instances. It outlines various classification models, applications such as credit approval and medical diagnosis, and emphasizes the importance of simplicity in algorithm design. Additionally, it introduces the OneR algorithm as a simple method for classification and mentions the Weka software for implementing machine learning tasks.

Uploaded by

anas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Knowledge Discovery and Data Mining

The Classification Problem

EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024

The classification problem


• Task:
• Given a set of pre-classified examples, build a model or classifier to classify new
cases.
• Classes are known for the examples used to build the classifier.
• è Supervised learning problem
• è Learn by example
• A classifier can be :
• a set of rules, a decision tree, a neural network, etc.
E. Zemmouri

• Typical applications:
• Credit approval, direct marketing, fraud detection, medical diagnosis, …
2
The classification problem

• Learn a method for predicting the instance class from pre-labeled


(classified) instances

A2 Given a set of points from classes


what is the class of new point ?

E. Zemmouri
A1
3

The classification problem


• Formal definition :
• Let ! = #! = #!" , #!# , … , #!$ / ' = 1. . * be a multidimensional dataset of
* examples, characterized by , attributes -" , -# , … , -% (predictors)
• è ! is an "×$ data matrix

• Let . = /! / ' = 1. . * be a set of labels (classes) for !


• è Each data point %! ∈ ! is labeled by a class '! ∈ (
• Where C = {c" , … , c# } is a finite set of 1 predefined classes (discrete)

• The classification problem consists of using the training set !, . to learn a


E. Zemmouri

model M (classifier) that allows to predict the class / of new data point #
• If 0-12(4) = 2, then the problem is a binary classification (7 = 2)
4
Classification vs Regression

• (", $) is the training set


• Classification problem : the predicted attribute is discrete

• è predicting a discrete class label.

• Regression problem : the predicted attribute is numeric continuous

• è predicting a continuous quantity

• Predictors may be nominal or numeric

E. Zemmouri
5

Examples of applications

• Customer target marketing (direct marketing)

• Labels correspond to the user interest in a particular product,

• The attributes may correspond to the demographic profiles of the customers,

• Training examples contain also previous buying behavior,

• The training examples are used to learn whether a customer, with a known demographic
profile, but unknown buying behavior, may be interested in a particular product or not.
E. Zemmouri

6
Examples of applications

• Document categorization and filtering

• Many online applications require real-time classification of documents,

• These are used to organize the documents under specific topics of a portal,

• Previous examples of manually classified documents from each topic may be available,

• The features correspond to the words in the document,

• The class labels correspond to the various topics : politics, sports, current events, …

E. Zemmouri
7

Examples of applications

• Medical disease management

• The features may be extracted from patient medical tests and treatments,

• The class label may correspond to treatment outcomes,

• The task is to predict treatment outcomes.

• Intrusion detection

• The sequences of customer activity in a computer system may be used to predict the
possibility of intrusions.
E. Zemmouri

8
Classification models
• Various models have been designed for data classification
• Decision trees,
• Rule-based classifiers,
• Probabilistic models,
• Instance-based classifiers,
• Support vector machines,
• Neural networks,
• …

E. Zemmouri
9

The weather data


Attributes
• Fictitious dataset ! Outlook Temp Humidity Windy Play

1 Sunny Hot High False No


• It concerns the conditions that are suitable
2 Sunny Hot High True No
for playing some unspecified game.
3 Overcast Hot High False Yes
• Given data of past days, construct a model or 4 Rainy Mild High False Yes
rules to predict what to do in a new day. 5 Rainy Cool Normal False Yes
6 Rainy Cool Normal True No
Instances

• Example :
7 Overcast Cool Normal True Yes
• (sunny, cool, normal, true) 8 Sunny Mild High False No

• Play or not ? 9 Sunny Cool Normal False Yes

10 Rainy Mild Normal False Yes


E. Zemmouri

11 Sunny Mild Normal True Yes

12 Overcast Mild High True Yes

13 Overcast Hot Normal False Yes

14 Rainy Mild High True No

10
The weather data

• Weather data with mixed


Outlook Temp Humidity Windy Play

1 Sunny 85 85 False No
attributes 2 Sunny 80 90 True No
3 Overcast 83 86 False Yes

4 Rainy 70 96 False Yes

5 Rainy 68 80 False Yes


6 Rainy 65 70 True No
7 Overcast 64 65 True Yes

8 Sunny 72 95 False No

9 Sunny 69 70 False Yes

10 Rainy 75 80 False Yes


75

E. Zemmouri
11 Sunny 70 True Yes

12 Overcast 72 90 True Yes

13 Overcast 81 75 False Yes

14 Rainy 71 91 True No

11

The contact lenses data


Spectacle Tear production Recommended
Age Astigmatism
prescription rate lenses
Young Myope No Reduced None
Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
E. Zemmouri

Presbyopic Myope Yes Reduced None


Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
12
Question
Can you imagine your own classification method ?

The Classification Problem

Simplicity first
Simplicity first
• Simple algorithms often work very well !
• Many kinds of simple structures in data :
• One attribute does all the work
• All attributes contribute equally & independently
• A weighted linear combination might do
• Instance-based
• …
• But success of a method depends on the domain

E. Zemmouri
• Data Mining is an Experimental science !

15

Simplicity first : OneR


• One Rule algorithm Attribute

• Learn rudimentary rules that all test only one attribute e va


lu lu
va value e
• è equivalent to 1 level decision tree
• How that works ? Class Class Class

• Generates one rule for each attribute (predictor) in the data


• construct a frequency table for each predictor against the target (class)
• then selects the rule with the smallest total error as its "one rule".
• Reference :
E. Zemmouri

• R.C. Holte (1993). “Very simple classification rules perform well on most
commonly used datasets.” Machine Learning.
16
OneR pseudo code

For each attribute (predictor) :


For each value of that attribute, make a rule as follows:
Count how often each class appears
Find the most frequent class
Make the rule assign that class to this attribute value
Calculate the error rate of this attribute’s rules
Choose the predictor with the smallest total error.

E. Zemmouri
17

OneR : example
Frequency Tables
Weather Problem
Which one is the best predictor ? Outlook Yes No Rule Errors Total Errors
Sunny
Outlook Temp Humidity Windy Play
Overcast
Sunny Hot High False No
Rainy
Sunny Hot High True No
Overcast Hot High False Yes
Temp Yes No Rule Errors Total Errors
Rainy Mild High False Yes
Hot
Rainy Cool Normal False Yes
Mild
Rainy Cool Normal True No
Cool
Overcast Cool Normal True Yes

Sunny Mild High False No


Humidity Yes No Rule Errors Total Errors
Sunny Cool Normal False Yes High
E. Zemmouri

Rainy Mild Normal False Yes


Normal
Sunny Mild Normal True Yes

Overcast Mild High True Yes Windy Yes No Rule Errors Total Errors
Overcast Hot Normal False Yes False

Rainy Mild High True No True 18


OneR : example
Frequency Tables
Weather Problem
Which one is the best predictor ? Outlook Yes No Rule Errors Total Errors
Sunny 2 3 Sunny à No 2/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Overcast 4 0 Overcast à Yes 0/4 4/14

Sunny Hot High True No


Rainy 2 3 Rainy à No 2/5

Overcast Hot High False Yes


Temp Yes No Rule Errors Total Errors
Rainy Mild High False Yes
Hot 2 2 Hot à Yes 2/4
Rainy Cool Normal False Yes
Mild 4 2 Mild à Yes 2/6 5/14
Rainy Cool Normal True No
Cool 3 1 Cool à Yes 1/4
Overcast Cool Normal True Yes

Sunny Mild High False No


Humidity Yes No Rule Errors Total Errors
Sunny Cool Normal False Yes High 3 4 High à No 3/7
4/14

E. Zemmouri
Rainy Mild Normal False Yes
Normal 6 1 Normal à Yes 1/7
Sunny Mild Normal True Yes

Overcast Mild High True Yes Windy Yes No Rule Errors Total Errors
Overcast Hot Normal False Yes False 6 2 False à Yes 2/8
5/15
Rainy Mild High True No True 3 3 True à No 3/6 19

OneR : example
Weather Problem
Which one is the best predictor ?
• The best predictor is Outlook
Outlook Temp Humidity Windy Play
Sunny Hot High False No • Rule :
Sunny Hot High True No
Overcast Hot High False Yes
• Outlook = sunny à Play = no
Rainy Mild High False Yes
• Outlook = overcast à Play = yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No • Outlook = rainy à Play = yes
Overcast Cool Normal True Yes

Sunny Mild High False No


• With accuracy : 0.71
Sunny Cool Normal False Yes
• How to classify this day :
E. Zemmouri

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes • (sunny, cool, normal, true)


Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No 20


The Classification Problem

Weka

Weka Software
• Waikato Environment for Knowledge Analysis
• Developed in Java at the University of Waikato, New Zealand
• Open source software (GNU General Public License)
• A collection of machine learning algorithms for data mining tasks :
• data preparation,
• classification,
• regression,
• clustering,
• association rules mining,
E. Zemmouri

• and visualization.

22
Weka Software

E. Zemmouri
23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy