03-Data Mining The Classification Problem
03-Data Mining The Classification Problem
EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024
• Typical applications:
• Credit approval, direct marketing, fraud detection, medical diagnosis, …
2
The classification problem
E. Zemmouri
A1
3
model M (classifier) that allows to predict the class / of new data point #
• If 0-12(4) = 2, then the problem is a binary classification (7 = 2)
4
Classification vs Regression
E. Zemmouri
5
Examples of applications
• The training examples are used to learn whether a customer, with a known demographic
profile, but unknown buying behavior, may be interested in a particular product or not.
E. Zemmouri
6
Examples of applications
• These are used to organize the documents under specific topics of a portal,
• Previous examples of manually classified documents from each topic may be available,
• The class labels correspond to the various topics : politics, sports, current events, …
E. Zemmouri
7
Examples of applications
• The features may be extracted from patient medical tests and treatments,
• Intrusion detection
• The sequences of customer activity in a computer system may be used to predict the
possibility of intrusions.
E. Zemmouri
8
Classification models
• Various models have been designed for data classification
• Decision trees,
• Rule-based classifiers,
• Probabilistic models,
• Instance-based classifiers,
• Support vector machines,
• Neural networks,
• …
E. Zemmouri
9
• Example :
7 Overcast Cool Normal True Yes
• (sunny, cool, normal, true) 8 Sunny Mild High False No
10
The weather data
1 Sunny 85 85 False No
attributes 2 Sunny 80 90 True No
3 Overcast 83 86 False Yes
8 Sunny 72 95 False No
E. Zemmouri
11 Sunny 70 True Yes
14 Rainy 71 91 True No
11
Simplicity first
Simplicity first
• Simple algorithms often work very well !
• Many kinds of simple structures in data :
• One attribute does all the work
• All attributes contribute equally & independently
• A weighted linear combination might do
• Instance-based
• …
• But success of a method depends on the domain
E. Zemmouri
• Data Mining is an Experimental science !
15
• R.C. Holte (1993). “Very simple classification rules perform well on most
commonly used datasets.” Machine Learning.
16
OneR pseudo code
E. Zemmouri
17
OneR : example
Frequency Tables
Weather Problem
Which one is the best predictor ? Outlook Yes No Rule Errors Total Errors
Sunny
Outlook Temp Humidity Windy Play
Overcast
Sunny Hot High False No
Rainy
Sunny Hot High True No
Overcast Hot High False Yes
Temp Yes No Rule Errors Total Errors
Rainy Mild High False Yes
Hot
Rainy Cool Normal False Yes
Mild
Rainy Cool Normal True No
Cool
Overcast Cool Normal True Yes
Overcast Mild High True Yes Windy Yes No Rule Errors Total Errors
Overcast Hot Normal False Yes False
E. Zemmouri
Rainy Mild Normal False Yes
Normal 6 1 Normal à Yes 1/7
Sunny Mild Normal True Yes
Overcast Mild High True Yes Windy Yes No Rule Errors Total Errors
Overcast Hot Normal False Yes False 6 2 False à Yes 2/8
5/15
Rainy Mild High True No True 3 3 True à No 3/6 19
OneR : example
Weather Problem
Which one is the best predictor ?
• The best predictor is Outlook
Outlook Temp Humidity Windy Play
Sunny Hot High False No • Rule :
Sunny Hot High True No
Overcast Hot High False Yes
• Outlook = sunny à Play = no
Rainy Mild High False Yes
• Outlook = overcast à Play = yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No • Outlook = rainy à Play = yes
Overcast Cool Normal True Yes
Weka
Weka Software
• Waikato Environment for Knowledge Analysis
• Developed in Java at the University of Waikato, New Zealand
• Open source software (GNU General Public License)
• A collection of machine learning algorithms for data mining tasks :
• data preparation,
• classification,
• regression,
• clustering,
• association rules mining,
E. Zemmouri
• and visualization.
22
Weka Software
E. Zemmouri
23