dm 3
dm 3
Unit - 3
Unit - III
Classification
❑ Classification and Prediction
❑ Basic concepts
❑ Decision tree induction
❑ Bayesian classification
❑ Rule–based classification
❑ Lazy learner.
2
Classification: Basic Concepts
❑ Classification: Basic Concepts
❑ Linear Classifier
❑ Summary
3
Supervised vs. Unsupervised Learning (1)
❑ Supervised learning (classification)
❑ Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
❑ New data is classified based on the models built from the training set
Training Data with class label:
age income student credit_rating buys_computer Training Model
<=30 high no fair no
Instances Learning
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Positive
<=30 low yes fair yes
>40
<=30
medium yes fair
medium yes excellent
yes
yes
Test Prediction
31…40 medium no excellent yes Instances Model
31…40 high yes fair yes Negative
>40 medium no excellent no
4
Supervised vs. Unsupervised Learning (2)
❑ Unsupervised learning (clustering)
❑ The class labels of training data are unknown
❑ Given a set of observations or measurements, establish the possible existence
of classes or clusters in the data
5
Prediction Problems: Classification vs. Numeric
Prediction
❑ Classification
❑ Predict categorical class labels (discrete or nominal)
❑ Construct a model based on the training set and the class labels (the values in a
classifying attribute) and use it in classifying new data
❑ Numeric prediction
❑ Model continuous-valued functions (i.e., predict unknown or missing values)
❑ Typical applications of classification
❑ Credit/loan approval
❑ Medical diagnosis: if a tumor is cancerous or benign
❑ Fraud detection: if a transaction is fraudulent
❑ Web page categorization: which category it is
6
Classification—Model Construction, Validation and Testing
❑ Model construction
❑ Each sample is assumed to belong to a predefined class (shown by the class label)
❑ The set of samples used for model construction is training set
❑ Model: Represented as decision trees, rules, mathematical formulas, or other forms
❑ Model Validation and Testing
❑ Test: Estimate accuracy of the model
❑ The known label of test sample is compared with the classified result from the
model
❑ Accuracy: % of test set samples that are correctly classified by the model
❑ Test set is independent of training set
❑ Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
❑ Model Deployment: If the accuracy is acceptable, use the model to classify new data
7
Classification: Basic Concepts
❑ Classification: Basic Concepts
❑ Linear Classifier
❑ Summary
8
Decision Tree Induction: An Example
❑ Decision tree construction: Training data set: Who buys computer?
age income student credit_rating buys_computer
❑ A top-down, recursive, divide-and-
<=30 high no fair no
conquer process <=30 high no excellent no
❑ Resulting tree: 31…40 high no fair yes
age? >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
<=30 31…40 low yes excellent yes
overcast
31..40 >40 <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
student? Buy credit rating? <=30 medium yes excellent yes
31…40 medium no excellent yes
no yes excellent fair 31…40 high yes fair yes
>40 medium no excellent no
Not-buy Buy Not-buy Buy Note: The data set is adapted from
“Playing Tennis” example of R. Quinlan
9
From Entropy to Info Gain: A Brief Review of Entropy
❑ Entropy (Information Theory)
❑ A measure of uncertainty associated with a random number
❑ Calculation: For a discrete random variable Y taking m distinct values {y 1, y2, …, ym}
❑ Interpretation
❑ Higher entropy → higher uncertainty
❑ Lower entropy → lower uncertainty
❑ Conditional entropy
m=2
10
Information Gain: An Attribute Selection Measure
❑ Select the attribute with the highest information gain (used in typical
decision tree induction algorithm: ID3/C4.5)
❑ Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D|
❑ Expected information (entropy) needed to classify a tuple in D.
11
Example: Attribute Selection with Information Gain
❑ Class P: buys_computer = “yes”
❑ Class N: buys_computer = “no”
13
Decision Tree Induction: Algorithm
❑ Basic algorithm
❑ Tree is constructed in a top-down, recursive, divide-and-conquer manner
❑ At start, all the training examples are at the root
❑ Examples are partitioned recursively based on selected attributes
❑ On each node, attributes are selected based on the training examples on that
node, and a heuristic or statistical measure (e.g., information gain)
❑ Conditions for stopping partitioning
❑ All samples for a given node belong to the same class
❑ There are no remaining attributes for further partitioning
❑ There are no samples left
❑ Prediction
❑ Majority voting is employed for classifying the leaf
14
How to Handle Continuous-Valued Attributes?
❑ Method 1: Discretize continuous values and treat them as categorical values
❑ E.g., age: < 20, 20..30, 30..40, 40..50, > 50
❑ Method 2: Determine the best split point for continuous-valued attribute A
❑ Sort the value A in increasing order:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
❑ Possible split point: the midpoint between each pair of adjacent values
❑ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
❑ e.g., (15+18/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
❑ The point with the maximum information gain for A is selected as the split-
point for A
❑ Split: Based on split point P
❑ The set of tuples in D satisfying A ≤ P vs. those with A > P
15
Gain Ratio: A Refined Measure for Attribute Selection
❑ Information gain measure is biased towards attributes with a large number of
values
❑ Gain ratio: Overcomes the problem (as a normalization to information gain)
❑ GainRatio(A) = Gain(A)/SplitInfo(A)
❑ The attribute with the maximum gain ratio is selected as the splitting attribute
❑ Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R. Quinlan
❑ Example
4 4 6 6 4 4
❑ SplitInfoincome D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
❑ GainRatio(income) = 0.029/1.557 = 0.019
16
Another Measure: Gini Index
❑ Gini index: Used in CART, and also in IBM IntelligentMiner
❑ If a data set 𝐷 contains examples from 𝑛 classes, gini index, 𝑔𝑖𝑛𝑖(𝐷) is defined as
❑ 𝑔𝑖𝑛𝑖 𝐷 = 1 − σ𝑛𝑗=1 𝑝𝑗2
❑ 𝑝𝑗 is the relative frequency of class 𝑗 in 𝐷
❑ If a data set 𝐷 is split on 𝐴 into two subsets 𝐷1 and 𝐷2, the 𝑔𝑖𝑛𝑖 index 𝑔𝑖𝑛𝑖(𝐷) is
defined as
𝐷1 𝐷2
❑ 𝑔𝑖𝑛𝑖𝐴 𝐷 = 𝐷
𝑔𝑖𝑛𝑖 𝐷1 + 𝐷
𝑔𝑖𝑛𝑖 𝐷2
❑ Reduction in Impurity:
❑ Δ𝑔𝑖𝑛𝑖 𝐴 = 𝑔𝑖𝑛𝑖 𝐷 − 𝑔𝑖𝑛𝑖𝐴 (𝐷)
❑ The attribute provides the smallest 𝑔𝑖𝑛𝑖𝑠𝑝𝑙𝑖𝑡(𝐷) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible splitting
points for each attribute)
17
Computation of Gini Index
❑ Example: D has 9 tuples in buys_computer = “yes” and 5 in “no”
❑ Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
10 4
❑ 𝑔𝑖𝑛𝑖𝑖𝑛𝑐𝑜𝑚𝑒∈ 𝑙𝑜𝑤,𝑚𝑒𝑑𝑖𝑢𝑚 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
14 14
2 2 2 2
10 7 3 4 2 2
= 1− − + 1− − = 0.443
14 10 10 14 4 4
= 𝐺𝑖𝑛𝑖𝑖𝑛𝑐𝑜𝑚𝑒∈ ℎ𝑖𝑔ℎ 𝐷
❑ Gini{low,high} is 0.458; Gini{medium,high} is 0.450
❑ Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index
❑ All attributes are assumed continuous-valued
❑ May need other tools, e.g., clustering, to get the possible split values
19
Other Attribute Selection Measures
❑ Minimal Description Length (MDL) principle
❑ Philosophy: The simplest solution is preferred
❑ The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree
❑ CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
❑ Multivariate splits (partition based on multiple variable combinations)
❑ CART: finds multivariate splits based on a linear combination of attributes
❑ There are many other measures proposed in research and applications
❑ E.g., G-statistics, C-SEP
❑ Which attribute selection measure is the best?
❑ Most give good results, none is significantly superior than others
20
Overfitting and Tree Pruning
❑ Overfitting: An induced tree may overfit the training data
❑ Too many branches, some may reflect anomalies due to noise or
outliers
❑ Poor accuracy for unseen samples
❑ Two approaches to avoid overfitting
❑ Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
❑ Difficult to choose an appropriate threshold
❑ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
❑ Use a set of data different from the training data to decide which is
21
the “best pruned tree”
Classification: Basic Concepts
❑ Classification: Basic Concepts
❑ Linear Classifier
❑ Summary
22
What Is Bayesian Classification?
❑ A statistical classifier
❑ Perform probabilistic prediction (i.e., predict class membership probabilities)
❑ Foundation—Based on Bayes’ Theorem
❑ Performance
❑ A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
❑ Incremental
❑ Each training example can incrementally increase/decrease the probability that
a hypothesis is correct—prior knowledge can be combined with observed data
❑ Theoretical Standard
❑ Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be
measured
23
Bayes’ Theorem: Basics
❑ Total probability Theorem:
p B = p B A i p(A i)
i
❑ Bayes’ Theorem:
p 𝐗H P H
p H|𝐗 = ∝p 𝐗H P H
p(𝐗)
26
Naïve Bayes Classifier: Training Dataset
27
Naïve Bayes Classifier: An Example
age income student credit_rating buys_computer
❑ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
❑ Compute P(X|Ci) for each class
>40 medium no fair yes
P(age = “<=30”|buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30”|buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 medium no excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 high yes fair yes
>40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
❑ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
28
Avoiding the Zero-Probability Problem
❑ Naïve Bayesian prediction requires each conditional probability be non-zero
❑ Otherwise, the predicted probability will be zero
p X|𝐶𝑖 = ς𝑘 𝑝 𝑥𝑘 𝐶𝑖 ) = 𝑝 𝑥1 𝐶𝑖 ) ∙ 𝑝 𝑥2 𝐶𝑖 ) ∙∙∙∙∙ 𝑝 𝑥𝑛 𝐶𝑖 )
❑ Example. Suppose a dataset with 1000 tuples:
income = low (0), income= medium (990), and income = high (10)
❑ Use Laplacian correction (or Laplacian estimator)
❑ Adding 1 to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
❑ The “corrected” probability estimates are close to their “uncorrected”
counterparts
29
Naïve Bayes Classifier: Strength vs. Weakness
❑ Strength
❑ Easy to implement
❑ Good results obtained in most of the cases
❑ Weakness
❑ Assumption: attributes conditional independence, therefore loss of accuracy
❑ Practically, dependencies exist among variables
❑ E.g., Patients: Profile: age, family history, etc.
Symptoms: fever, cough etc.
Disease: lung cancer, diabetes, etc.
❑ Dependencies among these cannot be modeled by Naïve Bayes Classifier
❑ How to deal with these dependencies?
❑ Use Bayesian Belief Networks (to be covered in the next chapter)
30
Classification: Advanced Methods
❑ Bayesian Belief Networks
❑ Pattern-Based Classification
❑ Summary
31
Lazy vs. Eager Learning
❑ Lazy vs. eager learning
❑ Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
❑ Eager learning (the above discussed methods): Given a set of training tuples,
constructs a classification model before receiving new (e.g., test) data to classify
❑ Lazy: less time in training but more time in predicting
❑ Accuracy
❑ Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form an implicit global approximation to the target function
❑ Eager: must commit to a single hypothesis that covers the entire instance space
32
Lazy Learner: Instance-Based Methods
❑ Instance-based learning:
❑ Store training examples and delay the processing (“lazy evaluation”) until a
new instance must be classified
❑ Typical approaches
❑ k-nearest neighbor approach
❑ Instances represented as points in a Euclidean space.
❑ Locally weighted regression
❑ Constructs local approximation
❑ Case-based reasoning
❑ Uses symbolic representations and knowledge-based inference
33
The k-Nearest Neighbor Algorithm
❑ All instances correspond to points in the n-D space
❑ The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
❑ Target function could be discrete- or real- valued
❑ For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
❑ Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples
_
_
_ _ .
+
_
. +
xq +
. . .
_ + .
34
Discussion on the k-NN Algorithm
❑ k-NN for real-valued prediction for a given unknown tuple
❑ Returns the mean values of the k nearest neighbors
❑ Distance-weighted nearest neighbor algorithm
❑ Weight the contribution of each of the k neighbors according to their distance
to the query xq
❑ Give greater weight to closer neighbors
❑ Robust to noisy data by averaging k-nearest neighbors
❑ Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes
❑ To overcome it, axes stretch or elimination of the least relevant attributes
35
Case-Based Reasoning (CBR)
❑ CBR: Uses a database of problem solutions to solve new problems
❑ Store symbolic description (tuples or cases)—not points in a Euclidean space
❑ Applications: Customer-service (product-related diagnosis), legal ruling
❑ Methodology
❑ Instances represented by rich symbolic descriptions (e.g., function graphs)
❑ Search for similar cases, multiple retrieved cases may be combined
❑ Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
❑ Challenges
❑ Find a good similarity metric
❑ Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
36
END OF UNIT - III
37