0% found this document useful (0 votes)
18 views37 pages

dm 3

The document covers the fundamental concepts of classification in data mining, including supervised and unsupervised learning, decision tree induction, and various classification methods such as Bayesian and rule-based classification. It discusses model construction, validation, testing, and techniques to improve classification accuracy, including ensemble methods and attribute selection measures like information gain and Gini index. Additionally, it addresses challenges such as overfitting and tree pruning in decision tree algorithms.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views37 pages

dm 3

The document covers the fundamental concepts of classification in data mining, including supervised and unsupervised learning, decision tree induction, and various classification methods such as Bayesian and rule-based classification. It discusses model construction, validation, testing, and techniques to improve classification accuracy, including ensemble methods and attribute selection measures like information gain and Gini index. Additionally, it addresses challenges such as overfitting and tree pruning in decision tree algorithms.

Uploaded by

mrpulluri1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

AI512PE: DATA MINING (PE - I)

Unit - 3
Unit - III
Classification
❑ Classification and Prediction
❑ Basic concepts
❑ Decision tree induction
❑ Bayesian classification
❑ Rule–based classification
❑ Lazy learner.

2
Classification: Basic Concepts
❑ Classification: Basic Concepts

❑ Decision Tree Induction

❑ Bayes Classification Methods

❑ Linear Classifier

❑ Model Evaluation and Selection

❑ Techniques to Improve Classification Accuracy: Ensemble Methods

❑ Additional Concepts on Classification

❑ Summary

3
Supervised vs. Unsupervised Learning (1)
❑ Supervised learning (classification)
❑ Supervision: The training data such as observations or measurements are
accompanied by labels indicating the classes which they belong to
❑ New data is classified based on the models built from the training set
Training Data with class label:
age income student credit_rating buys_computer Training Model
<=30 high no fair no
Instances Learning
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Positive
<=30 low yes fair yes
>40
<=30
medium yes fair
medium yes excellent
yes
yes
Test Prediction
31…40 medium no excellent yes Instances Model
31…40 high yes fair yes Negative
>40 medium no excellent no
4
Supervised vs. Unsupervised Learning (2)
❑ Unsupervised learning (clustering)
❑ The class labels of training data are unknown
❑ Given a set of observations or measurements, establish the possible existence
of classes or clusters in the data

5
Prediction Problems: Classification vs. Numeric
Prediction
❑ Classification
❑ Predict categorical class labels (discrete or nominal)
❑ Construct a model based on the training set and the class labels (the values in a
classifying attribute) and use it in classifying new data
❑ Numeric prediction
❑ Model continuous-valued functions (i.e., predict unknown or missing values)
❑ Typical applications of classification
❑ Credit/loan approval
❑ Medical diagnosis: if a tumor is cancerous or benign
❑ Fraud detection: if a transaction is fraudulent
❑ Web page categorization: which category it is
6
Classification—Model Construction, Validation and Testing
❑ Model construction
❑ Each sample is assumed to belong to a predefined class (shown by the class label)
❑ The set of samples used for model construction is training set
❑ Model: Represented as decision trees, rules, mathematical formulas, or other forms
❑ Model Validation and Testing
❑ Test: Estimate accuracy of the model
❑ The known label of test sample is compared with the classified result from the
model
❑ Accuracy: % of test set samples that are correctly classified by the model
❑ Test set is independent of training set
❑ Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
❑ Model Deployment: If the accuracy is acceptable, use the model to classify new data
7
Classification: Basic Concepts
❑ Classification: Basic Concepts

❑ Decision Tree Induction

❑ Bayes Classification Methods

❑ Linear Classifier

❑ Model Evaluation and Selection

❑ Techniques to Improve Classification Accuracy: Ensemble Methods

❑ Additional Concepts on Classification

❑ Summary

8
Decision Tree Induction: An Example
❑ Decision tree construction: Training data set: Who buys computer?
age income student credit_rating buys_computer
❑ A top-down, recursive, divide-and-
<=30 high no fair no
conquer process <=30 high no excellent no
❑ Resulting tree: 31…40 high no fair yes
age? >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
<=30 31…40 low yes excellent yes
overcast
31..40 >40 <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
student? Buy credit rating? <=30 medium yes excellent yes
31…40 medium no excellent yes
no yes excellent fair 31…40 high yes fair yes
>40 medium no excellent no

Not-buy Buy Not-buy Buy Note: The data set is adapted from
“Playing Tennis” example of R. Quinlan
9
From Entropy to Info Gain: A Brief Review of Entropy
❑ Entropy (Information Theory)
❑ A measure of uncertainty associated with a random number
❑ Calculation: For a discrete random variable Y taking m distinct values {y 1, y2, …, ym}

❑ Interpretation
❑ Higher entropy → higher uncertainty
❑ Lower entropy → lower uncertainty
❑ Conditional entropy

m=2

10
Information Gain: An Attribute Selection Measure
❑ Select the attribute with the highest information gain (used in typical
decision tree induction algorithm: ID3/C4.5)
❑ Let pi be the probability that an arbitrary tuple in D belongs to class C i,
estimated by |Ci, D|/|D|
❑ Expected information (entropy) needed to classify a tuple in D.

❑ Information needed (after using A to split D into v partitions) to classify D:

❑ Information gained by branching on attribute A

11
Example: Attribute Selection with Information Gain
❑ Class P: buys_computer = “yes”
❑ Class N: buys_computer = “no”

age pi ni I(pi, ni)


<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes excellent
yes excellent
no
yes Gain(income) = 0.029
Gain( student ) = 0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes Gain(credit _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
12 >40 medium no excellent no
Example: Attribute Selection with Information Gain…

13
Decision Tree Induction: Algorithm
❑ Basic algorithm
❑ Tree is constructed in a top-down, recursive, divide-and-conquer manner
❑ At start, all the training examples are at the root
❑ Examples are partitioned recursively based on selected attributes
❑ On each node, attributes are selected based on the training examples on that
node, and a heuristic or statistical measure (e.g., information gain)
❑ Conditions for stopping partitioning
❑ All samples for a given node belong to the same class
❑ There are no remaining attributes for further partitioning
❑ There are no samples left
❑ Prediction
❑ Majority voting is employed for classifying the leaf

14
How to Handle Continuous-Valued Attributes?
❑ Method 1: Discretize continuous values and treat them as categorical values
❑ E.g., age: < 20, 20..30, 30..40, 40..50, > 50
❑ Method 2: Determine the best split point for continuous-valued attribute A
❑ Sort the value A in increasing order:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
❑ Possible split point: the midpoint between each pair of adjacent values
❑ (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
❑ e.g., (15+18/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
❑ The point with the maximum information gain for A is selected as the split-
point for A
❑ Split: Based on split point P
❑ The set of tuples in D satisfying A ≤ P vs. those with A > P
15
Gain Ratio: A Refined Measure for Attribute Selection
❑ Information gain measure is biased towards attributes with a large number of
values
❑ Gain ratio: Overcomes the problem (as a normalization to information gain)

❑ GainRatio(A) = Gain(A)/SplitInfo(A)
❑ The attribute with the maximum gain ratio is selected as the splitting attribute
❑ Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R. Quinlan
❑ Example
4 4 6 6 4 4
❑ SplitInfoincome D = − log 2 − log 2 − log 2 = 1.557
14 14 14 14 14 14
❑ GainRatio(income) = 0.029/1.557 = 0.019

16
Another Measure: Gini Index
❑ Gini index: Used in CART, and also in IBM IntelligentMiner
❑ If a data set 𝐷 contains examples from 𝑛 classes, gini index, 𝑔𝑖𝑛𝑖(𝐷) is defined as
❑ 𝑔𝑖𝑛𝑖 𝐷 = 1 − σ𝑛𝑗=1 𝑝𝑗2
❑ 𝑝𝑗 is the relative frequency of class 𝑗 in 𝐷
❑ If a data set 𝐷 is split on 𝐴 into two subsets 𝐷1 and 𝐷2, the 𝑔𝑖𝑛𝑖 index 𝑔𝑖𝑛𝑖(𝐷) is
defined as
𝐷1 𝐷2
❑ 𝑔𝑖𝑛𝑖𝐴 𝐷 = 𝐷
𝑔𝑖𝑛𝑖 𝐷1 + 𝐷
𝑔𝑖𝑛𝑖 𝐷2
❑ Reduction in Impurity:
❑ Δ𝑔𝑖𝑛𝑖 𝐴 = 𝑔𝑖𝑛𝑖 𝐷 − 𝑔𝑖𝑛𝑖𝐴 (𝐷)
❑ The attribute provides the smallest 𝑔𝑖𝑛𝑖𝑠𝑝𝑙𝑖𝑡(𝐷) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible splitting
points for each attribute)
17
Computation of Gini Index
❑ Example: D has 9 tuples in buys_computer = “yes” and 5 in “no”

❑ Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
10 4
❑ 𝑔𝑖𝑛𝑖𝑖𝑛𝑐𝑜𝑚𝑒∈ 𝑙𝑜𝑤,𝑚𝑒𝑑𝑖𝑢𝑚 𝐷 = 𝑔𝑖𝑛𝑖 𝐷1 + 𝑔𝑖𝑛𝑖 𝐷2
14 14
2 2 2 2
10 7 3 4 2 2
= 1− − + 1− − = 0.443
14 10 10 14 4 4
= 𝐺𝑖𝑛𝑖𝑖𝑛𝑐𝑜𝑚𝑒∈ ℎ𝑖𝑔ℎ 𝐷
❑ Gini{low,high} is 0.458; Gini{medium,high} is 0.450
❑ Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index
❑ All attributes are assumed continuous-valued
❑ May need other tools, e.g., clustering, to get the possible split values

18 ❑ Can be modified for categorical attributes


Comparing Three Attribute Selection Measures
❑ The three measures, in general, return good results but
❑ Information gain:
❑ biased towards multivalued attributes
❑ Gain ratio:
❑ tends to prefer unbalanced splits in which one partition is much smaller than
the others
❑ Gini index:
❑ biased to multivalued attributes
❑ has difficulty when # of classes is large
❑ tends to favor tests that result in equal-sized partitions and purity in both
partitions

19
Other Attribute Selection Measures
❑ Minimal Description Length (MDL) principle
❑ Philosophy: The simplest solution is preferred
❑ The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree
❑ CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
❑ Multivariate splits (partition based on multiple variable combinations)
❑ CART: finds multivariate splits based on a linear combination of attributes
❑ There are many other measures proposed in research and applications
❑ E.g., G-statistics, C-SEP
❑ Which attribute selection measure is the best?
❑ Most give good results, none is significantly superior than others
20
Overfitting and Tree Pruning
❑ Overfitting: An induced tree may overfit the training data
❑ Too many branches, some may reflect anomalies due to noise or
outliers
❑ Poor accuracy for unseen samples
❑ Two approaches to avoid overfitting
❑ Prepruning: Halt tree construction early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
❑ Difficult to choose an appropriate threshold
❑ Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
❑ Use a set of data different from the training data to decide which is
21
the “best pruned tree”
Classification: Basic Concepts
❑ Classification: Basic Concepts

❑ Decision Tree Induction

❑ Bayes Classification Methods

❑ Linear Classifier

❑ Model Evaluation and Selection

❑ Techniques to Improve Classification Accuracy: Ensemble Methods

❑ Additional Concepts on Classification

❑ Summary

22
What Is Bayesian Classification?
❑ A statistical classifier
❑ Perform probabilistic prediction (i.e., predict class membership probabilities)
❑ Foundation—Based on Bayes’ Theorem
❑ Performance
❑ A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
❑ Incremental
❑ Each training example can incrementally increase/decrease the probability that
a hypothesis is correct—prior knowledge can be combined with observed data
❑ Theoretical Standard
❑ Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be
measured
23
Bayes’ Theorem: Basics
❑ Total probability Theorem:
p B = ෍ p B A i p(A i)
i
❑ Bayes’ Theorem:
p 𝐗H P H
p H|𝐗 = ∝p 𝐗H P H
p(𝐗)

posteriori likelihood prior probability


probability
What we should choose What we just see What we knew previously

❑ X: a data sample (“evidence”) Prediction can be done based on Bayes’ Theorem:


❑ H: X belongs to class C Classification is to derive the maximum posteriori
24
Naïve Bayes Classifier: Making a Naïve Assumption
❑ Practical difficulty of Naïve Bayes inference: It requires initial knowledge of many
probabilities, which may not be available or involving significant computational cost
❑ A Naïve Special Case
❑ Make an additional assumption to simplify the model, but achieve comparable
performance.

attributes are conditionally independent


(i.e., no dependence relation between attributes)

p X|𝐶𝑖 = ςk p x k Ci) = p x1 Ci) ∙ p x 2 Ci) ∙∙∙∙∙ p x n Ci)

❑ Only need to count the class distribution w.r.t. features


25
Naïve Bayes Classifier: Categorical vs. Continuous
Valued Features
❑ If feature xk is categorical, p(xk = vk |Ci ) is the # of tuples in Ci with xk = vk ,
divided by |Ci, D| (# of tuples of Ci in D)

p X|𝐶𝑖 = ςk p xk Ci ) = p x1 Ci ) ∙ p x2 Ci) ∙∙∙∙∙ p xn Ci )

❑ If feature xk is continuous-valued, p(xk = vk |Ci ) is usually computed based on


Gaussian distribution with a mean μ and standard deviation σ
2
𝑥−𝜇 𝐶
1 − 2
𝑖
p xk = vk Ci = 𝑁 xk μCi , σCi = 𝑒 2𝜎
2πσCi

26
Naïve Bayes Classifier: Training Dataset

age income student credit_rating buys_computer


Class: <=30 high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
<=30 medium no fair no
X = (age <=30, Income = medium, <=30 low yes fair yes
Student = yes, Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

27
Naïve Bayes Classifier: An Example
age income student credit_rating buys_computer
❑ P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 <=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
❑ Compute P(X|Ci) for each class
>40 medium no fair yes
P(age = “<=30”|buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30”|buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 medium no excellent yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 high yes fair yes
>40 medium no excellent no
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
❑ X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
28
Avoiding the Zero-Probability Problem
❑ Naïve Bayesian prediction requires each conditional probability be non-zero
❑ Otherwise, the predicted probability will be zero
p X|𝐶𝑖 = ς𝑘 𝑝 𝑥𝑘 𝐶𝑖 ) = 𝑝 𝑥1 𝐶𝑖 ) ∙ 𝑝 𝑥2 𝐶𝑖 ) ∙∙∙∙∙ 𝑝 𝑥𝑛 𝐶𝑖 )
❑ Example. Suppose a dataset with 1000 tuples:
income = low (0), income= medium (990), and income = high (10)
❑ Use Laplacian correction (or Laplacian estimator)
❑ Adding 1 to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
❑ The “corrected” probability estimates are close to their “uncorrected”
counterparts
29
Naïve Bayes Classifier: Strength vs. Weakness
❑ Strength
❑ Easy to implement
❑ Good results obtained in most of the cases
❑ Weakness
❑ Assumption: attributes conditional independence, therefore loss of accuracy
❑ Practically, dependencies exist among variables
❑ E.g., Patients: Profile: age, family history, etc.
Symptoms: fever, cough etc.
Disease: lung cancer, diabetes, etc.
❑ Dependencies among these cannot be modeled by Naïve Bayes Classifier
❑ How to deal with these dependencies?
❑ Use Bayesian Belief Networks (to be covered in the next chapter)
30
Classification: Advanced Methods
❑ Bayesian Belief Networks

❑ Support Vector Machines

❑ Neural Networks and Deep Learning

❑ Pattern-Based Classification

❑ Lazy Learners and K-Nearest Neighbors

❑ Other Classification Methods

❑ Summary

31
Lazy vs. Eager Learning
❑ Lazy vs. eager learning
❑ Lazy learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
❑ Eager learning (the above discussed methods): Given a set of training tuples,
constructs a classification model before receiving new (e.g., test) data to classify
❑ Lazy: less time in training but more time in predicting
❑ Accuracy
❑ Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form an implicit global approximation to the target function
❑ Eager: must commit to a single hypothesis that covers the entire instance space

32
Lazy Learner: Instance-Based Methods
❑ Instance-based learning:
❑ Store training examples and delay the processing (“lazy evaluation”) until a
new instance must be classified
❑ Typical approaches
❑ k-nearest neighbor approach
❑ Instances represented as points in a Euclidean space.
❑ Locally weighted regression
❑ Constructs local approximation
❑ Case-based reasoning
❑ Uses symbolic representations and knowledge-based inference

33
The k-Nearest Neighbor Algorithm
❑ All instances correspond to points in the n-D space
❑ The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
❑ Target function could be discrete- or real- valued
❑ For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
❑ Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples

_
_
_ _ .
+
_
. +
xq +
. . .
_ + .
34
Discussion on the k-NN Algorithm
❑ k-NN for real-valued prediction for a given unknown tuple
❑ Returns the mean values of the k nearest neighbors
❑ Distance-weighted nearest neighbor algorithm
❑ Weight the contribution of each of the k neighbors according to their distance
to the query xq
❑ Give greater weight to closer neighbors
❑ Robust to noisy data by averaging k-nearest neighbors
❑ Curse of dimensionality: distance between neighbors could be dominated by
irrelevant attributes
❑ To overcome it, axes stretch or elimination of the least relevant attributes

35
Case-Based Reasoning (CBR)
❑ CBR: Uses a database of problem solutions to solve new problems
❑ Store symbolic description (tuples or cases)—not points in a Euclidean space
❑ Applications: Customer-service (product-related diagnosis), legal ruling
❑ Methodology
❑ Instances represented by rich symbolic descriptions (e.g., function graphs)
❑ Search for similar cases, multiple retrieved cases may be combined
❑ Tight coupling between case retrieval, knowledge-based reasoning, and problem
solving
❑ Challenges
❑ Find a good similarity metric
❑ Indexing based on syntactic similarity measure, and when failure, backtracking,
and adapting to additional cases
36
END OF UNIT - III

37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy