Part 2
Part 2
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
© Prentice Hall 2
Classification Outline
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
Classification Problem Overview
Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks
© Prentice Hall 3
Classification Problem
Given a database D={t1,t2,…,tn} and a set
of classes C={C1,…,Cm}, the
Classification Problem is to define a
mapping f:DC where each ti is assigned
to one class.
Actually divides D into equivalence
classes.
Prediction is similar, but may be viewed
as having infinite number of classes.
© Prentice Hall 4
Classification Examples
Teachers classify students’ grades as
A, B, C, D, or F.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
© Prentice Hall 5
Classification Ex: Grading
x
If x >= 90 then grade
=A. <90 >=90
If 80<=x<90 then x A
grade =B.
<80 >=80
If 70<=x<80 then x B
grade =C.
If 60<=x<70 then <70 >=70
grade =D. x C
If x<50 then grade =F. <50 >=60
© Prentice Hall
F D 6
Classification Ex: Letter
Recognition
View letters as constructed from 5 components:
Letter A Letter B
Letter C Letter D
Letter E Letter F
© Prentice Hall 7
Classification Techniques
Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
Classes must be predefined
Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
© Prentice Hall 8
Defining Classes
Distance Based
Partitioning Based
© Prentice Hall 9
Issues in Classification
Missing Data
– Ignore
– Replace with assumed value
Measuring Performance
– Classification accuracy on test data
– Confusion matrix
– OC Curve
© Prentice Hall 10
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
© Prentice Hall 11
Classification Performance
© Prentice Hall 12
Confusion Matrix Example
Actual Assignment
Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
© Prentice Hall 13
Operating Characteristic Curve
© Prentice Hall 14
Regression
Assume data fits a predefined function
Determine best values for regression
coefficients c0,c1,…,cn.
Assume an error: y = c0+c1x1+…+cnxn+
Estimate error using mean squared error for
training set:
© Prentice Hall 15
Linear Regression Poor Fit
© Prentice Hall 16
Classification Using Regression
Division: Use regression function to
divide area into regions.
Prediction: Use regression function to
predict a class membership function.
Input includes desired class.
© Prentice Hall 17
Division
© Prentice Hall 18
Prediction
© Prentice Hall 19
Classification Using Distance
Place items in class to which they are
“closest”.
Must determine distance between an item
and a class.
Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
Algorithm: KNN
© Prentice Hall 20
K Nearest Neighbor (KNN):
Training set includes classes.
Examine K items near item to be
classified.
New item placed in class with the most
number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training set.)
© Prentice Hall 21
KNN
© Prentice Hall 22
KNN Algorithm
© Prentice Hall 23
Classification Using Decision
Trees
Partitioning based: Divide search
space into rectangular regions.
Tuple placed into class based on the
region within which it falls.
DT approaches differ in how the tree is
built: DT Induction
Internal nodes associated with attribute
and arcs with values for that attribute.
Algorithms: ID3, C4.5, CART
© Prentice Hall 24
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated
with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can be
applied to attribute at parent
– Each leaf node is labeled with a class, Cj
© Prentice Hall 25
DT Induction
© Prentice Hall 26
DT Splits Area
M
Gender
F
Height
© Prentice Hall 27
Comparing DTs
Balanced
Deep
© Prentice Hall 28
DT Issues
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
Pruning
© Prentice Hall 29
Decision Tree Induction is often based on
Information Theory
So
© Prentice Hall 30
Information
© Prentice Hall 31
DT Induction
When all the marbles in the bowl are
mixed up, little information is given.
When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.
© Prentice Hall 33
Entropy
© Prentice Hall 34
ID3
Creates tree using information theory
concepts and tries to reduce expected
number of comparison..
ID3 chooses split attribute with the highest
information gain:
© Prentice Hall 35
ID3 Example (Output1)
Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) =
0.4392
– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.34152
– Gain: 0.4384 – 0.34152 = 0.09688
Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
Choose height as first splitting attribute
© Prentice Hall 36
C4.5
ID3 favors attributes with large number of
divisions
Improved version of ID3:
– Missing Data
– Continuous Data
– Pruning
– Rules
– GainRatio:
© Prentice Hall 37
CART
Create Binary Tree
Uses entropy
Formula to choose split point, s, for node t:
© Prentice Hall 38
CART Example
At the start, there are six choices for
split point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
Split at 1.8
© Prentice Hall 39
Classification Using Neural
Networks
Typical NN structure for classification:
– One output node per class
– Output value is class membership function value
Supervised learning
For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
Algorithms: Propagation, Backpropagation,
Gradient Descent
© Prentice Hall 40
NN Issues
Number of source nodes
Number of hidden layers
Training data
Number of sinks
Interconnections
Weights
Activation Functions
Learning Technique
When to stop learning
© Prentice Hall 41
Decision Tree vs. Neural
Network
© Prentice Hall 42
Propagation
Tuple Input
Output
© Prentice Hall 43
NN Propagation Algorithm
© Prentice Hall 44
Example Propagation
© Prentie Hall
© Prentice Hall 45
NN Learning
Adjust weights to perform better with the
associated test data.
Supervised: Use feedback from
knowledge of correct classification.
Unsupervised: No knowledge of
correct classification needed.
© Prentice Hall 46
NN Supervised Learning
© Prentice Hall 47
Supervised Learning
Possible error values assuming output from
node i is yi but should be di:
© Prentice Hall 49
Backpropagation
Error
© Prentice Hall 50
Backpropagation Algorithm
© Prentice Hall 51
Gradient Descent
© Prentice Hall 52
Gradient Descent Algorithm
© Prentice Hall 53
Output Layer Learning
© Prentice Hall 54
Hidden Layer Learning
© Prentice Hall 55
Types of NNs
Different NN structures used for
different problems.
Perceptron
Self Organizing Feature Map
Radial Basis Function Network
© Prentice Hall 56
Perceptron
© Prentice Hall 57
Perceptron Example
Suppose:
– Summation: S=3x1+2x2-6
– Activation: if S>0 then 1 else 0
© Prentice Hall 58
Self Organizing Feature Map
(SOFM)
Competitive Unsupervised Learning
Observe how neurons work in brain:
– Firing impacts firing of those near
– Neurons far apart inhibit each other
– Neurons have specific nonoverlapping
tasks
Ex: Kohonen Network
© Prentice Hall 59
Kohonen Network
© Prentice Hall 60
Kohonen Network
Competitive Layer – viewed as 2D grid
Similarity between competitive nodes and
input nodes:
– Input: X = <x1, …, xh>
– Weights: <w1i, … , whi>
– Similarity defined based on dot product
Competitive node most similar to input “wins”
Winning node weights (as well as surrounding
node weights) increased.
© Prentice Hall 61
Radial Basis Function Network
RBF function has Gaussian shape
RBF Networks
– Three Layers
– Hidden layer – Gaussian activation
function
– Output layer – Linear activation function
© Prentice Hall 62
Radial Basis Function Network
© Prentice Hall 63
Classification Using Rules
Perform classification using If-Then
rules
Classification Rule: r = <a,c>
Antecedent, Consequent
May generate from from other
techniques (DT, NN) or generate
directly.
Algorithms: Gen, RX, 1R, PRISM
© Prentice Hall 64
Generating Rules from DTs
© Prentice Hall 65
Generating Rules Example
© Prentice Hall 66
Generating Rules from NNs
© Prentice Hall 67
1R Algorithm
© Prentice Hall 68
1R Example
© Prentice Hall 69
PRISM Algorithm
© Prentice Hall 70
PRISM Example
© Prentice Hall 71
Decision Tree vs. Rules
Tree has implied Rules have no
order in which ordering of
splitting is predicates.
performed.
Tree created based Only need to look at
on looking at all one class to
classes. generate its rules.
© Prentice Hall 72
Clustering Outline
Goal: Provide an overview of the clustering
problem and introduce some of the basic
algorithms
© Prentice Hall 73
Clustering Examples
Segment customer database based on
similar buying patterns.
Group houses in a town into
neighborhoods based on similar
features.
Identify new plant species
Identify similar Web usage patterns
© Prentice Hall 74
Clustering Example
© Prentice Hall 75
Clustering Houses
Geographic
SizeDistance
Based Based
© Prentice Hall 76
Clustering vs. Classification
No prior knowledge
– Number of clusters
– Meaning of clusters
Unsupervised learning
© Prentice Hall 77
Clustering Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
© Prentice Hall 78
Impact of Outliers on
Clustering
© Prentice Hall 79
Clustering Problem
Given a database D={t1,t2,…,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping
f:D{1,..,k} where each ti is assigned to
one cluster Kj, 1<=j<=k.
A Cluster, Kj, contains precisely those
tuples mapped to it.
Unlike classification problem, clusters
are not known a priori.
© Prentice Hall 80
Types of Clustering
Hierarchical – Nested set of clusters
created.
Partitional – One set of clusters
created.
Incremental – Each element handled
one at a time.
Simultaneous – All elements handled
together.
Overlapping/Non-overlapping
© Prentice Hall 81
Clustering Approaches
Clustering
© Prentice Hall 82
Cluster Parameters
© Prentice Hall 83
Distance Between Clusters
Single Link: smallest distance between points
Complete Link: largest distance between points
Average Link: average distance between points
Centroid: distance between centroids
© Prentice Hall 84
Hierarchical Clustering
Clusters are created in levels actually
creating sets of clusters at each level.
Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
© Prentice Hall 85
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
© Prentice Hall 86
Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
A cluster at level i is the
union of its children clusters
at level i+1.
© Prentice Hall 87
Levels of Clustering
© Prentice Hall 88
Agglomerative Example
A B C D E A B
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5 E C
D 2 4 1 0 3
E 3 3 5 3 0
D
Threshold of
1 2 34 5
A B C D E
© Prentice Hall 89
MST Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D
© Prentice Hall 90
Agglomerative Algorithm
© Prentice Hall 91
Single Link
View all items with links (distances)
between them.
Finds maximal connected components
in this graph.
Two clusters are merged if there is at
least one edge which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
© Prentice Hall 92
MST Single Link Algorithm
© Prentice Hall 93
Single Link Clustering
© Prentice Hall 94
Partitional Clustering
Nonhierarchical
Creates clusters in one step as opposed
to several steps.
Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
Usually deals with static sets.
© Prentice Hall 95
Partitional Algorithms
MST
Squared Error
K-Means
Nearest Neighbor
PAM
BEA
GA
© Prentice Hall 96
MST Algorithm
© Prentice Hall 97
Squared Error
Minimized squared error
© Prentice Hall 98
Squared Error Algorithm
© Prentice Hall 99
K-Means
Initial set of clusters randomly chosen.
Iteratively, items are moved among sets
of clusters until the desired set is
reached.
High degree of similarity among
elements in a cluster is obtained.
Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)
s=30% = 50%
© Prentice Hall 139
Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5. i = i + 1;
6. Ci = Apriori-Gen(Li-1);
7. Count Ci to determine Li;
8. until no more large itemsets found;
PL PL BD-(PL)
© Prentice Hall 146
Sampling Algorithm
1. Ds = sample of Database D;
2. PL = Large itemsets in Ds using smalls;
3. C = PL BD-(PL);
4. Count C in Database using s;
5. ML = large itemsets in BD-(PL);
6. If ML = then done
7. else C = repeated application of BD-;
8. Count C in Database;
© Prentice Hall 147
Sampling Example
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all
remaining itemsets
L2 ={{Bread}, {Milk},
D2 {PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
S=10% {Beer}, {Beer,Bread},
{Beer,Milk}}
© Prentice Hall 152
Partitioning Adv/Disadv
Advantages:
– Adapts to available main memory
– Easily parallelized
– Maximum number of database scans is
two.
Disadvantages:
– May have many candidates during second
scan.