Lecture 9
Lecture 9
Ben Kiage
1
Chapter 9. Classification: Advanced Methods
7
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure.”
Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the network
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on an array of real-world data, e.g., hand-written letters
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of
rules from trained neural networks
8
A Multi-Layer Feed-Forward Neural Network
Output vector
w(jk 1) w(jk ) ( yi yˆ i( k ) ) xij
Output layer
Hidden layer
wij
Input layer
Input vector: X
9
How A Multi-Layer Neural Network Works
The inputs to the network correspond to the attributes measured
for each training tuple
Inputs are fed simultaneously into the units making up the input
layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
The network is feed-forward: None of the weights cycles back to
an input unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
samples, they can closely approximate any function
10
Defining a Network Topology
Decide the network topology: Specify # of units in the
input layer, # of hidden layers (if > 1), # of units in each
hidden layer, and # of units in the output layer
Normalize the input values for each attribute measured in
the training tuples to [0.0—1.0]
One input unit per domain value, each initialized to 0
Output, if for classification and more than two classes,
one output unit per class is used
Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
11
Backpropagation
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
For each training tuple, the weights are modified to minimize the
mean squared error between the network's prediction and the actual
target value
Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights to small random numbers, associated with biases
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
12
Neuron: A Hidden/Output Layer Unit
bias
x0 w0 mk
x1
w1
f output y
xn wn For Example
n
y sign( wi xi m k )
Input weight weighted Activation i 0
x1 : # of word “homepage”
x
x2 : # of word “welcome”
x x
x x
Mathematically, x X = , y Y = {+1, –1},
n
x
x x x o
We want to derive a function f: X Y
o
x o o
Linear Classification ooo
o o
Binary Classification problem o o o o
Data above the red line belongs to class ‘x’
Criticism
Long training time
discovery
Not easy to incorporate domain knowledge
17
SVM—Support Vector Machines
A relatively new classification method for both linear and
nonlinear data
It uses a nonlinear mapping to transform the original
training data into a higher dimension
With the new dimension, it searches for the linear optimal
separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently
high dimension, data from two classes can always be
separated by a hyperplane
SVM finds this hyperplane using support vectors
(“essential” training tuples) and margins (defined by the
support vectors)
18
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
Used for: classification and numeric prediction
Applications:
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
19
SVM—General Philosophy
20
SVM—Margins and Support Vectors
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
22
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization
problem: Quadratic objective function and linear constraints
Quadratic Programming (QP) Lagrangian multipliers
23
Why Is SVM Effective on High Dimensional Data?
24
SVM—Linearly Inseparable
A2
25
SVM: Different Kernel functions
Instead of computing the dot product on the transformed
data, it is math. equivalent to applying a kernel function
K(Xi, Xj) to the original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
Typical Kernel Functions
26
Scaling SVM by Hierarchical Micro-Clustering
27
CF-Tree: Hierarchical Micro-cluster
Read the data set once, construct a statistical summary of the data
(i.e., hierarchical clusters) given a limited amount of memory
Micro-clustering: Hierarchical indexing structure
provide finer samples closer to the boundary and coarser
samples farther from the boundary
28
Selective Declustering: Ensure High Accuracy
29
CB-SVM Algorithm: Outline
33
Chapter 9. Classification: Advanced Methods
Scalability issue
It is computationally infeasible to generate all feature
38
Empirical Results
1
0.9 InfoGain
IG_UpperBnd
0.8
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Support
39
Feature Selection
Given a set of frequent patterns, both non-discriminative
and redundant patterns exist, which can cause overfitting
We want to single out the discriminative patterns and
remove redundant ones
The notion of Maximal Marginal Relevance (MMR) is
borrowed
A document has high marginal relevance if it is both
relevant to the query and contains minimal marginal
similarity to previously selected documents
40
Experimental Results
41
41
Scalability Tests
42
DDPMine: Branch-and-Bound Search
a: constant, a parent
node Association between information
b: variable, a descendent gain and frequency
43
DDPMine Efficiency: Runtime
PatClass
Harmony
Pattern
Classification Alg.
44
Chapter 9. Classification: Advanced Methods
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
space.
Locally weighted regression
Case-based reasoning
based inference
47
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _
.
+
_
. +
xq +
. . .
_ + . 48
Discussion on the k-NN Algorithm
53
Fuzzy Set
Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the
degree of membership (such as in a fuzzy membership graph)
Attribute values are converted to fuzzy values. Ex.:
Income, x, is assigned a fuzzy membership value to each of the
discrete categories {low, medium, high}, e.g. $49K belongs to
“medium income” with fuzzy value 0.15 but belongs to “high
income” with fuzzy value 0.96
Fuzzy membership values do not have to sum to 1.
Each applicable rule contributes a vote for membership in the
categories
Typically, the truth values for each predicted category are summed,
and these sums are combined
54
Chapter 9. Classification: Advanced Methods
H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X
Use it to label the unlabeled data, and those with the most confident
X
Teach each other: The tuple having the most confident prediction
from f1 is added to the set of labeled data for f2, & vice versa
Other methods, e.g., joint probability distribution of features and labels
58
Active Learning
61
Chapter 9. Classification: Advanced Methods
63
References
64
Surplus Slides
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
Non-linear regression
(x x )( yi y )
w w y w x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
69
Regression Trees and Model Trees
Regression tree: proposed in CART system (Breiman et al. 1984)
CART: Classification And Regression Trees
Each leaf stores a continuous-valued prediction
It is the average value of the predicted attribute for the training
tuples that reach the leaf
Model tree: proposed by Quinlan (1992)
Each leaf holds a regression model—a multivariate linear equation
for the predicted attribute
A more general case than regression tree
Regression and model trees tend to be more accurate than linear
regression when the data are not represented well by a simple linear
model
70
Predictive Modeling in Multidimensional Databases
72
Prediction: Categorical Data
73
SVM—Introductory Literature
“Statistical Learning Theory” by Vapnik: extremely hard to
understand, containing many errors too.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
Better than the Vapnik’s book, but still written too hard for
introduction, and the examples are so not-intuitive
The book “An Introduction to Support Vector Machines” by N.
Cristianini and J. Shawe-Taylor
Also written hard for introduction, but the explanation about the
mercer’s theorem is better than above literatures
The neural network book by Haykins
Contains one nice chapter of SVM introduction
74
Notes about SVM—
Introductory Literature
“Statistical Learning Theory” by Vapnik: difficult to understand,
containing many errors.
C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern
Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
Easier than Vapnik’s book, but still not introductory level; the
examples are not so intuitive
The book An Introduction to Support Vector Machines by Cristianini
and Shawe-Taylor
Not introductory level, but the explanation about Mercer’s
Theorem is better than above literatures
Neural Networks and Learning Machines by Haykin
Contains a nice chapter on SVM introduction
75
Associative Classification Can Achieve High
Accuracy and Efficiency (Cong et al. SIGMOD05)
76
A Closer Look at CMAR
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general
• Perceptron: update W
additively
• Winnow: update W
multiplicatively
x1
78