Machine Learning 04 - Bayes
Machine Learning 04 - Bayes
DECISION THEORY
Statistical nature of feature vectors
x x1 , x2 ,..., xl
T
x i : P ( i x )
That is
maximum
1
Computation of a-posteriori probabilities
Assume known
• a-priori probabilities
P (1 ), P (2 )..., P (M )
• p ( x i ), i 1,2...M
2
The Bayes rule (Μ=2)
p ( x) P (i x) p ( x i ) P (i )
p ( x i ) P (i )
P (i x)
p( x)
where 2
p ( x) p ( x i ) P (i )
i 1
3
The Bayes classification rule (for two classes M=2)
x
Given classify it according to the rule
If P (1 x ) P (2 x ) x 1
If P (2 x ) P (1 x ) x 2
Equivalently: classify according to the rule
x
p ( x 1 ) P (1 )( ) p( x 2 ) P (2 )
For equiprobable classes the test becomes
p ( x 1 )() P ( x 2 )
4
R1 ( 1 ) and R2 ( 2 )
5
Equivalently in words: Divide space in two
regions
If x R1 x in 1
If x R2 x in 2
8
For M=2
• Define the loss matrix
11 12
L ( )
21 22
• 12 2
penalty term for deciding class ,
although the pattern belongs to1 , etc.
1
Risk with respect to
9
2
Risk with respect to
r2 21 p ( x 2 )d x 22 p ( x 2 )d x
R1 R2
Probabilities of wrong
decisions, weighted by the
penalty terms
Average risk
r r1 P (1 ) r2 P ( 2 )
10
ChooseR1 andR2 so that r is minimized
11
If 1
P (1 ) P (2 ) and 11 22 0
2
21
x 1 if P ( x 1 ) P ( x 2 )
12
12
x 2 if P ( x 2 ) P ( x 1 )
21
if 21 12 Minimum classification
error probability
12
An example:
1
p ( x 1 ) exp( x )2
1
p( x 2 ) exp( ( x 1) 2 )
1
P (1 ) P (2 )
2
0 0 .5
L
1.0 0
13
Then the threshold value is:
x0 for minimum Pe :
2 2
x0 : exp( x ) exp( ( x 1) )
1
x0
2
Thresholdx̂0for minimum r
2 2
xˆ0 : exp( x ) 2 exp( ( x 1) )
(1 n 2) 1
xˆ0
2 2
14
1
Thus x̂0 moves to the left of x0
(WHY?) 2
15
DISCRIMINANT FUNCTIONS
DECISION SURFACES
If Ri , R j g ( x) P(i
are contiguous: x) P( j x) 0
Ri : P (i x) P ( j x)
+
- g ( x) 0
R j : P ( j x) P (i x)
16
If f(.) monotonic, the rule remains the same if we
use:
x i if : f ( P(i x)) f ( P( j x)) i j
g i ( x) f ( P(i x))
is a discriminant
function
1 1
p ( x i )
exp ( x i ) i 1 ( x i )
1
2
(2 ) i
2 2
i E x matrix in i
i E ( x i )( x i )
called auto covariance matrix
18
ln()is monotonic. Define:
g i ( x) ln( p ( x i ) P (i ))
ln p ( x i ) ln P (i )
1 T 1
g i ( x) ( x i ) i ( x i ) ln P (i ) Ci
2
1
Ci ( ) ln 2 ( ) ln i
Example: 2 2
2 0
i
2
0 19
1 1
g i ( x) 2
1
2
(x x )
2 ( i1 x1 i 2 x2 )
2 2
2
1
( i21 i22 ) ln( Pi ) Ci
2 2
Thatg iis,
(x) is quadratic and the
surfaces g i ( x) g j ( x) 0
20
Decision Hyperplanes
T 1
Quadratic terms:x x
i
wi Σ 1 i
1 Τ 1
wi 0 ln P (i ) i Σ i
2
Discriminant functions are LINEAR 21
Let in addition:
• Σ 2 I . Then
1 T
g i ( x) i x wi 0
2
• g ij ( x) g i ( x) g j ( x) 0
T
w ( x x o )
• w i j ,
1 P (i ) i j
• x o ( i j ) ln
2
2 P ( j ) 2
i j
22
Nondiagonal: 2
T
• g ij ( x) w ( x x 0 ) 0
1
• w ( i j )
1 P (i ) i j
• x 0 ( i j ) n ( )
2 P ( j ) 2
i j 1
1
T
x 1
( x 1 x) 2
not normal to i j
Decision hyperplane
normal to 1 ( i j )
23
Minimum Distance Classifiers
equiprobable
1
P (i )
M
1
g i ( x) ( x i )T 1 ( x i )
2
Euclidean Distance:
2 I : Assign
smaller x i :
dE x i
Mahalanobis Distance:
smaller
2 I : Assign x i :
1
T 1
d m (( x i ) ( x i )) 2
24
25
Example:
Given 1 , 2 : P (1 ) P (2 ) and p ( x 1 ) N ( 1 , Σ ),
0 3 1.1 0.3
p ( x 2 ) N ( 2 , Σ ), 1 , 2 ,
0
3
0 . 3 1 .9
1.0
classify the vector x using Bayesian classification :
2. 2
-1 0.95 0.15
Σ
0.15 0. 55
Compute Mahalanobis d m from 1 , 2 : d 2 m ,1 1.0, 2.2
1.0 1 2 .0
Σ 2.952, d m , 2 2.0, 0.8
1 2
3.672
2 .2 0 .8
27
For example, if ℓ=6, then we could assume:
p( x6 | x5 ,..., x1 ) p( x6 | x5 , x4 )
Then:
A6 x5 , x4 x5 ,..., x1
28
A graphical way to portray conditional
dependencies is given below
According to this figure
we have that:
• x6 is conditionally
dependent on x4, x5.
• x5 on x4
• x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally
independent on other
variables.
31
Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional
(non-root nodes) probabilities.
33
For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.
Complexity:
For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes.
34
Naïve BAYESIAN
p( x1 , x2 ,..., x ) p( x1 ) p( x2 )... p( x )
35