0% found this document useful (0 votes)
13 views35 pages

Machine Learning 04 - Bayes

The document discusses Bayesian classifiers based on decision theory, focusing on the computation of a-posteriori probabilities and the Bayes classification rule for both two and multiple classes. It explains the minimization of classification error and average risk through the use of loss matrices and discriminant functions. Additionally, it covers the application of Bayesian classifiers to normal distributions and decision hyperplanes, including examples of minimum distance classifiers and their calculations.

Uploaded by

233046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views35 pages

Machine Learning 04 - Bayes

The document discusses Bayesian classifiers based on decision theory, focusing on the computation of a-posteriori probabilities and the Bayes classification rule for both two and multiple classes. It explains the minimization of classification error and average risk through the use of loss matrices and discriminant functions. Additionally, it covers the application of Bayesian classifiers to normal distributions and decision hyperplanes, including examples of minimum distance classifiers and their calculations.

Uploaded by

233046
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

CLASSIFIERS BASED ON BAYES

DECISION THEORY
 Statistical nature of feature vectors

x  x1 , x2 ,..., xl 
T

 Assign the pattern represented by feature x


vector
to the most probable of the available classes
1 ,  2 ,...,  M

x   i : P ( i x )
That is
maximum
1
 Computation of a-posteriori probabilities
 Assume known
• a-priori probabilities
P (1 ), P (2 )..., P (M )
• p ( x i ), i 1,2...M

This is also known as the likelihood of


x w.r. to i .

2
 The Bayes rule (Μ=2)

p ( x) P (i x)  p ( x i ) P (i ) 
p ( x i ) P (i )
P (i x) 
p( x)

where 2
p ( x)  p ( x i ) P (i )
i 1

3
 The Bayes classification rule (for two classes M=2)
x
 Given classify it according to the rule

If P (1 x )  P (2 x ) x  1
If P (2 x )  P (1 x ) x  2
 Equivalently: classify according to the rule
x
p ( x 1 ) P (1 )( ) p( x 2 ) P (2 )
 For equiprobable classes the test becomes

p ( x 1 )() P ( x 2 )
4
R1 ( 1 ) and R2 ( 2 )
5
 Equivalently in words: Divide space in two
regions
If x  R1  x in 1
If x  R2  x in 2

 Probability of error for equiprobable classes


 Total shaded
x0 area 
1 1
Pe  p ( x 2 )dx  p ( x 1 )dx
 2  2 x0

 Bayesian classifier is OPTIMAL with respect to


6
minimising the classification error
 Indeed: Moving the threshold the total shaded
area INCREASES by the extra “grey” area.
7
 The Bayes classification rule for many (M>2)
classes:x i
 Given classify it to if:
P(i x)  P ( j x) j i

 Such a choice also minimizes the classification


error probability

 Minimizing the average risk


 For each wrong decision, a penalty term is assigned
since some decisions are more sensitive than others

8
 For M=2
• Define the loss matrix
11 12
L ( )
21 22

• 12 2
penalty term for deciding class ,
although the pattern belongs to1 , etc.

1
 Risk with respect to

r1 11 p( x 1 )d x  12 p( x 1 )d x


R1 R2

9
2
 Risk with respect to

r2 21 p ( x 2 )d x  22 p ( x 2 )d x
R1 R2


 Probabilities of wrong
decisions, weighted by the
penalty terms
 Average risk

r r1 P (1 )  r2 P ( 2 )
10
 ChooseR1 andR2 so that r is minimized

 Then assign x to1 if


1 11 p( x 1 ) P(1 )  21 p( x 2 ) P(2 )
2 12 p( x 1 ) P(1 )  22 p( x 2 ) P(2 )
 Equivalently:
assign x in 1 ( if2 )
p ( x 1 ) P (2 ) 21   22
12   ()
p( x 2 ) P (1 ) 12   11

12 : likelihood ratio

11
 If 1
P (1 ) P (2 )  and 11 22 0
2
21
x  1 if P ( x 1 )  P ( x 2 )
12
12
x  2 if P ( x 2 )  P ( x 1 )
21
if 21 12  Minimum classification
error probability

12
 An example:
1
 p ( x 1 )  exp( x )2


1
 p( x 2 )  exp( ( x  1) 2 )

1
 P (1 ) P (2 ) 
2
 0 0 .5 
 L  
 1.0 0 

13
 Then the threshold value is:
x0 for minimum Pe :
2 2
x0 : exp( x ) exp( ( x  1) ) 
1
x0 
2
 Thresholdx̂0for minimum r
2 2
xˆ0 : exp( x ) 2 exp( ( x  1) ) 
(1  n 2) 1
xˆ0  
2 2
14
1
Thus x̂0 moves to the left of  x0
(WHY?) 2

15
DISCRIMINANT FUNCTIONS
DECISION SURFACES
 If Ri , R j g ( x) P(i
are contiguous: x)  P( j x) 0
Ri : P (i x)  P ( j x)
+
- g ( x) 0
R j : P ( j x)  P (i x)

is the surface separating the regions. On one


side is positive (+), on the other is negative
(-). It is known as Decision Surface

16
 If f(.) monotonic, the rule remains the same if we
use:
x  i if : f ( P(i x))  f ( P( j x)) i  j

g i ( x)  f ( P(i x))
 is a discriminant
function

 In general, discriminant functions can be defined


independent of the Bayesian rule. They lead to
suboptimal solutions, yet if chosen appropriately,
can be computationally more tractable.
17
BAYESIAN CLASSIFIER FOR NORMAL
DISTRIBUTIONS

 Multivariate Gaussian pdf

1  1 
p ( x i )  
exp  ( x   i )   i 1 ( x   i ) 
1
 2 
(2 )  i
2 2

 i E  x   matrix in i


 i E ( x   i )( x   i )  
called auto covariance matrix

18
 ln()is monotonic. Define:


g i ( x) ln( p ( x i ) P (i )) 
ln p ( x  i )  ln P (i )

1 T 1
g i ( x)  ( x   i )  i ( x   i )  ln P (i )  Ci
2
 1
Ci  ( ) ln 2  ( ) ln  i
 Example: 2 2

 2 0 
 i  
2
 0   19
1 1
 g i ( x)  2
1
2
(x  x ) 
2 ( i1 x1  i 2 x2 )
2 2
 2

1
 ( i21  i22 )  ln( Pi )  Ci
2 2

Thatg iis,
(x) is quadratic and the
surfaces g i ( x)  g j ( x) 0

quadrics, ellipsoids, parabolas,


hyperbolas,
pairs of lines.
For example:

20
 Decision Hyperplanes

T 1
 Quadratic terms:x  x
i

If ALLΣ i  Σ (the same) the


quadratic terms are not of interest.
They are not involved in comparisons.
Then, equivalently, we can write:
T
g i ( x) w x  wio
i

wi  Σ  1  i
1 Τ 1
wi 0 ln P (i )   i Σ  i
2
Discriminant functions are LINEAR 21
 Let in addition:
• Σ  2 I . Then
1 T
g i ( x)   i x  wi 0
 2

• g ij ( x)  g i ( x)  g j ( x) 0
T
w ( x  x o )
• w  i   j ,

1 P (i )  i   j
• x o  (  i   j )   ln
2

2 P ( j )    2
i j

22
 Nondiagonal:   2 
T
• g ij ( x) w ( x  x 0 ) 0
1
• w  (  i   j )
1 P (i ) i   j
• x 0  (  i   j )  n ( )
2 P ( j )    2
i j  1
1
T
x  1
( x   1 x) 2

not normal to  i   j
 Decision hyperplane
normal to   1 (  i   j )
23
 Minimum Distance Classifiers

 equiprobable
1
P (i ) 
 M
1
 g i ( x)  ( x   i )T   1 ( x   i )
2
Euclidean Distance:

  2 I : Assign
smaller x  i :

dE  x   i
Mahalanobis Distance:
smaller

  2 I : Assign x  i :
1
T 1
d m (( x   i )  ( x   i )) 2

24
25
 Example:
Given 1 , 2 : P (1 ) P (2 ) and p ( x 1 )  N (  1 , Σ ),
 0  3  1.1 0.3
p ( x 2 )  N (  2 , Σ ),  1   ,  2   ,   
0
  3
   0 . 3 1 .9 
 1.0 
classify the vector x   using Bayesian classification :
 2. 2 
-1  0.95  0.15
 Σ  
  0.15 0. 55 
 Compute Mahalanobis d m from 1 ,  2 : d 2 m ,1 1.0, 2.2
 1.0   1   2 .0 
Σ   2.952, d m , 2  2.0,  0.8  
1 2
 3.672
 2 .2    0 .8 

 Classify x  1. Observe that d E ,2  d E ,1


26
BAYESIAN NETWORKS
 Bayes Probability Chain Rule

p( x1 , x2 ,..., x )  p( x | x 1 ,..., x1 ) p( x 1 | x 2 ,..., x1 ) ...


... p( x2 | x1 ) p( x1 )
 Assume now that the conditional dependence
for each xi is limited to a subset of the features
appearing in each of the product terms. That is:

p( x1 , x2 ,..., x )  p( x1 )  p( xi | Ai )
i 2
where
Ai  xi  1 , xi  2 ,..., x1

27
 For example, if ℓ=6, then we could assume:
p( x6 | x5 ,..., x1 )  p( x6 | x5 , x4 )
Then:
A6 x5 , x4   x5 ,..., x1

 The above is a generalization of the Naïve –


Bayes. For the Naïve – Bayes the assumption
is:
Ai = Ø, for i=1, 2, …, ℓ

28
 A graphical way to portray conditional
dependencies is given below
 According to this figure
we have that:
• x6 is conditionally
dependent on x4, x5.
• x5 on x4
• x4 on x1, x2
• x3 on x2
• x1, x2 are conditionally
independent on other
variables.

 For this case:


p( x1 , x2 ,..., x6 )  p( x6 | x5 , x4 ) p( x5 | x4 ) p( x3 | x2 ) p( x2 ) p( x1 )
29
 Bayesian Networks

 Definition: A Bayesian Network is a directed


acyclic graph (DAG) where the nodes
correspond to random variables. Each node is
associated with a set of conditional
probabilities (densities), p(xi|Ai), where xi is the
variable associated with the node and Ai is the
set of its parents in the graph.

 A Bayesian Network is specified by:


• The marginal probabilities of its root nodes.
• The conditional probabilities of the non-root
nodes, given their parents, for ALL possible
combinations.
30
 The figure below is an example of a Bayesian
Network corresponding to a paradigm from
the medical applications field.
 This Bayesian network
models conditional
dependencies for an
example concerning smokers
(S), tendencies to develop
cancer (C) and heart disease
(H), together with variables
corresponding to heart (H1,
H2) and cancer (C1, C2)
medical tests.

31
 Once a DAG has been constructed, the joint
probability can be obtained by multiplying the
marginal (root nodes) and the conditional
(non-root nodes) probabilities.

 Training: Once a topology is given,


probabilities are estimated via the training
data set. There are also methods that learn
the topology.

 Probability Inference: This is the most common


task that Bayesian networks help us to solve
efficiently. Given the values of some of the
variables in the graph, known as evidence, the
goal is to compute the conditional probabilities
for some of the other variables, given the
evidence.
32
 Example: Consider the Bayesian network of
the figure:

a) If x is measured to be x=1 (x1), compute


P(w=0|x=1) [P(w0|x1)].

b) If w is measured to be w=1 (w1) compute


P(x=0|w=1) [ P(x0|w1)].

33
 For a), a set of calculations are required that
propagate from node x to node w. It turns out that
P(w0|x1) = 0.63.

 For b), the propagation is reversed in direction. It


turns out that P(x0|w1) = 0.4.

 In general, the required inference information is


computed via a combined process of “message
passing” among the nodes of the DAG.

 Complexity:
 For singly connected graphs, message passing
algorithms amount to a complexity linear in the
number of nodes.
34
Naïve BAYESIAN

p( x1 , x2 ,..., x )  p( x1 ) p( x2 )... p( x )

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy