0% found this document useful (0 votes)
35 views6 pages

Lecture 3

Uploaded by

Mohanad Kadhim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Lecture 3

Uploaded by

Mohanad Kadhim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Need for Probabilistic Reasoning

• Most everyday reasoning is based on uncertain evidence


and inferences.
• Classical logic, which only allows conclusions to be strictly
CS 343: Artificial Intelligence true or strictly false, does not account for this uncertainty or
Probabilistic Reasoning and the need to weigh and combine conflicting evidence.
• Straightforward application of probability theory is
Naïve Bayes impractical since the large number of probability parameters
required are rarely, if ever, available.
• Therefore, early expert systems employed fairly ad hoc
methods for reasoning under uncertainty and for combining
evidence.
• Recently, methods more rigorously founded in probability
Raymond J. Mooney theory that attempt to decrease the amount of conditional
probabilities required have flourished.
University of Texas at Austin
1 2

Axioms of Probability Theory Conditional Probability

• All probabilities between 0 and 1 • P(A | B) is the probability of A given B


0 ≤ P( A) ≤ 1 • Assumes that B is all and only information
• True proposition has probability 1, false has known.
probability 0. • Defined by:
P(true) = 1 P(false) = 0. P( A ∧ B)
P ( A | B) =
P(B)
• The probability of disjunction is:
P ( A ∨ B ) = P ( A) + P ( B ) − P ( A ∧ B )
A A∧ B B
A A∧ B B

3 4

Independence Classification (Categorization)

• A and B are independent iff: • Given:


P( A | B) = P ( A) – A description of an instance, x∈X, where X is the
These two constraints are logically equivalent instance language or instance space.
P( B | A) = P ( B ) – A fixed set of categories: C={c1, c2,…cn}
• Therefore, if A and B are independent: • Determine:
– The category of x: c(x)∈C, where c(x) is a
P( A ∧ B)
P ( A | B) = = P( A) categorization function whose domain is X and whose
P(B) range is C.
P ( A ∧ B ) = P ( A) P ( B ) – If c(x) is a binary function C={0,1} ({true,false},
{positive, negative}) then it is called a concept.

5 6

1
Learning for Categorization Sample Category Learning Problem

• A training example is an instance x∈X, • Instance language: <size, color, shape>


paired with its correct category c(x): – size ∈ {small, medium, large}
<x, c(x)> for an unknown categorization – color ∈ {red, blue, green}
– shape ∈ {square, circle, triangle}
function, c.
• C = {positive, negative}
• Given a set of training examples, D.
• D: Example Size Color Shape Category
• Find a hypothesized categorization function,
h(x), such that: 1 small red circle positive
2 large red circle positive
∀ < x, c ( x ) > ∈ D : h ( x ) = c ( x )
3 small red triangle negative
Consistency
4 large blue circle negative
7 8

Joint Distribution Probabilistic Classification


• The joint probability distribution for a set of random variables,
X1,…,Xn gives the probability of every combination of values (an n-
• Let Y be the random variable for the class which takes values
dimensional array with vn values if all variables are discrete with v {y1,y2,…ym}.
values, all vn values must sum to 1): P(X1,…,Xn) • Let X be the random variable describing an instance consisting
positive negative of a vector of values for n features <X1,X2…Xn>, let xk be a
circle square circle square possible value for X and xij a possible value for Xi.
red 0.20 0.02 red 0.05 0.30 • For classification, we need to compute P(Y=yi | X=xk) for i=1…m
blue 0.02 0.01 blue 0.20 0.20 • However, given no other assumptions, this requires a table
• The probability of all possible conjunctions (assignments of values to giving the probability of each category for each possible instance
some subset of variables) can be calculated by summing the in the instance space, which is impossible to accurately estimate
appropriate subset of values from the joint distribution. from a reasonably-sized training set.
P(red ∧ circle) = 0.20 + 0.05 = 0.25 – Assuming Y and all Xi are binary, we need 2n entries to specify
P( red ) = 0.20 + 0.02 + 0.05 + 0.3 = 0.57 P(Y=pos | X=xk) for each of the 2n possible xk’s since
P(Y=neg | X=xk) = 1 – P(Y=pos | X=xk)
• Therefore, all conditional probabilities can also be calculated.
– Compared to 2n+1 – 1 entries for the joint distribution P(Y,X1,X2…Xn)
P( positive ∧ red ∧ circle ) 0.20
P( positive | red ∧ circle ) = = = 0.80
P( red ∧ circle ) 0.25 9 10

Bayes Theorem Bayesian Categorization

P( E | H ) P( H ) • Determine category of xk by determining for each yi


P( H | E ) =
P( E )
P (Y = yi ) P( X = xk | Y = yi )
P(Y = yi | X = xk ) =
Simple proof from definition of conditional probability: P ( X = xk )

P( H ∧ E ) • P(X=xk) can be determined since categories are


P( H | E ) = (Def. cond. prob.)
P( E ) complete and disjoint.
P( H ∧ E )
P( E | H ) = (Def. cond. prob.) m m
P (Y = yi ) P ( X = xk | Y = yi )
P( H ) ∑ P(Y = y
i =1
i | X = xk ) = ∑
i =1 P ( X = xk )
=1
P ( H ∧ E ) = P ( E | H ) P( H )
m

P( E | H ) P ( H ) P( X = xk ) = ∑ P(Y = yi ) P ( X = xk | Y = yi )
QED: P ( H | E ) = i =1
P( E )
11 12

2
Bayesian Categorization (cont.) Generative Probabilistic Models
• Need to know: • Assume a simple (usually unrealistic) probabilistic method
by which the data was generated.
– Priors: P(Y=yi) • For categorization, each category has a different
– Conditionals: P(X=xk | Y=yi) parameterized generative model that characterizes that
category.
• P(Y=yi) are easily estimated from data. • Training: Use the data for each category to estimate the
– If ni of the examples in D are in yi then P(Y=yi) = ni / |D| parameters of the generative model for that category.
– Maximum Likelihood Estimation (MLE): Set parameters to
• Too many possible instances (e.g. 2n for binary maximize the probability that the model produced the given
training data.
features) to estimate all P(X=xk | Y=yi). – If Mλ denotes a model with parameter values λ and Dk is the
training data for the kth class, find model parameters for class k
• Still need to make some sort of independence (λk) that maximize the likelihood of Dk:
assumptions about the features to make learning λk = argmax P( Dk | M λ )
tractable. λ
• Testing: Use Bayesian analysis to determine the category
model that most likely generated a specific test instance.
13 14

Naïve Bayes Generative Model Naïve Bayes Inference Problem

lg red circ
neg
pos pos ?? ??
pos neg
pos neg

Category neg
pos pos
pos neg
pos neg

Category

red circ lg red circ red circ lg red circ


med blue med blue
sm blue tri tricirc sm sqr sm blue tri tricirc sm sqr
med lg red grn red circ circ med med grn grn tri circ med lg red grn red circ circ med med grn grn tri circ
lg lg sm sm lglg red blue circ tri sqr lg lg sm sm lglg red blue circ tri sqr
sm med red blue circ sqr sm blue grn sqr tri sm med red blue circ sqr sm blue grn sqr tri
red red
Size Color Shape Size Color Shape Size Color Shape Size Color Shape
Positive Negative 15 Positive Negative 16

Naïve Bayesian Categorization Naïve Bayes Categrization Example


• If we assume features of an instance are independent given
Probability positive negative
the category (conditionally independent). m
P( X | Y ) = P ( X 1 , X 2 ,L X n | Y ) = ∏ P( X i | Y )
P(Y) 0.5 0.5
P(small | Y) 0.4 0.4
i =1
P(medium | Y) 0.1 0.2
• Therefore, we then only need to know P(Xi | Y) for each P(large | Y) 0.5 0.4 Test Instance:
possible pair of a feature-value and a category.
P(red | Y) 0.9 0.3 <medium ,red, circle>
• If Y and all Xi and binary, this requires specifying only 2n P(blue | Y) 0.05 0.3
parameters:
P(green | Y) 0.05 0.4
– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi
P(square | Y) 0.05 0.4
– P(Xi=false | Y) = 1 – P(Xi=true | Y)
P(triangle | Y) 0.05 0.3
• Compared to specifying 2n parameters without any P(circle | Y) 0.9 0.3
independence assumptions.

17 18

3
Naïve Bayes Categorization Example Naïve Bayes Diagnosis Example

Probability positive negative • C = {allergy, cold, well}


P(Y) 0.5 0.5
• e1 = sneeze; e2 = cough; e3 = fever
P(medium | Y) 0.1 0.2
P(red | Y) 0.9 0.3 Test Instance:
<medium ,red, circle>
• E = {sneeze, cough, ¬fever}
P(circle | Y) 0.9 0.3

P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X) Prob Well Cold Allergy
0.5 * 0.1 * 0.9 * 0.9
= 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181
P(ci) 0.9 0.05 0.05
P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) P(sneeze|ci) 0.1 0.9 0.9
0.5 * 0.2 * 0.3 * 0.3 P(cough|ci) 0.1 0.8 0.7
= 0.009 / P(X) = 0.009 / 0.0495 = 0.1818
P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1
P(fever|ci) 0.01 0.7 0.4
P(X) = (0.0405 + 0.009) = 0.0495
19 20

Naïve Bayes Diagnosis Example (cont.) Estimating Probabilities


Probability Well Cold Allergy • Normally, probabilities are estimated based on observed
P(ci ) 0.9 0.05 0.05 frequencies in the training data.
P(sneeze | ci ) 0.1 0.9 0.9 E={sneeze, cough, ¬fever} • If D contains nk examples in category yk, and nijk of these nk
examples have the jth value for feature Xi, xij, then:
P(cough | ci ) 0.1 0.8 0.7
nijk
P(fever | ci ) 0.01 0.7 0.4 P( X i = xij | Y = yk ) =
P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E)
nk
P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E)
• However, estimating such probabilities from small training
P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E)
sets is error-prone.
• If due only to chance, a rare feature, Xi, is always false in
Most probable category: allergy the training data, ∀yk :P(Xi=true | Y=yk) = 0.
P(E) = 0.0089 + 0.01 + 0.019 = 0.0379 • If Xi=true then occurs in a test example, X, the result is that
P(well | E) = 0.23 ∀yk: P(X | Y=yk) = 0 and ∀yk: P(Y=yk | X) = 0
P(cold | E) = 0.26
P(allergy | E) = 0.50
21 22

Probability Estimation Example Smoothing

Ex Size Color Shape Category


Probability positive negative • To account for estimation from small samples,
P(Y) 0.5 0.5 probability estimates are adjusted or smoothed.
1 small red circle positive P(small | Y) 0.5 0.5
P(medium | Y) 0.0 0.0
• Laplace smoothing using an m-estimate assumes that
2 large red circle positive
P(large | Y) 0.5 0.5 each feature is given a prior probability, p, that is
3 small red triangle negitive P(red | Y) 1.0 0.5 assumed to have been previously observed in a
P(blue | Y) 0.0 0.5 “virtual” sample of size m.
4 large blue circle negitive
P(green | Y) 0.0 0.0 nijk + mp
P( X i = xij | Y = yk ) =
P(square | Y) 0.0 0.0
nk + m
P(triangle | Y) 0.0 0.5
Test Instance X:
<medium, red, circle> P(circle | Y) 1.0 0.5 • For binary features, p is simply assumed to be 0.5.
P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 / P(X) = 0
P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 / P(X) = 0 23 24

4
Laplace Smothing Example Text Categorization Applications
• Assume training set contains 10 positive examples: • Web pages
– Recommending
– 4: small – Yahoo-like classification
– 0: medium • Newsgroup/Blog Messages
– 6: large – Recommending
– spam filtering
• Estimate parameters as follows (if m=1, p=1/3) – Sentiment analysis for marketing
– P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 • News articles
– P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 – Personalized newspaper
– P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576 • Email messages
– P(small or medium or large | positive) = 1.0 – Routing
– Prioritizing
– Folderizing
– spam filtering
25 – Advertising on Gmail 26

Text Categorization Methods Naïve Bayes for Text

• Most common representation of a document • Modeled as generating a bag of words for a


is a “bag of words,” i.e. set of words with document in a given category by repeatedly
their frequencies, word order is ignored. sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
• Gives a high-dimensional vector probabilities P(wj | ci).
representation (one feature for each word). • Smooth probability estimates with Laplace
• Vectors are sparse since most words are m-estimates assuming a uniform distribution
rare. over all words (p = 1/|V|) and m = |V|
– Zipf’s law and heavy-tailed distributions – Equivalent to a virtual sample of seeing each word in
each category exactly once.

27 28

Naïve Bayes Generative Model for Text Naïve Bayes Text Classification

Win lotttery $ !
spam ?? ??
legit
spam spam
legit legit spam
spam spam legit
legit spam spam
legit legit
Category
spam spam
science legit
Viagra science
Viagra
win PM
win
Category PM
hot ! !! computer Friday
Nigeria deal hot ! !! computer Friday
test homework Nigeria deal
lottery nude test homework
March score lottery nude
! Viagra March score
$ May exam ! Viagra
$ May exam
spam legit spam
29
legit 30

5
Text Naïve Bayes Algorithm Text Naïve Bayes Algorithm
(Train) (Test)
Let V be the vocabulary of all words in the documents in D Given a test document X
For each category ci ∈ C Let n be the number of word occurrences in X
Let Di be the subset of documents in D in category ci Return the category:
n
P(ci) = |Di| / |D|
argmax P (ci )∏ P( ai | ci )
Let Ti be the concatenation of all the documents in Di ci ∈C i =1
Let ni be the total number of word occurrences in Ti where ai is the word occurring the ith position in X
For each word wj ∈ V
Let nij be the number of occurrences of wj in Ti
Let P(wj | ci) = (nij + 1) / (ni + |V|)

31 32

Underflow Prevention Comments on Naïve Bayes

• Multiplying lots of probabilities, which are • Makes probabilistic inference tractable by


between 0 and 1 by definition, can result in making a strong assumption of conditional
floating-point underflow. independence.
• Since log(xy) = log(x) + log(y), it is better to • Tends to work fairly well despite this strong
assumption.
perform all computations by summing logs
of probabilities rather than multiplying • Experiments show it to be quite competitive
with other classification methods on
probabilities. standard datasets.
• Class with highest final un-normalized log • Particularly popular for text categorization,
probability score is still the most probable. e.g. spam filtering.
33 34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy