UNCERTIANITY
UNCERTIANITY
1. Propositions:
▪ Degrees of belief are always applied to propositions.
▪ Probability theory typically uses a language that is slightly more expressive than propositional logic.
(This section describes that language.)
13
14
15
Continued...
2. Atomic events
• An atomic event is a complete specification of the state of the world about which the agent is uncertain.
• It can be thought of as an assignment of particular values to all the variables of which the world is composed.
Ex: If my world consists of only the Boolean variables Cavity and Toothache,
then there are just four distinct atomic events;
the proposition Cavity =false Ʌ Toothache = true is one such event.
(1) They are mutually exclusive-at most one can actually be the case.
Ex: cavity Ʌ toothache and cavity Ʌ ¬toothache cannot both be the case.
(2) The set of all possible atomic events is exhaustive-at least one must be the case.
-- That is, the disjunction of all atomic events is logically equivalent to true.
❑ Preferences, as expressed by utilities, are combined with probabilities in the general theory of rational decisions
called decision theory:
Decision theory = Probability theory + Utility theory
The fundamental idea of decision theory is that an agent is rational if and only if it chooses the action that yields the
highest expected utility, averaged over all the possible outcomes of the action.
Figure 13.1 sketches the structure of an agent that uses decision theory to select actions.
• A joint probability distribution that covers the complete set of random variables is called the full joint probability
distribution.
Ex: if the world consists of just the variables Cavity, Toothache, and Weather,
then the full joint distribution is given by P (Cavity, Toothache, Weather).
⮚ This joint distribution can be represented as a 2 x 2 x 4 table with 16 entries.
⮚ So, any probabilistic query can be answered from the full joint distribution.
• For continuous variables, it is not possible to write out the entire distribution as a table, because there are
infinitely many values.
• Instead, one usually defines the probability that a random variable takes on some value x as a parameterized
function of x.
Ex: Let the random variable X denote tomorrow's maximum temperature in Berkeley.
expresses the belief that X is distributed uniformly between 18 and 26 degrees Celsius.
• Probability distributions for continuous variables are called probability density functions.
❑ As an example consider applying the product rule to each case where the propositions a and b assert particular
values of X and Y respectively.
❑ We obtain the following equations:
❖ The third line of this derivation is itself a useful fact and can be extended from the Boolean case to the general
discrete case.
Let the discrete variable D have the domain < d1 , ……… ,dn >
--that is, the computation from observed evidence of posterior probabilities for query propositions.
We will use the full joint distribution as the "knowledge base" from which answers to all questions may be derived.
Now identify those atomic events in which the proposition is true and add up their probabilities.
Ex: There are six atomic events in which cavity V toothache holds: (i.e., cavity V toothache is True)
P(cavity V toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28
One common task is to extract the distribution over some subset of variables or a single variable.
Ex: Adding the entries in the first row gives the unconditional or marginal probability of cavity:
This process is called marginalization, or summing out-because the variables other than Cavity are summed out.
• That is, a distribution over Y can be obtained by summing out all the other variables
from any joint distribution containing Y.
A variant of this rule involves conditional probabilities instead of joint probabilities, using the product rule:
In fact, it can be viewed as a normalization constant for the distribution P( Cavity | toothache),
ensuring that it adds up to 1.
It loops over the values of X and the values of Y to enumerate all possible atomic events with e fixed,
adds up their probabilities from the joint table, and normalizes the results.
❑ Let us expand the full joint distribution in Figure 13.3 by adding a fourth variable, Weather.
❑ The full joint distribution then becomes P(Toothache, Catch, Cavity, Weather),
which has 32 entries (because Weather has four values).
❑ It contains four "editions" of the table shown in Figure 13.3, one for each kind of weather.
❑ It seems natural to ask what relationship these editions have to each other and to the original three-variable table.
For example, how are
P(toothache, catch, cavity, Weather = cloudy) and P(toothache, catch, cavity) related?
One way to answer this question is to use the product rule: P (a Ʌ b) = P (a | b) P (b)
❖ One should not imagine that one's dental problems influence the weather.
⮚ A similar equation exists for every entry in P(Toothache, Catch, Cavity, Weather).
❖ In fact, we can write the general equation
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P( Weather) .
⮚ Thus, the 32-element table for four variables can be constructed from one 8-element table and
one four-element table.
❖ This decomposition is illustrated schematically in Figure 13.5(a).
The property we used in writing Equation (1) is called independence (also marginal independence and absolute
independence).
⮚ Independence between variables X and Y can be written as follows (again, these are all equivalent):
P(X|Y)=P(X) or P ( Y | X ) = P(Y) or P ( X ,Y ) = P ( X ) P ( Y ).
We defined the product rule and pointed out that it can be written in two forms because of the commutativity of
conjunction:
This equation is known as Bayes' rule (also Bayes' law or Bayes' theorem).
⮚ The more general case of multivalued variables can be written in the P notation as
where again this is to be taken as representing a set of equations, each dealing with specific values of the
variables.
⮚ A more general version conditionalized on some background evidence e:
▪ Bayes' rule is useful in practice because there are many cases where we do have
good probability estimates for these three numbers and need to compute the fourth.
▪ In a task such as medical diagnosis,
we often have conditional probabilities on causal relationships and want to derive a diagnosis.
▪ A doctor knows that the disease meningitis causes the patient to have a stiff neck, say, 50% of the time.
▪ The doctor also knows some unconditional facts:
the prior probability that a patient has meningitis is 1/50,000, and
the prior probability that any patient has a stiff neck is 1/20.
▪ Letting s be the proposition that the patient has a stiff neck and
m be the proposition that the patient has meningitis, we have
⮚ That is, we expect only 1 in 5000 patients with a stiff neck to have meningitis.
⮚ Notice that, even though a stiff neck is quite strongly indicated by meningitis (with probability 0.5),
the probability of meningitis in the patient remains small.
⮚ This is because the prior probability on stiff necks is much higher than that on meningitis.
▪ Thus, in order to use this approach we need to estimate P(s | ¬m) instead of P(s).
where a is the normalization constant needed to make the entries in P(Y I X) sum to 1.
If we know the full joint distribution (Figure 13.3), one can read off the answer:
⮚ We know, however, that such an approach will not scale up to larger numbers of variables.
⮚ We can try using Bayes' rule to reformulate the problem:
⮚
------ (2)
⮚ These variables are independent, however, given the presence or the absence of a cavity.
⮚ Each is directly caused by the cavity, but neither has a direct effect on the other:
toothache depends on the state of the nerves in the tooth, whereas
the probe's accuracy depends on the dentist's skill, to which the toothache is irrelevant.
This equation expresses the conditional independence of toothache and catch given Cavity.
The general definition of conditional independence of two variables X and Y , given a third variable Z is
P(X, Y|Z) = P(X|Z) P(Y|Z)
⮚ It turns out that the same is true for conditional independence assertions.
For example, given the assertion in Equation—(3), we can derive decomposition as follows:
⮚ In this way, the original large table is decomposed into three smaller tables.
• The decision tree is really describing a relationship between Will Wait and some logical combination of attribute
values.
• Decision trees can express any function of the input attributes. For Boolean functions, truth table row gives path
to leaf.
• If the function is the parity function, which returns 1 if and only if an even number of inputs are 1, then an
exponentially large decision tree will be needed. It is also difficult to use a decision tree to represent a majority
function, which returns 1 if more than half of its inputs are 1.
• The truth table has 2n rows, because each input case is described by n attributes.
We can consider the "answer" column of the table as a 2n -bit number that defines the function.
A trivial (unimportant) solution for the problem of finding a decision tree that agrees with the training set:
• Construct a decision tree that has one path to a leaf for each example,
where the path tests each attribute in turn and follows the value for the example and the leaf has the
classification of the example.
• When given the same example again, the decision tree will come up with the right classification.
• Unfortunately, it will not have much to say about any other cases!
• We then decide which attribute to use as the first test in the tree.
• Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible outcomes,
each of which has the same number of positive and negative examples.
• On the other hand, in Figure 18.4(b) we see that Patrons is a fairly important attribute, because if the value is
None or Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively).
• In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning
problem in itself, with fewer examples and one fewer attribute.
1. If there are some positive and some negative examples, then choose the best attribute to split them.
(Figure 18.4(b) shows Hungry being used to split the remaining examples.)
2. If all the remaining examples are positive (or all negative), then we are done: we can answer Yes or No.
(Figure 18.4(b) shows examples of this in the None and Some cases.)
3. If there are no examples left, it means that no such example has been observed, and we return a default value
calculated from the majority classification at the node's parent.
4. If there are no attributes left, but both positive and negative examples, we have a problem.
⮚ It means that these examples have exactly the same description, but different classifications.
⮚ This happens when some of the data are incorrect; we say there is noise in the data.
⮚ It also happens either when the attributes do not give enough information to describe the situation fully, or when
the domain is truly nondeterministic. One simple way out of the problem is to use a majority vote.
• The measure should have its maximum value when the attribute is perfect and its minimum value when the
attribute of no use at all.
• One suitable measure is the expected amount of information provided by the attribute.
Ex: Whether a coin will come up heads.
-- The amount of information contained in the answer depends on one's prior knowledge.
-- The less you know, the more information is provided.
• One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a
fair coin.
• If the coin is loaded to give 99% heads, we get I (1/100,99/100) = 0.08 bits, and
as the probability of heads goes to 1, the information of the actual answer goes to 0.
• An estimate of the probabilities of the possible answers before any of the attributes have been tested is given by
the proportions of positive and negative examples in the training set.
• Suppose the training set contains p positive examples and n negative examples.
• Then an estimate of the information contained in a correct answer is:
• The restaurant training set in Figure 18.3 has p = n = 6, so we need 1 bit of information.
• Now a test on a single attribute A will not usually tell us this much information, but it will give us some of it.
• We can measure exactly how much by looking at how much information we still need after the attribute test.
• A randomly chosen example from the training set has the ith value for the attribute
with probability (pi + ni)/(p+n).
⮚ So on average, after testing attribute A, we will need
• The heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain.
• Returning to the attributes considered in Figure 18.4, we have
❖ Uncertainty arises in the wumpus world because the agent's sensors give only partial, local information about the
world.
▪ Figure 13.6 shows a situation in which each of the three reachable squares-[1,3], [2,2], and [3,1]-might contain a
pit.
▪ Pure logical inference can conclude nothing about which square is most likely to be safe,
so a logical agent might be forced to choose randomly.
▪ A probabilistic agent can do much better than the logical agent.
• These variables are included only for the observed squares--[1,1], [1,2], and [2,1].
• The first term (on the RHS) : Conditional probability of a breeze configuration, given a pit configuration
-- this is 1 if the breezes are adjacent to the pits and 0 otherwise.
• The second term : Prior probability of a pit configuration.
-- Each square contains a pit with probability 0.2 independently of the other squares.
Hence,
To answer this query, we can follow the standard approach suggested by the equation
--- (4)