Unit II
Unit II
PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic reasoning – Bayesian networks
– exact inference in BN – approximate inference in BN – causal networks.
Part-A
1. Why does uncertainty arise? CO2 May
Agents almost never have access to the whole truth about their environment. 2023
Uncertainty arises because of both laziness and ignorance. It is inescapable in complex,
nondeterministic, or partially observable environments
Agents cannot find a categorical answer.
Uncertainty can also arise because of incompleteness, incorrectness in agents
understanding of properties of environment.
2. Define Baye's theorem. CO2
In probability theory and applications, Baye's theorem (alternatively called as Baye's
law or Bayes rule) links a conditional probability to its inverse.
It is written as P(a). For example, the probability that, Ram has cavity = 0.1, then the
prior priobility is written as,
For specifying the hybrid network two kinds of distribution are specified. The conditional
distribution for a continuous variable given discrete or continuous parents and the
conditional distribution for a discrete variable given continuous parent.
Earlier agents we have seen make the epistemological commitment that either the facts
(expressed as prepositions) are true, false or else they are unknown. When an agent
knows enough facts about its environment, the logical approach enables it to derive
plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be impossible to
construct a complete and correct description of how its actions will work. If a logical
agent can not conclude that any perticular course of action achieves its goal, then it will
be unable to act.
8. What is reasoning by default? CO2
We can do qualitative reasoning using technique like default reasoning. Default reasoning
treats conclusions not as "believed to a certain degree", but as "believed until a better reason
is found to believe something else".
9. Define Conditional Probability CO2
1) When agent obtains evidence concerning previously unknown random variables in the
domain, then prior probability are not used. Based on new information conditional or
posterior probabilities are calculated.
The 'P' is read as "the probability of a given that all we know is b". That is when b is known
it indicates probability of a.
1) All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a) ≤1.
2) Necessarily true (i.e, valid) propositions have probability 1, and necessarily false (i.e.,
unsatisfiable) propositions have probability 0.
P(true) = 1 P(false) = 0
To specify a hybrid network, we have to specify two new kinds of distributions. The
conditional distribution for a continuous variable given discrete or continuous parents; and
the conditional distribution for a discrete variable given continuous parents.
X: Query variable.
E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y: The
set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also called as
hidden variables].
2) Intermediate results are stored and summations over each variable are done only for those
portions of the expression that depends on the variable.
2) It can be used to compute conditional probabilities; that is, to determine P(X | e).
for j = 1 to N do
X ←PRIOR-SAMPLE (bn)
if X is consistent with e is N[x] ← N[x] +1 where x is the value of X in x.
The rejection sampling algorithm for answering queries given evidence in a Bayesian
network.
2 In a class, there are 70% of the students who like English and 40% of the students who CO2
0. likes English and mathematics, and then what is the percent of students those who like
English also like mathematics?
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.
2 List some applications of Bayes theorem CO2
1.
It is used to calculate the next step of the robot when the already executed step
is given.
Bayes' theorem is helpful in weather forecasting.
It can solve the Monty Hall problem.
2 What is the probability that a patient has diseases meningitis with a stiff neck? CO2
2. Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it
occurs 80% of the time. He is also aware of some more facts, which are given as
follows:
o The Known probability that a patient has meningitis disease is 1/30,000.
Let a be the proposition that patient has stiff neck and b be the proposition that
patient has meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with
a stiff neck.
2 Define Probability CO2
3. Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.
In the real world, there are lots of scenarios, where the certainty of something
is not confirmed, such as "It will rain today," "behavior of someone for some
situations," "A match between two teams or two players." These are probable
sentences for which we can assume that it will happen but not sure about it, so
here we use probabilistic reasoning.
2 What is the need for probabilistic reasoning in AI? CO2
5.
When there are unpredictable outcomes.
When specifications or possibilities of predicates becomes too large to handle.
When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:
o Bayes' rule
o Bayesian Statistics
Part-B
1. Explain acting under uncertainty with relevant examples CO2
A agent working in real world environment almost never has access to whole
truth about its environment. Therefore, agent needs to work under uncertainity.
Earlier agents we have seen make the epistemological commitment that either
the facts (expressed as prepositions) are true, false or else they are unknown.
When an agent knows enough facts about its environment, the logical approach
enables it to derive plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be impossible to
construct a complete and correct description of how its actions will work. If a
logical agent can not conclude that any perticular course of action achieves its
goal, then it will be unable to act.
The right thing logical agent can do is, take a rational decision. The rational
decision depends on following things:
• The relative importance of various goals.
• The likelihood and the degree to which, goals will be achieved.
Utility Theory
Utility theory is used to represent and reason with preferences. The term utility
in current context is used as "quality of being useful".
Utility theory says that every state has a degree of usefulness called as utility.
The agent will prefer the states with higher utility.
The utility of the state is relative to the agent for which utility function is
calculated on the basis of agent's preferences.
For example: The pay off functions for games are utility functions. The utility
of a state in which black has won a game of chess is obviously high for the
agent playing black and low for the agent playing white.
There is no measure that can count test or preferences. Someone loves deep
chocolate icecream and someone loves chocochip icecream. A utility function
can account for altruistic behavior, simply by including the welfare of other as
one of the factors contributing to the agent's own utility.
Decision theory
Preferences as expressed by utilities are combined with probabilities for
making rational decisions. This theory, of rational decision making is called as
decision theory.
Decision theory can be summarized as,
Decision theory = Probability theory + Utility theory.
• The principle of Maximum Expected Utility (MEU):
Decision theory says that the agent is rational if and only if it chooses the
action that yields highest expected utility,averaged over all the possible
outcomes of the action.
• Design for a decision theoretic agent:
Following algorithm sketches the structure of an agent that uses decision
theory to select actions.
The algorithm
Function: DT-AGENT (percept) returns an action.
Static: belief-state, probabilistic beliefs about the current state of the world.
action, the agent's action.
- Update belief-state based on action and percept
- Calculate outcome probabilities for actions, given actions descriptions
and current belief-state
- Select action with highest expected utility given probabilities of
outcomes and utility information
- Return action.
A decision therotic agent that selects rational actions.
The decision theoretic agent is identical, at an abstract level, to the logical
agent. The primary difference is that the decision theoretic agent's knowledge
of the current state is uncertain; the agent's belief state is a representation of the
probabilities of all possible actual states of the world.
As time passes, the agent accumulates more evidence and its belief state
changes. Given the belief state, the agent can make probabilistic predictions of
action outcomes and hence select the action with highest expected utility.
3) It is written as P(a).
For example: The probability that, Ram has cavity = 0.1, then prior probability is
written as, P (Cavity = true) = 0.1 or P (Cavity) = 0.1
4) It should be noted that as soon as new information is received, one should reason
with the conditional probability of 'a' depending upon new information. 5) When it is
required to express probabilities of all the possible values of a random variable, then a
vector of values is used. It is represented using P(a). This represents values for the
probabilities of each individual state of the 'a'.
For example: P (Weather) = < 0.7, 0.2, 0.08, 0.02 > is representing four equations
6) The expression P(a) is said to be defining prior probability distribution for the
random variable 'a'.
9) A joint probability distribution that covers the complete set of random variables is
called as full joint probability distribution.
If problem world consists of 3 random variables, wheather, cavity, toothache then full
joint probability distribution would be,
i) For continuous random variable it is not feasible to represent vector of all possible
values because the values are infinite. For continuous random variable the probability
is defined as a function with parameter x, which indicates that random variable takes
some value x.
For example: Let random variable x denotes the tomorrow's temperature in Chennai. It
would be represented as,
This sentence express the belief that X is distributed uniformly between 25 and 37
degrees celcius.
ii) The probability distribution for continuous random variable has probability density
function.
Conditional Probability
The 'P' is read as "the probability of a given that all we know is b". That is when b is
known it indicates probability of a.
it means that, if patient has toothache (and no other information is known) then the
chances of having cavity are = 0.8
This is called as product rule. In other words it says, for 'a' and 'b' to be true we need
'b' to be true and we need a to be true given b. It can be also written as,
6) P notation can be used for conditional distribution. P(x | y) gives the values of P(X
= xi | Y = yj) for each possible i, j. following are the individual equations,
1) All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a) ≤1.
2) Necessarily true (i.e, valid) propositions have probability 1, and necessarily false
(i.e., unsatisfiable) propositions have probability 0.
P(true) = 1 P(false) = 0
This axiom connects the probabilities of logically related propositions. This rule states
that, the cases where 'a' holds, together with the cases where 'b' holds, certainly cover
all the causes where 'a v b' holds; but summing the two sets of cases counts their
intersection twice, so we need to subtract P(a Ʌ b).
Equating right sides of equation (7.1.6) and equation (7.1.7) and dividing by P(a),
For example: Probability of patient having low sugar has high blood pressure is
50 %.
Then we have,
P(s | m) = 0.5
P(m) = 1 | 50000
P(s) = 1 | 20
= 0.5×1/50000 /1/20
= 0.0002
That is, we can expect that 1 in 5000 with high B.P. will has low sugar.
For example: Toothache and catch both evidences are available then cavity is sure to
exist. Which can be represented as,
For this reformulation to work, we need to know the conditional probabilities of the
conjunction Toothache Catch for each value of Cavity. That might be feasible for just
two evidence variables, but again it will not scale up.
If there are n possible evidence variable (X-rays, diet, oral hygiene, etc.), then there
are 2n possible combinations of observed values for which we would need to know
conditional probabilities.
The notion of independence can be used here. These variables are independent,
however, given the presence or the absence of a cavity. Each is directly caused by the
cavity, but neither has a direct effect on the other. Toothache depends on the state of
the nerves in the tooth, where as the probe's accuracy depends on the dentist's skill, to
which the toothache is irrelevant.
This equation expresses the conditional independence of toothache and catch, given
cavity.
Now, the information requirement are the same as for inference using each piece of
evidence separately the prior probability P(Cavity) for the query variable and the
conditional probability of each effect, given its cause.
Conditional independence assertions can allow probabilistic systems to scale up; more
over, they are much more commonly available than absolute independence assertions.
When their are 'n' variables given that they are all conditionally independent, the size
of the representation grows as O(n) instead of O(2n).
For example -
• One particular common task in inferencing is to extract the distribution over some
subset of variables or a single variable. This distribution over some variables or single
variables is called as marginal probability.
For example: P(Cavity) = 0.108 +0.012+ 0.072 + 0.008. = 0.2
This process is called as marginalization or summing. Because the variables other than
'cavity' (that is whose probability is being counted) are summed out.
It indicates that distribution Y can be obtained by summing out all the other paulay to
variables from any joint distribution X containing Y.
0.108+0.012/
0.108+0.012+0.016+0.064 = 0.6
Just to check we can also compute the probability that there is no cavity given a
toothache:
Notice that in these two calculations the term 1/P (toothache) remains constant, no
matter which value of cavity we calculate. With this notation we can write above two
equations in one.
Consider the case in which query involves a single variable. The notation used is let X
be the query variable (cavity in the example), let E be the set of evidence variables
(just toothache in the example) let e be the observed values for them, and let Y be the
remaining unobserved variable (just catch in the example). The query is P(X/e) and
can be evaluated as
where the summation is over all possible ys (i.e. all possible combinations of values of
the unobserved variables 'y'). Notice that together the variables, X, E and Y constitute
the complete set of variables for the domain, so P (X, e, y) is simply a subset of
probabilities from the full joint distribution.
Independance
It is a relationship between two different sets of full joint distributions. It is also called
as marginal or absolute independance of the variables. Independence indicates that
whether the two full joint distributions affects probability of each other.
• For example: The weather is independant of once dental problem. Which can be
shown as below equation.
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather).
2) Intermediate results are stored and summations over each variable are done only for
those portions of the expression that depends on the variable.
3) Factors: Each part of the expression is annotated with the name of the associated
variable, these parts are called factors.
Steps in algorithm:
i) The factor for M, P (m | a), does not require summing over M. Probability is stored,
given each value of a, in a two-element factor,
iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix fA (A, B, E).
iv) Summing out A from the product of these tree factors. This will give 2×2 matrix
whose indices range over just B and E. We put bar over A in the name of the matrix to
indicate that A has been summed out.
FȂJM(B, E) =Σa fA (a, B, E) × fj (a) × fM (a)
v) Processing E in the same way (i.e.) sum out E from the product of
vi) Compute the answer simply by multiplying the factor for B. (i.e.) (f B|B) = P(B),
701 by the accumulated matrix (B) :
From the above sequence of steps it can be noticed that two computational operations
are required.
For example: Given two factors f 1 (A, B) and f2 (B, C) with probability distributions
shown below, the pointwise product f1 × f2 is given as f1 (A, B, C).
For example:
Matrices are not multiplied until we need to sum but a variable from the accumulated
product. At that point, multiply those matrices that include the variable to be summed
out.
The procedure for pointwise product and summing is given below, the variable
elimination algorithm is shown below:
Σm P(M | a) is equal to 1.
Note The variable M is irrelevant to this query. Result of the query P (J calls/Burglary
= True) is unchanged if we remove M calls from the network. We can remove any leaf
node which is not a query variable or an evidence variable. After its removal, there
may be more leaf nodes and they may be irrelevant. Eventually we find that every
variable that is not an ancestor of a query variable or evidence variable is, irrelevant to
the query. A variable elimination algorithm can remove all these variables before
evaluating the query.
1) The basic element in any sampling algorithm is the generation of samples from a
known probability distribution.
For example: An unbiased coin can be thought of as a random variable coin with
values (heads, tails) and a prior distribution P(coin) <0.5, 0.5>. Sampling from this
distribution is exactly like flipping the coin: with probability 0.5 it will return 9879
heads, and with probability 0.5 it will return tails. Given a source of random numbers
in the range [0, 1], it is a simple matter to sample any distribution on a single variable.
2) The simplest kind of random sampling process for Bayesian networks generates
events from a network that has no evidence associated with it. The idea is to sample
each variable in turn, in topological order.
3) The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable's parents.
Function PRIOR-SAMPLE (bn) returns an event sampled from the prior specified by
bn.
for i = 1 to n do
return x.
ii) Sample from P(Sprinkler | Cloudy = true) = <0.1, 0.9>; suppose this returns false.
iii) Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true. iv)
Sample from P(WetGrass | Sprinkler = false, Rain = true) <0.9, 0.1>; suppose this
returns true.
In this case PRIOR-SAMPLE returns the event [true, false, true, true].
Because each sampling step depends only on the parent values. This expression is also
the probability of the event according to the Bayesian net's representation of the joint
distribution. We have,
7) In any sampling algorithm, the answers are computed by counting the actual
samples generated. Suppose there are N total samples, and let N (x 1,...,xn) be the
frequency of the specific event x 1,...,xn. We expect this frequency to converge in the
limit, to its expected value according to the sampling probability:-
N → lim ∞ Nps (x1, …., xn )/N = Sps (x1, …, xn) = P(x1,….xn) ... (7.3.4)
For example: Consider the event produced earlier : [true, false, true, true]. The b. a
sampling probability for this event is,
= 0.324
Hence, in the limit of large N, we expect 32.4 % of the samples to be of this event.
For example: One can produce a consistent estimate of the probability of any partially
specified event x1,...,xn, where m≤n, as follows: -
That is, the probability of the event can be estimated as the fraction of all complete
events generated by the sampling process that match the partially specified event.
For example: If we generate 1000 samples from the sprinkler network, and 511 of
them have, Rain = true, then the estimated probability of rain, written as P(Rain =
true), is 0.511.
2) It can be used to compute conditional probabilities; that is, to determine P(X | e).
for j = 1 to N do
X ←PRIOR-SAMPLE (bn)
The rejection sampling algorithm for answering queries given evidence in a Bayesian
network.
• Working of algorithm:
i) It generates samples from the prior distribution specified by the network.
ii) It rejects all those samples that do not match the evidence.
iii) Finally, the estimate P(X = x | e) is obtained by counting how often X = x occurs
in the remaining samples.
4) Let P(X | e) be the estimated distribution that the algorithm returns. From the
definition of the algorithm, we have,
That is, rejection sampling produces a consistent estimate of the rule probability. 5)
Applying operations of algorithm on the network in the Fig. 7.3.8 (a), let us assume
that we wish to estimate P (Rain/Sprinkler = true), using 100 samples. Of the 100 that
we generate, suppose 8 have Rain = true and 19 have Rain = false. Hence,
6) As more samples are collected, the estimate will converges to the true answer. The
standard deviation of the error in each probability will be proportional to 1/√n, where
'n' is the number of samples used in the estimate.
7) The biggest problem with rejection sampling is that it rejects so many samples !
The fraction of samples consistent with the evidence 'e' drops exponentially as the
number of evidence variables grows, so the procedure is simply unusable for complex
problems.
for j = 1 to N do
X, w← WEIGHTED-SAMPLE (bn)
for i = 1 to n do
if X, has a value xi in e
return X, w.
i) It fixes the values for the evidence variables E and samples only the remaining
variables X and Y. This guarantees that each event generated is consistent with the
evidence.
ii) Not all events are equal, however, before tallying the counts in the distribution for
the query variable, each event is weighted by the likelihood that the event 19 accords
to the evidence, as measured by the product of the conditional of probabilities for each
evidence variable, given its parents.
iii) Intuitively, events in which the actual evidence appears unlikely should be given
less weight.
3) Applying operations of algorithm on the network Fig. 7.3.8 (a), with the query
P(Rain/Sprinkler = true, WetGrass = true). The process goes as follows:
iv) The weight is low because the event describes a cloudy day, which makes the
sprinkler unlikely to be on.
iv) The algorithm samples each variable in Z, in given its parent values:
Notice that Parents (Zi) can include both hidden variables and evidence variables.
Unlike the prior distribution P(Z), the distribution S ws pays some attention to the son
evidence the sampled values for each Zi will be influenced by evidence among Zi's
ancestors.
v) On the other hand, Sws pays less attention to the evidence than does the true Som
posterior distribution P(Z | e), because the sampled values for each Z i ignore evidence
among Zi's non ancestors.
vi) The likelihood weight w makes up for the difference between the actual and
desired sampling distribution. The weight for a given sample 'x', composed from Z
and 'e', is the product of the likelihood for each evidence variable given its parents
(some or all of which may be among the Zis):
vii) Multiplying equations (7.3.4) and (7.3.5), we see that the weighted probability of
a sample has the particularly convenient form,
Because the two products cover all the variables in the network, allowing us to use
equation (7.3.1) for the joint probability.
viii) It can be easy to show that likelihood weighting estimates are consistent. For any
particular value x of X, the estimated posterior probability can be calculated as
follows:
from LIKELIHOOD-WEIGHTING.
from large N.
= α' Σy Ρ(x, y, e)
5) Performance of algorithm:
i) Likelihood weighting uses all the samples generated therefore it can be much more
efficient than rejection sampling.
iii) Because most samples will have very low weights and hence the weighted
estimate will be dominated by the tiny fraction of samples that accord more than an
infinitesimal likelihood to the evidence.
iv) The problem is exacerbated if the evidence variables occur late in the variable
ordering, because then the samples will be the simulations that bear little resemblance
to the reality suggested by the evidence.
where parents (xi) denotes the specific values of the variables in parents (xi).
5) Thus, each entry in the joint distribution is represented by the product of the
appropriate elements of the Conditional Probability Tables (CPTs) in the Bayesian
network. The CPTs therefore, provide a decomposed representation of the joint
distribution.
We can calculate the probability that the alarm has sounded, but neither a burglary nor
an earthquake has occurred and both J and M call. We use single letter names for the
variables:
6) Remember that the full joint distribution can be used to answer any query about the
domain.
3) Above identity holds true for any set of random variables and is called the Chain
rule. Comparing it with equation (7.3.1). We see that the specification of the joint
distribution is equivalent to the general assertion that, for every variable X i in the
network,
provided that parents (Xi) (Xi-1,....X1}. This last condition is satisfied by labeling the
nodes in any order that is consistent with the partial order implicit in the graph
structure.
5) In order to construct a Bayesian network with the correct structure for the domain,
we need to choose parents for each node such that this property holds. Intuitively, the
parents of node X; should contain all those nodes in X i,....Xi-1 that directly influence
Xi.
2) A node is conditionally independent of all other nodes in the network, given it’s
parents, children, and children’s parents- that is, given its Markov blanket.
For example: Burglary is independent of J calls and M calls given Alarm and
Earthquake.
As node X is conditionally on independent of all other nodes in the network given its
Markove blanket (the gray area).
If the maximum number of parents k is smallish, filling in the CPT for a node, would
require up to O(2k) numbers.
For example: A patient could have cold (parent) but not exhibit a fever (child).
For noisy-OR model it is required that all the possible effects of parent must be listed.
It should be clear that inhibition of one parent is independent of other parent.
• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.
X: Query variable.
E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y:
The set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also
called as hidden variables].
• Generally the query requires the posterior probability distribution P(X | e) [assuming
that query variable is not among the evidence variables, if it is, then posterior
distribution for X simply gives probability 1 to the observed value]. [Note that query
can contain more than one variable. For study purpose we are assuming single
variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'
2. Inference by Enumeration
• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full
joint distribution.
Characteristics of algorithm:
1) It takes input a full joint distribution P and looks up values in it. [The same
algorithm can be modified to take input as Bayesian network and looking up in joint
entries by multiplying the corresponding conditional probability table entries from
Bayesian network.
3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression
which results in wastage of computation time.
• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.
bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */
Q(X) ← A distribution over X, initially empty for each value x i of X do extend e with
value xi for X.
Y ← FIRST (vars)
If Y has value y in e.
Example:
Consider query,
Hence,
That is, the chance of burglary, given calls from both neighbours is about 28
%. Note In the Fig. 7.3.7, the evalution proceeds top to down, multiplying values
along each path and summing at the "t" nodes. Observe that there is repetition of paths
for j and m.
1 Explain the Complexity Involved in Exact Inferencing CO2
0. The variable elimination algorithm is more efficient than enumeration algorithm
because it avoids repeated computations as well as drops irrelevant variables.
The variable elimination algorithm constructs the factor, deriving its operation. The
space and time complexity of variable elimination is directly dependant on size of the
largest factor constructed during the operation. Basically the factor construction is
determined by the order of elimination of variables and by the structure of the
network; which affects both space and time complexity.
For developing more efficient process we can construct singly connected networks
which are also called as polytrees. In singly connected network, there is at most one
undirected path between any two nodes in the networks. The singly connected
networks have property that, the time and space complexity of exact inference in
polytrees is linear in the size of the network. Here the size is defined as the number of
CPT entries. If the number of parents of each node is bounded by a constant, then the
complexity I will also be linear in the number of nodes.
For example: The Burglary network shown in the Fig. 7.3.2 is a polytrees.
In multiply networks [In this, their can be multiple undirected paths between any two
nodes and more than one directed path between some pair of nodes], variable
elimination takes exponential time and space complexity in the worst case, even when
the number of parents per node is bounded. It should be noted that variable
elimination includes inference in propositional logic as a special case and inference in
Bayesian network is NP-hard. In fact it is strictly harder than NP-complete problem.
Clustering algorithm:
1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can
be reduced to O(n). In clustering individual nodes of the network are joint to form
cluster nodes to such a way that the resulting network is a polytree.
For example: The multiply connected network shown in Fig. 7.3.8 (a) can be
converted into a polytree by combining the Sprinkler and Rain node into a clusternode
called Sprinkler + Rain, as shown in Fig. 7.2.8 (b). The two Boolean nodes are
replaced by a meganode that takes on four possible values: TT, TF, FT, FF. The
meganode has only one parent, the Boolean variable. Cloudy, so there are two
conditioning cases.
Peculiarities of algorithm:
2) With careful book keeping, this algorithm is able to compute posterior probabilities
for all the non-evidence nodes in the network in time O(n), where n is now the size of
the modified network.
1 Consider an incandescent bulb manufacturing unit. Here machines M1, M2 and CO2
1. M3 make 30 %, 30 % and 40% of the total bulbs of their, output, let's assume
that 2 %, 3 % and 4 % are defective. A bulb is drawn at random and is found
defective. What is the probability that the bulb is made by machine M1 or M2 or
M3.
Solution:
• Let E1, E2 and E3 be the events that a bulb selected at random is made by machine
M1, M2 and M3.
• Let Q denote that it is defective.
Prob (E1) = 0.3
Prob (E2) = 0.3 and Prob (E3) = 0.4 (given data),
These represent the prior probabilities.
Prob (E1/Q) = Prob (E1/) * Prob (Q/E1)/ Σ3i=1 Prob (Ei) * Prob (Q/Ei)
= 0.1935
Similarly,
Prob (E2/Q) = 0.3* 0.03/ (0.03* 0.2) + (0.03* 0.3) + (0.04* 0.4)
= 0.2903
= (1-((0.1935) (0.2903)))
= 0.5162
1 Explain Bayesian Belief Network in artificial intelligence CO2
2. Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:
Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A by
a directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |
Parent(Xi) ), which determines the effect of the parent on that node.
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
In general for each variable Xi, we can write the equation as:
Part-C
1. Calculate the probability that alarm has sounded, but there is neither a burglary, CO2
nor an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional
probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B,
E], can rewrite the above probability statement using joint probability distribution:
ADVERTISEMENT
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of
Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
3. If the weather is sunny, then the Player should play or not? Calculate using CO2
Naïve bayes algorithm
Play
Outlook
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.
X: Query variable.
E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y:
The set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also
called as hidden variables].
• Generally the query requires the posterior probability distribution P(X | e) [assuming
that query variable is not among the evidence variables, if it is, then posterior
distribution for X simply gives probability 1 to the observed value]. [Note that query
can contain more than one variable. For study purpose we are assuming single
variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'
• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full
joint distribution.
Characteristics of algorithm:
1) It takes input a full joint distribution P and looks up values in it. [The same
algorithm can be modified to take input as Bayesian network and looking up in joint
entries by multiplying the corresponding conditional probability table entries from
Bayesian network.
3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression
which results in wastage of computation time.
• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.
bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */
Q(X) ← A distribution over X, initially empty for each value x i of X do extend e with
value xi for X.
Y ← FIRST (vars)
If Y has value y in e.
Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)