0% found this document useful (0 votes)
2 views44 pages

Unit II

The document discusses probabilistic reasoning, focusing on concepts such as Bayesian inference, uncertainty, and Bayesian networks. It explains the causes of uncertainty, defines key terms like Bayes' theorem and prior probability, and outlines various approximation methods. Additionally, it covers applications of Bayesian networks, Naïve Bayes classifiers, and the importance of probabilistic reasoning in artificial intelligence.

Uploaded by

RAJASEKAR M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views44 pages

Unit II

The document discusses probabilistic reasoning, focusing on concepts such as Bayesian inference, uncertainty, and Bayesian networks. It explains the causes of uncertainty, defines key terms like Bayes' theorem and prior probability, and outlines various approximation methods. Additionally, it covers applications of Bayesian networks, Naïve Bayes classifiers, and the importance of probabilistic reasoning in artificial intelligence.

Uploaded by

RAJASEKAR M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Unit – II

PROBABILISTIC REASONING
Acting under uncertainty – Bayesian inference – naïve bayes models. Probabilistic reasoning – Bayesian networks
– exact inference in BN – approximate inference in BN – causal networks.
Part-A
1. Why does uncertainty arise? CO2 May
Agents almost never have access to the whole truth about their environment. 2023
Uncertainty arises because of both laziness and ignorance. It is inescapable in complex,
nondeterministic, or partially observable environments
 Agents cannot find a categorical answer.
 Uncertainty can also arise because of incompleteness, incorrectness in agents
understanding of properties of environment.
2. Define Baye's theorem. CO2
In probability theory and applications, Baye's theorem (alternatively called as Baye's
law or Bayes rule) links a conditional probability to its inverse.

P(b | a) = P(a | b) P(b)/P(a)

This equation is called as Baye's Rule or Baye's Theorem.


3. Define prior probability. CO2
The prior (unconditional) probability is associated with a proposition 'a'. The prior
probability is the degree of belief accorded to a proposition in the absence of any other
information.

It is written as P(a). For example, the probability that, Ram has cavity = 0.1, then the
prior priobility is written as,

P(Cavity = true) = 1 or P(cavity) = 0.1


4. State the types of approximation methods. CO2
 For approximate inferencing randomize sampling algorithm (Monte Carlo
Algorithm) is used. There are two approximation methods that are used in
randomize sampling algorithm which are 1) Direct sampling algorithm and 2)
Markov chain sampling algorithm.
 In direct sampling algorithm samples are generated from known probability
distribution. In Markov chain sampling each event is generated by making a
random change to the preceding event.

5. What do you mean by hybrid Bayesian network? CO2


A network with both discrete and continuous variables is called as hybrid Bayesian
network. In hybrid Bayesian network, for representing the continuous variable its
discretization is done in terms of intervals because it can have infinite values.

For specifying the hybrid network two kinds of distribution are specified. The conditional
distribution for a continuous variable given discrete or continuous parents and the
conditional distribution for a discrete variable given continuous parent.

6. Give the full specification of Bayesian network. CO2


Definition: It is a data structure which is a graph, in which each node is annotated with
quantitative probability information.
The nodes and edges in the graph are specified as follows:-
1) A set of random variables makes up the nodes of the network. Variables may be
discrete or continuous.
2) A set of directed links or arrows connects pairs of nodes. If there is an arrow from
node X to node Y, then X is said to be a parent of Y.
3) Each node Xi, has a conditional probability distribution P(Xi | Parents(Xi)) that
quantifies the effect of the parents on the node.
4) The graph has no directed cycles (and hence is a directed, acyclic graph, or DAG).
The set of nodes and links is called as topology of the network.
7. Define uncertainty. CO2
A agent working in real world environment almost never has access to whole truth
about its environment. Therefore, agent needs to work under uncertainity.

Earlier agents we have seen make the epistemological commitment that either the facts
(expressed as prepositions) are true, false or else they are unknown. When an agent
knows enough facts about its environment, the logical approach enables it to derive
plans, which are guaranteed to work.

But when agent works with uncertain knowledge then it might be impossible to
construct a complete and correct description of how its actions will work. If a logical
agent can not conclude that any perticular course of action achieves its goal, then it will
be unable to act.
8. What is reasoning by default? CO2
We can do qualitative reasoning using technique like default reasoning. Default reasoning
treats conclusions not as "believed to a certain degree", but as "believed until a better reason
is found to believe something else".
9. Define Conditional Probability CO2
1) When agent obtains evidence concerning previously unknown random variables in the
domain, then prior probability are not used. Based on new information conditional or
posterior probabilities are calculated.

2) The notation is P(a | b) where a and b are any proposition.

The 'P' is read as "the probability of a given that all we know is b". That is when b is known
it indicates probability of a.

1 State the axioms of probability. CO2


0. Axioms gives semantic of probability statements. The basic axioms (Kolmogorov's axioms)
serve to define the probability scale and its end points.

1) All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a) ≤1.

2) Necessarily true (i.e, valid) propositions have probability 1, and necessarily false (i.e.,
unsatisfiable) propositions have probability 0.

P(true) = 1 P(false) = 0

3) The probability of a disjunction is given by

P (a v b) = P(a) + P(b) - P(a Ʌ b)

1 Define a Hybrid Bayesian Network CO2


1. A network with both discrete and continuous variables is called as hybrid Bayesian
network. For representing continuous variable its discretization is done in terms of intervals
(because it can have infinite values).

To specify a hybrid network, we have to specify two new kinds of distributions. The
conditional distribution for a continuous variable given discrete or continuous parents; and
the conditional distribution for a discrete variable given continuous parents.

1 Specify the notation used in exact inference CO2


2.
The notation used in inferencing is same as the one used in probability theory.

X: Query variable.

E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y: The
set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also called as
hidden variables].

X: It the complete set of all the types of variables, where X = {X} U E U Y.

1 Define Inference by Enumeration CO2


3. A Bayesian network gives a complete representation of the full joint distribution. These full
joint distributions can be written as product of conditional probabilities from the Bayesian
network.

A query can be answered using Bayesian network by computing sums of products of


conditional probabilities from the network.
1 What us a causal network? CO2
4.
 A directed network which illustrates the causal dependencies of all the components
in the network.
 A causal relationship exists when one variable in a data set has a direct influence on
another variable. Thus, one event triggers the occurrence of another event. A causal
relationship is also referred to as cause and effect.
 The ability to identify truly causal relationships is fundamental to developing
impactful interventions in medicine, policy, business, and other domains.

1 State the working of Variable Elimination algorithm CO2


5.
1) It works by evaluating expressions such as [P(b | j, m) = α P(b) Σ e P(e) Σa P(a | b, e) P(j |
a) P(m | a)] in right-to-left order.

2) Intermediate results are stored and summations over each variable are done only for those
portions of the expression that depends on the variable.

For example: Consider the Burglary network.

We evaluate the expression:


3) Factors: Each part of the expression is annotated with the name of the associated
variable, these parts are called factors.

1 Mention the working of clustering algorithm CO2


6. 1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can be
reduced to O(n). In clustering individual nodes of the network are joint to form cluster
nodes to such a way that the resulting network is a polytree.

2) The variable elimination algorithm is efficient algorithm for answering individual


queries. Posterior probabilities are computed for all the variables in the network. It can be
less efficient, in polytree network because it needs to issue O(n) queries costing O(n) each,
for a total of O(n2) time, clustering algorithm, improves over it.

1 What is Rejection Sampling in Bayesian Networks? CO2


7.
1) Rejection sampling is a method for producing samples from a hard-to-sample distribution
given an easy-to-simple distribution.

2) It can be used to compute conditional probabilities; that is, to determine P(X | e).

3) The Rejection Sampling algorithm is,

function REJECTION-SAMPLING (x, e, bn, N) returns an estimate of P(X | e)

inputs: X, the query variable

e, evidence specified as an event.

bn, a Bayesian network.

N, the total number of samples to be generated.

local variables: N, a vector of counts over X, initially zero.

for j = 1 to N do

X ←PRIOR-SAMPLE (bn)
if X is consistent with e is N[x] ← N[x] +1 where x is the value of X in x.

return NORMALIZE (N[x]).

The rejection sampling algorithm for answering queries given evidence in a Bayesian
network.

1 Mention the advantages of Bayesian Networks CO2


8.
Bayesian Networks offer a graphical representation that is reasonably interpretable and
easily explainable
Relationships captured between variables in a Bayesian Network are more complex
yet hopefully more informative than a conventional model;
Models can reflect both statistically significant information (learned from the data)
and domain expertise simultaneously;

1 Define naïve bayes classifier algorithm CO2


9.  Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional training
dataset.
 Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in buildng the fast machine learning models that can make
quick predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

2 In a class, there are 70% of the students who like English and 40% of the students who CO2
0. likes English and mathematics, and then what is the percent of students those who like
English also like mathematics?
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.

Hence, 57% are the students who like English also like Mathematics.
2 List some applications of Bayes theorem CO2
1.
 It is used to calculate the next step of the robot when the already executed step
is given.
 Bayes' theorem is helpful in weather forecasting.
 It can solve the Monty Hall problem.

2 What is the probability that a patient has diseases meningitis with a stiff neck? CO2
2. Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it
occurs 80% of the time. He is also aware of some more facts, which are given as
follows:
o The Known probability that a patient has meningitis disease is 1/30,000.

o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that
patient has meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with
a stiff neck.
2 Define Probability CO2
3. Probability: Probability can be defined as a chance that an uncertain event will occur. It is
the numerical measure of the likelihood that an event will occur. The value of probability
always remains between 0 and 1 that represent ideal uncertainties.
 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
 P(A) = 0, indicates total uncertainty in an event A.
 P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.

 P(¬A) = probability of a not happening event.


 P(¬A) + P(A) = 1.

2 What is Probabilistic Reasoning? CO2


4.  Probabilistic reasoning is a way of knowledge representation where we apply
the concept of probability to indicate the uncertainty in knowledge. In
probabilistic reasoning, we combine probability theory with logic to handle the
uncertainty.

 We use probability in probabilistic reasoning because it provides a way to


handle the uncertainty that is the result of someone's laziness and ignorance.

 In the real world, there are lots of scenarios, where the certainty of something
is not confirmed, such as "It will rain today," "behavior of someone for some
situations," "A match between two teams or two players." These are probable
sentences for which we can assume that it will happen but not sure about it, so
here we use probabilistic reasoning.
2 What is the need for probabilistic reasoning in AI? CO2
5.
 When there are unpredictable outcomes.
 When specifications or possibilities of predicates becomes too large to handle.
 When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:
o Bayes' rule

o Bayesian Statistics

Part-B
1. Explain acting under uncertainty with relevant examples CO2
A agent working in real world environment almost never has access to whole
truth about its environment. Therefore, agent needs to work under uncertainity.
Earlier agents we have seen make the epistemological commitment that either
the facts (expressed as prepositions) are true, false or else they are unknown.
When an agent knows enough facts about its environment, the logical approach
enables it to derive plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be impossible to
construct a complete and correct description of how its actions will work. If a
logical agent can not conclude that any perticular course of action achieves its
goal, then it will be unable to act.
The right thing logical agent can do is, take a rational decision. The rational
decision depends on following things:
• The relative importance of various goals.
• The likelihood and the degree to which, goals will be achieved.

Acting Under Uncertainty


An agent would possess some early basic knowledge of the world (Assume
that knowledge is represented in first order logic sentence). Using first order
logic to handle real word problem domains fails for three main reasons as
discussed below:
1) Laziness:
It is too much work to list the complete set of antecedents or consequents
needed to ensure an exceptionless rule and too hard to use such rules.
2) Theoretical ignorance:
A perticular problem may not have complete theory for the domain.
3) Practical ignorance:
Even if all the rules are known, perticular aspects of problem are not checked
yet or some details are not considered at all (missing out the details).
The agent's knowledge can provide it with a degree of belief with relevent
sentences. To this degree of belief probability theory is applied. Probability
assigns a numerical degree of belief between 0 and 1 to each sentence.
Probability provides a way of summarizing the uncertainity that comes from
our laziness and ignorance.
Assigning probability of 0 to a given sentence corresponds to an unequivocal
belief saying that sentence is false. Assigning probability of 1 corresponds to
an unequivocal belief saying that the sentence is true. Probabilities between 0
and 1 correspond to intermediate degree of belief in the truth of the sentence.
The beliefs completely depends on percepts of agent at perticular time. These
percepts constitute the evidence on which probability assertions are based.
Assignment of probability to a proposition is analogous to saying that whether
the given logical sentence (or its negation) is entailed by the knowledge base
rather than whether it is true or not. When more sentences are added to
knowledge base the entailment keeps on changing. Similarly the probability
would also keep on changing with additional knowledge.
All probability statements must therefore, indicate the evidence with respect to
which the probability is being assessed. As the agent receives new percepts, its
probability assessments are updated to reflect the new evidence. Before the
evidence is obtained, we talk about prior or unconditional probability; after the
evidence is obtained, we talk about posterior or conditional probability. In
most cases, an agent will have some evidence from its percepts and will be
interested in computing the posterior probabilities of the outcomes it cares
about.
Uncertainty and rational decisions:
The presence of uncertainty drastically changes the way an agent makes
decision. At perticular time an agent can have various available decisions, from
which it has to make a choice. To make such choices an agent must have a
preferences between the different possible outcomes, of the various plans.
A perticular outcome is completely specified state, along with the expected
factors related with the outcome.
For example: Consider a car driving agent who wants to reach at airport by a
specific time say at 7.30 pm.
Here factors like, whether agent arrived at airport on time, what is the length of
waiting duration at the airport are attached with the outcome.

Utility Theory
Utility theory is used to represent and reason with preferences. The term utility
in current context is used as "quality of being useful".
Utility theory says that every state has a degree of usefulness called as utility.
The agent will prefer the states with higher utility.
The utility of the state is relative to the agent for which utility function is
calculated on the basis of agent's preferences.
For example: The pay off functions for games are utility functions. The utility
of a state in which black has won a game of chess is obviously high for the
agent playing black and low for the agent playing white.
There is no measure that can count test or preferences. Someone loves deep
chocolate icecream and someone loves chocochip icecream. A utility function
can account for altruistic behavior, simply by including the welfare of other as
one of the factors contributing to the agent's own utility.
Decision theory
Preferences as expressed by utilities are combined with probabilities for
making rational decisions. This theory, of rational decision making is called as
decision theory.
Decision theory can be summarized as,
Decision theory = Probability theory + Utility theory.
• The principle of Maximum Expected Utility (MEU):
Decision theory says that the agent is rational if and only if it chooses the
action that yields highest expected utility,averaged over all the possible
outcomes of the action.
• Design for a decision theoretic agent:
Following algorithm sketches the structure of an agent that uses decision
theory to select actions.
The algorithm
Function: DT-AGENT (percept) returns an action.
Static: belief-state, probabilistic beliefs about the current state of the world.
action, the agent's action.
- Update belief-state based on action and percept
- Calculate outcome probabilities for actions, given actions descriptions
and current belief-state
- Select action with highest expected utility given probabilities of
outcomes and utility information
- Return action.
A decision therotic agent that selects rational actions.
The decision theoretic agent is identical, at an abstract level, to the logical
agent. The primary difference is that the decision theoretic agent's knowledge
of the current state is uncertain; the agent's belief state is a representation of the
probabilities of all possible actual states of the world.
As time passes, the agent accumulates more evidence and its belief state
changes. Given the belief state, the agent can make probabilistic predictions of
action outcomes and hence select the action with highest expected utility.

2. Explain unconditional and conditional probability CO2


1) The prior (unconditional) probability is associated with a proposition 'a'.

2) It is the degree of belief accorded to a proposition in the absence of information.

3) It is written as P(a).

For example: The probability that, Ram has cavity = 0.1, then prior probability is
written as, P (Cavity = true) = 0.1 or P (Cavity) = 0.1

4) It should be noted that as soon as new information is received, one should reason
with the conditional probability of 'a' depending upon new information. 5) When it is
required to express probabilities of all the possible values of a random variable, then a
vector of values is used. It is represented using P(a). This represents values for the
probabilities of each individual state of the 'a'.

For example: P (Weather) = < 0.7, 0.2, 0.08, 0.02 > is representing four equations

P (Weather = Sunny) = 0.7

P (Weather = Rain) = 0.2

P (Weather = Cloudy) = 0.08

P (Weather = Cold) = 0.02

6) The expression P(a) is said to be defining prior probability distribution for the
random variable 'a'.

7) To denote probabilities of all random variables combinations, the expression P(a 1,


a2) can be used. This is called as joint probability distribution for random variables a 1,
a2. Any number of random variables can be mentioned in the expression.

8) A simple example of joint probability distribution is,

→ P<Weather, Cavity> can be represented as, 4×2 table of probabilities.

(Weather's probability) (Cavity probability)

9) A joint probability distribution that covers the complete set of random variables is
called as full joint probability distribution.

10) A simple example of full joint probability distribution is,

If problem world consists of 3 random variables, wheather, cavity, toothache then full
joint probability distribution would be,

P<Weather, Cavity, Toothache>

It will be represented as, 4×2×2, table of probabilities.

11) Prior probability for continuous random variable:

i) For continuous random variable it is not feasible to represent vector of all possible
values because the values are infinite. For continuous random variable the probability
is defined as a function with parameter x, which indicates that random variable takes
some value x.

For example: Let random variable x denotes the tomorrow's temperature in Chennai. It
would be represented as,

P(X = x) = U [25 – 37] (x).

This sentence express the belief that X is distributed uniformly between 25 and 37
degrees celcius.

ii) The probability distribution for continuous random variable has probability density
function.

Conditional Probability

1) When agent obtains evidence concerning previously unknown random variables in


the domain, then prior probability are not used. Based on new information conditional
or posterior probabilities are calculated.

2) The notation is P(a | b) where a and b are any proposition.

The 'P' is read as "the probability of a given that all we know is b". That is when b is
known it indicates probability of a.

For example: P (Cavity | Toothache) = 0.8

it means that, if patient has toothache (and no other information is known) then the
chances of having cavity are = 0.8

3) Prior probability are infact special case of conditional probability. It can be


represented as P(a) which means that probability 'a' is conditioned on no evidence.

4) Conditional probability can be defined interms of unconditional probabilities. The


equation would look like,

P(a/b) =P(aɅb) / P(b), it holds whenever P(b) > 0 …(7.1.1)

The above equation can also be written as,

P(a Ʌ b) = P(a | b) P(b)

This is called as product rule. In other words it says, for 'a' and 'b' to be true we need
'b' to be true and we need a to be true given b. It can be also written as,

P(a Ʌ b)= P(b | a) P(a).


Conditional probability are used for probabilistic inferencing.

6) P notation can be used for conditional distribution. P(x | y) gives the values of P(X
= xi | Y = yj) for each possible i, j. following are the individual equations,

P(X = x1 ^ Y = y1) = P(X = x1 |Y = y1) P(Y = y1)

P(X = x1 ^ Y = y2) = P(X = x1 |Y = y2) P(Y = y2)

These can be combined into a single equation as,

P(X, Y) = P(X | Y) P(Y)

7) Conditional probabilities should not be treated as logical implications. That is,


"When 'b' holds then probability of 'a' is something", is a conditional probability o and
not to be mistake as logical implication. It is wrong on two points, one is, P(a) always
denotes prior probability. For this it do not require any evidence. P(a | b) = 0.7, is
immediately relevant when b is available evidence. This will keep on altering. When
information is updated logical Implications do not change over time.

The Probability Axioms

Axioms gives semantic of probability statements. The basic axioms (Kolmogorov's


axioms) serve to define the probability scale and its end points.

1) All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a) ≤1.

2) Necessarily true (i.e, valid) propositions have probability 1, and necessarily false
(i.e., unsatisfiable) propositions have probability 0.

P(true) = 1 P(false) = 0

3) The probability of a disjunction is given by

P (a v b) = P(a) + P(b) - P(a Ʌ b)

This axiom connects the probabilities of logically related propositions. This rule states
that, the cases where 'a' holds, together with the cases where 'b' holds, certainly cover
all the causes where 'a v b' holds; but summing the two sets of cases counts their
intersection twice, so we need to subtract P(a Ʌ b).

3. Discuss in detail about Bayes Rule CO2


Bayes' rule is derived from the product rule.

The product rule can be written as,

P(a Ʌ b)=P(a | b) P(b) .…(7.1.6)

P(a Ʌ b)=P(b | a) P(a) .... (7.1.7)

[because conjunction is commutative]

Equating right sides of equation (7.1.6) and equation (7.1.7) and dividing by P(a),

P(b | a) = P(a | b)P(b) / P(a)


This equation is called as Bayes' rule or Bayes' theorem or Bayes' law. This rule is
very useful in probabilistic inferences.

Generalized Bayes' rule is,

P(Y | X) = P(X | Y) P(Y) / P(X)

(where P has same meanings)

We can have more general version, conditionalized on some background evidence e.

P(Y | X, e) = P(X |Y, e) P(Y | e) / P(X | e)

General form of Bays' rule with normalization's

P(y | x) = α P(x | y) P(y).

Applying Bays' Rule:

1) It requires total three terms (1 conditional probability and 2 unconditional


Probabilities). For computing one conditional probability.

For example: Probability of patient having low sugar has high blood pressure is

50 %.

Let, M be proposition, 'patient has low sugar'.

S be a proposition, 'patient has high blood pressure'.

Suppose we assume that, doctor knows following unconditional fact,

i) Prior probabilition of (m) = 1/50,000.

ii) Prior probability of (s) = 1/20.

Then we have,

P(s | m) = 0.5

P(m) = 1 | 50000

P(s) = 1 | 20

P(m/s) = P(s | m) P(m) / P(s)

= 0.5×1/50000 /1/20

= 0.0002

That is, we can expect that 1 in 5000 with high B.P. will has low sugar.

2) Combining evidence in Bayes' rule.

Bayes rule is helpful for answering queries conditioned on evidences.

For example: Toothache and catch both evidences are available then cavity is sure to
exist. Which can be represented as,

P(Cavity | Toothache ^ Catch) = α <0.108, 0.016> ≈ <0.871, 0.129>

By using Bayes' rule to reformulate the problem:


P(Cavity | Toothache ^ Catch) = α P(Toothache A Catch | Cavity) P(Cavity) ……
(7.1.8)

For this reformulation to work, we need to know the conditional probabilities of the
conjunction Toothache Catch for each value of Cavity. That might be feasible for just
two evidence variables, but again it will not scale up.

If there are n possible evidence variable (X-rays, diet, oral hygiene, etc.), then there
are 2n possible combinations of observed values for which we would need to know
conditional probabilities.

The notion of independence can be used here. These variables are independent,
however, given the presence or the absence of a cavity. Each is directly caused by the
cavity, but neither has a direct effect on the other. Toothache depends on the state of
the nerves in the tooth, where as the probe's accuracy depends on the dentist's skill, to
which the toothache is irrelevant.

Mathematically, this property is written as,

P(Toothache ^ Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) .…(7.1.9)

This equation expresses the conditional independence of toothache and catch, given
cavity.

Substitute equation (7.1.3) into (7.1.4) to obtain the probability of a cavity:

P (Cavity | Toothache ^Catch) = α P (Toothache | Cavity) P (Catch | Cavity) P


(Cavity)

Now, the information requirement are the same as for inference using each piece of
evidence separately the prior probability P(Cavity) for the query variable and the
conditional probability of each effect, given its cause.

Conditional independence assertions can allow probabilistic systems to scale up; more
over, they are much more commonly available than absolute independence assertions.
When their are 'n' variables given that they are all conditionally independent, the size
of the representation grows as O(n) instead of O(2n).

For example -

Consider dentistry example, in which a single cause, directly influences a number of


effects, all of which are conditionally independent, given the cause.

The full joint distribution can be written as,

P(Cause, Effect1,..., Effectn) = P(Cause) II P(Effecti | Cause).


Such a probability distribution is called as naive Bayes' model - "naive" because it is
often used (as a simplifying assumption) in cases where the "effect" variables are not
conditionally independent given the cause variable. The naive Bayes model is
sometimes called as Bayesian classifier.
4. Explain Inference using Full Joint Distribution CO2
Probabilistic inference means, computation from observed evidence of posterior
probabilities, for query propositions. The knowledge base used for answering the
query is represented as full joint distribution. Consider simple example, consists of
three boolean variables, toothache, cavity, catch. The full joint distribution is 2×2×2,
as shown below.

Note that the probability in the joint distribution sum to 9.

• One particular common task in inferencing is to extract the distribution over some
subset of variables or a single variable. This distribution over some variables or single
variables is called as marginal probability.
For example: P(Cavity) = 0.108 +0.012+ 0.072 + 0.008. = 0.2

This process is called as marginalization or summing. Because the variables other than
'cavity' (that is whose probability is being counted) are summed out.

• The general marginalization rule is as follows,


For any sets of variables Y and Z,

P(Y) = Σ z P(Y, z). ….(7.1.3)

It indicates that distribution Y can be obtained by summing out all the other paulay to
variables from any joint distribution X containing Y.

• Variant of above example of general marginalization rule involved the conditional to


to probabilities using product rule.
P(Y) = Σ z P(Y | z) P(z) ….(7.1.4)

This rule is conditioning rule.

For example: Computing probability of a cavity, given evidence of a toothache is as


follows:

P(Cavity | Toothache) =P(Cavity Ʌ Toothache) /P(Toothache) =

0.108+0.012 / 0.108+0.012+0.016+0.064 = 0.6

• Normalization constant: It is variable that remains constant for the distribution,


which ensures that it adds in to 1. a is used to denote such constant.
For example: We can compute the probability of a cavity, given evidence of a
toothache, as follows:

P(Cavity | Toothache) = P(Cavity Ʌ Toothache)/ P(Toothache) =

0.108+0.012/

0.108+0.012+0.016+0.064 = 0.6
Just to check we can also compute the probability that there is no cavity given a
toothache:

P( Cavity / Toothache)= P (¬Cavity Ʌ Tootache) /P (Toothache) = 0.016+ 0.064 /


0.108+0.012+ 0.016+ 0.064 = 0.4

Notice that in these two calculations the term 1/P (toothache) remains constant, no
matter which value of cavity we calculate. With this notation we can write above two
equations in one.

P(Cavity | Toothache) = αP(Cavity, Toothache)

= α [P(Cavity, Toothache, Catch) + P(Cavity, Toothache, ¬ Catch)]

= α [< 0.108, 0.016> + <0.012, 0.064>]

= α <0.12, 0.08> = <0.6, 0.4>

From above one can extract a general inference procedure.

Consider the case in which query involves a single variable. The notation used is let X
be the query variable (cavity in the example), let E be the set of evidence variables
(just toothache in the example) let e be the observed values for them, and let Y be the
remaining unobserved variable (just catch in the example). The query is P(X/e) and
can be evaluated as

P(X | e) = α P(X, e) = α Σ y P(X, e, y) ……(7.1.5)

where the summation is over all possible ys (i.e. all possible combinations of values of
the unobserved variables 'y'). Notice that together the variables, X, E and Y constitute
the complete set of variables for the domain, so P (X, e, y) is simply a subset of
probabilities from the full joint distribution.

Independance

It is a relationship between two different sets of full joint distributions. It is also called
as marginal or absolute independance of the variables. Independence indicates that
whether the two full joint distributions affects probability of each other.

The independance between variables X and Y can be written as follows,

P(X | Y) = P(X) or P(Y | X) = P(Y) or P(X, Y) = P(X) P(Y)

• For example: The weather is independant of once dental problem. Which can be
shown as below equation.
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather).

• Following diagram shows factoring a large joint distributing into smaller


distributions, using absolute independence. Weather and dental problems are
independent.
5. Explain variable elimination algorithm for answering queries on Bayesian CO2
networks.
The Variable Elimination Algorithm

The enumeration algorithm can be improved substantially by eliminating calculations


of repeated sub expression in tree. Calculation can be done once and save the results
for later use. This is a form of dynamic programming.

Working of variable elimination algorithm

1) It works by evaluating expressions such as [P(b | j, m) = α P(b) Σ e P(e) Σa P(a | b, e)


P(j | a) P(m | a)] in right-to-left order.

2) Intermediate results are stored and summations over each variable are done only for
those portions of the expression that depends on the variable.

For example: Consider the Burglary network.

We evaluate the expression:

3) Factors: Each part of the expression is annotated with the name of the associated
variable, these parts are called factors.

Steps in algorithm:

i) The factor for M, P (m | a), does not require summing over M. Probability is stored,
given each value of a, in a two-element factor,

Note fM means that M was used to produce f.

ii) Store the factor for J as the two-element vector fj (A).

iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix fA (A, B, E).

iv) Summing out A from the product of these tree factors. This will give 2×2 matrix
whose indices range over just B and E. We put bar over A in the name of the matrix to
indicate that A has been summed out.
FȂJM(B, E) =Σa fA (a, B, E) × fj (a) × fM (a)

= fA (a, B, E) x fJ(a) x fM (a) + fA ( ⌐a, B, E) x fj (¬a)×fm (¬a)

The multiplication process used is called a printwise product.

v) Processing E in the same way (i.e.) sum out E from the product of

fE (E) and fȂJM (B, E):

fȂJM (B) = fE (e) × fȂJM (B, e) + fE (¬e) × fȂJM (B, ¬e).

vi) Compute the answer simply by multiplying the factor for B. (i.e.) (f B|B) = P(B),
701 by the accumulated matrix (B) :

P(B | j, m) = α fB (B) x f EȂJM (B).

From the above sequence of steps it can be noticed that two computational operations
are required.

a) Pointwise product of a pair of factors.

b) Summing out a variable from a product of factors.

a) Pointwise product of a pair of factors: The pointwise product of two factors


f1 and f2 yields a new factor f, those variables are the union of the variables in f 1 and f2.
Suppose the two factors have variables Y 1,..., Yk. Then we have f(x1,..., Xj, Y1,.... Yk,
Z1,.....Zl) = f1 (X1,....Xj, Y1.... Yk) f2(Y1,..... Yk, Z1,....Zl). If all the variables are binary,
then f1 and f2 have 2j+k and 2k+l entries and the pointwise product has 2j+k+1 entries.

For example: Given two factors f 1 (A, B) and f2 (B, C) with probability distributions
shown below, the pointwise product f1 × f2 is given as f1 (A, B, C).

b) Summing out a variable from a product of factors: It is a straight forward


computation. Any factor that does not depend on the variable to be summed out can
be moved outside the summation process.

For example:

Σe fE (e) × fA (A, B, e) × fj (A) × fM (A) = fj (A) × fM (A) × Σ e fE (e) × fA (A, B, e).


Now, the pointwise product inside the summation is computed and the variable is
summed out of the resulting matrix.
Fj (A) × fM (A) × Σe fE (e) × fA (A, B, e) = fj (A) × fM (A) × fEA(A, B).

Matrices are not multiplied until we need to sum but a variable from the accumulated
product. At that point, multiply those matrices that include the variable to be summed
out.

The procedure for pointwise product and summing is given below, the variable
elimination algorithm is shown below:

Function ELIMINATION-ASK (X, e, bn) returns a distribution over X

Inputs: X, the query variable

e, evidence specified as an event

bn, a Bayesian network specifying joint distribution P(X1,..., Xn).

Factors ← [ ] : vars ← REVERSE (VARS [bn])

for each var in vars do

Factors ←[MAKE-FACTOR (var, e) factors]

if var is a hidden variable then

Factors←SUM-OUT (var, factors)

returns NORMALIZE (POINTWISE-PRODUCT (factors)).

P (J calls, Burglary = True)

The first step is to write out the nested summation.

P(J | b) = α P(b) Σe P(e) Σa P(a | b, e) P(J | a) Σm P(M | a).

Evaluating this expression from right to left,

Σm P(M | a) is equal to 1.

Note The variable M is irrelevant to this query. Result of the query P (J calls/Burglary
= True) is unchanged if we remove M calls from the network. We can remove any leaf
node which is not a query variable or an evidence variable. After its removal, there
may be more leaf nodes and they may be irrelevant. Eventually we find that every
variable that is not an ancestor of a query variable or evidence variable is, irrelevant to
the query. A variable elimination algorithm can remove all these variables before
evaluating the query.

6. Explain direct sampling in Approximate Inference CO2


Direct Sampling Algorithm

1) The basic element in any sampling algorithm is the generation of samples from a
known probability distribution.

For example: An unbiased coin can be thought of as a random variable coin with
values (heads, tails) and a prior distribution P(coin) <0.5, 0.5>. Sampling from this
distribution is exactly like flipping the coin: with probability 0.5 it will return 9879
heads, and with probability 0.5 it will return tails. Given a source of random numbers
in the range [0, 1], it is a simple matter to sample any distribution on a single variable.

2) The simplest kind of random sampling process for Bayesian networks generates
events from a network that has no evidence associated with it. The idea is to sample
each variable in turn, in topological order.

3) The probability distribution from which the value is sampled is conditioned on the
values already assigned to the variable's parents.

4) The sampling algorithm:

Function PRIOR-SAMPLE (bn) returns an event sampled from the prior specified by
bn.

inputs: bn, a Bayesian network specifying joint distribution P(X1,..., Xn).

X ← an event with n elements.

for i = 1 to n do

xi ← a random sample from P(Xi parent (Xi))

return x.

5) Applying operations of algorithm on the network in Fig. 7.3.8 (a) assuming an


ordering [Cloudy, Sprinkler, Rain, WetGrass]:

i) Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.

ii) Sample from P(Sprinkler | Cloudy = true) = <0.1, 0.9>; suppose this returns false.

iii) Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true. iv)
Sample from P(WetGrass | Sprinkler = false, Rain = true) <0.9, 0.1>; suppose this
returns true.

In this case PRIOR-SAMPLE returns the event [true, false, true, true].

6) PRIOR-SAMPLE generates samples from the prior joint distribution specified by


the nework. First, let S ps (x1,...,xn) be the probability that a specific event is generated
by the PRIOR-SAMPLE algorithm. Just looking at the sampling process, we have,

Sps (x1,...,xn)= Πni=1 P(xi | parents(Xi)).

Because each sampling step depends only on the parent values. This expression is also
the probability of the event according to the Bayesian net's representation of the joint
distribution. We have,

Sps (x1,...,xn) = P(x1,...,xn)

7) In any sampling algorithm, the answers are computed by counting the actual
samples generated. Suppose there are N total samples, and let N (x 1,...,xn) be the
frequency of the specific event x 1,...,xn. We expect this frequency to converge in the
limit, to its expected value according to the sampling probability:-
N → lim ∞ Nps (x1, …., xn )/N = Sps (x1, …, xn) = P(x1,….xn) ... (7.3.4)

For example: Consider the event produced earlier : [true, false, true, true]. The b. a
sampling probability for this event is,

Sps (true, false, true, true) = 0.5 × 0.9 × 0.8 × 0.9

= 0.324

Hence, in the limit of large N, we expect 32.4 % of the samples to be of this event.

8) Whenever we use an approximate equality ("≈"), we mean it in exactly this sense


that the estimated probability becomes exact in the large sample limit. Such an
estimate is called consistent.

For example: One can produce a consistent estimate of the probability of any partially
specified event x1,...,xn, where m≤n, as follows: -

P(x1,...,xm) ≈ Nps (x1,...,xm)/N ….(7.3.5)

That is, the probability of the event can be estimated as the fraction of all complete
events generated by the sampling process that match the partially specified event.

For example: If we generate 1000 samples from the sprinkler network, and 511 of
them have, Rain = true, then the estimated probability of rain, written as P(Rain =
true), is 0.511.

2. Rejection Sampling in Bayesian Networks

1) Rejection sampling is a method for producing samples from a hard-to-sample


distribution given an easy-to-simple distribution.

2) It can be used to compute conditional probabilities; that is, to determine P(X | e).

3) The Rejection Sampling algorithm is,

function REJECTION-SAMPLING (x, e, bn, N) returns an estimate of P(X | e)

inputs: X, the query variable

e, evidence specified as an event.

bn, a Bayesian network.

N, the total number of samples to be generated.

local variables: N, a vector of counts over X, initially zero.

for j = 1 to N do

X ←PRIOR-SAMPLE (bn)

if X is consistent with e is N[x] ← N[x] +1 where x is the value of X in x.

return NORMALIZE (N[x]).

The rejection sampling algorithm for answering queries given evidence in a Bayesian
network.
• Working of algorithm:
i) It generates samples from the prior distribution specified by the network.

ii) It rejects all those samples that do not match the evidence.

iii) Finally, the estimate P(X = x | e) is obtained by counting how often X = x occurs
in the remaining samples.

4) Let P(X | e) be the estimated distribution that the algorithm returns. From the
definition of the algorithm, we have,

P(X | e) = α Nps (X, e) = Nps (X, e)/Nps (e)

from equation (7.3.5) this becomes,

P(X | e) ≈ P(x, e)/P(e) = P(x, e).

That is, rejection sampling produces a consistent estimate of the rule probability. 5)
Applying operations of algorithm on the network in the Fig. 7.3.8 (a), let us assume
that we wish to estimate P (Rain/Sprinkler = true), using 100 samples. Of the 100 that
we generate, suppose 8 have Rain = true and 19 have Rain = false. Hence,

P(Rain | Sprinkler = true) ≈NORMALIZE

(<8.19>) = <0.296, 0.704 >

The true answer is <0.3, 0.7>

6) As more samples are collected, the estimate will converges to the true answer. The
standard deviation of the error in each probability will be proportional to 1/√n, where
'n' is the number of samples used in the estimate.

7) The biggest problem with rejection sampling is that it rejects so many samples !
The fraction of samples consistent with the evidence 'e' drops exponentially as the
number of evidence variables grows, so the procedure is simply unusable for complex
problems.

3. Likelihood Weighing in Bayesian Network

1) Likelihood weighing avoids the inefficiency of rejection sampling by generating


only events that are consistent with the evidence 'e'.

2) The Likelihood Weighing algorithm is,

Function LIKELIHOOD-WEIGHING (X, e, bn, N) returns an estimate of P(X | e)


inputs: X, the query variable

e, evidence specified as an event

bn, a Bayesian network.

N, the total number of samples to be generated.

Local variables: W, a vector of weighted counts over X, initially zero.

for j = 1 to N do
X, w← WEIGHTED-SAMPLE (bn)

W[X]← W[x] + w where x is the value of X in x.

return NORMALIZE (W[X]).

Function WEIGHTED-SAMPLE (bn, e) returns an event and a weight

X ← an event with n elements; w← 1.

for i = 1 to n do

if X, has a value xi in e

then w ← wX P(Xi = xi parents(Xi))

else xi ← a random sample from P(Xi | parents (Xi))

return X, w.

Note on likelihood weighing algorithm -

i) It fixes the values for the evidence variables E and samples only the remaining
variables X and Y. This guarantees that each event generated is consistent with the
evidence.

ii) Not all events are equal, however, before tallying the counts in the distribution for
the query variable, each event is weighted by the likelihood that the event 19 accords
to the evidence, as measured by the product of the conditional of probabilities for each
evidence variable, given its parents.

iii) Intuitively, events in which the actual evidence appears unlikely should be given
less weight.

3) Applying operations of algorithm on the network Fig. 7.3.8 (a), with the query
P(Rain/Sprinkler = true, WetGrass = true). The process goes as follows:

i) The weight w is set to 1.0

ii) Now, an event is generated.

• Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.


• Sprinkler is an evidence variable with value true. Therefore, we set
w ← w P (Sprinkler = true | Cloudy = true) = 0.1.
• Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true.
• WetGrass is an evidence variable with value true. Therefore, we set
W ← w P(Wet Grass = true | Sprinkler = true, Rain = true) = 0.099.
iii) Here WEIGHTED-SAMPLE returns the event [true, true, true, true] with weight
0.099, and this is tallied under Rain = true.

iv) The weight is low because the event describes a cloudy day, which makes the
sprinkler unlikely to be on.

4) Likelihood Weighting working:


i)Examine the sampling distribution Sws for WEIGHTED-SAMPLE.

ii) The evidence variables E are fixed with values 'e'.

iii) Call the other variables Z, that is, Z = {X}U Y.

iv) The algorithm samples each variable in Z, in given its parent values:

Sws (Z, e) =IIli=1 P(Zi | parents (Zi))... (7.3.6)

Notice that Parents (Zi) can include both hidden variables and evidence variables.
Unlike the prior distribution P(Z), the distribution S ws pays some attention to the son
evidence the sampled values for each Zi will be influenced by evidence among Zi's
ancestors.

v) On the other hand, Sws pays less attention to the evidence than does the true Som
posterior distribution P(Z | e), because the sampled values for each Z i ignore evidence
among Zi's non ancestors.

vi) The likelihood weight w makes up for the difference between the actual and
desired sampling distribution. The weight for a given sample 'x', composed from Z
and 'e', is the product of the likelihood for each evidence variable given its parents
(some or all of which may be among the Zis):

w(Z, e) = IImj=l P(ei / parents(Ei) .... (7.3.7)

vii) Multiplying equations (7.3.4) and (7.3.5), we see that the weighted probability of
a sample has the particularly convenient form,

Sws (Z, e) W(Z, e) = Πli=1 P(yi | parents(Yi))

Πmi=1 P(ei | parents(Ei)) = P(y, e)

Because the two products cover all the variables in the network, allowing us to use
equation (7.3.1) for the joint probability.

viii) It can be easy to show that likelihood weighting estimates are consistent. For any
particular value x of X, the estimated posterior probability can be calculated as
follows:

P(x | e) =α Σy Nws (x, y, e) w (x, y, e)

from LIKELIHOOD-WEIGHTING.

= α՛ Σy Sws (x, y, e) w (x, y, e)

from large N.

= α' Σy Ρ(x, y, e)

= α' P(x, e) =P(x | e)by equation (7.3.8)

Hence, likelihood weighting returns consistent estimates.

5) Performance of algorithm:

i) Likelihood weighting uses all the samples generated therefore it can be much more
efficient than rejection sampling.

ii) It will, suffer a degradation in performance as the number of evidence variables


increases.

iii) Because most samples will have very low weights and hence the weighted
estimate will be dominated by the tiny fraction of samples that accord more than an
infinitesimal likelihood to the evidence.

iv) The problem is exacerbated if the evidence variables occur late in the variable
ordering, because then the samples will be the simulations that bear little resemblance
to the reality suggested by the evidence.

7. Explain the semantics used in Bayesian networks CO2


There are two ways through which semantic (meaning) of Bayesian network can be
understood.

One way is to view network as a representation of the joint probability distribution.


This view helps to constructs networks. Second way is to view a network as an
encoding of a collection of conditional independence statements. This view helps in
designing inference procedure semantically both views are equivalent.

Understanding semantic of Bayesian network method:

• Representing the full joint distribution:


1) Every entry in the full joint probability distribution (hereafter abbreviated as
"joint") can be calculated from the information in the network.

2) A generic entry in the joint distribution is the probability of a conjunction of


particular assignments to each variable, such as P(X1 = x1 ^.....^xn = xn)

follg3) The notation P(x1,...,xn) is used as an abbrevation for this.

4) The value of this entry is given by the formula.

P(x1,...,xn) = Пnj=1 P(xi | parents (Xi)) ……(7.3.1)

where parents (xi) denotes the specific values of the variables in parents (xi).

5) Thus, each entry in the joint distribution is represented by the product of the
appropriate elements of the Conditional Probability Tables (CPTs) in the Bayesian
network. The CPTs therefore, provide a decomposed representation of the joint
distribution.

We can calculate the probability that the alarm has sounded, but neither a burglary nor
an earthquake has occurred and both J and M call. We use single letter names for the
variables:

P(j Ʌ m Ʌ a Ʌ⌐ b Ʌ⌐ e) = P(j | a) P(m | a) P(a |⌐ b Ʌ⌐ e) P(⌐ b) P(⌐e)

= 0.90 × 0.70 × 0.001 × 0.999 x 0.998 = 0.00062

6) Remember that the full joint distribution can be used to answer any query about the
domain.

If a Bayesian network is a representation of the joint distribution, then it too can be


used to answer any query, by summing all the relevant joint entries. A method for
constructing Bayesian network:

1) We rewrite the joint distribution in terms of a conditional probability, using the


product rule.

P(x1,...,xn) = P(xn | Xn-1,...,X1) P(xn-1,...,x1).

2) Then the process is represented, reducing each conjunctive probability to a


conditional probability and a smaller conjunction. We end up with one big product.

P(x1,...,xn) = P(xn | xn-1,..., x1)P(xn-1 xn-2 ,...,x1).....P(x2 | x1) P(x1)

= IIni=1 P(xi | xj-1,….x1) …..(7.3.2)

3) Above identity holds true for any set of random variables and is called the Chain
rule. Comparing it with equation (7.3.1). We see that the specification of the joint
distribution is equivalent to the general assertion that, for every variable X i in the
network,

P(Xi Xi-1,....X1) = P(Xi | Parents(Xj)),

provided that parents (Xi) (Xi-1,....X1}. This last condition is satisfied by labeling the
nodes in any order that is consistent with the partial order implicit in the graph
structure.

4) Bayesian network is correct representation of the domain only if each node is


conditionally independent of its predecessors in the node ordering, given its parents.

5) In order to construct a Bayesian network with the correct structure for the domain,
we need to choose parents for each node such that this property holds. Intuitively, the
parents of node X; should contain all those nodes in X i,....Xi-1 that directly influence
Xi.

8. Discuss Conditional Independence Relations in Bayesian Networks CO2


One can start from a "topological" semantics that specifies the conditional
independence relationships encoded by the graph structure, and from these we can
derive the "numerical" semantics. The topological semantics is given by either of the
following specifications, which are equivalent.

1) A node is conditionally independent of its non-descendants, given its parents. For


example, in Fig. 7.3.2 J calls is independent of Burglary and Earthquake, given the
value of Alarm.

2) A node is conditionally independent of all other nodes in the network, given it’s
parents, children, and children’s parents- that is, given its Markov blanket.
For example: Burglary is independent of J calls and M calls given Alarm and
Earthquake.

This specifications are illustrated in following figures. From these conditional


independence assertions and the CPTs, the full joint distribution can be reconstructed;
thus the "numerical" semantics and the "topological" semantics are equivalent.

A node X is conditionally independent of its non-descendants (example - - The Z1js)


given its parents(the U; is shown in the gray area).

As node X is conditionally on independent of all other nodes in the network given its
Markove blanket (the gray area).

Efficient representation of conditional representation:

If the maximum number of parents k is smallish, filling in the CPT for a node, would
require up to O(2k) numbers.

To avoid this big relationship number canconical distribution is used. In this a


complete table is specified by naming the patterns supplied with parameters, which
can describe relationship which are necessary. A deterministic node represents
canonical distribution. A deterministic node is a node whose value is exactly specified
by the value of its parents with no uncertainity.

The uncertain relationships are often represented by noisy-OR relationships. [Noisy-


OR is generalization of logical OR]. The noisy-OR model allows uncertainity about
the parent to cause the child to be true i.e. the causal relationship between parent and
child would be inhibited.

For example: A patient could have cold (parent) but not exhibit a fever (child).

For noisy-OR model it is required that all the possible effects of parent must be listed.
It should be clear that inhibition of one parent is independent of other parent.

9. Explain exact inference in Bayesian networks CO2


Exact Inference in Bayesian Networks

For inferencing in probabilistic system, it is required to calculate posterior probability


distribution for a set of query variables, where some observed events are given. [That
is we have some values attached to evidence variables].

• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.

X: Query variable.

E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y:
The set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also
called as hidden variables].

X: It the complete set of all the types of variables, where X = {X} U E U Y.

• Generally the query requires the posterior probability distribution P(X | e) [assuming
that query variable is not among the evidence variables, if it is, then posterior
distribution for X simply gives probability 1 to the observed value]. [Note that query
can contain more than one variable. For study purpose we are assuming single
variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'

The probability distribution for this situation would be,

P(Burglary | J calls = true, M calls = true) = < 0.284, 0.716 >

2. Inference by Enumeration

A Bayesian network gives a complete representation of the full joint distribution.


These full joint distributions can be written as product of conditional probabilities
from the Bayesian network.

A query can be answered using Bayesian network by computing sums of products of


conditional probabilities from the network.

• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full
joint distribution.

Characteristics of algorithm:

1) It takes input a full joint distribution P and looks up values in it. [The same
algorithm can be modified to take input as Bayesian network and looking up in joint
entries by multiplying the corresponding conditional probability table entries from
Bayesian network.

2) The ENUMERATION-JOINT-ASK uses ENUMERATION-ASK (EA) algorithm


which evaluate expression using depth-first recursions. Therefore, the space
complexity of EA is only linear in the number of variables. The algorithm sums over
the full joint distribution without ever constructing it explicitely. The time complexity
for network with 'n' boolean variables is always O(2 n) which is better than the O(n 2 n)
required in simple inferencing approach (using posterior probability).

3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression
which results in wastage of computation time.

The enumeration algorithm for answering queries on Bayesian network.

• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.

Inputs: X, the query variable

e, observed values for variables E.

bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */

Q(X) ← A distribution over X, initially empty for each value x i of X do extend e with
value xi for X.

Q(xi) ← ENUMERATE-ALL (VARS[bn] e) return NORMALIZE (Q(x)).

Function ENUMERATE-ALL (vars, e) returns a real number.

if EMPTY? (vars) then return 1.0

Y ← FIRST (vars)

If Y has value y in e.

Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)

else return, Σy Ply parents (Y)) X ENUMERATE-ALL (REST(vars), ey)

where eyis e extended with Y = y.

Example:

Consider query,

P(Burglary | J calls = true, M calls =true)

Hidden variables in the queries are → Earthquake and Alarm.


using the query equation.

P(Burglary | j, m) = α P( Burglary, J, M) = α Σe Σa P(Burglary, e, a, j, m)

The semantics of Bayesian networks (equation 7.2.1) then gives us an expression in


terms of CPT entries. For simplicity, we will do this just for Burglary = true.

P(b | j, m) = α Σe ΣaP(b)P(e)P(a | b, e)P(j | a) P(m | a)

• To compute this expression, we have to add four terms, each computed by


multiplying five numbers.
• Worst case, where we have to sum out almost all the variables, the complexity of the
algorithm for a network with n boolean variables is O(n2n).
• An improvement can be obtained from the following simple observations. The P(b)
term is a constant and can be moved outside the summations over a and e, and the P(e)
term can be moved outside the summation over a. Hence, we have
P(b | j, m) = α P(b) Σe P(e) ΣaP(a | b, e)P(j | a) P(m, a)

This expression can be evaluated by looping through the variables in order,


multiplying CPT entries as we go. For each summation, we also need to loop over the
variable's possible values. The structure of this computation is shown in following
diagram. Using the numbers from Fig. 7.3.2, we obtain P(b | j, m) = α × 0.00059224.
The corresponding computation for ⌐b yields α × 0.0014919;

Hence,

P(B | j, m)= α < 0.00059224, 0.0014919 >

≈ <0.284, 0.716 >

That is, the chance of burglary, given calls from both neighbours is about 28
%. Note In the Fig. 7.3.7, the evalution proceeds top to down, multiplying values
along each path and summing at the "t" nodes. Observe that there is repetition of paths
for j and m.
1 Explain the Complexity Involved in Exact Inferencing CO2
0. The variable elimination algorithm is more efficient than enumeration algorithm
because it avoids repeated computations as well as drops irrelevant variables.

The variable elimination algorithm constructs the factor, deriving its operation. The
space and time complexity of variable elimination is directly dependant on size of the
largest factor constructed during the operation. Basically the factor construction is
determined by the order of elimination of variables and by the structure of the
network; which affects both space and time complexity.

For developing more efficient process we can construct singly connected networks
which are also called as polytrees. In singly connected network, there is at most one
undirected path between any two nodes in the networks. The singly connected
networks have property that, the time and space complexity of exact inference in
polytrees is linear in the size of the network. Here the size is defined as the number of
CPT entries. If the number of parents of each node is bounded by a constant, then the
complexity I will also be linear in the number of nodes.

For example: The Burglary network shown in the Fig. 7.3.2 is a polytrees.

[Note that every problem may not be represented as polytrees].

In multiply networks [In this, their can be multiple undirected paths between any two
nodes and more than one directed path between some pair of nodes], variable
elimination takes exponential time and space complexity in the worst case, even when
the number of parents per node is bounded. It should be noted that variable
elimination includes inference in propositional logic as a special case and inference in
Bayesian network is NP-hard. In fact it is strictly harder than NP-complete problem.

Clustering algorithm:
1) Clustering algorithm (known as joint tree algorithms) in which inferencing time can
be reduced to O(n). In clustering individual nodes of the network are joint to form
cluster nodes to such a way that the resulting network is a polytree.

2) The variable elimination algorithm is efficient algorithm for answering individual


queries. Posterior probabilities are computed for all the variables in the network. It can
be less efficient, in polytree network because it needs to issue O(n) queries costing
O(n) each, for a total of O(n2) time, clustering algorithm, improves over it.

For example: The multiply connected network shown in Fig. 7.3.8 (a) can be
converted into a polytree by combining the Sprinkler and Rain node into a clusternode
called Sprinkler + Rain, as shown in Fig. 7.2.8 (b). The two Boolean nodes are
replaced by a meganode that takes on four possible values: TT, TF, FT, FF. The
meganode has only one parent, the Boolean variable. Cloudy, so there are two
conditioning cases.
Peculiarities of algorithm:

1) Once the network is in polytree form, a special-purpose inference algorithm is


applied. Essentially, the algorithm is a form of constraint propagation where the
constraints ensure that neighbouring clusters agree on the posterior probability of any
variables that they have in common.

2) With careful book keeping, this algorithm is able to compute posterior probabilities
for all the non-evidence nodes in the network in time O(n), where n is now the size of
the modified network.

3) However, the NP-hardness of the problem continues: if a network requires


exponential time and space with variable elimination, then the CPTs in the clustered
network will require exponential time and space to construct.

1 Consider an incandescent bulb manufacturing unit. Here machines M1, M2 and CO2
1. M3 make 30 %, 30 % and 40% of the total bulbs of their, output, let's assume
that 2 %, 3 % and 4 % are defective. A bulb is drawn at random and is found
defective. What is the probability that the bulb is made by machine M1 or M2 or
M3.

Solution:

• Let E1, E2 and E3 be the events that a bulb selected at random is made by machine
M1, M2 and M3.
• Let Q denote that it is defective.
Prob (E1) = 0.3
Prob (E2) = 0.3 and Prob (E3) = 0.4 (given data),
These represent the prior probabilities.

• Probability of drawing a defective bulb made by M1 = Prob (Q/E1) = 0.02


• Probability of drawing a defective bulb made by M2 = Prob (Q/E2) = 0.03
• Probability of drawing a defective bulb made by M3 = Prob (Q/E3) = 0.04
•These values are the posterior probabilities
Therefore,

Prob (E1/Q) = Prob (E1/) * Prob (Q/E1)/ Σ3i=1 Prob (Ei) * Prob (Q/Ei)

= 0.3* 0.02/ (0.03* 0.2) + (0.03* 0.3) + (0.04 * 0.4)

= 0.1935

Similarly,

Prob (E2/Q) = 0.3* 0.03/ (0.03* 0.2) + (0.03* 0.3) + (0.04* 0.4)

= 0.2903

Prob (E3/Q) = (1-(Prob(E1/Q) + Prob(E2/Q)))

= (1-((0.1935) (0.2903)))
= 0.5162
1 Explain Bayesian Belief Network in artificial intelligence CO2
2. Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:

"A Bayesian network is a probabilistic graphical model which represents a set of


variables and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian


model.

Bayesian networks are probabilistic, because these networks are built from a
probability distribution, and also use probability theory for prediction and anomaly
detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A by
a directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |
Parent(Xi) ), which determines the effect of the parent on that node.

Bayesian network is based on Joint probability distribution and conditional


probability. So let's first understand the joint probability distribution:

Joint probability distribution:

If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Part-C
1. Calculate the probability that alarm has sounded, but there is neither a burglary, CO2
nor an earthquake occurred, and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional
probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B,
E], can rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

ADVERTISEMENT

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)


True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999


Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95


Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98


From the formula of joint distribution, we can write the problem statement in the form
of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:

1. To understand the network as the representation of the Joint probability


distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional


independence statements.
2. Discuss Naïve bayes classifier algorithm in detail. CO2

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the fruit
is identified on the bases of color, shape, and taste, then red, spherical, and
sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event


B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target variable


"Play". So using this dataset we need to decide that whether we should play or not on
a particular day according to the weather conditions. So to solve this problem, we
need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

3. If the weather is sunny, then the Player should play or not? Calculate using CO2
Naïve bayes algorithm

Play
Outlook

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71


Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it


cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.

4. Explain the method of performing exact inference in Bayesian Networks. CO2


Exact Inference in Bayesian Networks

For inferencing in probabilistic system, it is required to calculate posterior probability


distribution for a set of query variables, where some observed events are given. [That
is we have some values attached to evidence variables].

• Notation Revisited:
The notation used in inferencing is same as the one used in probability theory.

X: Query variable.

E: The set of evidence variables E1,....., Em and 'e' is the perticular observed event. Y:
The set of non-evidence variables Y1, Y2,..... Yk [Non-evidence variables are also
called as hidden variables].

X: It the complete set of all the types of variables, where X = {X} U E U Y.

• Generally the query requires the posterior probability distribution P(X | e) [assuming
that query variable is not among the evidence variables, if it is, then posterior
distribution for X simply gives probability 1 to the observed value]. [Note that query
can contain more than one variable. For study purpose we are assuming single
variable].
• Example: In the burglary case, if the observed event is Jcalls = true and Mcalls true.
The query is 'Has burglary occured?'

The probability distribution for this situation would be,

P(Burglary | J calls = true, M calls = true) = < 0.284, 0.716 >


2. Inference by Enumeration

A Bayesian network gives a complete representation of the full joint distribution.


These full joint distributions can be written as product of conditional probabilities
from the Bayesian network.

A query can be answered using Bayesian network by computing sums of products of


conditional probabilities from the network.

• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by enumerating on full
joint distribution.

Characteristics of algorithm:

1) It takes input a full joint distribution P and looks up values in it. [The same
algorithm can be modified to take input as Bayesian network and looking up in joint
entries by multiplying the corresponding conditional probability table entries from
Bayesian network.

2) The ENUMERATION-JOINT-ASK uses ENUMERATION-ASK (EA) algorithm


which evaluate expression using depth-first recursions. Therefore, the space
complexity of EA is only linear in the number of variables. The algorithm sums over
the full joint distribution without ever constructing it explicitely. The time complexity
for network with 'n' boolean variables is always O(2 n) which is better than the O(n 2 n)
required in simple inferencing approach (using posterior probability).

3) The drawback of the algorithm is, it keeps on evaluating repeated sub expression
which results in wastage of computation time.

The enumeration algorithm for answering queries on Bayesian network.

• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.

Inputs: X, the query variable

e, observed values for variables E.

bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */

Q(X) ← A distribution over X, initially empty for each value x i of X do extend e with
value xi for X.

Q(xi) ← ENUMERATE-ALL (VARS[bn] e) return NORMALIZE (Q(x)).

Function ENUMERATE-ALL (vars, e) returns a real number.

if EMPTY? (vars) then return 1.0

Y ← FIRST (vars)

If Y has value y in e.
Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)

else return, Σy Ply parents (Y)) X ENUMERATE-ALL (REST(vars), ey)

where eyis e extended with Y = y.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy