0% found this document useful (0 votes)
91 views20 pages

Em Algorithm

The Expectation-Maximization algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. It alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables, and a maximization (M) step, which computes parameter estimates maximizing the expected likelihood found on the E step. This process is repeated until convergence. The algorithm is useful for problems with missing or hidden data like clustering and estimating Hidden Markov Models. It guarantees an increase in likelihood with each iteration and has easy to implement E and M steps for many problems. However, it can be slow to converge and only finds local optima.

Uploaded by

jana k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views20 pages

Em Algorithm

The Expectation-Maximization algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables. It alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables, and a maximization (M) step, which computes parameter estimates maximizing the expected likelihood found on the E step. This process is repeated until convergence. The algorithm is useful for problems with missing or hidden data like clustering and estimating Hidden Markov Models. It guarantees an increase in likelihood with each iteration and has easy to implement E and M steps for many problems. However, it can be slow to converge and only finds local optima.

Uploaded by

jana k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

Expectation-Maximization

Algorithm
In the real-world applications of machine learning, it is very
common that there are many relevant features available for
learning but only a small subset of them are observable. So, for
the variables which are sometimes observable and sometimes
not, then we can use the instances when that variable is visible
is observed for the purpose of learning and then predict its value
in the instances when it is not observable. On the other
hand, Expectation-Maximization algorithm can be used for
the latent variables (variables that are not directly observable
and are actually inferred from the values of the other observed
variables) too in order to predict their values with the condition
that the general form of probability distribution governing those
latent variables is known to us. This algorithm is actually at the
base of many unsupervised clustering algorithms in the field of
machine learning.
It was explained, proposed and given its name in a paper
published in 1977 by Arthur Dempster, Nan Laird, and Donald
Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting
parameters.
2. Expectation step (E – step): Using the observed
available data of the dataset, estimate (guess) the values of
the missing data.
3. Maximization step (M – step): Complete data generated
after the expectation (E) step is used in order to update the
parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the


available observed data of the dataset to estimate the missing
data and then using that data to update the values of the
parameters. Let us understand the EM algorithm in detail.
 Initially, a set of initial values of the parameters are
considered. A set of incomplete observed data is given to the
system with the assumption that the observed data comes
from a specific model.
 The next step is known as “Expectation” – step or E-step.
In this step, we use the observed data in order to estimate or
guess the values of the missing or incomplete data. It is
basically used to update the variables.
 The next step is known as “Maximization”-step or M-step.
In this step, we use the complete data generated in the
preceding “Expectation” – step in order to update the values
of the parameters. It is basically used to update the
hypothesis.
 Now, in the fourth step, it is checked whether the values
are converging or not, if yes, then stop otherwise repeat step-
2 and step-3 i.e. “Expectation” – step and “Maximization” –
step until the convergence occurs.
 Flow chart for EM algorithm –

Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of
clusters.
 It can be used for the purpose of estimating the
parameters of Hidden Markov Model (HMM).
 It can be used for discovering the values of latent
variables.
Advantages of EM algorithm –
 It is always guaranteed that likelihood will increase with
each iteration.
 The E-step and M-step are often pretty easy for many
problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.

Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward
(numerical optimization requires only forward probability).

Bayesian Belief Network


Bayesian Belief Network is a graphical representation of
different probabilistic relationships among random variables in a
particular set. It is a classifier with no dependency on attributes
i.e it is condition independent. Due to its feature of joint
probability, the probability in Bayesian Belief Network is derived,
based on a condition — P(attribute/parent) i.e probability of an
attribute, true over parent attribute.
 Consider this example:

 In the above figure, we have an alarm ‘A’ – a node, say


installed in a house of a person ‘gfg’, which rings upon two
probabilities i.e burglary ‘B’ and fire ‘F’, which are – parent
nodes of the alarm node. The alarm is the parent node of two
probabilities P1 calls  ‘P1’ & P2 calls ‘P2’ person nodes.
 Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call
person ‘gfg’, respectively. But, there are few drawbacks in
this case, as sometimes ‘P1’ may forget to call the person
‘gfg’, even after hearing the alarm, as he has a tendency to
forget things, quick.  Similarly, ‘P2’, sometimes fails to call the
person ‘gfg’, as he is only able to hear the alarm, from a
certain distance.
Q) Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’
is true (P2 has called ‘gfg’) when the alarm ‘A’ rang, but no
burglary ‘B’ and fire ‘F’ has occurred.  
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events
and ‘~B’ & ‘~F’ are ‘false’ events]
Burglary ‘B’ –
 P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
 P (B=F) = 0.999  (‘B’ is false i.e burglary has not occurred)
Fire ‘F’ –
 P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
 P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)

T T 0.95 0.05

T F 0.94 0.06

F T 0.29 0.71

F F 0.001 0.999

 The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have
rung or may not have rung). It has two parent nodes burglary
‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have
occurred or may not have occurred) depending upon different
conditions.
 Person ‘P1’ –

A P (P1=T) P (P1=F)

T 0.95 0.05

F 0.05 0.95

 The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have
called the person ‘gfg’ or not) . It has a parent node, the
alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or
may not have rung ,upon burglary ‘B’ or fire ‘F’).

Person ‘P2’ –
A P (P2=T) P (P2=F)

T 0.80 0.20
F 0.01 0.99

 The person ‘P2’ node can be ‘true’ or false’ (i.e may have
called the person ‘gfg’ or not). It has a parent node, the alarm
‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may
not have rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question —  P ( P1, P2, A, ~B, ~F) , we need
to get the probability of ‘P1’. We find it with regard to its parent
node – alarm ‘A’. To get the probability of ‘P2’, we find it with
regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ &
‘~F’ since burglary ‘B’ and fire ‘F’ are parent nodes of alarm ‘A’. 
From the observed probabilistic scan, we can deduce – 
 P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075

Naïve Bayes Classifier

o Naïve Bayes algorithm is a supervised learning algorithm, which


is based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast
machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence


of a certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given


that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing


the evidence.

P(B) is Marginal Probability: Probability of Evidence.


Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of
the below example:

Suppose we have a dataset of weather conditions and corresponding


target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes
8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|


Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict


a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the
other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes
Classifier is an eager learner.
o It is used in Text classification such as Spam
filtering and Sentiment analysis.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a


normal distribution. This means if predictors take continuous
values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used
when the data is multinomial distributed. It is primarily used for
document classification problems, it means a particular
document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the
Multinomial classifier, but the predictor variables are the
independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for
document classification tasks.

Bayes Theorem
Machine Learning is one of the most emerging technology of Artificial
Intelligence. We are living in the 21th century which is completely
driven by new technologies and gadgets in which some are yet to be
used and few are on its full potential. Similarly, Machine Learning is also
a technology that is still in its developing phase. There are lots of
concepts that make machine learning a better technology such as
supervised learning, unsupervised learning, reinforcement learning,
perceptron models, Neural networks, etc. In this article "Bayes Theorem
in Machine Learning", we will discuss another most important concept
of Machine Learning theorem i.e., Bayes Theorem. But before starting
this topic you should have essential understanding of this theorem
such as what exactly is Bayes theorem, why it is used in Machine
Learning, examples of Bayes theorem in Machine Learning and much
more. So, let's start the brief introduction of Bayes theorem.

Introduction to Bayes Theorem in Machine


Learning
Bayes theorem is given by an English statistician, philosopher, and
Presbyterian minister named Mr. Thomas Bayes in 17th century. Bayes
provides their thoughts in decision theory which is extensively used in
important mathematics concepts as Probability. Bayes theorem is also
widely used in Machine Learning where we need to predict classes
precisely and accurately. An important concept of Bayes theorem
named Bayesian method is used to calculate conditional probability in
Machine Learning application that includes classification tasks. Further,
a simplified version of Bayes theorem (Naïve Bayes classification) is also
used to reduce computation time and average cost of the projects.

Bayes theorem is also known with some other name such as Bayes rule
or Bayes Law. Bayes theorem helps to determine the probability of
an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred. It
is a best method to relate the condition probability and marginal
probability.
In simple words, we can say that Bayes theorem helps to contribute
more accurate results.

Bayes Theorem is used to estimate the precision of values and provides


a method for calculating the conditional probability. However, it is
hypocritically a simple calculation but it is used to easily calculate the
conditional probability of events where intuition often fails. Some of
the data scientist assumes that Bayes theorem is most widely used in
financial industries but it is not like that. Other than financial, Bayes
theorem is also extensively applied in health and medical, research and
survey industry, aeronautical sector, etc.

What is Bayes Theorem?


Bayes theorem is one of the most popular machine learning concepts
that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional


probability of event X with known event Y:

o According to the product rule we can express as the probability


of event X with known event Y as follows;

P(X ? Y)= P(X|Y) P(Y)       {equation 1} 
o Further, the probability of event Y with known event X:

P(X ? Y)= P(Y|X) P(X)       {equation 2}  

 Mathematically, Bayes theorem can be expressed by combining both


equations on right hand side. We will get:
Here, both events X and Y are
independent events which means probability of outcome of both
events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is


defined as updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence
when hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis
before considering the evidence
o P(Y) is called marginal probability. It is defined as the probability
of evidence under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

Prerequisites for Bayes Theorem


While studying the Bayes theorem, we need to understand few
important concepts. These are as follows:

1. Experiment

An experiment is defined as the planned operation carried out under


controlled condition such as tossing a coin, drawing a card and rolling
a dice, etc.

2. Sample Space
During an experiment what we get as a result is called as possible
outcomes and the set of all possible outcome of an event is known as
sample space. For example, if we are rolling a dice, sample space will
be:

S1 = {1, 2, 3, 4, 5, 6}

Similarly, if our experiment is related to toss a coin and recording its


outcomes, then sample space will be:

S2 = {Head, Tail}

3. Event

Event is defined as subset of sample space in an experiment. Further, it


is also called as set of outcomes.

Assume in our experiment of rolling a dice, there are two event A and B
such that;
A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable


outcomes / Total number of possible outcomes
P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of
favourable outcomes / Total number of possible outcomes
=2/6
=1/3
=0.333
o Union of event A and B:
A∪B = {2, 4, 5, 6}
o Intersection of event A and B:
A∩B= {6}

o Disjoint Event: If the intersection of the event A and B is an


empty set or null then such events are known as disjoint
event or mutually exclusive events also.

4. Random Variable:
It is a real value function which helps mapping between sample space
and a real line of an experiment. A random variable is taken on some
random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.

5. Exhaustive Event:

As per the name suggests, a set of events where at least one event
occurs at a time, called exhaustive event of an experiment.

Thus, two events A and B are said to be exhaustive if either A or B


definitely occur at a time and both are mutually exclusive for e.g., while
tossing a coin, either it will be a Head or may be a Tail.

6. Independent Event:

Two events are said to be independent when occurrence of one event


does not affect the occurrence of another event. In simple words we
can say that the probability of outcome of both events does not
depends one another.

Mathematically, two events A and B are said to be independent if:

P(A ∩ B) = P(AB) = P(A)*P(B)

7. Conditional Probability:

Conditional probability is defined as the probability of an event A,


given that another event B has already occurred (i.e. A conditional B).
This is represented by P(A|B) and we can define it as:

P(A|B) = P(A ∩ B) / P(B)

8. Marginal Probability:

Marginal probability is defined as the probability of an event A


occurring independent of any other event B. Further, it is considered as
the probability of evidence under any consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Here ~B represents the event that B does not occur.

How to apply Bayes Theorem or Bayes rule


in Machine Learning?
Bayes theorem helps us to calculate the single term P(B|A) in terms of
P(A|B), P(B), and P(A). This rule is very helpful in such scenarios where
we have a good probability of P(A|B), P(B), and P(A) and need to
determine the fourth term.

Naïve Bayes classifier is one of the simplest applications of Bayes


theorem which is used in classification algorithms to isolate data as per
accuracy, speed and classes.

Let's understand the use of Bayes theorem in machine learning with


below example.

Suppose, we have a vector A with I attributes. It means

A = A1, A2, A3, A4……………Ai

Further, we have n classes represented as C1, C2, C3, C4…………Cn.


These are two conditions given to us, and our classifier that works on
Machine Language has to predict A and the first thing that our classifier
has to choose will be the best possible class. So, with the help of Bayes
theorem, we can write it as:

P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)

Here;

P(A) is the condition-independent entity.

P(A) will remain constant throughout the class means it does not
change its value with respect to change in class. To maximize the
P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).

With n number classes on the probability list let's assume that the
possibility of any class being the right answer is equally likely.
Considering this factor, we can say that:

P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).

This process helps us to reduce the computation cost as well as time.


This is how Bayes theorem plays a significant role in Machine Learning
and Naïve Bayes theorem has simplified the conditional probability
tasks without affecting the precision. Hence, we can conclude that:

P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)

Hence, by using Bayes theorem in Machine Learning we can easily


describe the possibilities of smaller events.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy