Em Algorithm
Em Algorithm
Algorithm
In the real-world applications of machine learning, it is very
common that there are many relevant features available for
learning but only a small subset of them are observable. So, for
the variables which are sometimes observable and sometimes
not, then we can use the instances when that variable is visible
is observed for the purpose of learning and then predict its value
in the instances when it is not observable. On the other
hand, Expectation-Maximization algorithm can be used for
the latent variables (variables that are not directly observable
and are actually inferred from the values of the other observed
variables) too in order to predict their values with the condition
that the general form of probability distribution governing those
latent variables is known to us. This algorithm is actually at the
base of many unsupervised clustering algorithms in the field of
machine learning.
It was explained, proposed and given its name in a paper
published in 1977 by Arthur Dempster, Nan Laird, and Donald
Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting
parameters.
2. Expectation step (E – step): Using the observed
available data of the dataset, estimate (guess) the values of
the missing data.
3. Maximization step (M – step): Complete data generated
after the expectation (E) step is used in order to update the
parameters.
4. Repeat step 2 and step 3 until convergence.
Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of
clusters.
It can be used for the purpose of estimating the
parameters of Hidden Markov Model (HMM).
It can be used for discovering the values of latent
variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with
each iteration.
The E-step and M-step are often pretty easy for many
problems in terms of implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward
(numerical optimization requires only forward probability).
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have
rung or may not have rung). It has two parent nodes burglary
‘B’ and fire ‘F’ which can be ‘true’ or ‘false’ (i.e may have
occurred or may not have occurred) depending upon different
conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have
called the person ‘gfg’ or not) . It has a parent node, the
alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or
may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have
called the person ‘gfg’ or not). It has a parent node, the alarm
‘A’, which can be ‘true’ or ‘false’ (i.e may have rung or may
not have rung, upon burglary ‘B’ or fire ‘F’).
Solution: Considering the observed probabilistic scan –
With respect to the question — P ( P1, P2, A, ~B, ~F) , we need
to get the probability of ‘P1’. We find it with regard to its parent
node – alarm ‘A’. To get the probability of ‘P2’, we find it with
regard to its parent node — alarm ‘A’.
We find the probability of alarm ‘A’ node with regard to ‘~B’ &
‘~F’ since burglary ‘B’ and fire ‘F’ are parent nodes of alarm ‘A’.
From the observed probabilistic scan, we can deduce –
P ( P1, P2, A, ~B, ~F)
= P (P1/A) * P (P2/A) * P (A/~B~F) * P (~B) * P (~F)
= 0.95 * 0.80 * 0.001 * 0.999 * 0.998
= 0.00075
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Bayes Theorem
Machine Learning is one of the most emerging technology of Artificial
Intelligence. We are living in the 21th century which is completely
driven by new technologies and gadgets in which some are yet to be
used and few are on its full potential. Similarly, Machine Learning is also
a technology that is still in its developing phase. There are lots of
concepts that make machine learning a better technology such as
supervised learning, unsupervised learning, reinforcement learning,
perceptron models, Neural networks, etc. In this article "Bayes Theorem
in Machine Learning", we will discuss another most important concept
of Machine Learning theorem i.e., Bayes Theorem. But before starting
this topic you should have essential understanding of this theorem
such as what exactly is Bayes theorem, why it is used in Machine
Learning, examples of Bayes theorem in Machine Learning and much
more. So, let's start the brief introduction of Bayes theorem.
Bayes theorem is also known with some other name such as Bayes rule
or Bayes Law. Bayes theorem helps to determine the probability of
an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred. It
is a best method to relate the condition probability and marginal
probability.
In simple words, we can say that Bayes theorem helps to contribute
more accurate results.
P(X ? Y)= P(X|Y) P(Y) {equation 1}
o Further, the probability of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}
1. Experiment
2. Sample Space
During an experiment what we get as a result is called as possible
outcomes and the set of all possible outcome of an event is known as
sample space. For example, if we are rolling a dice, sample space will
be:
S1 = {1, 2, 3, 4, 5, 6}
S2 = {Head, Tail}
3. Event
Assume in our experiment of rolling a dice, there are two event A and B
such that;
A = Event when an even number is obtained = {2, 4, 6}
4. Random Variable:
It is a real value function which helps mapping between sample space
and a real line of an experiment. A random variable is taken on some
random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can
either be discrete, continuous or combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event
occurs at a time, called exhaustive event of an experiment.
6. Independent Event:
7. Conditional Probability:
8. Marginal Probability:
Here;
P(A) will remain constant throughout the class means it does not
change its value with respect to change in class. To maximize the
P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the
possibility of any class being the right answer is equally likely.
Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).