0% found this document useful (0 votes)
39 views28 pages

Unit-4 Ai

Uploaded by

Jagathdhathri KR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views28 pages

Unit-4 Ai

Uploaded by

Jagathdhathri KR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT-4

Uncertainty Measure: Probability Theory

Uncertainty in artificial intelligence (AI) refers to the inherent limitations in predictions, decisions, or
classifications due to incomplete, ambiguous, or noisy data, as well as model limitations. AI systems, especially
those employing machine learning, often encounter uncertainty when dealing with real-world data that may
be imperfect or incomplete. Managing uncertainty is crucial to ensure robust, reliable, and accurate
performance in AI applications.

There are several ways to model uncertainty in AI. Bayesian approaches quantify uncertainty by treating
model parameters as probabilistic entities, offering confidence intervals or probability distributions for
predictions. Fuzzy logic addresses uncertainty by allowing partial truth values between 0 and 1, making it
useful for systems where binary decisions (true/false) are inadequate. Probabilistic graphical models like
Hidden Markov Models or Bayesian Networks handle uncertainty by modelling relationships between
variables and their likelihoods.

Additionally, deep learning models handle uncertainty through techniques like dropout as a regularization
method, which can be interpreted to provide uncertainty estimates in predictions. Uncertainty measures play
a critical role in applications like autonomous systems, healthcare, and decision-making processes, where
incorrect or overconfident predictions can have significant consequences.
Uncertainty in artificial intelligence (AI) refers to the lack of complete information or the presence of
variability in data and models. Understanding and modeling uncertainty is crucial for making informed
decisions and improving the robustness of AI systems. There are several types of uncertainty in AI, including:
1. Aleatoric Uncertainty: This type of uncertainty arises from the inherent randomness or variability
in data. It is often referred to as “data uncertainty.” For example, in a classification task, aleatoric
uncertainty may arise from variations in sensor measurements or noisy labels.
2. Epistemic Uncertainty: Epistemic uncertainty is related to the lack of knowledge or information
about a model. It represents uncertainty that can potentially be reduced with more data or better
modeling techniques. It is also known as “model uncertainty” and arises from model limitations, such
as simplifications or assumptions.
3. Parameter Uncertainty: This type of uncertainty is specific to probabilistic models, such as Bayesian
neural networks. It reflects uncertainty about the values of model parameters and is characterized by
probability distributions over those parameters.
4. Uncertainty in Decision-Making: Uncertainty in AI systems can affect the decision-making process.
For instance, in reinforcement learning, agents often need to make decisions in environments with
uncertain outcomes, leading to decision-making uncertainty.
5. Uncertainty in Natural Language Understanding: In natural language processing (NLP),
understanding and generating human language can be inherently uncertain due to language
ambiguity, polysemy (multiple meanings), and context-dependent interpretations.
6. Uncertainty in Probabilistic Inference: Bayesian methods and probabilistic graphical models are
commonly used in AI to model uncertainty. Uncertainty can arise from the process of probabilistic
inference itself, affecting the reliability of model predictions.
7. Uncertainty in Reinforcement Learning: In reinforcement learning, uncertainty may arise from the
stochasticity of the environment or the exploration-exploitation trade-off. Agents must make
decisions under uncertainty about the outcomes of their actions.
8. Uncertainty in Autonomous Systems: Autonomous systems, such as self-driving cars or drones,
must navigate uncertain and dynamic environments. This uncertainty can pertain to the movement
of other objects, sensor measurements, and control actions.
9. Uncertainty in Safety-Critical Systems: In applications where safety is paramount, such as
healthcare or autonomous vehicles, managing uncertainty is critical. Failure to account for uncertainty
can lead to dangerous consequences.
10. Uncertainty in Transfer Learning: When transferring a pre-trained AI model to a new domain or
task, uncertainty can arise due to domain shift or differences in data distributions. Understanding this
uncertainty is vital for adapting the model effectively.
11. Uncertainty in Human-AI Interaction: When AI systems interact with humans, there can be
uncertainty in understanding and responding to human input, as well as uncertainty in predicting
human behavior and preferences.
Addressing and quantifying these various types of uncertainty is an ongoing research area in AI, and
techniques such as probabilistic modeling, Bayesian inference, and Monte Carlo methods are commonly used
to manage and mitigate uncertainty in AI systems.
Become a master of Data Science and AI by going through this PG Diploma in Data Science and
Artificial Intelligence!
Techniques for Addressing Uncertainty in AI
We’ve just discussed the different types of uncertainty in AI. Now, let’s switch gears and learn techniques for
addressing uncertainty in AI. It’s like going from understanding the problem to finding solutions for it.

Probabilistic Logic Programming


Probabilistic logic programming (PLP) is a way to mix logic and probability to handle uncertainty in computer
programs. This is useful for computer programmers when they are not completely sure about the facts and
rules they are working with. PLP uses probabilities to help them make decisions and learn from data. They
can use different techniques, like Bayesian logic programs or Markov logic networks, to put PLP into action.
PLP is handy in various areas of artificial intelligence, like making guesses when we’re not sure, planning
when there are risks involved, and creating models with pictures and symbols.
Fuzzy Logic Programming
To deal with uncertainty in logic programming, there’s a method called fuzzy logic programming (FLP). FLP
combines regular logic with something called “fuzzy” logic. This helps programmers express things that are
a bit unclear or not black and white. FLP also helps them make decisions and learn from this uncertain
information. They can use different ways to do FLP, like fuzzy Prolog, fuzzy answer set programming, and
fuzzy description logic. FLP is useful in various areas of artificial intelligence, like understanding language,
working with images, and making decisions when things are not very clear.
Probability Theory
Introduction to Probabilistic Reasoning
Probabilistic reasoning provides a mathematical framework for representing and manipulating uncertainty.
Unlike deterministic systems, which operate under the assumption of complete and exact information,
probabilistic systems acknowledge that the real world is fraught with uncertainties. By employing
probabilities, AI systems can make informed decisions even in the face of ambiguity.
Need for Probabilistic Reasoning in AI
Probabilistic reasoning with artificial intelligence is important to different tasks such as:
• Machine learning helps algorithms learn from possibly incomplete or noisy data.
• Robotics: Provides robots the capability to act in and interact with dynamic and uncertain
environments.
• Natural Language Processing: Gives computers an understanding of human language in all its
ambiguity and sensitivity to context.
• Decision Making Systems: It empowers AI systems for well-informed decisions and judgments by
considering the likelihood of alternative outcomes.
Probabilistic reasoning can introduce uncertainty, allowing the AI system to sensibly operate in the real world
and make effective predictions.
Key Concepts in Probabilistic Reasoning
1. Bayesian Networks
• Imagine a kind of spider web cluttered with factors—one might say, a type of detective board
associating suspects, motives, and evidence. This, in a nutshell, is your basic intuition behind a
Bayesian network: a graphical model showing the relationships between variables and their
conditional probabilities.
• Advantages: Bayesian Networks are very effective to express cause and effect and reasoning about
missing information. They have found wide applications in medical diagnosis where symptoms are
considered variables which have different grades of association with diseases considered other
variables.
2. Markov Models
• Consider a weather forecast. A Markov model predicts the future state of a system from its current
state and its past history. For instance, according to a simple Markov model of weather, the probability
that a sunny day will be followed by another sunny day is greater than the probability that a sunny
day will be followed by a rainy day.
• Advantages: Markov models are effective and easy to implement. They are widely used, such as in
speech recognition, and they can also be used for prediction, depending on the choice of the previous
words, as in the probability of the next word.
3. Hidden Markov Models (HMMs)
• Consider, for example, a weather-predicting scenario that includes states of some kind and yet also
includes invisible states, such as humidity. HMMs are a generalization of Markov models in which
states are hidden.
• Advantages: HMMs are found to be very powerful in cases where hidden variables are taken into
account. Such tasks usually involve stock market prediction, where the factors that govern prices are
not fully transparent.
4. Probabilistic Graphical Models
• Probabilistic Graphical Models give a broader framework encompassing both Bayesian networks and
HMMs. In general, PGMs are an approach for representation and reasoning in a framework of
uncertain information, given in graphical structure.
• Advantages: PGMs offer a powerful, flexible, and expressive language for doing probabilistic
reasoning, which is well suited for complex relationships that may capture many different types of
uncertainty.
These techniques are not mutually exclusive; rather, they can be combined and extended to handle more and
more specific problems in AI. For instance, the particular technique that may be used will depend on the
character of the uncertainty and the type of result that may be sought. In turn, probabilistic reasoning can
allow AI systems to make not just predictions but quantifiable ones, thus leading to more robust and reliable
decision-making.
Techniques in Probabilistic Reasoning
1. Inference: The process of computing the probability distribution of certain variables given known
values of other variables. Exact inference methods include variable elimination and the junction tree
algorithm, while approximate inference methods include Markov Chain Monte Carlo (MCMC) and
belief propagation.
2. Learning: Involves updating the parameters and structure of probabilistic models based on observed
data. Techniques include maximum likelihood estimation, Bayesian estimation, and expectation-
maximization (EM).
3. Decision Making: Utilizing probabilistic models to make decisions that maximize expected utility.
Techniques involve computing expected rewards and selecting actions accordingly, often
implemented using frameworks like POMDPs.
How Probabilistic Reasoning Empowers AI Systems?
Suppose for a moment the maze in which you find yourself with nothing but an out-of-focus map. The kind of
traditional, rule-based reasoning would grind you to a halt, unable to reason about the likelihood of a dead-
end or an unclear way to go. Probabilistic reasoning is like a powerful flashlight that can show the path ahead
even in circumstances of uncertainty.
This is the way in which probabilistic reasoning empowers AI systems:
• Quantifying Uncertainty: Probabilistic reasoning does not shrink from uncertainty. It turns to the
tools of probability theory to represent uncertainty by attaching degrees of likelihood. For example,
instead of a simple “true” or “false” to whether it will rain tomorrow, probabilistic reasoning might
assign a 60% chance that it will.
• Reasoning with Evidence: AI systems cannot enjoy the luxury of making decisions in isolation. They
have to consider the available evidence and act accordingly to help refine the probabilities. For
example, the probability for a rainy day can be refined to increase to 80% if dark clouds come in the
afternoon.
• Based on Past Experience: AI systems can learn from past experiences. Probabilistic reasoning
factors in the prior knowledge of the nature of decisions. For example, an AI system that was trained
in the past on historical weather data in your location might, therefore, consider seasonal trends when
calculating the probability of rain.
• Effective Decision-Making: Probabilistic reasoning will also enable AI systems to make effective
and well-informed decisions based on quantified uncertainty, evidence, and prior knowledge.
Returning to our maze analogy, the AI would be able to actually weigh the probability of different
paths, given the map at each point in the maze and whatever it’s found its way through, making its
reaching the goal much more likely.
Probabilistic reasoning is not about achieving perfection in a world full of uncertainty but about realizing the
limits of perfect knowledge and working best with the information available. This enables AI systems to
perform soundly in the realistic world, full of vagueness and where information is, in general, not complete.
Applications of Probabilistic Reasoning in AI
Probabilistic reasoning is widely applicable in a variety of domains:
1. Robotics: Probabilistic reasoning enables robots to navigate and interact with uncertain
environments. For instance, simultaneous localization and mapping (SLAM) algorithms use
probabilistic techniques to construct maps of unknown environments while tracking the robot’s
location.
2. Healthcare: In medical diagnosis, probabilistic models help in assessing the likelihood of diseases
given symptoms and test results. Bayesian networks, for example, can model the relationships between
various medical conditions and diagnostic indicators.
3. Natural Language Processing (NLP): Probabilistic models, such as HMMs and Conditional
Random Fields (CRFs), are used for tasks like part-of-speech tagging, named entity recognition, and
machine translation.
4. Finance: Probabilistic reasoning is used to model market behavior, assess risks, and make investment
decisions. Techniques like Bayesian inference and Monte Carlo simulations are commonly employed
in financial modeling.
Advantages of Probabilistic Reasoning
• Flexibility: Probabilistic models can handle a wide range of uncertainties and are adaptable to various
domains.
• Robustness: These models are robust to noise and incomplete data, making them reliable in real-
world applications.
• Interpretable: Probabilistic models provide a clear framework for understanding and quantifying
uncertainty, which can aid in transparency and explainability.
Conclusion
Probabilistic reasoning is one of the most important methods to empower AI applications and is widely used,
dealing with the uncertainty of the problem to make logical decisions. With the built-in probabilities, AI
systems can navigate through complexities in the real world, ultimately improving both reliability and
performance.
Bayesian Belief Networks

Bayesian Belief Network in artificial intelligence


Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a
problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and
also use probability theory for prediction and anomaly detection.
PauseNext
Mute
Current Time 0:40
/
Duration 18:10
Loaded: 9.54%

Fullscreen
Real world applications are probabilistic in nature, and to represent the relationship between multiple events,
we need a Bayesian network. It can also be used in various tasks including prediction, anomaly detection,
diagnostics, automated insight, reasoning, time series prediction, and decision making under
uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain
knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link
that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of
the network graph.
o If we are considering node B, which is connected with node A by a directed arrow, then
node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic
graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So let's first
understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are
known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably responds at
detecting a burglary but also responds for minor earthquakes. Harry has two neighbors David and Sophia,
who have taken a responsibility to inform Harry at work when they hear the alarm. David always calls Harry
when he hears the alarm, but sometimes he got confused with the phone ringing and calls at that time too. On
the other hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm. Here we would
like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake
occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is showing that
burglary and earthquake is the parent node of the alarm and directly affecting the probability of
alarm's going off, but David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set
of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the
above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69


A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

False False 0.001 0.999

Conditional probability table for David Calls:


The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability
distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is given below:
1. To understand the network as the representation of the Joint probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional independence statements.
It is helpful in designing inference procedure.
Dempster Shafer Theory

Uncertainty is a pervasive aspect of AI systems, as they often deal with incomplete or conflicting information.
Dempster–Shafer Theory, named after its inventors Arthur P. Dempster and Glenn Shafer, offers a
mathematical framework to represent and reason with uncertain information. By utilizing belief functions,
Dempster–Shafer Theory in Artificial Intelligence systems enables them to handle imprecise and conflicting
evidence, making it a powerful tool in decision-making processes.
Introduction
In recent times, the scientific and engineering community has come to realize the significance of incorporating
multiple forms of uncertainty. This expanded perspective on uncertainty has been made feasible by notable
advancements in computational power within the field of artificial intelligence. As computational systems
become more adept at handling intricate analyses, the limitations of relying solely on traditional probability
theory to encompass the entirety of uncertainty have become apparent.
Traditional probability theory falls short in its ability to effectively address consonant, consistent, or arbitrary
evidence without the need for additional assumptions about probability distributions within a given set.
Moreover, it fails to express the extent of conflict that may arise between different sets of evidence. To
overcome these limitations, Dempster-Shafer theory has emerged as a viable framework, blending the concept
of probability with the conventional understanding of sets. Dempster-Shafer theory provides the means to
handle diverse types of evidence, and it incorporates various methods to account for conflicts when combining
multiple sources of information in the context of artificial intelligence.
What Is Dempster – Shafer Theory (DST)?
Dempster-Shafer Theory (DST) is a theory of evidence that has its roots in the work of Dempster and Shafer.
While traditional probability theory is limited to assigning probabilities to mutually exclusive single events,
DST extends this to sets of events in a finite discrete space. This generalization allows DST to handle evidence
associated with multiple possible events, enabling it to represent uncertainty in a more meaningful way. DST
also provides a more flexible and precise approach to handling uncertain information without relying on
additional assumptions about events within an evidential set.
Where sufficient evidence is present to assign probabilities to single events, the Dempster-Shafer model can
collapse to the traditional probabilistic formulation. Additionally, one of the most significant features of DST
is its ability to handle different levels of precision regarding information without requiring further
assumptions. This characteristic enables the direct representation of uncertainty in system responses, where
an imprecise input can be characterized by a set or interval, and the resulting output is also a set or interval.
The incorporation of Dempster Shafer theory in artificial intelligence allows for a more comprehensive
treatment of uncertainty. By leveraging the unique features of this theory, AI systems can better navigate
uncertain scenarios, leveraging the potential of multiple evidentiary types and effectively managing conflicts.
The utilization of Dempster Shafer theory in artificial intelligence empowers decision-making processes in the
face of uncertainty and enhances the robustness of AI systems. Therefore, Dempster-Shafer theory is a
powerful tool for building AI systems that can handle complex uncertain scenarios.
The Uncertainty in this Model
At its core, DST represents uncertainty using a mathematical object called a belief function. This belief
function assigns degrees of belief to various hypotheses or propositions, allowing for a nuanced representation
of uncertainty. Three crucial points illustrate the nature of uncertainty within this theory:
1. Conflict: In DST, uncertainty arises from conflicting evidence or incomplete information. The theory
captures these conflicts and provides mechanisms to manage and quantify them, enabling AI systems
to reason effectively.
2. Combination Rule: DST employs a combination rule known as Dempster's rule of combination to
merge evidence from different sources. This rule handles conflicts between sources and determines
the overall belief in different hypotheses based on the available evidence.
3. Mass Function: The mass function, denoted as m(K), quantifies the belief assigned to a set of
hypotheses, denoted as K. It provides a measure of uncertainty by allocating probabilities to various
hypotheses, reflecting the degree of support each hypothesis has from the available evidence.
Example
Consider a scenario in artificial intelligence (AI) where an AI system is tasked with solving a murder mystery
using Dempster–Shafer Theory. The setting is a room with four individuals: A, B, C, and D. Suddenly, the
lights go out, and upon their return, B is discovered dead, having been stabbed in the back with a knife. No
one entered or exited the room, and it is known that B did not commit suicide. The objective is to identify the
murderer.
To address this challenge using Dempster–Shafer Theory, we can explore various possibilities:
1. Possibility 1: The murderer could be either A, C, or D.
2. Possibility 2: The murderer could be a combination of two individuals, such as A and C, C and D, or
A and D.
3. Possibility 3: All three individuals, A, C, and D, might be involved in the crime.
4. Possibility 4: None of the individuals present in the room is the murderer.
To find the murderer using Dempster–Shafer Theory, we can examine the evidence and assign measures of
plausibility to each possibility. We create a set of possible conclusions (P)(P) with individual
elements {p1,p2,...,pn}{p1,p2,...,pn}, where at least one element (p)(p) must be true. These elements must be
mutually exclusive.
By constructing the power set, which contains all possible subsets, we can analyze the evidence. For instance,
if P={a,b,c}P={a,b,c}, the power set would
be {o,{a},{b},{c},{a,b},{b,c},{a,c},{a,b,c}}{o,{a},{b},{c},{a,b},{b,c},{a,c},{a,b,c}},
comprising 23=823=8 elements.
Mass function m(K)
In Dempster–Shafer Theory, the mass function m(K) represents evidence for a hypothesis or subset K. It
denotes that evidence for {K or B} cannot be further divided into more specific beliefs for K and B.
Belief in K
The belief in KK, denoted as Bel(K)Bel(K), is calculated by summing the masses of the subsets that belong
to KK. For example, if K={a,d,c},Bel(K)K={a,d,c},Bel(K) would be calculated
as m(a)+m(d)+m(c)+m(a,d)+m(a,c)+m(d,c)+m(a,d,c)m(a)+m(d)+m(c)+m(a,d)+m(a,c)+m(d,c)+m(a,d,c).
Plausibility in K
Plausibility in KK, denoted as Pl(K)Pl(K), is determined by summing the masses of sets that intersect with KK.
It represents the cumulative evidence supporting the possibility of K being true. Pl(K)Pl(K) is computed
as m(a)+m(d)+m(c)+m(a,d)+m(d,c)+m(a,c)+m(a,d,c)m(a)+m(d)+m(c)+m(a,d)+m(d,c)+m(a,c)+m(a,d,c).
By leveraging Dempster–Shafer Theory in AI, we can analyze the evidence, assign masses to subsets of
possible conclusions, and calculate beliefs and plausibilities to infer the most likely murderer in this murder
mystery scenario.
Characteristics of Dempster Shafer Theory
Dempster Shafer Theory in artificial intelligence (AI) exhibits several notable characteristics:
1. Handling Ignorance: Dempster Shafer Theory encompasses a unique aspect related to ignorance,
where the aggregation of probabilities for all events sums up to 1. This peculiar trait allows the theory
to effectively address situations involving incomplete or missing information.
2. Reduction of Ignorance: In this theory, ignorance is gradually diminished through the accumulation
of additional evidence. By incorporating more and more evidence, Dempster Shafer Theory enables
AI systems to make more informed and precise decisions, thereby reducing uncertainties.
3. Combination Rule: The theory employs a combination rule to effectively merge and integrate various
types of possibilities. This rule allows for the synthesis of different pieces of evidence, enabling AI
systems to arrive at comprehensive and robust conclusions by considering the diverse perspectives
presented.
By leveraging these distinct characteristics, Dempster Shafer Theory proves to be a valuable tool in the field
of artificial intelligence, empowering systems to handle ignorance, reduce uncertainties, and combine multiple
types of evidence for more accurate decision-making.
Advantages and Disadvantages
Dempster Shafer Theory in Artificial Intelligence (AI) Offers Numerous Benefits:
1. Firstly, it presents a systematic and well-founded framework for effectively managing uncertain
information and making informed decisions in the face of uncertainty.
2. Secondly, the application of Dempster–Shafer Theory allows for the integration and fusion of diverse
sources of evidence, enhancing the robustness of decision-making processes in AI systems.
3. Moreover, this theory caters to the handling of incomplete or conflicting information, which is a
common occurrence in real-world scenarios encountered in artificial intelligence.
Nevertheless, it is Crucial to Acknowledge Certain Limitations Associated with the Utilization of
Dempster Shafer Theory in Artificial Intelligence:
1. One drawback is that the computational complexity of DST increases significantly when confronted
with a substantial number of events or sources of evidence, resulting in potential performance
challenges.
2. Furthermore, the process of combining evidence using Dempster–Shafer Theory necessitates careful
modeling and calibration to ensure accurate and reliable outcomes.
3. Additionally, the interpretation of belief and plausibility values in DST may possess subjectivity,
introducing the possibility of biases influencing decision-making processes in artificial intelligence.
MACHINE LEARNING

Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems
that learn—or improve performance—based on the data they ingest. Artificial intelligence is a broad word
that refers to systems or machines that resemble human intelligence. Machine learning and AI are frequently
discussed together, and the terms are occasionally used interchangeably, although they do not signify the same
thing. A crucial distinction is that, while all machine learning is AI, not all AI is machine learning.
What is Machine Learning?
Machine Learning is the field of study that gives computers the capability to learn without being explicitly
programmed. ML is one of the most exciting technologies that one would have ever come across. As it is
evident from the name, it gives the computer that makes it more similar to humans: The ability to learn.
Machine learning is actively being used today, perhaps in many more places than one would expect.

Features of Machine Learning


• Machine learning is a data-driven technology. A large amount of data is generated by organizations
daily, enabling them to identify notable relationships and make better decisions.
• Machines can learn from past data and automatically improve their performance.
• Given a dataset, ML can detect various patterns in the data.
• For large organizations, branding is crucial, and targeting a relatable customer base becomes easier.
• It is similar to data mining, as both deal with substantial amounts of data.
Advantages and Disadvantages of the Machine Learning
Advantages:
1. Improved Accuracy and Precision: Machine learning (ML) excels in analyzing vast data sets,
identifying patterns, and improving accuracy in predictions, such as diagnosing diseases or detecting
anomalies that human analysis might miss.
2. Automation of Repetitive Tasks: ML automates routine tasks, such as data entry and customer
service, leading to increased productivity, efficiency, and the allocation of human resources to more
creative tasks.
3. Enhanced Decision-Making: By analyzing large datasets, ML provides valuable insights, aiding in
data-driven decision-making across industries, including finance, healthcare, and marketing.
4. Personalization and Customer Experience: ML algorithms allow businesses to personalize
products and services based on user behavior, enhancing customer satisfaction, such as through
personalized recommendations in e-commerce or content platforms.
5. Predictive Analytics: ML can predict future trends or events by analyzing historical data, such as
forecasting demand or identifying potential disease outbreaks, helping industries plan more
effectively.
6. Scalability: Machine learning models can efficiently handle and process large datasets, making them
essential for big data applications like social media analysis or real-time business operations.
7. Improved Security: ML helps in detecting cybersecurity threats by identifying abnormal patterns in
data. It’s used in fraud detection by monitoring transactions and analyzing network activity for
suspicious behavior.
8. Cost Reduction: By automating tasks and optimizing processes, ML reduces operational costs, such
as predictive maintenance in manufacturing that prevents costly machine failures.
9. Innovation and Competitive Advantage: Companies adopting ML gain a competitive edge by
innovating and responding to customer demands faster. ML-driven products and insights can lead to
new revenue streams and market leadership.
10. Enhanced Human Capabilities: ML amplifies human potential, offering tools that provide insights
and help professionals, such as assisting doctors in diagnosing diseases or researchers in processing
complex data.
Disadvantages:
1. Data Dependency: ML models require vast amounts of data to function effectively. The quality,
quantity, and diversity of data are crucial to the model’s performance, and biased or insufficient data
can lead to poor results.
2. High Computational Costs: Training ML models can be resource-intensive, often requiring
expensive hardware like GPUs or TPUs. The energy consumption is also significant, raising concerns
about sustainability.
3. Complexity and Interpretability: Complex ML models, especially deep neural networks, are difficult
to interpret, leading to a "black-box" problem where understanding how a model arrived at a decision
becomes challenging, especially in sensitive fields like healthcare.
4. Overfitting and Underfitting: ML models can suffer from overfitting (when they memorize the
training data) or underfitting (when they are too simplistic), leading to poor generalization to new
data.
5. Ethical Concerns: ML raises ethical issues around privacy, as models often rely on sensitive personal
data. Biases in data can also perpetuate social inequalities, resulting in unfair treatment.
6. Lack of Generalization: ML models are often designed for specific tasks and may struggle when
applied to different datasets or domains. Generalizing across diverse contexts remains a challenge in
machine learning.
7. Dependency on Expertise: Developing ML models requires specialized knowledge in algorithms,
data preprocessing, and model evaluation. A shortage of skilled professionals can limit the adoption
of ML.
8. Security Vulnerabilities: ML models can be vulnerable to adversarial attacks, where manipulated
input data is used to trick the model into making incorrect predictions, posing risks in applications
like autonomous vehicles and cybersecurity.
9. Maintenance and Updates: ML models require ongoing maintenance and retraining as data changes
over time. Data drift, where the underlying data distribution shifts, can degrade model performance if
not addressed.
10. Legal and Regulatory Challenges: The use of ML, especially in handling personal data, faces legal
and regulatory challenges like complying with GDPR. A lack of clear regulations can create
uncertainty for developers and businesses.
Conclusion:
Machine learning offers numerous advantages, such as automation, enhanced accuracy, scalability, and
personalization, making it highly valuable across industries. However, it also faces challenges like data
dependency, computational costs, interpretability issues, and security vulnerabilities. Addressing these
challenges is essential for ethical and effective use of ML technologies.
Supervised Learning
In supervised learning, the machine is trained on a set of labeled data, which means that the input data is
paired with the desired output. The machine then learns to predict the output for new input data. Supervised
learning is often used for tasks such as classification, regression, and object detection.
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input data is
not paired with the desired output. The machine then learns to find patterns and relationships in the data.
Unsupervised learning is often used for tasks such as clustering, dimensionality reduction, and anomaly
detection.
What is Supervised learning?
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled data is data
that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised learning
is when we teach or train the machine using data that is well-labelled. Which means some data is already
tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that
the supervised learning algorithm analyses the training data(set of training examples) and produces a correct
outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each image tagged with
either “Elephant” , “Camel”or “Cow.”

Key Points:
• Supervised learning involves training a machine from labeled data.
• Labeled data consists of examples with the correct answer or classification.
• The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
• The trained machine can then make predictions on new, unlabeled data.
Example:
Let’s say you have a fruit basket that you want to identify. The machine would first analyze the image to
extract features such as its shape, color, and texture. Then, it would compare these features to the features of
the fruits it has already learned about. If the new image’s features are most similar to those of an apple, the
machine would predict that the fruit is an apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train
the machine with all the different fruits one by one like this:
• If the shape of the object is rounded and has a depression at the top, is red in color, then it will be
labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow color, then it will be labeled
as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the basket, and
asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it wisely. It will
first classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in the
Banana category. Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).
Types of Supervised Learning
Supervised learning is classified into two categories of algorithms:
• Regression: A regression problem is when the output variable is a real value, such as “dollars” or
“weight”.
• Classification: A classification problem is when the output variable is a category, such as “Red” or
“blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is already tagged
with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous values, such as house prices,
stock prices, or customer churn. Regression algorithms learn a function that maps from the input features to
the output value.
Some common regression algorithms include:
• Linear Regression
• Polynomial Regression
• Support Vector Machine Regression
• Decision Tree Regression
• Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict categorical values, such as whether a
customer will churn or not, whether an email is spam or not, or whether a medical image shows a tumor or
not. Classification algorithms learn a function that maps from the input features to a probability distribution
over the output classes.
Some common classification algorithms include:
• Logistic Regression
• Support Vector Machines
• Decision Trees
• Random Forests
• Naive Baye
Evaluating Supervised Learning Models
Evaluating supervised learning models is an important step in ensuring that the model is accurate and
generalizable. There are a number of different metrics that can be used to evaluate supervised learning models,
but some of the most common ones include:
For Regression
• Mean Squared Error (MSE): MSE measures the average squared difference between the predicted
values and the actual values. Lower MSE values indicate better model performance.
• Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing the standard
deviation of the prediction errors. Similar to MSE, lower RMSE values indicate better model
performance.
• Mean Absolute Error (MAE): MAE measures the average absolute difference between the predicted
values and the actual values. It is less sensitive to outliers compared to MSE or RMSE.
• R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in
the target variable that is explained by the model. Higher R-squared values indicate better model fit.
For Classification
• Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It is calculated
by dividing the number of correct predictions by the total number of predictions.
• Precision: Precision is the percentage of positive predictions that the model makes that are actually
correct. It is calculated by dividing the number of true positives by the total number of positive
predictions.
• Recall: Recall is the percentage of all positive examples that the model correctly identifies. It is
calculated by dividing the number of true positives by the total number of positive examples.
• F1 score: The F1 score is a weighted average of precision and recall. It is calculated by taking the
harmonic mean of precision and recall.
• Confusion matrix: A confusion matrix is a table that shows the number of predictions for each
class, along with the actual class labels. It can be used to visualize the performance of the model and
identify areas where the model is struggling.
Applications of Supervised learning
Supervised learning can be used to solve a wide variety of problems, including:
• Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails
based on their content, helping users avoid unwanted messages.
• Image classification: Supervised learning can automatically classify images into different categories,
such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and image-
based product recommendations.
• Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data,
such as medical images, test results, and patient history, to identify patterns that suggest specific
diseases or conditions.
• Fraud detection: Supervised learning models can analyze financial transactions and identify patterns
that indicate fraudulent activity, helping financial institutions prevent fraud and protect their
customers.
• Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks,
including sentiment analysis, machine translation, and text summarization, enabling machines to
understand and process human language effectively.
Advantages of Supervised learning
• Supervised learning allows collecting data and produces data output from previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in the training data.
Disadvantages of Supervised learning
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
• Supervised learning cannot handle all complex tasks in Machine Learning.
• Computation time is vast for supervised learning.
• It requires a labelled data set.
• It requires a training process.

Unsupervised Learning.
Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the data
does not have any pre-existing labels or categories. The goal of unsupervised learning is to discover patterns
and relationships in the data without any explicit guidance.
Unsupervised learning is the training of a machine using information that is neither classified nor labeled and
allowing the algorithm to act on that information without guidance. Here the task of the machine is to group
unsorted information according to similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
You can use unsupervised learning to examine the animal data that has been gathered and distinguish between
several groups according to the traits and actions of the animals. These groupings might correspond to various
animal species, providing you to categorize the creatures without depending on labels that already exist.
Key Points
• Unsupervised learning allows the model to discover patterns and relationships in unlabeled data.
• Clustering algorithms group similar data points together based on their inherent characteristics.
• Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
• Label association assigns categories to the clusters based on the extracted patterns and characteristics.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing both
dogs and cats. The model has never seen an image of a dog or cat before, and it has no pre-existing labels or
categories for these animals. Your task is to use unsupervised learning to identify the dogs and cats in a new,
unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘.
But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize
the above picture into two parts. The first may contain all pics having dogs in them and the second part may
contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or
examples.
It allows the model to work on its own to discover patterns and information that was previously undetected.
It mainly deals with unlabelled data.
Types of Unsupervised Learning
Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent groupings in the data,
such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as people that buy X also tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points together. Clustering
algorithms work by iteratively moving data points closer to their cluster centers and further away from data
points in other clusters.
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Gaussian Mixture Models (GMMs)
7. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association rule learning
Association rule learning is a type of unsupervised learning that is used to identify patterns in a
data. Association rule learning algorithms work by finding relationships between different items in a dataset.
Some common association rule learning algorithms include:
• Apriori Algorithm
• Eclat Algorithm
• FP-Growth Algorithm
Evaluating Non-Supervised Learning Models
Evaluating non-supervised learning models is an important step in ensuring that the model is effective and
useful. However, it can be more challenging than evaluating supervised learning models, as there is no ground
truth data to compare the model’s predictions to.
There are a number of different metrics that can be used to evaluate non-supervised learning models, but some
of the most common ones include:
• Silhouette score: The silhouette score measures how well each data point is clustered with its own
cluster members and separated from other clusters. It ranges from -1 to 1, with higher scores
indicating better clustering.
• Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio between the variance
between clusters and the variance within clusters. It ranges from 0 to infinity, with higher scores
indicating better clustering.
• Adjusted Rand index: The adjusted Rand index measures the similarity between two clusterings. It
ranges from -1 to 1, with higher scores indicating more similar clusterings.
• Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between
clusters. It ranges from 0 to infinity, with lower scores indicating better clustering.
• F1 score: The F1 score is a weighted average of precision and recall, which are two metrics that are
commonly used in supervised learning to evaluate classification models. However, the F1 score can
also be used to evaluate non-supervised learning models, such as clustering models.
Application of Unsupervised learning
Non-supervised learning can be used to solve a wide variety of problems, including:
• Anomaly detection: Unsupervised learning can identify unusual patterns or deviations from normal
behavior in data, enabling the detection of fraud, intrusion, or system failures.
• Scientific discovery: Unsupervised learning can uncover hidden relationships and patterns in scientific
data, leading to new hypotheses and insights in various scientific fields.
• Recommendation systems: Unsupervised learning can identify patterns and similarities in user
behavior and preferences to recommend products, movies, or music that align with their interests.
• Customer segmentation: Unsupervised learning can identify groups of customers with similar
characteristics, allowing businesses to target marketing campaigns and improve customer service
more effectively.
• Image analysis: Unsupervised learning can group images based on their content, facilitating tasks
such as image classification, object detection, and image retrieval.
Advantages of Unsupervised learning
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Unsupervised learning can help you gain insights from unlabeled data that you might not have been
able to get otherwise.
• Unsupervised learning is good at finding patterns and relationships in data without being told what
to look for. This can help you learn new things about your data.
Disadvantages of Unsupervised learning
• Difficult to measure accuracy or effectiveness due to lack of predefined answers during training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which follow that classification.
• Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy
data.
• Without labelled data, it can be difficult to evaluate the performance of unsupervised learning models,
making it challenging to assess their effectiveness.
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning focused on making decisions to maximize
cumulative rewards in a given situation. Unlike supervised learning, which relies on a training dataset with
predefined answers, RL involves learning through experience. In RL, an agent learns to achieve a goal in an
uncertain, potentially complex environment by performing actions and receiving feedback through rewards
or penalties.
Key Concepts of Reinforcement Learning
• Agent: The learner or decision-maker.
• Environment: Everything the agent interacts with.
• State: A specific situation in which the agent finds itself.
• Action: All possible moves the agent can make.
• Reward: Feedback from the environment based on the action taken.
How Reinforcement Learning Works
RL operates on the principle of learning optimal behavior through trial and error. The agent takes actions
within the environment, receives rewards or penalties, and adjusts its behavior to maximize the cumulative
reward. This learning process is characterized by the following elements:
• Policy: A strategy used by the agent to determine the next action based on the current state.
• Reward Function: A function that provides a scalar feedback signal based on the state and action.
• Value Function: A function that estimates the expected cumulative reward from a given state.
• Model of the Environment: A representation of the environment that helps in planning by predicting
future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in between. The agent is
supposed to find the best possible path to reach the reward. The following problem explains the problem more
easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the reward that is the
diamond and avoid the hurdles that are fired. The robot learns by trying all the possible paths and then
choosing the path which gives him the reward with the least hurdles. Each right step will give the robot a
reward and each wrong step will subtract the reward of the robot. The total reward will be calculated when it
reaches the final reward that is the diamond.

Main points in Reinforcement learning –


• Input: The input should be an initial state from which the model will start
• Output: There are many possible outputs as there are a variety of solutions to a particular problem
• Training: The training is based upon the input, The model will return a state and the user will decide
to reward or punish the model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions sequentially. In


In Supervised learning, the decision is
simple words, we can say that the output depends on the state of the
made on the initial input or the input
current input and the next input depends on the output of the
given at the start
previous input

In supervised learning the decisions are


In Reinforcement learning decision is dependent, So we give labels
independent of each other so labels are
to sequences of dependent decisions
given to each decision.

Example: Object recognition,spam


Example: Chess game,text summarization
detetction

Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words, it has a positive effect on
behavior.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior because a negative
condition is stopped or avoided.
3. Advantages of reinforcement learning:
• Increases Behavior
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards for planning.

Support Vector Machine.


A Support Vector Machine (SVM) is a powerful machine learning algorithm widely used for both linear
and nonlinear classification, as well as regression and outlier detection tasks. SVMs are highly adaptable,
making them suitable for various applications such as text classification, image classification, spam
detection, handwriting identification, gene expression analysis, face detection, and anomaly detection.
SVMs are particularly effective because they focus on finding the maximum separating hyperplane between
the different classes in the target feature, making them robust for both binary and multiclass classification.
In this outline, we will explore the Support Vector Machine (SVM) algorithm, its applications, and how it
effectively handles both linear and nonlinear classification, as well as regression and outlier
detection tasks.
Support Vector Machine
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression tasks. While it can be applied to regression problems, SVM is best suited
for classification tasks. The primary objective of the SVM algorithm is to identify the optimal
hyperplane in an N-dimensional space that can effectively separate data points into different classes in the
feature space. The algorithm ensures that the margin between the closest points of different classes, known
as support vectors, is maximized.
The dimension of the hyperplane depends on the number of features. For instance, if there are two input
features, the hyperplane is simply a line, and if there are three input features, the hyperplane becomes a 2-D
plane. As the number of features increases beyond three, the complexity of visualizing the hyperplane also
increases.
Consider two independent variables, x1 and x2, and one dependent variable represented as either a blue circle
or a red circle.
• In this scenario, the hyperplane is a line because we are working with two features (x1 and x2).
• There are multiple lines (or hyperplanes) that can separate the data points.
• The challenge is to determine the best hyperplane that maximizes the separation margin between
the red and blue circles.

Linearly Separable Data points


From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line because we are
considering only two input features x1, x2) that segregate our data points or do a classification between red
and blue circles. So how do we choose the best line or in general the best hyperplane that segregates
our data points?
How does Support Vector Machine Algorithm Work?
One reasonable choice for the best hyperplane in a Support Vector Machine (SVM) is the one that
maximizes the separation margin between the two classes. The maximum-margin hyperplane, also
referred to as the hard margin, is selected based on maximizing the distance between the hyperplane and the
nearest data point on each side.

Multiple hyperplanes separate the data from two classes


So we choose the hyperplane whose distance from it to the nearest data point on each side is maximized. If
such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin. So from the above
figure, we choose L2. Let’s consider a scenario like shown below

Selecting hyperplane for data with outlier


Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s simple!
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the characteristics
to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one


So in this type of data point what SVM does is, finds the maximum margin as done with previous data sets
along with that it adds a penalty each time a point crosses the margin. So the margins in these types of cases
are called soft margins. When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss.If
violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are separable by
a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification


Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We call
a point xi on the line and we create a new variable yi as a function of distance from origin o.so if we plot this
we get something like as shown below

Mapping 1D data to 2D to become able to separate the two classes


In this case, the new variable y is created as a function of distance from the origin. A non-linear function that
creates a new variable is referred to as a kernel.
Support Vector Machine Terminology
• Hyperplane: The hyperplane is the decision boundary used to separate data points of different classes
in a feature space. For linear classification, this is a linear equation represented as wx+b=0.
• Support Vectors: Support vectors are the closest data points to the hyperplane. These points are
critical in determining the hyperplane and the margin in Support Vector Machine (SVM).
• Margin: The margin refers to the distance between the support vector and the hyperplane. The
primary goal of the SVM algorithm is to maximize this margin, as a wider margin typically results in
better classification performance.
• Kernel: The kernel is a mathematical function used in SVM to map input data into a higher-
dimensional feature space. This allows the SVM to find a hyperplane in cases where data points are
not linearly separable in the original space. Common kernel functions include linear, polynomial,
radial basis function (RBF), and sigmoid.
• Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly separates the
data points of different classes without any misclassifications.
• Soft Margin: When data contains outliers or is not perfectly separable, SVM uses the soft
margin technique. This method introduces a slack variable for each data point to allow some
misclassifications while balancing between maximizing the margin and minimizing violations.
• C: The C parameter in SVM is a regularization term that balances margin maximization and the
penalty for misclassifications. A higher C value imposes a stricter penalty for margin violations,
leading to a smaller margin but fewer misclassifications.
• Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes misclassified points or
margin violations and is often combined with a regularization term in the objective function.
• Dual Problem: The dual problem in SVM involves solving for the Lagrange multipliers associated
with the support vectors. This formulation allows for the use of the kernel trick and facilitates more
efficient computation.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We have a training dataset
consisting of input feature vectors X and their corresponding class labels Y.
The equation for the linear hyperplane can be written as:

The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the
hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin
along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:

Optimization:
• For Hard margin linear SVM classifier:

Types of Support Vector Machine


Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided into two main
parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of different
classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This means
that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the data
points into their respective classes. A hyperplane that maximizes the margin between the classes is
the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be separated into two
classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs can handle
nonlinearly separable data. The original input data is transformed by these kernel functions into a
higher-dimensional feature space, where the data points can be linearly separated. A linear SVM is
used to locate a nonlinear decision boundary in this modified space.
Popular kernel functions in SVM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy