Aiml Material Unit-1 To 5
Aiml Material Unit-1 To 5
UNIT– I:
Introduction: Definition of Artificial Intelligence, Evolution, Need, and applications in real world.
Intelligent Agents, Agents and environments; Good Behavior-The concept of rationality, the nature
of environments, structure of agents.
Neural Networks and Genetic Algorithms: Neural network representation, problems, perceptrons,
multilayer networks and back propagation algorithms, Genetic algorithms.
UNIT– II:
Knowledge–Representation and Reasoning: Logical Agents: Knowledge based agents, the
Wumpus world, logic. Patterns in Propositional Logic, Inference in First-Order Logic-Propositional
vs first order inference, unification and lifting
UNIT– III:
Bayesian and Computational Learning: Bayes theorem , concept learning, maximum likelihood,
minimum description length principle, Gibbs Algorithm, Naïve Bayes Classifier, Instance Based
Learning- K-Nearest neighbour learning
Introduction to Machine Learning (ML): Definition, Evolution, Need, applications of ML in
industry and real world, classification; differences between supervised and unsupervised learning
paradigms.
UNIT– IV:
Basic Methods in Supervised Learning: Distance-based methods, Nearest-Neighbors, Decision
Trees, Support Vector Machines, Nonlinearity and Kernel Methods.
Unsupervised Learning: Clustering, K-means, Dimensionality Reduction, PCA and kernel.
UNIT– V:
Machine Learning Algorithm Analytics: Evaluating Machine Learning algorithms, Model,
Selection, Ensemble Methods (Boosting, Bagging, and Random Forests).
Modeling Sequence/Time-Series Data and Deep Learning: Deep generative models, Deep
Boltzmann Machines, Deep auto-encoders, Applications of Deep Networks.
2
UNIT-I
The goal of AI is to make a smart computer system like humans to solve complex
problems.
What Is AI?,
3
• Talking
• Thinking
• Learning
• Planning
• Understanding
AI Examples
• E-Payment
• Google Maps
• Text Autocorrect
• Automated Translation
• Chatbots
• Social Media
• Face Detection
• Search Algorithms
• Robots
• Automated Investment
• Flying Drones
• Dr. Watson
• Apple Siri
• Microsoft Cortana
• Amazon Alexa
4
Artificial Intelligence is not a new word and not a new technology for researchers.
This technology is much older than you would imagine.
• Year 1943: The first work which is now recognized as AI was done by Warren McCulloch and
Walter pits in 1943. They proposed a model of artificial neurons.
• Year 1949: Donald Hebb demonstrated an updating rule for modifying the connection strength
between neurons. His rule is now called Hebbian learning.
• Year 1950: The Alan Turing who was an English mathematician and pioneered Machine
learning in 1950. Alan Turing publishes "Computing Machinery and Intelligence" in which he
5
proposed a test. The test can check the machine's ability to exhibit intelligent behavior
equivalent to human intelligence, called a Turing test.
Mathematics theorems, and find new and more elegant proofs for some theorems.
• Year 1956: The word "Artificial Intelligence" first adopted by American Computer scientist
John McCarthy at the Dartmouth Conference. For the first time, AI coined as an academic
field.
At that time high-level computer languages such as FORTRAN, LISP, or COBOL were invented.
And the enthusiasm for AI was very high at that time.
problems. Joseph Weizenbaum created the first chatbot in 1966, which was named as
ELIZA.
• Year 1972: The first intelligent humanoid robot was built in Japan which was named as
WABOT-1.
A boom of AI (1980-1987)
6
• Year 1980: After AI winter duration, AI came back with "Expert System". Expert systems were
• In the Year 1980, the first national conference of the American Association of Artificial
Intelligence was held at Stanford University.
• Again Investors and government stopped in funding for AI research as due to high cost but
not efficient result. The expert system such as XCON was very cost effective.
• Year 2002: for the first time, AI entered the home in the form of Roomba, a vacuum cleaner.
• Year 2006: AI came in the Business world till the year 2006. Companies like Facebook, Twitter, and
Netflix also started using AI.
The agents sense the environment through sensors and act on their environment through actuators.
An AI agent can have mental properties such as knowledge, belief, intention, etc.
Agent Terminology
• Performance Measure of Agent − It is the criteria, which determines how successful an agent is.
• Behavior of Agent − It is the action that agent performs after any given sequence of percepts.
• Percept − It is agent’s perceptual inputs at a given instance.
• Percept Sequence − It is the history of all that an agent has perceived till date.
• Agent Function − It is a map from the precept sequence to an action.
What is an Agent?
An agent can be gets information from environment through sensors
o Human-Agent: A human agent has eyes, ears, and other organs which work for sensors and hand, legs, vocal tract
work for actuators.
o Robotic Agent: A robotic agent can have cameras, infrared range finder, sensors and various motors for actuators.
o Software Agent: Software agent can have keystrokes, file contents as sensory input and act on those inputs and
display output on the screen.
Hence the world around us is full of agents such as cellphone, camera, and even we are also agents.
Before moving forward, we should first know about sensors, effectors, and actuators.
Sensor: Sensor is a device which detects the change in the environment and sends the information to other electronic
devices. An agent observes its environment through sensors.
Actuators: Actuators are the component of machines that converts energy into motion. The actuators are only responsible
for moving and controlling a system. An actuator can be an electric motor, gears, etc.
Effectors: Effectors are the devices which affect the environment. Effectors can be legs, wheels, arms, fingers, wings, fins,
and display screen.
8
Intelligent Agents:
An intelligent agent is an autonomous entity which act upon an environment using sensors and actuators for achieving
goals.
An intelligent agent may learn from the environment to achieve their goals.
o Rule 1: An AI agent must have the ability to perceive (receive) the environment.
Rational Agent:
Performance of agent is called Rational of Agent.
Maximize its performance measure with all possible reasons based on environment.
Rational agents to useful for game theory and decision theory for various real-world scenarios.
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function.
Following are the main three terms involved in the structure of an AI agent:
PEAS Representation
PEAS is a type of model on which an AI agent works upon.
When we define an AI agent or rational agent, then we can group its properties under PEAS representation model. It is made
up of four words:
o P: Performance measure
o E: Environment
o A: Actuators
o S: Sensors
Here performance measure is the objective for the success of an agent's behavior.
3. Part -picking Percentage of parts in Conveyor belt with Jointed Arms Camera
Robot correct bins. parts, Hand Joint angle sensors.
Bins
Where the right action means the action that causes the agent to be most successful in the given percept sequence.
The problem the agent solves is characterized by Performance Measure, Environment, Actuators, and Sensors (PEAS).
Internal State − It is a representation of unobserved aspects of current state depending on percept history.
Since the knowledge supporting a decision is explicitly modeled, thereby allowing for modifications.
The environment is where agent lives, operate and provide the agent with something to sense and act upon it.
Features of Environment
As per Russell and Norvig, an environment can have various features from the point of view of an agent:
2. Static vs Dynamic
3. Discrete vs Continuous
4. Deterministic vs Stochastic
5. Single-agent vs Multi-agent
6. Episodic vs sequential
7. Known vs Unknown
8. Accessible vs Inaccessible
o If an agent sensor can sense or access the complete state of an environment at each point of time then it is a fully
observable environment, else it is partially observable.
o A fully observable environment is easy as there is no need to maintain the internal state to keep track history of the
world.
o An agent with no sensors in all environments then such an environment is called as unobservable.
2. Deterministic vs Stochastic:
o If an agent's current state and selected action can completely determine the next state of the environment,
o In a deterministic, fully observable environment, agent does not need to worry about uncertainty.
3. Episodic vs Sequential:
o In an episodic environment, there is a series of one-shot actions, and only the current percept is required for the
action.
o However, in Sequential environment, an agent requires memory of past actions to determine the next best actions.
4. Single-agent vs Multi-agent
o If only one agent is involved in an environment, and operating by itself then such an environment is called single
agent environment.
o However, if multiple agents are operating in an environment, then such an environment is called a multi-agent
environment.
o The agent design problems in the multi-agent environment are different from single agent environment.
5. Static vs Dynamic:
o If the environment can change itself while an agent is deliberating then such environment is called a dynamic
environment else it is called a static environment.
o Static environments are easy to deal because an agent does not need to continue looking at the world while
deciding for an action.
o However for dynamic environment, agents need to keep looking at the world at each action.
6. Discrete vs Continuous:
o If in an environment there are a finite number of percepts and actions that can be performed within it, then such an
environment is called a discrete environment else it is called continuous environment.
o A chess game comes under discrete environment as there is a finite number of moves that can be performed.
7. Known vs Unknown
o Known and unknown are not actually a feature of an environment, but it is an agent's state of knowledge to perform
an action.
o In a known environment, the results for all actions are known to the agent. While in unknown environment, agent
needs to learn how it works in order to perform an action.
o It is quite possible that a known environment to be partially observable and an Unknown environment to be fully
observable.
8. Accessible vs Inaccessible
o If an agent can obtain complete and accurate information about the state's environment, then such an environment
is called an Accessible environment else it is called inaccessible.
o An empty room whose state can be defined by its temperature is an example of an accessible environment.
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function.
The structure of an intelligent agent is a combination of architecture and agent program.
It can be viewed as:
1. Agent = Architecture + Agent program
Following are the main three terms involved in the structure of an AI agent:
Architecture: Architecture is machinery that an AI agent executes on.
Agent Function: Agent function is used to map a percept to an action.
1. A given node takes the weighted sum of its inputs, and passes it through a non-linear activation function.
2. This is the output of the node, which then becomes the input of another node in the next layer.
3. The signal flows from left to right, and the final output is calculated by performing this procedure for all
the nodes.
4. Training this deep neural network means learning the weights associated with all the edges.
5. The equation for a given node looks as follows.
6. The weighted sum of its inputs passed through a non-linear activation function.
7. It can be represented as a vector dot product, where n is the number of inputs for the node.
Referred from:-https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Neurons
Scientists agree that our brain has around 100 billion neurons.
These neurons have hundreds of billions connections between them.
Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system.
The neurons are responsible for receiving input from the external world, for sending output
(commands to our muscles), and for transforming the electrical signals in between.
20
Neural Networks
Artificial Neural Networks are normally called Neural Networks (NN).
Neural networks are in fact multi-layer Perceptrons. (Perceptron=గ్రహణశక్త)ి
The perceptron defines the single unit into number of neural networks.
These are the final and important components that help to determine whether the neuron will fire or not.
Activation Function can be considered primarily as a step function.
Weights
Neural network training is about finding weights that minimize prediction error.
We usually start our training with a set of randomly generated weights.
Then, backpropagation is used to update the weights in an attempt to correctly map arbitrary inputs to outputs.
w1 = 0.11,
w2 = 0.21,
w3 = 0.12,
w4 = 0.08,
w5 = 0.14 and
w6 = 0.15
22
Dataset
Our dataset has one sample with two inputs and one output.
Forward Pass
We will use given weights and inputs to predict the output. Inputs are multiplied by weights; the results are then passed
forward to next layer.
23
Calculating Error
• Now, it’s time to find out how our network performed by calculating the difference between the actual output
and predicted one.
• It’s clear that our network output, or prediction, is not even close to actual output.
• We can calculate the difference or the error as following.
24
Reducing Error
Our main goal of the training is to reduce the error or the difference between prediction and actual output.
Since actual output is constant, “not changing”, the only way to reduce the error is to change prediction value.
The question now is, how to change prediction value?
By decomposing prediction into its basic elements we can find that weights are the variable elements
affecting prediction value.
In other words, in order to change prediction value, we need to change weights values.
Backpropagation
25
o Population: Population is the subset of all possible or probable solutions, which can solve the given problem.
26
o Chromosomes: A chromosome is one of the solutions in the population for the given problem, and the collection of gene generate a
chromosome.
o Allele: Allele is the value provided to the gene within a particular chromosome.
o Fitness Function: The fitness function is used to determine the individual's fitness level in the population. It means the ability of an individual
to compete with other individuals. In every iteration, individuals are evaluated based on their fitness function.
o Genetic Operators: In a genetic algorithm, the best individual mate to regenerate offspring better than parents. Here genetic operators play
a role in changing the genetic composition of the next generation.
o Selection
After calculating the fitness of every existent in the population, a selection process is used to determine which of the individualities in the population
will get to reproduce and produce the seed that will form the coming generation.
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is called population. Here each individual is the solution for the
given problem. An individual contains or is characterized by a set of parameters called Genes. Genes are combined into a string and generate
chromosomes, which is the solution to the problem. One of the most popular techniques for initialization is the use of random binary strings.
2. Fitness Assignment
1. Fitness function is used to determine how fit an individual is?
5. This score further determines the probability of being selected for reproduction.
6. The high the fitness score, the more chances of getting selected for reproduction.
27
3. Selection
1. The selection phase involves the selection of individuals for the reproduction of offspring.
2. All the selected individuals are then arranged in a pair of two to increase reproduction.
3. Then these individuals transfer their genes to the next generation.
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In this step, the genetic algorithm uses
two variation operators that are applied to the parent population. The two operators involved in the reproduction phase are
given below:
o Crossover: The crossover plays a most significant role in the reproduction phase of the genetic algorithm. In this
process, a crossover point is selected at random within the genes. Then the crossover operator swaps genetic
information of two parents from the current generation to produce a new individual representing the offspring.
o The genes of parents are exchanged among themselves until the crossover point is met.
2. Two-point crossover
3. Livery crossover
Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the diversity in the population.
o It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification.
o The below image shows the mutation process:
Types of mutation styles available,
➢ Flip bit mutation
➢ Exchange/Swap mutation
28
KNOWLEDGE REPRESENTATION
Suppose we wish to write a program to play a simple card game using the standard
deck of 52 playing cards. We will need some way to represent the cards dealt to each player
and a way to express the rules. We can represent cards in different ways.
1. The most straightforward way is to record the suit (clubs, diamonds, hearts, spades)
and face values (ace.' 2,3, ... , IO, jack, queen, king) as a symbolic pair. So the queen
of hearts might be represented as <queen, hearts>.
2. Alternatively, we could assign abbreviated codes (c6 for the 6 of clubs), numeric
values which ignore suit (1, 2, ... , 13), or some other scheme. If the game we wish to
play is bridge, suit as well as value will be important.
3. On the other hand, if the game is black jack, only face values are important and a
simpler program will result if only numeric values are used.
To see how important a good representation is, one only needs to try solving a few simple
problems using different representations. Consider the problem of discovering a pattern in the
sequence of numbers 1 1 2 3 4 7. A change of base in the number from l 0 to 2 transforms the
number to
011011011011011011.
In order to solve complex problems encountered in artificial intelligence, one needs both
a large amount of knowledge and some mechanism for manipulating that knowledge to
create solutions.
Knowledge and Representation are two distinct entities. They play central but
distinguishable roles in the intelligent system.
Explicit knowledge
Exists outside a human being;
It is embedded.
Can be articulated formally.
Also, Can be shared, copied, processed and stored.
So, Easy to steal or copy
Drawn from the artifact of some type as a principle,
procedure, process, concepts.
A variety of ways of representing knowledge have been exploited in AI programs.
There are two different kinds of entities, we are dealing with.
1. Facts: Truth in some relevant world. Things we want to represent.
2. Also, Representation of facts in some chosen formalism. Things we will be able to
manipulate.
These entities structured at two levels:
1. The knowledge level, at which facts described.
2. Moreover, the symbol level, at which representation of objects defined in terms of
symbols that can manipulate by programs.
The computer requires a well-defined problem description to process and provide a well
defined acceptable solution.
Also, The computer can then use an algorithm to compute an answer. So, This process
illustrated as knowledge representation framework.
Relational Knowledge
The simplest way to represent declarative facts is a set of relations of the same sort used
in the database system.
Provides a framework to compare two objects based on equivalent attributes. o Any
instance in which two different objects are compared is a relational type of knowledge.
The table below shows a simple way to store facts.
Also, The facts about a set of objects are put systematically in columns.
This representation provides little opportunity for inference.
Given the facts, it is not possible to answer a simple question such as: “Who is the
heaviest player?”
Also, But if a procedure for finding the heaviest player is provided, then these facts will
enable that procedure to compute an answer.
Moreover, We can ask things like who “bats – left” and “throws – right”.
Inheritable Knowledge
Here the knowledge elements inherit attributes from their parents.
The knowledge embodied in the design hierarchies found in the functional, physical and
process domains.
Within the hierarchy, elements inherit attributes from their parents, but in many cases, not
all attributes of the parent elements prescribed to the child elements.
Also, The inheritance is a powerful form of inference, but not adequate.
Moreover, The basic KR (Knowledge Representation) needs to augment with inference
mechanism.
Property inheritance: The objects or elements of specific classes inherit attributes and
values from more general classes.
So, The classes organized in a generalized hierarchy.
Inferential Knowledge
This knowledge generates new information from the given information.
This new information does not require further data gathering form source but does
require analysis of the given information to generate new knowledge.
Example: given a set of relations and values, one may infer other values or relations. A
predicate logic (a mathematical deduction) used to infer from a set of attributes.
Moreover, Inference through predicate logic uses a set of logical operations to relate
individual data.
Represent knowledge as formal logic: All dogs have tails ∀x: dog(x) hastail(x)
Advantages:
A set of strict rules.
Can use to derive more facts.
Also, Truths of new statements can be verified.
Guaranteed correctness.
So, Many inference proceduresavailable to implement standard rules of logic popular in
AI systems. e.g Automated theorem proving.
Procedural Knowledge
A representation in which the control information, to use the knowledge, embedded in the
knowledge itself. For example, computer programs, directions, and recipes; these indicate
specific use or implementation.
Moreover, Knowledge encoded in some procedures, small programs that know how to do
specific things, how to proceed.
Advantages:
Heuristic or domain-specific knowledge can represent.
Moreover, Extended logical inferences, such as default reasoning facilitated.
Also, Side effects of actions may model. Some rules may become false in time.
Keeping track of this in large systems may be tricky.
Disadvantages:
Completeness — not all cases may represent.
Consistency — not all deductions may be correct. e.g If we know that Fred is a
bird we might deduce that Fred can fly. Later we might discover that Fred is an
emu.
Modularity sacrificed. Changes in knowledge base might have far-reaching effects.
Cumbersome control information.
All Knowledge representation schemes suffer from problems. We will discuss them here.
Are any attributes of objects so basic that they occur in almost every problem
domain. If there are, we need to make sure that they are handled appropriately in
each of the mechanisms we propose. If such attributes exist, what are they?
Are there any important relationships that exists among attributes of objects?
Given a large amount of knowledge stored in a database, how can relevant parts
be accessed when they are needed?
Important Attributes
There are two attribute that are of very general significance, and we have already seen
.their use. instance and.1sa. These attributes are important because they support property
inheritance. They are called a variety of things in AI systems, but the names do not
matter. What does matter is that they represent class membership and class inclusion
and that class inclusion is transitive.
The attributes that we use to describe objects are themselves entities that we represent.
what properties do they have independent of the specific knowledge they encode? There
are four such properties are listed issues that should be raised when using a knowledge
representation technique:
Inverses
The relationship between the attributes of an object, such as, inverses, existence, techniques
for reasoning about values and single valued attributes. We can consider an example of an
inverse in
This can be treated as John Zorn plays in the band Naked City or John Zorn's band is Naked
City.
The second approach is to use attributes that focus on single entity but to use in them pairs.
One the inverse of the other. The band information is represented with two attributes.
Just as there are classes of objects and specialized subsets of those classes, there are attributes
and specializations of attributes. Consider, for example, the attribute height. It Is actually a
specialization of the more general attribute physical-size which is, in turn, a specialization of
physical-attribute. These generalization-specialization relationships are important for
attributes, for the same reason that they are important for other concepts they support
inheritance. In the case of attributes, they support inheriting, information on about such things
as constraints on the values that the attribute can have and mechanisms for computing those-
values.
Sometimes values of attributes are specified explicitly when a knowledge base is created.
Several kinds of information can play a role in this reasoning, including:
• Information about the type of the value. For example, the value of height must be a number
measured m a unit of length. ·
• Constraints on the value, often stated in terms of related entities. For example, the age of a
person cannot be greater than the age of either of that person's parents.
• Rules for computing the value when it is needed, These rules are called backward rules.
Such rules have also been called if-needed rules.
• Rules that describe actions that should be taken if a value ever becomes known. These rules
are called forward rules, or sometimes if-added rules.
Single-valued attribute
A specific but very useful kind of attribute is one that is guaranteed to take a unique value.
For example, a baseball player can, at any one time, have only a single height and be a
member of only one team. If there is already a value present for one of these attributes and a
different value is asserted, then one of two things has happened. Either a. change has
occurred in the world or there is now a contradiction in the knowledge base that needs to be
resolved. Knowledge-representation systems have taken several different approaches to
providing support for single-valued attributes! including:
• Introduce. an explicit notation for temporal interval. If two different values are ever asserted
for the same temporal interval, signal a contradiction automatically.
• Assume that the only temporal interval that is of interest is now, So if a new value is
asserted, replace the old value.
At what level should the knowledge be represented and what are the primitives. Choosing the
Granularity of Representation Primitives are fundamental concepts such as holding, seeing,
playing and as English is a very rich language with over half a million words it is clear we
will find difficulty in deciding upon which words to choose as our primitives in a series of
situations.
feeds(tom, dog)
There are several arguments against the use of low-level primitives. One is that simple high-
level facts may require a lot of storage when broken down into primitives. Much of that
storage is really wasted since the low-level rendition of a particular high level concept will
appear many times, once for each time the high-level concept is referenced. For example,
suppose that actions are being represented as combinations of a small set of primitive actions.
Then the fact that John punched Mary might be represented as shown in first figure. The
representation says that there was physical contact between John's fist and Mary. The contact
was caused by John propelling his fist toward Mary, and to do that John first went to where
Mary was. But suppose we also know that Mary punched John. Then we must also store the
structure shown in second figure below. If, however, punching were represented simply as
punching, then most of the detail of both structures could be omitted from the structures
themselves. It could instead be stored just once in a common representation of the concept of
punching.
Representing Set of Objects
It is important to be able to represent set of objects for several reasons. One is that there are
some properties that are true of sets that are not true of the individual members of a set. As
examples, consider the assertions that are being made in the sentences.
"There are more sheep than people in Australia" and "English speakers can be found all over
the world.".
The only way to represent the facts described in these sentences is to attach assertions to the
sets representing people, sheep, and English speakers, since, for example, no single English
speaker can be found all over the world. The other reason that it is important to be. able to
represent sets of objects is that if a property is true of all (or even most) elements of a set,
then it is more efficient to associate it once with the set rather than to associate it explicitly
with every element of the set.
Thus if we assert something like large(Elephant), it must be clear whether we are asserting
Some property of the set itself (i.e., that the set of elephants is large) or some property that
holds for individual elements of the set (i.e., that anything that is an elephant is large). There
are three obvious ways in which sets may be represented. The simplest is just by a name.
This is the issue of locating appropriate knowledge structures that have been stored in
memory. For example, suppose we have a script (a description of a class of events in terms of
contexts, participants, and subevents) that describes the typical sequence of events in a .
restaurant. This script would enable us to take a text such as ·
John went to Steak and Ale last night. He ordered a large, rare steak, paid his bill, and left.
And answer yes to the question: Did john eat dinner last night?
Notice that nowhere in the story was John's eating anything mentioned explicitly. But the fact
that when one goes to a restaurant one eats will be contained in the restaurant script. If we
know in advance to use the restaurant script, then we can answer the question easily. But in·
order to be able to reason about a variety of things, a system must have many scripts for
everything from going to work to sailing around the world. How will it select the appropriate
one each time? For example, nowhere in our story was the word "restaurant" mentioned. · · _
In fact in order to have access to the right structure for describing a particular situation, it is
necessary to solve all of the following problems.
There is no good, general-purpose method for solving all these problems. Some knowledge-
representation technique solve some of them. This leads to two questions: how to select an
initial structure to consider and how to find a better structure (or revise it) if that one turns out
not to be a good match.
KNOWLEDGE REPRESENTATION USING PROPOSITIONAL AND PREDICATE
LOGIC
It is raining.
My car is painted silver.
John and Sue have five children.
Snow is white.
People live on the moon.
Compound propositions are formed from atomic formulas using the logical connectives not
and or if . . . then, and if and only if. For example, the following are compound formulas.
We will use capital letters, sometimes followed by digits, to stand for propositions; T and F
are special symbols having the values true and false, respectively.
V for or or disjunction
In addition, left and right parentheses, left and right braces, and the period will be used as
delimiters for punctuation. So, for example, to represent the compound sentence "It is raining
and the wind is blowing" we could write (R & B) where R and B stand for the propositions "It
is raining" and "the wind is blowing," respectively. If we write (R V B) we mean "it is raining
or the wind is blowing or both" that is, V indicates inclusive disjunction.
Syntax
The semantics or meaning of a sentence is just the value true or false; that is, it is an
assignment of a truth value to the sentence. The values true and false should not be confused
with the symbols T and F which can appear within a sentence. An interpretation for a
sentence or group of sentences is an assignment of a truth value to each propositional symbol.
As an example, consider the statement (P & -Q). One interpretation (I1) assigns true to P and
false to Q. A different interpretation (I2) assigns true to P and true to Q. Clearly, there are
four distinct interpretations for this sentence. Some semantic rules are summarized in the
table below.
We can now find the meaning of any statement given an interpretation I for the statement.
For example, let I assign true to P, false to Q and false to R in the statement
Application of rule 2 then gives -Q as true, rule 3 gives (P & -Q) as true, rule 6 gives (P & -
Q) ~ R as false, and rule 5 gives the statement value as false.
Properties of statements
LOGICAL CONSEQUENCES
More generally, it is a logical consequence of other statements if and only if for any
interpretation in which the statements are true, the resulting statement is also true. A valid
statement is satisfiable, and a contradictory statement is invalid, but the converse is not
necessarily true. As examples of the above definitions consider the following statements.
P is satisfiable but not valid since an interpretation that assigns false to P assigns false to the
sentence P.
The notion of logical consequence provides us with a means to perform valid inferencing in
PL. The following are two important theorems which give criteria for a statement to be a
logical consequence of a set of statements.
Theorem 4.1. The sentence s is a logical consequence of s1, ............ , sn if and only if s1 & s2
& ... & sn s is valid.
Theorem 4.2. The sentence s is a logical consequence of s1,……….., sn if and only if s1 & s2
& ....&sn &‘s is inconsistent. Table 4.2 lists some of the important laws of PL.
One way to determine the equivalence of two sentences is by using truth table. For example
the conditional elimination and Bi-conditional elimination in the above table can be verified
by the following truth table.
Inference Rules
The inference rules of PL provide the means to perform logical proofs or deductions. The
problem is, given a set of sentences S = {s1, ............. , sn} (the premises), prove the truth of s
(the conclusion); that is, show that Sl-s. The use of truth tables to do this is a form of
semantic proof. Other syntactic methods of inference or deduction are also possible. Such
methods do not depend on truth assignments but on syntactic relationships only; that is, it is
possible to derive new sentences which are logical consequences of s1… sn using only
syntactic operations. Few inference rules are given here.
SYNTAX AND SEMANTICS FOR PREDICATE LOGIC (FOPL-First Order
Predicate Logic)
expressiveness is one of the requirements for any serious representation scheme. It should be
possible to accurately represent most, if not all concepts which can be verbalized. PL falls
short of this requirement in some important respects. It is too "coarse" to easily describe
properties of objects, and it lacks the structure to express relations that exist among two or
more entitles. Furthermore, PL does not permit us to make generalized statements about
classes of similar objects. These are serious limitations when reasoning about real world
entities. For example, given the following statements, it should be possible to conclude that
John must take the Pascal course.
As stated, it is not possible to conclude in PL that John must take Pascal since the second
statement does not occur as part of the first one. To draw the desired conclusion with a valid
inference rule, it would be necessary to rewrite the sentences.
FOPL was developed by logicians to extend the expressiveness of PL. It is a generalization of
PL that permits reasoning about world objects as relational entities as well as classes or
subclasses of objects. This generalization comes from the introduction of predicates in place
of propositions, the use of functions and the use of variables together with variable
quantifiers.
Syntax of FOPL
The symbols and rules of combination permitted in FOPL are defined as follows.
Semantics for FOPL
When considering specific wffs, we always have in mind some domain D. If not stated
explicitly, D will be understood from the context. D is the set of all elements or objects from
which fixed assignments are made to constants and from which the domain and range of
functions are defined. The arguments of predicates must be terms (constants, variables, or
functions). Therefore, the domain of each n-place predicate is also defined over D.
For example, our domain might be all entities that make up the Computer Science
Department at the University of Texas. In this case, constants would be professors (Bell,
Cooke, Gelfond, and so on), staff (Martha, Pat, Linda, and so on), books, labs, offices, and so
forth. The functions we may choose might be
PROPOSITIONAL LOGIC: RESOLUTION
The following steps should be carried out in sequences to employ it for theorem
proving in propositional using resolution:
Resolution Algorithm:
Given:
A set of clauses, called axioms and a goal.
Aim:
To test whether the goal is derivable from the axioms.
Begin:
1. Construct a set S of axioms plus the negated goal.
2. Represent each element of S into conjunctive normal form (CNF) by the following
steps:
a) Replace ‘if-then’ operator by NEGATION and OR operation by theorem using 10.
(b) Bring each modified clause into the following form and then drop AND operators
connected between each square bracket. The clauses thus obtained are in conjunctive normal
from (CNF). It may be noted that pij may be in negated or non-negated form.
3. Repeat:
(a) Select any two clauses from S, such that one clause contains a negated literal and the other
clause contains its corresponding positive (non-negated) literal.
(b) Resolve these two clauses and call the resulting clause the resolvent. Remove the parent
clauses from S.
Until a null clause is obtained or no further progress can be made.
Example 1:
We are given the axioms given in the first column of table 6.3, and we want to prove R. First
we convert the axioms to clause form, as shown in the second column of the table. Then we
negate R, producing ¬ R which is already in clause form it is added into the given clauses
(data base).
Then we look for pairs of clauses to resolve together. Although many pair of clauses can be
resolved, only those pairs which contain complementary literals will produce a resolvent
which is likely to lead to the goal shown by empty clause (shown as a box). We begin by
resolving R with the clause ¬ R since that is one of the clauses which must be involved in the
contradiction we are trying to find. The sequence of contradiction resolvents of the example
in table 6.3., is shown in Fig. 6.4.
Example 2:
Consider the following knowledge base:
1. If the-humidity-is-high or the-sky-is-cloudy.
2. If the-sky-is-cloudy then it-will-rain.
3. If the-humidity-is-high then it-is-hot.
4. It-is-not-hot.
and the goal: It-will-rain prove by resolution theorem that the goal is derivable from the
knowledge base.
Proof:
Let us first denote the above clauses by the following symbols.
1. p ∨ q
2. ¬ q ∨ r (after applying theorem 10)
3. ¬ p ∨ s (after applying theorem 10)
4. ¬ s
5. ¬ r
and the negated goal = ¬ r. The set of statements; S, thus includes all these 5 clauses in
Normal Form. When all the clauses are connected through connector 𝖠 they are called in CNF
and conjugated terms for the set S. For example
Now by resolution algorithm, we construct the graph of Fig. 6.5. Since it terminates with a
null clause the goal is proved.
KNOWLEDGE REPRESENTATION USING PREDICATE LOGIC
Propositional logic is useful because it is simple to deal with and a decision procedure for it
exists.
Also, In order to draw conclusions, facts are represented in a more convenient way as,
1. Marcus is a man.
man(Marcus)
2. Plato is a man.
man(Plato)
3. All men are mortal.
mortal(men)
But propositional logic fails to capture the relationship between an individual being a man
and
that individual being a mortal.
How can these sentences be represented so that we can infer the third sentence from the
first two?
Also, Propositional logic commits only to the existence of facts that may or may not be
the case in the world being represented.
Moreover, It has a simple syntax and simple semantics. It suffices to illustrate the process
of inference.
Propositional logic quickly becomes impractical, even for very small worlds.
Predicate logic
First-order Predicate logic (FOPL) models the world in terms of
Objects, which are things with individual identities
Properties of objects that distinguish them from other objects
Functions, which are a subset of relations where there is only one “value” for any given
“input”
First-order Predicate logic (FOPL) provides
Constants: a, b, dog33. Name a specific object.
Variables: X, Y. Refer to an object without naming it.
Functions: Mapping from objects to objects.
Terms: Refer to objects
Atomic Sentences: in(dad-of(X), food6) Can be true or false, Correspond to propositional
symbols P, Q.
A well-formed formula (wff) is a sentence containing no “free” variables. So, That is, all
variables are “bound” by universal or existential quantifiers.
(∀x)P(x, y) has x bound as a universally quantified variable, but y is free.
Quantifiers
Universal quantification
(∀x)P(x) means that P holds for all values of x in the domain associated with that variable
E.g., (∀x) dolphin(x) → mammal(x)
Existential quantification
(∃ x)P(x) means that P holds for some value of x in the domain associated with that
variable
E.g., (∃ x) mammal(x) 𝖠 lays-eggs(x)
Also, Consider the following example that shows the use of predicate logic as a way of
representing knowledge.
1. Marcus was a man.
2. Marcus was a Pompeian.
3. All Pompeians were Romans.
4. Caesar was a ruler.
5. Also, All Pompeians were either loyal to Caesar or hated him.
6. Everyone is loyal to someone.
7. People only try to assassinate rulers they are not loyal to.
8. Marcus tried to assassinate Caesar.
The facts described by these sentences can be represented as a set of well-formed formulas
(wffs)
as follows:
1. Marcus was a man.
man(Marcus)
2. Marcus was a Pompeian.
Pompeian(Marcus)
3. All Pompeians were Romans.
∀x: Pompeian(x) → Roman(x)
4. Caesar was a ruler.
ruler(Caesar)
5. All Pompeians were either loyal to Caesar or hated him.
inclusive-or
∀x: Roman(x) → loyalto(x, Caesar) ∨ hate(x, Caesar)
exclusive-or
∀x: Roman(x) → (loyalto(x, Caesar) 𝖠¬ hate(x, Caesar)) ∨
Now suppose if we want to use these statements to answer the question: Was Marcus loyal to
Caesar?
Also, Now let’s try to produce a formal proof, reasoning backward from the desired goal: ¬
Ioyalto(Marcus, Caesar)
In order to prove the goal, we need to use the rules of inference to transform it into another
goal
(or possibly a set of goals) that can, in turn, transformed, and so on, until there are no
unsatisfied goals remaining.
1. Many English sentences are ambiguous (for example, 5, 6, and 7 above). Choosing the
correct interpretation may be difficult.
2. Also, There is often a choice of how to represent the knowledge. Simple representations
are desirable, but they may exclude certain kinds of reasoning.
3. Similalry, Even in very simple situations, a set of sentences is unlikely to contain all the
information necessary to reason about the topic at hand. In order to be able to use a set of
statements effectively. Moreover, It is usually necessary to have access to another set of
statements that represent facts that people consider too obvious to mention.
* Specific attributes instance and isa play an important role particularly in a useful form of
reasoning called property inheritance.
* The predicates instance and isa explicitly captured the relationships they used to express,
namely class membership and class inclusion.
* Figure shows the first five sentences of the last section represented in logic in three
different ways.
* The first part of the figure contains the representations we have already discussed. In
these representations, class membership represented with unary predicates (such as
Roman), each of which corresponds to a class.
* Asserting that P(x) is true is equivalent to asserting that x is an instance (or element) of P.
* The second part of the figure contains representations that use the instance predicate
explicitly.
The following figure shows three ways of representing class membership: isa relationships
The predicate instance is a binary one, whose first argument is an object and whose second
argument is a class to which the object belongs.
But these representations do not use an explicit isa predicate.
Instead, subclass relationships, such as that between Pompeians and Romans, described
as shown in sentence 3.
The implication rule states that if an object is an instance of the subclass Pompeian then it
is an instance of the superclass Roman.
Note that this rule is equivalent to the standard set-theoretic definition of the subclass
superclass relationship.
The third part contains representations that use both the instance and isa predicates
explicitly.
The use of the isa predicate simplifies the representation of sentence 3, but it requires that
one additional axiom (shown here as number 6) be provided.
To express simple facts, such as the following greater-than and less-than relationships:
gt(1,O) It(0,1) gt(2,1) It(1,2) gt(3,2) It( 2,3)
It is often also useful to have computable functions as well as computable predicates.
Thus we might want to be able to evaluate the truth of gt(2 + 3,1)
To do so requires that we first compute the value of the plus function given the arguments
2 and 3, and then send the arguments 5 and 1 to gt.
Consider the following set of facts, again involving Marcus:
1) Marcus was a man.
man(Marcus)
2) Marcus was a Pompeian.
Pompeian(Marcus)
3) Marcus was born in 40 A.D.
born(Marcus, 40)
4) All men are mortal.
x: man(x) → mortal(x)
5) All Pompeians died when the volcano erupted in 79 A.D.
erupted(volcano, 79) 𝖠 ∀ x : [Pompeian(x) → died(x, 79)]
6) No mortal lives longer than 150 years.
x: t1: At2: mortal(x) born(x, t1) gt(t2 – t1,150) → died(x, t2)
7) It is now 1991.
now = 1991
So, Above example shows how these ideas of computable functions and predicates can be
useful.
It also makes use of the notion of equality and allows equal objects to be substituted for each
other whenever it appears helpful to do so during a proof.
So, Now suppose we want to answer the question “Is Marcus alive?”
The statements suggested here, there may be two ways of deducing an answer.
Either we can show that Marcus is dead because he was killed by the volcano or we can
show that he must be dead because he would otherwise be more than 150 years old, which we
know is not possible.
Also, As soon as we attempt to follow either of those paths rigorously, however, we
discover, just as we did in the last example, that we need some additional knowledge. For
example, our statements talk about dying, but they say nothing that relates to being alive,
which is what the question is asking. So we add the following facts:
8) Alive means not dead.
x: t: [alive(x, t) → ¬ dead(x, t)] [¬ dead(x, t) → alive(x, t)]
9) If someone dies, then he is dead at all later times.
x: t1: At2: died(x, t1) gt(t2, t1) → dead(x, t2)
So, Now let’s attempt to answer the question “Is Marcus alive?” by proving: ¬ alive(Marcus,
now)
RESOLUTION
Resolution is used, if there are various statements are given, and we need to prove a
conclusion of those statements. Unification is a key concept in proofs by resolutions.
Resolution is a single inference rule which can efficiently operate on the conjunctive normal
form or clausal form.
Clause: Disjunction of literals (an atomic sentence) is called a clause. It is also known as a
unit clause.
To better understand all the above steps, we will take an example in which we will
apply resolution.
Example:
Step-1: Conversion of Facts into FOL Step-2: Conversion of FOL into CNF
In the first step we will convert all the given Eliminate all implication (→) and rewrite
statements into its first order logic.
1. ∀x ¬ food(x) V likes(John, x)
2. food(Apple)Λ
food(vegetables)
3. ∀x ∀y ¬ [eats(x, y) Λ ¬
killed(x)] V food(y)
4. eats (Anil, Peanuts) Λ
alive(Anil)
5. ∀x ¬ eats(Anil, x) V
eats(Harry, x)
6. ∀x¬ [¬ killed(x) ] V alive(x)
7. ∀x ¬ alive(x) V ¬ killed(x)
8. likes(John, Peanuts).
The inference engine is the component of the intelligent system in artificial intelligence,
which applies logical rules to the knowledge base to infer new information from known facts.
The first inference engine was part of the expert system. Inference engine commonly
proceeds in two modes, which are:
a. Forward chaining
b. Backward chaining
Horn clause and definite clause are the forms of sentences, which enables knowledge base to
use a more restricted and efficient inference algorithm. Logical inference algorithms use
forward and backward chaining approaches, which require KB in the form of the first-order
definite clause.
Definite clause: A clause which is a disjunction of literals with exactly one positive
literal is known as a definite clause or strict horn clause.
Horn clause: A clause which is a disjunction of literals with at most one positive literal is
known as horn clause. Hence all the definite clauses are horn clauses.
It is equivalent to p 𝖠 q → k.
Forward Chaining
Forward chaining is also known as a forward deduction or forward reasoning method when
using an inference engine. Forward chaining is a form of reasoning which start with atomic
sentences in the knowledge base and applies inference rules (Modus Ponens) in the forward
direction to extract more data until a goal is reached.
The Forward-chaining algorithm starts from known facts, triggers all rules whose premises
are satisfied, and add their conclusion to the known facts. This process repeats until the
problem is solved.
Properties of Forward-Chaining:
Consider the following famous example which we will use in both approaches:
Example:
"As per the law, it is a crime for an American to sell weapons to hostile nations. Country A,
an enemy of America, has some missiles, and all the missiles were sold to it by Robert, who
is an American citizen."
Prove that "Robert is criminal."
To solve the above problem, first, we will convert all the above facts into first-order definite
clauses, and then we will use a forward-chaining algorithm to reach the goal.
Step-1:
In the first step we will start with the known facts and will choose the sentences which do not
have implications, such as: American(Robert), Enemy(A, America), Owns(A, T1), and
Missile(T1). All these facts will be represented as below.
Step-2:
At the second step, we will see those facts which infer from available facts and with satisfied
premises.
Rule-(1) does not satisfy premises, so it will not be added in the first iteration.
Rule-(4) satisfy with the substitution {p/T1}, so Sells (Robert, T1, A) is added, which infers
from the conjunction of Rule (2) and (3).
Rule-(6) is satisfied with the substitution(p/A), so Hostile(A) is added and which infers from
Rule-(7).
Step-3:
At step-3, as we can check Rule-(1) is satisfied with the substitution {p/Robert, q/T1, r/A},
so we can add Criminal(Robert) which infers all the available facts. And hence we reached
our goal statement.
Backward Chaining:
Example:
In backward-chaining, we will use the same above example, and will rewrite all the rules.
Backward-Chaining proof:
In Backward chaining, we will start with our goal predicate, which is Criminal(Robert), and
then infer further rules.
Step-1:
At the first step, we will take the goal fact. And from the goal fact, we will infer other facts,
and at last, we will prove those facts true. So our goal fact is "Robert is Criminal," so
following is the predicate of it.
Step-2:
At the second step, we will infer other facts form goal fact which satisfies the rules. So as we
can see in Rule-1, the goal predicate Criminal (Robert) is present with substitution
{Robert/P}. So we will add all the conjunctive facts below the first level and will replace p
with Robert.
Step-4:
At step-4, we can infer facts Missile(T1) and Owns(A, T1) form Sells(Robert, T1, r) which
satisfies the Rule- 4, with the substitution of A in place of r. So these two statements are
proved here.
Step-5:
At step-5, we can infer the fact Enemy(A, America) from Hostile(A) which satisfies Rule-
6. And hence all the statements are proved true using backward chaining.
SEMANTIC TABLEAU
Since the 1980s another technique for determining the validity of arguments in either PC
(Pedictive Coding) or LPC (Linear Predictive Coding)has gained some popularity, owing
both to its ease of learning and to its straightforward implementation by computer programs.
Originally suggested by the Dutch logician Evert W. Beth, it was more fully developed and
publicized by the American mathematician and logician Raymond M. Smullyan. Resting on
the observation that it is impossible for the premises of a valid argument to be true while
the conclusion is false, this method attempts to interpret (or evaluate) the premises in such a
way that they are all simultaneously satisfied and the negation of the conclusion is also
satisfied. Success in such an effort would show the argument to be invalid, while failure to
find such an interpretation would show it to be valid.
Write:
Only if all the sentences in at least one branch are true is it possible for the original premises
to be true and the conclusion false (equivalently for the negation of the conclusion). By
tracing the line upward in each branch to the top of the tree, one observes that no valuation
of a in the left branch will result in all the sentences in that branch receiving the value true
(because of the presence of a and ∼a). Similarly, in the right branch the presence of b and
∼b makes it impossible for a valuation to result in all the sentences of the branch receiving
the value true. These are all the possible branches; thus, it is impossible to find a situation in
which the premises are true and the conclusion false. The original argument is therefore valid
Furthermore, in LPC, rules for instantiating quantified wffs need to be introduced. Clearly,
any branch containing both (∀x)ϕx and ∼ϕy is one in which not all the sentences in that
branch can be simultaneously. Again, if all the branches fail to be simultaneously satisfiable,
the original argument is valid.
UNIFICATION ALGORITHM
CONVERSION TO DIFFERENT FORMS
DEDUCTION
PROPOSITIONAL THEOREM PROVING
INFERENCING
It applies logical rulesw to the knowledge base to infer new information from known facts.
Inference engine proceeds in two modes:
1. Forward Chaining 2. Backward Chaining
The location of inference engine is shown in the figure here.
In artificial intelligence, we need intelligent computers which can create new logic
from old logic or by evidence, so generating the conclusions from evidence and
facts is termed as Inference.
Inference rules:
Inference rules are the templates for generating valid arguments. Inference rules are applied
to derive proofs in artificial intelligence, and the proof is a sequence of the conclusion that
leads to the desired goal.
In inference rules, the implication among all the connectives plays an important role.
Following are some terminologies related to inference rules:
From the above term some of the compound statements are equivalent to each other, which
we can prove using truth table:
Hence from the above truth table, we can prove that P → Q is equivalent to ¬ Q → ¬ P, and
Q→ P is equivalent to ¬ P → ¬ Q.
Types of Inference rules:
1. Modus Ponens:
The Modus Ponens rule is one of the most important rules of inference, and it states that if P
and P → Q is true, then we can infer that Q will be true. It can be represented as:
Example:
2. Modus Tollens:
The Modus Tollens rule state that if P→ Q is true and ¬ Q is true, then ¬ P will also true. It
can be represented as:
3. Hypothetical Syllogism:
The Hypothetical Syllogism rule state that if P→R is true whenever P→Q is true, and Q→R
is true. It can be represented as the following notation:
Example:
Statement-1: If you have my home key then you can unlock my home. P→Q Statement-
2: If you can unlock my home then you can take my money. Q→R Conclusion: If you
have my home key then you can take my money. P→R
4. Disjunctive Syllogism:
The Disjunctive syllogism rule state that if P∨Q is true, and ¬P is true, then Q will be true. It
can be represented as:
Example:
Proof by truth-table:
5. Addition:
The Addition rule is one the common inference rule, and it states that If P is true, then P∨Q
will be true.
Example:
Proof by Truth-Table:
6. Simplification:
The simplification rule state that if P𝖠 Q is true, then Q or P will also be true. It can be
represented as:
Proof by Truth-Table:
7. Resolution:
The Resolution rule state that if P∨Q and ¬ P𝖠R is true, then Q∨R will also be true. It can be
represented as
Proof by Truth-Table:
MONOTONIC AND NON-MONOTONIC REASONING
Monotonic Reasoning:
In monotonic reasoning, once the conclusion is taken, then it will remain the same
even if we add some other information to existing information in our knowledge base. In
monotonic reasoning, adding knowledge does not decrease the set of prepositions that can be
derived.
To solve monotonic problems, we can derive the valid conclusion from the available
facts only, and it will not be affected by new facts.
Monotonic reasoning is not useful for the real-time systems, as in real time, facts get
changed, so we cannot use monotonic reasoning.
Example:
"Human perceptions for various things in daily life, "is a general example of non-
monotonic reasoning.
Example: Let suppose the knowledge base contains the following knowledge:
So from the above sentences, we can conclude that Pitty can fly.
However, if we add one another sentence into knowledge base "Pitty is a penguin",
which concludes "Pitty cannot fly", so it invalidates the above conclusion.
Where:
- P(H|E) is the posterior probability of the hypothesis (H) given the evidence (E)
2. New Evidence: Collect new data or evidence (E) related to the hypothesis.
Example:
Suppose we want to determine the probability that a person has a disease (H)
based on a positive test result (E).
- Prior Probability (P(H)): 0.01 (1% of the population has the disease)
- Likelihood (P(E|H)): 0.9 (90% of people with the disease test positive)
= 0.45
1. Medical diagnosis
2. Spam filtering
3. Image recognition
Concept Learning:
3. Define the Likelihood Function: Write the likelihood function based on the
model and data.
2. Efficiency: MLEs are often the most efficient estimators, meaning they have
the smallest variance.
2. Model Complexity: The complexity of a model, which affects its ability to fit
the data.
3. Data Compression: The idea that a good model should be able to compress
the data, reducing the description length.
MDL Principle:
The MDL principle states that the best model is the one that minimizes the total
description length, which includes:
1. Model Description Length: The length of the description of the model itself.
2. Data Description Length: The length of the description of the data given the
model.
Mathematical Formulation:
Let M be a model and D be the data. The MDL principle can be formulated as:
where L(M) is the model description length, L(D|M) is the data description
length given the model, and L(M, D) is the total description length.
Applications of MDL:
1. Model Selection: MDL can be used to select the best model from a set of
candidate models.
2. Hypothesis Testing: MDL can be used to test hypotheses and select the most
plausible explanation.
3. Data Compression: MDL can be used to compress data by finding the most
concise description.
Advantages of MDL:
What is MDL?
Description Length
1. Model Description Length: The length of the description of the model itself.
2. Data Description Length: The length of the description of the data given the
model.
MDL Principle
The MDL principle states that the best model is the one that minimizes the total
description length:
where L(M) is the model description length, L(D|M) is the data description
length given the model, and L(M, D) is the total description length.
Applications of MDL
1. Model Selection: MDL can be used to select the best model from a set of
candidate models.
2. Hypothesis Testing: MDL can be used to test hypotheses and select the most
plausible explanation.
3. Data Compression: MDL can be used to compress data by finding the most
concise description.
Advantages of MDL
Gibbs Algorithm:
The Gibbs Algorithm is a Markov chain Monte Carlo (MCMC) method used for
estimating the distribution of a random variable or a set of random variables. It's
a powerful tool for Bayesian inference and is widely used in machine learning,
statistics, and data science.
2. Monte Carlo: The algorithm uses Monte Carlo methods to approximate the
distribution of the random variables.
2. Efficiency: The algorithm can be more efficient than other MCMC methods,
especially for high-dimensional distributions.
2. Machine Learning: The algorithm is used in machine learning for tasks such
as clustering, dimensionality reduction, and regression.
2. Naïve Assumption: The algorithm assumes that the features of the instances
are independent of each other, given the class.
1. Text Classification: The algorithm is widely used for text classification tasks,
such as spam detection and sentiment analysis.
3. Not Suitable for Complex Relationships: The algorithm is not suitable for
modeling complex relationships between features.
IBL is a type of machine learning that involves storing and retrieving instances
of data. The goal of IBL is to make predictions or take actions based on the
similarity between new, unseen instances and the stored instances.
KNN is a specific type of IBL that involves finding the k most similar instances
to a new instance. The KNN algorithm works as follows:
Advantages of KNN
2. Flexible: KNN can be used for classification, regression, and other tasks.
3. Robust: KNN is robust to noisy data and can handle high-dimensional data.
Disadvantages of KNN
3. Not Suitable for Complex Relationships: KNN is not suitable for modeling
complex relationships between features.
Applications of KNN
1. Image Classification: KNN can be used for image classification tasks, such as
object recognition and image tagging.
2. Text Classification: KNN can be used for text classification tasks, such as
spam detection and sentiment analysis.
Variations of KNN
2. K-D Trees: Uses a k-d tree data structure to efficiently search for the k
nearest neighbors.
3. Ball Tree: Uses a ball tree data structure to efficiently search for the k nearest
neighbors.
Definition:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves
the development of algorithms and statistical models that enable machines to
learn from data, make decisions, and improve their performance on a task
without being explicitly programmed.
Evolution:
Machine Learning has its roots in the 1950s, but it has evolved significantly
over the years. Here's a brief timeline:
1. 1950s: Alan Turing proposed the Turing Test, which measures a machine's
ability to exhibit intelligent behavior equivalent to, or indistinguishable from,
that of a human.
2. 1960s: The first machine learning algorithms, such as decision trees and
clustering, were developed.
5. 2000s: The rise of big data and the development of deep learning algorithms,
such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), transformed the field of machine learning.
Need:
Real-World Examples:
1. Virtual Assistants: Siri, Alexa, and Google Assistant use machine learning to
understand voice commands and respond accordingly.
2. Image Recognition: Facebook's facial recognition feature uses machine
learning to identify and tag people in photos.
[3:17 pm, 13/12/2024] Meta AI: Here's an overview of classification and the
differences between supervised and unsupervised learning paradigms:
Classification
Supervised Learning
- The algorithm learns to predict the output label based on the input data.
- The goal is to minimize the error between predicted and actual labels.
- Examples:
- Image classification
- Sentiment analysis
- Speech recognition
Unsupervised Learning
- Key Characteristics:
- Examples:
- Clustering
- Dimensionality reduction
- Anomaly detection
Key Differences
Here are the key differences between supervised and unsupervised learning:
- Labeled vs. Unlabeled Data: Supervised learning uses labeled data, while
unsupervised learning uses unlabeled data.
The perceptron was intended to be a machine, rather than a program, and while its
first implementation was in software for the IBM 704, it was subsequently
implemented in custom-built hardware as the “Mark 1 perceptron”. This machine
was designed for image recognition: it had an array of 400 photocells, randomly
connected to the “neurons”. Weights were encoded in potentiometers, and weight
updates during learning were performed by electric motors.
Although the perceptron initially seemed promising, it was quickly proved that
perceptron’s could not be trained to recognize many classes of patterns. This caused
the field of neural network research to stagnate for many years before it was
recognized that a feedforward neural network with two or more layers (also called a
multilayer perceptron) had greater processing power than perceptron’s with one
layer (also called a single-layer perceptron).
Single layer perceptron’s are only capable of learning linearly separable patterns. For
a classification task with some step activation function, a single node will have a
single line dividing the data points forming the patterns. More nodes can create
more dividing lines, but those lines must somehow be combined to form more
complex classifications. A second layer of perceptron’s, or even linear nodes, are
sufficient to solve a lot of otherwise non-separable problems.
In 1969, a famous book entitled Perceptron’s by Marvin Minsky and Seymour Paper
showed that it was impossible for these classes of network to learn an XOR function.
It is often believed (incorrectly) that they also conjectured that a similar result would
hold for a multi-layer perceptron network. However, this is not true, as both Minsky
and Paper already knew that multi-layer perceptron’s could produce an XOR
function. (See the page on Perceptron’s (book) for more information.) Nevertheless,
the often-miscited Minsky/Paper text caused a significant decline in interest and
funding of neural network research. It took ten more years until neural network
research experienced a resurgence in the 1980s. This text was reprinted in 1987 as
“Perceptron’s Expanded Edition” where some errors in the original text are shown
and corrected.
A set of data points are said to be linearly separable if the data can be divided into
two classes using a straight line. If the data is not divided into two classes using a
straight line, such data points are said to be called non-linearly separable data.
Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not
linearly separable.
A second training rule, called the delta rule, is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta rule converges toward a
best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the
hypothesis space of possible weight vectors to find the weights that best fit the
training examples.
This rule is important because gradient descent provides the basis for the
BACKPROPAGATON algorithm, which can learn networks with many interconnected
units.
Multilayer networks
In network theory, multidimensional networks, a special type of multilayer network,
are networks with multiple kinds of relations. Increasingly sophisticated attempts to
model real-world systems as multidimensional networks have yielded valuable
insight in the fields of social network analysis, economics, urban and international
transport, ecology, psychology, medicine, biology, Commerce, climatology, physics,
computational neuroscience, operations management, infrastructures, and finance.
The rapid exploration of complex networks in recent years has been dogged by a
lack of standardized naming conventions, as various groups use overlapping and
contradictory terminology to describe specific network configurations (e.g.,
multiplex, multilayer, multilevel, multidimensional, multirelational, interconnected).
Formally, multidimensional networks are edge-labeled multigraphs. The term “fully
multidimensional” has also been used to refer to a multipartite edge-labeled
multigraph. Multidimensional networks have also recently been reframed as specific
instances of multilayer networks. In this case, there are as many layers as there are
dimensions, and the links between nodes within each layer are simply all the links
for a given dimension.
In another example, Stella used an “eco multiplex model” to study the spread of
Trypanosoma Cruzi (cause of Chagas disease in humans) across different mammal
species. This pathogen can be transmitted either through invertebrate vectors
(Triatominae or kissing bugs) or through predation when a susceptible predator
feeds on infected prey or vectors. Thus, their model included two
ecological/transmission layers: the food-web and vector layers. Their results showed
that studying the multiplex network structure offered insights on which host species
facilitate parasite spread, and thus which would be more effective to immunize to
control the spread. At the same time, they showed how, in this system, when
parasites spread occurs primarily through the trophic layer, immunizing predators
hampers parasite transmission more than immunizing prey.
For coupled processes involving a disease alongside a social process (i.e., spread of
information or disease awareness), we might expect that the spread of the pathogen
will be associated with the spread of disease awareness or preventative behaviors
such as mask-wearing, and in these cases theoretical models suggest that
considering the spread of disease awareness can result in reduced disease spread. A
model was presented by Granell, which represented two competing processes on the
same network: infection spread (modeled using a Susceptible-Infected-Susceptible
compartmental model) coupled with information spread through a social network (an
Unaware-Aware-Unaware compartmental model). The authors used their model to
show that the timing of self-awareness of infection had little effect on the epidemic
dynamics. However, the degree of immunization (a parameter which regulates the
probability of becoming infected when aware) and mass media information spread
on the social layer did critically impact disease spread. A similar framework has been
used to study the effect of the diffusion of vaccine opinion (pro or anti) across a
social network with concurrent infectious disease spread. The study showed a clear
regime shift from a vaccinated population and controlled outbreak to vaccine refusal
and epidemic spread depending on the strength of opinion on the perceived risks of
the vaccine. The shift in outcomes from a controlled to uncontrolled outbreak was
accompanied by an increase in the spatial correlation of cases. While models in the
veterinary literature have accounted for altered behavior of nodes (imposition of
control measures) because of detection or awareness of disease, it is not common
for awareness to be considered as a dynamic process that is influenced by how each
node has interacted with the pathogen (i.e., contact with an infected neighbor). For
example, the rate of adoption of biosecurity practices at a farm, such as enhanced
surveillance, use of vaccination, or installation of air filtration systems, may be
dependent on the presence of disease in neighboring farms or the farmers’
awareness of a pathogen through a professional network of colleagues.
There is also some evidence that nodes that are more connected in their “social
support” networks (e.g., connections with family and close friends in humans) can
alter network processes that result in negative outcomes, such as pathogen
exposure or engagement in high-risk behavior. In a case based on users of
injectable drugs, social connections with non-injectors can reduce drug-users
connectivity in a network based on risky behavior with other drug injectors. In a
model presented by Chen, a social-support layer of a multiplex network drove the
allocation of resources for infection recovery, meaning that infected individuals
recovered faster if they possessed more neighbors in the social support layer. In
animal (both wild and domesticated) populations, this concept could be adapted to
represent an individual’s likelihood of recovery from, or tolerance to, infection being
influenced by the buffering effect of affiliative social relationships. For domestic
animals, investment in certain resources at a farm level could influence a premise’s
ability to recover (e.g., treatment) or onwards transmission of a pathogen (e.g.,
treatment or biosecurity practices). Sharing of these resources between farms could
be modeled through a “social-support” layer in a multiplex, for example, where a
farm’s transmissibility is impacted by access to shared truck-washing facilities.
Multi-Host Infections
Multilayer networks can be used to study the features of mixed species contact
networks or model the spread of a pathogens in a host community, providing
important insights into multi-host pathogens. Scenarios like this are commonplace at
the livestock-wildlife interface and therefore the insights provided could be of real
interest to veterinary epidemiology. In the case of multi-host pathogens, intralayer
and interlayer edges represent the contacts between individuals of the same species
and between individuals of different species, respectively. They can therefore be
used to identify bottlenecks of transmission and provide a clearer idea of how
spillover occurs. For example, Silk used an interconnected network with three layers
to study potential routes of transmission in a multi-host system. One layer consisted
of a wild European badger (Meles meles) contact network, the second a
domesticated cattle contact network, and the third a layer containing badger latrine
sites (potentially important sites of indirect environmental transmission). No
intralayer edges were possible in the latrine layer. The authors demonstrated the
importance of these environmental sites in shortening paths through the multilayer
network (for both between- and within-species transmission routes) and showed
that some latrine sites were more important than others in connecting the different
layers. Pilosof presented a theoretical model, labeling the species as focal (i.e., of
interest) and non-focal, showing that the outbreak probability and outbreak size
depend on which species originates the outbreak and on asymmetries in between-
species transmission probabilities.
Backpropagation Algorithm
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for
training feedforward neural networks. Generalizations of backpropagation exist for
other artificial neural networks (ANNs), and for functions generally. These classes of
algorithms are all referred to generically as “backpropagation”. In fitting a neural
network, backpropagation computes the gradient of the loss function with respect to
the weights of the network for a single input–output example, and does so
efficiently, unlike a naive direct computation of the gradient with respect to each
weight individually. This efficiency makes it feasible to use gradient methods for
training multilayer networks, updating weights to minimize loss; gradient descent, or
variants such as stochastic gradient descent, are commonly used. The
backpropagation algorithm works by computing the gradient of the loss function
with respect to each weight by the chain rule, computing the gradient one layer at a
time, iterating backward from the last layer to avoid redundant calculations of
intermediate terms in the chain rule; this is an example of dynamic programming.
The term backpropagation strictly refers only to the algorithm for computing the
gradient, not how the gradient is used; however, the term is often used loosely to
refer to the entire learning algorithm, including how the gradient is used, such as by
stochastic gradient descent. Backpropagation generalizes the gradient computation
in the delta rule, which is the single-layer version of backpropagation, and is in turn
generalized by automatic differentiation, where backpropagation is a special case of
reverse accumulation (or “reverse mode”). The term backpropagation and its
general use in neural networks was announced in Rumelhart, Hinton & Williams
(1986a), then elaborated and popularized in Rumelhart, Hinton & Williams (1986b),
but the technique was independently rediscovered many times, and had many
predecessors dating to the 1960s; see § History. A modern overview is given in the
deep learning textbook by Goodfellow, Bengio & Courville.
The algorithm is used to effectively train a neural network through a method called
chain rule. In simple terms, after each forward pass through a network,
backpropagation performs a backward pass while adjusting the model’s parameters
(weights and biases).
Input layer
The neurons, colored in purple, represent the input data. These can be as simple as
scalars or more complex like vectors or multidimensional matrices.
Hidden layers
The final values at the hidden neurons, colored in green, are computed using z^l —
weighted inputs in layer l, and a^l— activations in layer l.
Output layer
The final part of a neural network is the output layer which produces the predicated
value. In our simple example, it is presented as a single neuron, colored in blue.
The adjective “deep” in deep learning refers to the use of multiple layers in the
network. Early work showed that a linear perceptron cannot be a universal classifier,
but that a network with a nonpolynomial activation function with one hidden layer of
unbounded width can. Deep learning is a modern variation which is concerned with
an unbounded number of layers of bounded size, which permits practical application
and optimized implementation, while retaining theoretical universality under mild
conditions. In deep learning the layers are also permitted to be heterogeneous and
to deviate widely from biologically informed connectionist models, for the sake of
efficiency, trainability, and understandability, whence the “structured” part.
Most modern deep learning models are based on artificial neural networks,
specifically convolutional neural networks (CNN)s, although they can also include
propositional formulas or latent variables organized layer-wise in deep generative
models such as the nodes in deep belief networks and deep Boltzmann machines.
In deep learning, each level learns to transform its input data into a slightly more
abstract and composite representation. In an image recognition application, the raw
input may be a matrix of pixels; the first representational layer may abstract the
pixels and encode edges; the second layer may compose and encode arrangements
of edges; the third layer may encode a nose and eyes; and the fourth layer may
recognize that the image contains a face. Importantly, a deep learning process can
learn which features to optimally place in which level on its own. This does not
completely eliminate the need for hand-tuning; for example, varying numbers of
layers and layer sizes can provide different degrees of abstraction.
The word “Deep” in “Deep learning” refers to the number of layers through which
the data is transformed. More precisely, deep learning systems have a substantial
credit assignment path (CAP) depth. The CAP is the chain of transformations from
input to output. CAPs describe potentially causal connections between input and
output. For a feedforward neural network, the depth of the CAPs is that of the
network and is the number of hidden layers plus one (as the output layer is also
parameterized). For recurrent neural networks, in which a signal may propagate
through a layer more than once, the CAP depth is potentially unlimited. No
universally agreed-upon threshold of depth divides shallow learning from deep
learning, but most researchers agree that deep learning involves CAP depth higher
than 2. CAP of depth 2 has been shown to be a universal approximator in the sense
that it can emulate any function. Beyond that, more layers do not add to the
function approximator ability of the network. Deep models (CAP > 2) can extract
better features than shallow models and hence, extra layers help in learning the
features effectively.
The universal approximation theorem for deep neural networks concerns the
capacity of networks with bounded width but the depth is allowed to grow. Lu
proved that if the width of a deep neural network with ReLU activation is strictly
larger than the input dimension, then the network can approximate any Lebesgue
integrable function; If the width is smaller or equal to the input dimension, then
deep neural network is not a universal approximator.
The probabilistic interpretation derives from the field of machine learning. It features
inference, as well as the optimization concepts of training and testing, related to
fitting and generalization, respectively. More specifically, the probabilistic
interpretation considers the activation nonlinearity as a cumulative distribution
function. The probabilistic interpretation led to the introduction of dropout as
regularize in neural networks. The probabilistic interpretation was introduced by
researchers including Hopfield, Widrow and Narendra and popularized in surveys
such as the one by Bishop.
Architectures:
Recurrent (perform same task for every element of a sequence) Neural Network:
Allows for parallel and sequential computation. Like the human brain (large feedback
network of connected neurons). They can remember important things about the
input they received and hence enables them to be more precise.
Limitations:
Advantages:
Disadvantages:
Applications:
Automatic Text Generation: Corpus of text is learned and from this model new
text is generated, word-by-word or character-by-character. Then this model is
capable of learning how to spell, punctuate, form sentences, or it may even capture
the style.
Architecture
In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x
(input width) x (input channels). After passing through a convolutional layer, the
image becomes abstracted to a feature map, also called an activation map, with
shape: (number of inputs) x (feature map height) x (feature map width) x (feature
map channels).
Convolutional layers convolve the input and pass its result to the next layer. This is
similar to the response of a neuron in the visual cortex to a specific stimulus. Each
convolutional neuron processes data only for its receptive field. Although fully
connected feedforward neural networks can be used to learn features and classify
data, this architecture is generally impractical for larger inputs such as high-
resolution images. It would require a very high number of neurons, even in a
shallow architecture, due to the large input size of images, where each pixel is a
relevant input feature. For instance, a fully connected layer for a (small) image of
size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead,
convolution reduces the number of free parameters, allowing the network to be
deeper. For example, regardless of image size, using a 5 x 5 tiling region, each with
the same shared weights, requires only 25 learnable parameters. Using regularized
weights over fewer parameters avoids the vanishing gradients and exploding
gradients problems seen during backpropagation in traditional neural networks.
Furthermore, convolutional neural networks are ideal for data with a grid-like
topology (such as images) as spatial relations between separate features are
considered during convolution and/or pooling.
Pooling layers
Convolutional networks may include local and/or global pooling layers along with
traditional convolutional layers. Pooling layers reduce the dimensions of data by
combining the outputs of neuron clusters at one layer into a single neuron in the
next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are
commonly used. Global pooling acts on all the neurons of the feature map. There
are two common types of pooling in popular use: max and average. Max pooling
uses the maximum value of each local cluster of neurons in the feature map, while
average pooling takes the average value.
Fully connected layers connect every neuron in one layer to every neuron in another
layer. It is the same as a traditional multi-layer perceptron neural network (MLP).
The flattened matrix goes through a fully connected layer to classify the images.
Receptive field
In neural networks, each neuron receives input from some number of locations in
the previous layer. In a convolutional layer, each neuron receives input from only a
restricted area of the previous layer called the neuron’s receptive field. Typically, the
area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the
receptive field is the entire previous layer. Thus, in each convolutional layer, each
neuron takes input from a larger area in the input than previous layers. This is due
to applying the convolution over and over, which considers the value of a pixel, as
well as its surrounding pixels. When using dilated layers, the number of pixels in the
receptive field remains constant, but the field is more sparsely populated as its
dimensions grow when combining the effect of several layers.
Weights
The vector of weights and the bias are called filters and represent features of the
input (e.g., a particular shape). A distinguishing feature of CNNs is that many
neurons can share the same filter. This reduces the memory footprint because a
single bias and a single vector of weights are used across all receptive fields that
share that filter, as opposed to each receptive field having its own bias and vector
weighting.
Convolutional layers are the major building blocks used in convolutional neural
networks.
Activation function
Activation function decides whether a neuron should be activated or not by
calculating weighted sum and further adding bias with it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
Neural network has neurons that work in correspondence of weight, bias and their
respective activation function. In a neural network, we would update the weights
and biases of the neurons based on the error at the output. This process is known
as back-propagation. Activation functions make the back-propagation possible since
the gradients are supplied along with the error to update the weights and biases.
1) Linear Function:
Equation: Linear function has the equation similar to as of a straight line i.e. y
= ax
No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
Range: -inf to +inf
Uses: Linear activation function is used at just one place i.e. output layer.
Issues: If we will differentiate linear function to bring non-linearity, result will
no longer depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.
2) Sigmoid Function:
A = 1/(1 + e-x)
Tanh Function: The activation that works almost always better than sigmoid
function is Tanh function also knows as Tangent Hyperbolic function. It’s
mathematically shifted version of the sigmoid function. Both are similar and can be
derived from each other.
Pooling Layer
The pooling or down sampling layer is responsible for reducing the special size of the
activation maps. In general, they are used after multiple stages of other layers (i.e.
convolutional and non-linearity layers) to reduce the computational requirements
progressively through the network as well as minimizing the likelihood of overfitting.
The key concept of the pooling layer is to provide translational invariance since
particularly in image recognition tasks, the feature detection is more important
compared to the feature’s exact location. Therefore, the pooling operation aims to
preserve the detected features in a smaller representation and does so, by
discarding less significant data at the cost of spatial resolution.
Fully connected
Fully connected layers connect every neuron in one layer to every neuron in another
layer. It is the same as a traditional multi-layer perceptron neural network (MLP).
The flattened matrix goes through a fully connected layer to classify the images.
Fully connected neural networks (FCNNs) are a type of artificial neural network
where the architecture is such that all the nodes, or neurons, in one layer are
connected to the neurons in the next layer.
While this type of algorithm is commonly applied to some types of data, in practice
this type of network has some issues in terms of image recognition and
classification. Such networks are computationally intense and may be prone to
overfitting. When such networks are also ‘deep’ (meaning there are many layers of
nodes or neurons) they can be particularly hard for humans to understand.
With the human-like ability to problem-solve and apply that skill to huge datasets
neural networks possess the following powerful attributes:
Adaptive Learning: Like humans, neural networks model non-linear and complex
relationships and build on previous knowledge. For example, software uses adaptive
learning to teach math and language arts.
Self-Organization: The ability to cluster and classify vast amounts of data makes
neural networks uniquely suited for organizing the complicated visual problems
posed by medical image analysis.
Fault Tolerance: When significant parts of a network are lost or missing, neural
networks can fill in the blanks. This ability is especially useful in space exploration,
where the failure of electronic devices is always a possibility.
Neural networks are highly valuable because they can carry out tasks to make sense
of data while retaining all their other attributes. Here are the critical tasks that
neural networks perform:
Clustering: They identify a unique feature of the data and classify it without any
knowledge of prior data.
Associating: You can train neural networks to “remember” patterns. When you
show an unfamiliar version of a pattern, the network associates it with the most
comparable version in its memory and reverts to the latter.
Neural Network is having two input units and one output unit with no hidden layers.
These are also known as ‘single-layer perceptron’s.’
These networks are like the feed-forward Neural Network, except radial basis
function is used as these neurons’ activation function.
These networks use more than one hidden layer of neurons, unlike single-layer
perceptron. These are also known as Deep Feedforward Neural Networks.
Hopfield Network
These networks are like the Hopfield network, except some neurons are input, while
others are hidden in nature. The weights are initialized randomly and learn through
the backpropagation algorithm.
Get a complete overview of Convolutional Neural Networks through our blog Log
Analytics with Machine Learning and Deep Learning.
The environment is typically stated in the form of a Markov decision process (MDP)
because many reinforcement learning algorithms for this context use dynamic
programming techniques. The main difference between the classical dynamic
programming methods and reinforcement learning algorithms is that the latter do
not assume knowledge of an exact mathematical model of the MDP and they target
large MDPs where exact methods become infeasible.
Input: The input should be an initial state from which the model will start
Output: There are many possible outputs as there are a variety of solutions to a
particular problem
Training: The training is based upon the input; The model will return a state and
the user will decide to reward or punish the model based on its output. The model
keeps continues to learn. The best solution is decided based on the maximum
reward.
Positive:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can diminish
the results
Negative:
Increases Behavior.
Provide defiance to a minimum standard of performance.
It Only provides enough to meet up the minimum behavior.
There are many different algorithms that tackle this issue. As a matter of fact,
Reinforcement Learning is defined by a specific type of problem, and all its solutions
are classed as Reinforcement Learning algorithms. In the problem, an agent is
supposed to decide the best action to select based on his current state. When this
step is repeated, the problem is known as a Markov Decision Process.
State
A State is a set of tokens that represent every state that the agent can be in.
Model
Actions
An Action A is a set of all possible actions. A(s) defines the set of actions that can be
taken being in state S.
For any finite Markov decision process (FMDP), Q-learning finds an optimal policy in
the sense of maximizing the expected value of the total reward over all successive
steps, starting from the current state. Q-learning can identify an optimal action-
selection policy for any given FMDP, given infinite exploration time and a partly
random policy. “Q” refers to the function that the algorithm computes the expected
rewards for an action taken in each state.
The goal of the agent is to maximize its total reward. It does this by adding the
maximum reward attainable from future states to the reward for achieving its
current state, effectively influencing the current action by the potential future
reward. This potential reward is a weighted sum of expected values of the rewards
of all future steps starting from the current state.
On the next day, by random chance (exploration), you decide to wait and let other
people depart first. This initially results in a longer wait time. However, time-fighting
other passengers are less. Overall, this path has a higher reward than that of the
previous day, since the total boarding time is now:
Through exploration, despite the initial (patient) action resulting in a larger cost (or
negative reward) than in the forceful strategy, the overall cost is lower, thus
revealing a more rewarding strategy.
Q Learning Algorithm
Q-learning is an off-policy learner. Means it learns the value of the optimal policy
independently of the agent’s actions. On the other hand, an on-policy learner learns
the value of the policy being carried out by the agent, including the exploration steps
and it will find a policy that is optimal, considering the exploration inherent in the
policy.
Q-Table
Q-Table is the data structure used to calculate the maximum expected future
rewards for action at each state. Basically, this table will guide us to the best action
at each state. To learn each value of the Q-table, Q-Learning algorithm is used.
We will first build a Q-table. There are n columns, where n= number of actions.
There are m rows, where m= number of states. We will initialize the values at 0.
This combination of steps is done for an undefined amount of time. This means that
this step runs until the time we stop the training, or the training loop stops as
defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as
mentioned earlier when the episode initially starts, every Q-value is 0.
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward. We need to
update the function Q(s,a).
Manufacturing
In Fanuc, a robot uses deep reinforcement learning to pick a device from one box
and putting it in a container. Whether it succeeds or fails, it memorizes the object
and gains knowledge and train’s itself to do this job with great speed and precision.
Many warehousing facilities used by eCommerce sites and other supermarkets use
these intelligent robots for sorting their millions of products every day and helping to
deliver the right products to the right people. If you look at Tesla’s factory, it
comprises of more than 160 robots that do major part of work on its cars to reduce
the risk of any defect.
Finance
Inventory Management
Healthcare
With technology improving and advancing on a regular basis, it has taken over
almost every industry today, especially the healthcare sector. With the
implementation of reinforcement learning, the healthcare system has generated
better outcomes consistently. One of the most common areas of reinforcement
learning in the healthcare domain is Quotient Health.
Delivery Management
Image Processing
Deep Q-learning
The DeepMind system used a deep convolutional neural network, with layers of tiled
convolutional filters to mimic the effects of receptive fields. Reinforcement learning is
unstable or divergent when a nonlinear function approximator such as a neural
network is used to represent Q. This instability comes from the correlations present
in the sequence of observations, the fact that small updates to Q may significantly
change the policy of the agent and the data distribution, and the correlations
between Q and the target values.
The technique used experience replay, a biologically inspired mechanism that uses a
random sample of prior actions instead of the most recent action to proceed. This
removes correlations in the observation sequence and smooths changes in the data
distribution. Iterative updates adjust Q towards target values that are only
periodically updated, further reducing correlations with the target.