Unit-8 - Reinforcement Learning
Unit-8 - Reinforcement Learning
• model-based learning
• model-free learning
▪ In model-based learning an agent generates an approximation
of the transition function, Tˆ(s,a,s’ ), by keeping counts of the
number of times it arrives in each state s’ after entering each q-
state (s,a). The agent can then generate the the approximate
transition function Tˆ upon request by normalizing the counts it
has collected - dividing the count for each observed tuple
(s,a,s’ ) by the sum over the counts for all instances where the
agent was in q-state (s,a). Normalization of counts scales them
such that they sum to one, allowing them to be interpreted as
probabilities.
▪ There are several model-free learning algorithms, and we’ll
cover three of them: direct evaluation, temporal difference
learning, and Q-learning. Direct evaluation and temporal
difference learning fall under a class of algorithms known as
passive reinforcement learning.
▪ In passive reinforcement learning, an agent is given a policy to
follow and learns the value of states under that policy as it
experiences episodes, which is exactly what is done by policy
evaluation for MDPs when T and R are known.
▪ In case of passive RL, the agent’s policy is fixed which means that it
is told what to do. OR the agent imply watches the world going by and
tries to learn the utilities of being in various states .
▪ In active RL, an agent needs to decide what to do as there’s no fixed
policy that it can act on.
▪ Therefore, the goal of a passive RL agent is to execute a fixed policy
(sequence of actions) and evaluate it while that of an active RL agent is
to act and learn an optimal policy.
▪ The agent see the sequence of state transitions and associate
rewards.
▪ The environment generates state transitions and the agent
perceive them
▪ It is a fast but can become quite costly to compute for large state
spaces. ADP is a model based approach and requires the transition
model of the environment.
▪ Q-Learning is a Reinforcement learning policy that will find the
next best action, given a current state. It chooses this action at
random and aims to maximize the reward.
▪ Q-learning is a model-free, off-policy reinforcement learning that
will find the best course of action, given the current state of the
agent. Depending on where the agent is in the environment, it will
decide the next action to be taken.
▪ The objective of the model is to find the best course of action
given its current state. To do this, it may come up with rules of its
own or it may operate outside the policy given to it to follow. This
means that there is no actual need for a policy, hence we call it
off-policy.
▪ Model-free means that the agent uses predictions of the
environment’s expected response to move forward. It does not
use the reward system to learn, but rather, trial and error.
▪ Both direct evaluation and TD learning will eventually learn the
true value of all states under the policy they follow.
▪ However, they both have a major inherent issue - we want to
find an optimal policy for our agent, which requires knowledge
of the q-values of states.
▪ To compute q-values from the values we have, we require a
transition function and reward function as dictated by the
Bellman equation.
▪ Advertisement recommendation system.
▪ In a normal ad recommendation system, the ads you get are
based on your previous purchases or websites you may have
visited. If you’ve bought a TV, you will get recommended TVs of
different brands.
▪
Using Q-learning, we can optimize the ad recommendation
system to recommend products that are frequently bought
together. The reward will be if the user clicks on the suggested
product.
▪ Let’s say that a robot has to cross a maze and reach the end
point. There are mines, and the robot can only move one tile at
a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
▪ The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot
takes the shortest path and reaches the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
How do we train a robot to reach the end goal with the shortest
path without stepping on a mine?
▪ Q-Table is just a fancy name for a simple lookup table where we
calculate the maximum expected future rewards for action at each
state. Basically, this table will guide us to the best action at each state.
▪ Each Q-table score will be the maximum expected future reward that
the robot will get if it takes that action at that state. This is an
iterative process, as we need to improve the Q-Table at each
iteration.
➢ questions are:
• How do we calculate the values of the Q-table?
• Are the values available or predefined?
➢ Answer:
▪ Q-function
▪ The Q-function uses the Bellman equation and takes two inputs:
state (s) and action (a).
▪ Using the above function, we get the values of Q for the cells in
the table.
▪ When we start, all the values in the Q-table are zeros.
▪ There is an iterative process of updating the values. As we start to
explore the environment, the Q-function gives us better and
better approximations by continuously updating the Q-values in
the table.
▪ Now, let’s understand how the updating takes place.
▪ There are n columns, where n= number of actions. There are m
rows, where m= number of states. We will initialize the values at
0.
We can now update the Q-values for being at the start and moving
right using the Bellman equation.
▪ Now we have taken an action and observed an outcome and reward.
We need to update the function Q(s,a).