RL MJJ
RL MJJ
Learning
Introduction
Example:
In the above image, the agent is at the very first block of the maze. The
maze is consisting of an S6 block, which is a wall, S8 a fire pit, and
S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the fire pit,
then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the
path S9-S5-S1-S2-S3, so he will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to
reach the final step. To memorize the steps, it assigns 1 value to each
previous step. Consider the following step:
Now, the agent has successfully stored the previous steps assigning
the 1 value to each previous block. But what will the agent do if he
starts moving from the block, which has 1 value block on both sides?
Consider the below diagram:
Conti..
In the above image, we can see there is an agent who has three
values options, V(s1), V(s2), V(s3). As this is MDP(Markov Decision
Process), so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he
needs to decide where to go for the optimal path. Here agent will take
a move as per probability bases and changes the state. But if we
want some exact moves, so for this, we need to make some changes
in terms of Q-value.
Conti..
Q- represents the quality of the actions at each state. So instead of
using a value at each state, we will use a pair of state and action,
i.e., Q(s, a). Q-value specifies that which action is more lubricative
than others, and according to the best Q-value, the agent takes his
next move. The Bellman equation can be used for deriving the Q-
value.
To perform any action, the agent will get a reward R(s, a), and also
he will end up on a certain state, so the Q -value equation will be:
Q-table