Markov Decision Process: Reinforcement Learning
Markov Decision Process: Reinforcement Learning
Reinforcement Learning:
Reinforcement Learning is a type of Machine Learning. It allows machines and software agents
to automatically determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its behavior; this is
known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known as a
Markov Decision Process.
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular,
T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state
S’ (S and S’ may be the same). For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken
in state S. Note Markov property states that the effects of an action taken in a state depend only
on that state and not on the prior history.
Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being in
state S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent’s path, i.e., if there is a wall in the direction the agent would have taken,
the agent stays in the same place. So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
RIGHT RIGHT UP UPRIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the
action agent takes causes it to move at right angles. For example, if the agent says UP the
probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the probability
of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
Small reward for each step (can be negative when can also be term as punishment, in the above
example entering the Fire can have a reward of -1).
Deep Q-Learning
Q-Learning is required as a pre-requisite as it is a process of Q-Learning creates an exact matrix
for the working agent which it can “refer to” to maximize its reward in the long run. Although
this approach is not wrong in itself, this is only practical for very small environments and quickly
loses it’s feasibility when the number of states and actions in the environment increases. The
solution for the above problem comes from the realization that the values in the matrix only have
relative importance ie the values only have importance with respect to the other values. Thus,
this thinking leads us to Deep Q-Learning which uses a deep neural network to approximate the
values. This approximation of values does not hurt as long as the relative importance is
preserved. The basic working step for Deep Q-Learning is that the initial state is fed into the
neural network and it returns the Q-value of all possible actions as an output. The difference
between Q-Learning and Deep Q-Learning can be illustrated as follows:
Observe that in the equation target = R(s,a,s’) + , the term \gamma max_{a'}Q_{k}(s',a') is a
variable term. Therefore, in this process, the target for the neural network is variable unlike other
typical Deep Learning processes where the target is stationary. This problem is overcome by
having two neural networks instead of one. One neural network is used to adjust the parameters
of the network and the other is used for computing the target and which has the same architecture
as the first network but has frozen parameters. After an x number of iterations in the primary
network, the parameters are copied to the target network.
Deep Q-Learning is a type of reinforcement learning algorithm that uses a deep neural network
to approximate the Q-function, which is used to determine the optimal action to take in a given
state. The Q-function represents the expected cumulative reward of taking a certain action in a
certain state and following a certain policy. In Q-Learning, the Q-function is updated iteratively
as the agent interacts with the environment. Deep Q-Learning is used in various applications
such as game playing, robotics and autonomous vehicles.
Deep Q-Learning is a variant of Q-Learning that uses a deep neural network to represent the Q-
function, rather than a simple table of values. This allows the algorithm to handle environments
with a large number of states and actions, as well as to learn from high-dimensional inputs such
as images or sensor data.
One of the key challenges in implementing Deep Q-Learning is that the Q-function is typically
non-linear and can have many local minima. This can make it difficult for the neural network to
converge to the correct Q-function. To address this, several techniques have been proposed, such
as experience replay and target networks.
Experience replay is a technique where the agent stores a subset of its experiences (state, action,
reward, next state) in a memory buffer and samples from this buffer to update the Q-function.
This helps to decorrelate the data and make the learning process more stable. Target networks,
on the other hand, are used to stabilize the Q-function updates. In this technique, a separate
network is used to compute the target Q-values, which are then used to update the Q-function
network.
Deep Q-Learning has been applied to a wide range of problems, including game playing,
robotics, and autonomous vehicles. For example, it has been used to train agents that can play
games such as Atari and Go, and to control robots for tasks such as grasping and navigation.
Understanding Exploitation
Exploitation is a strategy of using the accumulated knowledge to make decisions that maximize
the expected reward based on the present information. The focus of exploitation is on utilizing
what is already known about the environment and achieving the best outcome using that
information. The key aspects of exploitation include:
Reward Maximization: Maximizing the immediate or short-term reward based on the current
understanding of the environment is the main objective of exploitation. This is choosing courses
of action based on learned values or rewards that the model predicts will yield the highest
expected payoff.
Decision Efficiency: Exploitation can often make more efficient decisions by concentrating on
known high-reward actions, which lowers the computational and temporal costs associated with
exploration.
Risk Aversion: Exploitation inherently involves a lower level of risk as it relies on tried and
tested actions, avoiding the uncertainty associated with less familiar options.
Exploitation strategies focus by tapping the currently world-known solutions with the aim of
getting maximum benefits in the short-term.
Greedy Algorithms: Greedy algorithms tend to choose the locally optimal solutions at each step
without consideration of the potential impact on the overall solution. They are often efficient in
terms of computation time; however, this approach may be suboptimal when sacrifices are
required to achieve the best global solution
Exploitation of Learned Policies: Reinforcement learning algorithms tend to base their pursuits
on previously learned policies as a way of leveraging on old gains. This is picking the activity
that amounts in high rewards, when it is similar to the previous experiences.
Understanding Exploration
Information Gain: The main objective of exploration is to gather fresh data that can improve
the model's comprehension of the surroundings. This involves exploring distinct regions of the
state space or experimenting with different actions whose outcomes are unknown.
State Space Coverage: In certain models, especially those with large or continuous state spaces,
exploration makes sure that enough different areas of the state space are visited to prevent
learning that is biased toward a small number of experiences.
Exploration Strategies in Machine Learning
In the strategy called exploration, gathered data is used to extend or upgrade the model's
knowledge by considering other options' opportunities. Some common exploration techniques in
machine learning include:
Thompson Sampling: Thompson sampling exploits the Bayesian method to explore and exploit
services simultaneously. It helps to keep the chances that are associated with the parameters and
takes in considerations of what is most likely to happen so as to balance for exploration and
exploitation.
One of the critical aspects of machine learning that people must keep in mind is the proper
balance for exploitation and exploration. This way, an efficient learning process of the machine
learning systems can be achieved. It is always necessary to satisfy maximum short-term profits
but the exploration helps to discover new strategies and find the ways to get out of inferior
solution.
Dynamic Parameter Tuning: It makes the algorithm dynamically set the exploration and
exploitation parameters according to how the model performs and the environment changes
characteristics, thus the algorithm can be changed in a way that better adapts to the changing
environment and is learning efficiently.
Multi-Armed Bandit Frameworks: The multi-armed bandit theory has got a formal basis for
balancing the exploration and exploitation in the decision problems that are sequential in nature.
They provide algorithms that make the analysis of this trade-off between exploration and
exploitation depending on different reward systems and conditions.
Hierarchical Approaches: Hierarchical reinforcement learning (RL) approaches can maintain a
balance at different levels of architecture between exploration and exploitation. Classifying
actions and policies in the hierarchical order makes efficient search for a combination of methods
while using known answers at all level as in exploitation method.