0% found this document useful (0 votes)
11 views10 pages

Markov Decision Process: Reinforcement Learning

Uploaded by

2203a52078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Markov Decision Process: Reinforcement Learning

Uploaded by

2203a52078
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Markov Decision Process

Reinforcement Learning:

Reinforcement Learning is a type of Machine Learning. It allows machines and software agents
to automatically determine the ideal behavior within a specific context, in order to maximize its
performance. Simple reward feedback is required for the agent to learn its behavior; this is
known as the reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known as a
Markov Decision Process.

A Markov Decision Process (MDP) model contains:

 A set of possible world states S.


 A set of Models.
 A set of possible actions A.
 A real-valued reward function R(s,a).
 A policy is a solution to Markov Decision Process.
What is a State?

A State is a set of tokens that represent every state that the agent can be in.

What is a Model?

A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular,
T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state
S’ (S and S’ may be the same). For stochastic actions (noisy, non-deterministic) we also define a
probability P(S’|S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken
in state S. Note Markov property states that the effects of an action taken in a state depend only
on that state and not on the prior history.

What are Actions?

Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being in
state S.

What is a Reward?

A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.

What is a Policy?

A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It


indicates the action ‘a’ to be taken while in state S.

Let us take the example of a grid world:


An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no
1,1). The purpose of the agent is to wander around the grid to finally reach the Blue Diamond
(grid no 4,3). Under all circumstances, the agent should avoid the Fire grid (orange color, grid no
4,2). Also, the grid no 2,2 is a blocked grid, it acts as a wall hence the agent cannot enter it.

The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT

Walls block the agent’s path, i.e., if there is a wall in the direction the agent would have taken,
the agent stays in the same place. So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.

First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
RIGHT RIGHT UP UPRIGHT

UP UP RIGHT RIGHT RIGHT

Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.

The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the
action agent takes causes it to move at right angles. For example, if the agent says UP the
probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the probability
of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).

The agent receives rewards for each time step:-

Small reward for each step (can be negative when can also be term as punishment, in the above
example entering the Fire can have a reward of -1).

Big rewards come at the end (good or bad).

The goal is to Maximize the sum of rewards.

Deep Q-Learning
Q-Learning is required as a pre-requisite as it is a process of Q-Learning creates an exact matrix
for the working agent which it can “refer to” to maximize its reward in the long run. Although
this approach is not wrong in itself, this is only practical for very small environments and quickly
loses it’s feasibility when the number of states and actions in the environment increases. The
solution for the above problem comes from the realization that the values in the matrix only have
relative importance ie the values only have importance with respect to the other values. Thus,
this thinking leads us to Deep Q-Learning which uses a deep neural network to approximate the
values. This approximation of values does not hurt as long as the relative importance is
preserved. The basic working step for Deep Q-Learning is that the initial state is fed into the
neural network and it returns the Q-value of all possible actions as an output. The difference
between Q-Learning and Deep Q-Learning can be illustrated as follows:
Observe that in the equation target = R(s,a,s’) + , the term \gamma max_{a'}Q_{k}(s',a') is a
variable term. Therefore, in this process, the target for the neural network is variable unlike other
typical Deep Learning processes where the target is stationary. This problem is overcome by
having two neural networks instead of one. One neural network is used to adjust the parameters
of the network and the other is used for computing the target and which has the same architecture
as the first network but has frozen parameters. After an x number of iterations in the primary
network, the parameters are copied to the target network.

Deep Q-Learning is a type of reinforcement learning algorithm that uses a deep neural network
to approximate the Q-function, which is used to determine the optimal action to take in a given
state. The Q-function represents the expected cumulative reward of taking a certain action in a
certain state and following a certain policy. In Q-Learning, the Q-function is updated iteratively
as the agent interacts with the environment. Deep Q-Learning is used in various applications
such as game playing, robotics and autonomous vehicles.

Deep Q-Learning is a variant of Q-Learning that uses a deep neural network to represent the Q-
function, rather than a simple table of values. This allows the algorithm to handle environments
with a large number of states and actions, as well as to learn from high-dimensional inputs such
as images or sensor data.
One of the key challenges in implementing Deep Q-Learning is that the Q-function is typically
non-linear and can have many local minima. This can make it difficult for the neural network to
converge to the correct Q-function. To address this, several techniques have been proposed, such
as experience replay and target networks.

Experience replay is a technique where the agent stores a subset of its experiences (state, action,
reward, next state) in a memory buffer and samples from this buffer to update the Q-function.
This helps to decorrelate the data and make the learning process more stable. Target networks,
on the other hand, are used to stabilize the Q-function updates. In this technique, a separate
network is used to compute the target Q-values, which are then used to update the Q-function
network.

Deep Q-Learning has been applied to a wide range of problems, including game playing,
robotics, and autonomous vehicles. For example, it has been used to train agents that can play
games such as Atari and Go, and to control robots for tasks such as grasping and navigation.

Exploitation and Exploration in Machine


Learning
Exploration and Exploitation are methods for building effective learning algorithms that can
adapt and perform optimally in different environments. This article focuses on exploitation and
exploration in machine learning, and it elucidates various techniques involved.

Understanding Exploitation

Exploitation is a strategy of using the accumulated knowledge to make decisions that maximize
the expected reward based on the present information. The focus of exploitation is on utilizing
what is already known about the environment and achieving the best outcome using that
information. The key aspects of exploitation include:

Reward Maximization: Maximizing the immediate or short-term reward based on the current
understanding of the environment is the main objective of exploitation. This is choosing courses
of action based on learned values or rewards that the model predicts will yield the highest
expected payoff.
Decision Efficiency: Exploitation can often make more efficient decisions by concentrating on
known high-reward actions, which lowers the computational and temporal costs associated with
exploration.

Risk Aversion: Exploitation inherently involves a lower level of risk as it relies on tried and
tested actions, avoiding the uncertainty associated with less familiar options.

Exploitation Strategies in Machine Learning

Exploitation strategies focus by tapping the currently world-known solutions with the aim of
getting maximum benefits in the short-term.

Some common exploitation techniques in machine learning include:

Greedy Algorithms: Greedy algorithms tend to choose the locally optimal solutions at each step
without consideration of the potential impact on the overall solution. They are often efficient in
terms of computation time; however, this approach may be suboptimal when sacrifices are
required to achieve the best global solution

Exploitation of Learned Policies: Reinforcement learning algorithms tend to base their pursuits
on previously learned policies as a way of leveraging on old gains. This is picking the activity
that amounts in high rewards, when it is similar to the previous experiences.

Model-Based Methods: Model-based approaches take advantage of underlying models that


make decisions based on their predictive capabilities.

Understanding Exploration

Exploration is used to increase knowledge about an environment or model. The exploration


process selects actions with uncertain outcomes to gather information about the possible states
and rewards that the performed actions will result. The key aspects of exploration include:

Information Gain: The main objective of exploration is to gather fresh data that can improve
the model's comprehension of the surroundings. This involves exploring distinct regions of the
state space or experimenting with different actions whose outcomes are unknown.

Uncertainty Reduction: Reducing uncertainty in the model's estimates of the environment


guides the actions that are selected. For example, activities that are rarely selected in the past are
ranked in order of possible rewards.

State Space Coverage: In certain models, especially those with large or continuous state spaces,
exploration makes sure that enough different areas of the state space are visited to prevent
learning that is biased toward a small number of experiences.
Exploration Strategies in Machine Learning

In the strategy called exploration, gathered data is used to extend or upgrade the model's
knowledge by considering other options' opportunities. Some common exploration techniques in
machine learning include:

Epsilon-Greedy Exploration: Epsilon-greedy algorithms manage to unify those two


characteristics (exploitation and exploration) by sometimes choosing completely random actions
with probability epsilon while continuing to use the current best-known action with probability
(1 - epsilon).

Thompson Sampling: Thompson sampling exploits the Bayesian method to explore and exploit
services simultaneously. It helps to keep the chances that are associated with the parameters and
takes in considerations of what is most likely to happen so as to balance for exploration and
exploitation.

Balancing Exploitation and Exploration

One of the critical aspects of machine learning that people must keep in mind is the proper
balance for exploitation and exploration. This way, an efficient learning process of the machine
learning systems can be achieved. It is always necessary to satisfy maximum short-term profits
but the exploration helps to discover new strategies and find the ways to get out of inferior
solution.

Several approaches can help maintain this balance:

Exploration-Exploitation Trade-off: The foremost idea here is to understand the exchange


between exploration and exploitation processes. Allocation of resources should rest on needs to
both streams alternatively depending on current state of knowledge and complexity of the
learning task or a given day.

Dynamic Parameter Tuning: It makes the algorithm dynamically set the exploration and
exploitation parameters according to how the model performs and the environment changes
characteristics, thus the algorithm can be changed in a way that better adapts to the changing
environment and is learning efficiently.

Multi-Armed Bandit Frameworks: The multi-armed bandit theory has got a formal basis for
balancing the exploration and exploitation in the decision problems that are sequential in nature.
They provide algorithms that make the analysis of this trade-off between exploration and
exploitation depending on different reward systems and conditions.
Hierarchical Approaches: Hierarchical reinforcement learning (RL) approaches can maintain a
balance at different levels of architecture between exploration and exploitation. Classifying
actions and policies in the hierarchical order makes efficient search for a combination of methods
while using known answers at all level as in exploitation method.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy