0% found this document useful (0 votes)

23 views15 pages

Reinforcement Learning MY101

Uploaded by

Azzeddine Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views15 pages

Reinforcement Learning MY101

Uploaded by

Azzeddine Ben

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Reinforcement Learning: All the

Basics
Mohamed Yosef

January 2024

Abstract

The way we learn is the core thing in any part of our life. So, can we move
from the traditional learning way in machine learning which is about learning from
labeled examples (supervised) or unlabeled examples (unsupervised)? Reinforcement
learning provides a solution for this, where you have an agent (the AI) that learns
through trial and error and learn from mistakes or achievements. There are two
main approaches to solve almost any reinforcement learning problem; value-based
and policy-based. Around these two approaches, I created this piece covering the
basics, Markov decision processes, and main RL algorithms. All of this with a goal
to help you understand all the basics by reading this one piece.

Contents
1 What is Reinforcement Learning? 2
1.1 Reinforcement Learning Process . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Types of RL tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The Exploration/Exploitation trade-off . . . . . . . . . . . . . . . . . . . 3
1.4 Two approaches for solving RL problems . . . . . . . . . . . . . . . . . . 4

2 Markov Decision Processes 4

2.1 The Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Dynamic Programming (DP) . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . . . . . 8

3 RL Algorithms 9
3.1 Q-learning and DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Actor Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
1 What is Reinforcement Learning?

If you think about how you learn and the nature of learning, you will clearly see that
you learn by interacting with your world (or environment). In the same time, you are
acutely aware of how your world responds to what you do, and your goal is to get the
best results through your actions. The same thing happens with our little RL agent;
the agent learns from the world or environment by interacting with it, through trial
and error, and receiving rewards, negative or positive, as a feedback for performing
actions. The agent is not told which actions to take at first, but the agent use the
feedback from the environment to discover which actions yield the most reward.

Reinforcement learning is different from supervised learning; supervised learning is

learning from a training set of labeled examples provided by a knowledgeable external
supervisor giving the AI the solution and the right action to take in a specific situation.
The goal of supervised learning is to generalize a rule for the AI to deal with other
situations that is not in the training set. BUT in real world interactive problems, the
answer often emerges through exploration and trial-and-error. There might not be a
definitive “correct” answer for every situation the agent encounters. Even if there is a
right answer for some situations, it will not work well as a general solution (Sutton and
Barto, 2018).

Reinforcement learning is also different from unsupervised learning; unsupervised learn-

ing is finding structure hidden in collection of unlabeled data. Understanding the hidden
structure can be useful in reinforcement leaning, but unsupervised leaning itself does
not maximize the reward signal.

So, reinforcement learning is the third machine learning paradigm alongside with su-
pervised learning and unsupervised learning with a goal to maximize the total rewards
that agent gets from the environment.

1.1 Reinforcement Learning Process

The process of reinforcement learning (as shown in Figure 1) starts with the agent
observing a state st which is a representation of the current situation the agent is
in within its environment. Each state gives the agent information about the world
(environment). Based on the state, the agent selects an action at which is the move,
or decision made by the agent in a given state of the environment – the agent decides
what to do using policy π, the agent’s brain, decides what actions to take based on
the observed state– then, the environment provides rewards rt to guide the agent after
taking the action. The idea of rewards came from points in games; i.e., in football, the
team gets 3 points for winning and 1 point for a draw and 0 points for losing.

2
Figure 1: Reinforcement Learning process starts with the agent observing the current state
in the environment, choosing an action, get a reward from the environment, then adjust the
policy, and repeat. Based on a similar figure in (Sutton and Barto, 2018)

The agent’s goal is to maximize its expected return (cumulative reward). So, instead
of individual rewards, we often consider the return, which sums up all future rewards.
Since we collect rewards over time, we need a way to determines how much future
rewards matter in other words a discount rate, gamma, γ. A higher gamma prioritizes
long-term rewards (to take 100 dollar after a year), while a lower gamma focuses on
immediate rewards (to take 20 dollar now). To it all together, we have a trajectory τ
which is a sequence of states, actions, and rewards the agent experiences in the world.

τ = (s0 , a0 , s1 , a1 , ...)

1.2 Types of RL tasks

A task is a specific instance of a problem that you face everyday in your job. There
are mainly two categories of tasks in reinforcement learning: episodic and continuous.
Episodic tasks have a clear beginning and specific end, or a terminal state. In contrast,
continuous tasks are ongoing, lacking a definitive endpoint, which requires the agent
to improve the policy continuously while interacting with the environment.

1.3 The Exploration/Exploitation trade-off

Sometimes, agents need to explore to learn new things and exploit to use what they
know to do well. But the question is still: how to balance between exploration and
exploitation? Exploration is when the agent tries out different things in the environ-
ment to learn more about it. It’s like looking around to find new information. Therefore

3
exploitation is when the agent uses what it already knows to get the best results. It’s
like using a map you’ve made to find the quickest route to a treasure.

1.4 Two approaches for solving RL problems

The Policy is the function we want to learn. Our goal is to find the optimal policy π,
the policy that maximizes expected return when the agent acts according to it. We
find this π through training. There are approaches to find this optimal policy π ∗ : (1)
Directly, by teaching the agent to learn which action to take, given a state; policy-based
methods. (2) Indirectly, teach the agent to learn which state is more valuable and then
take the action that leads to more valuable states; value-based methods.

Policy-based methods (direct approach) focus on directly learning a mapping from

states to probabilities of taking specific action. This policy, often stochastic and repre-
sented by a neural network, takes the current state as input and outputs a probability
distribution over actions. Common Algorithms are REINFORCE, Proximal Policy Op-
timization (PPO), and Deterministic Policy Gradient (DPG).

Value-based methods (indirect approach) focus on learning the value of states or

state-action pairs, rather than directly learning the optimal policy. Value-based meth-
ods estimate the expected cumulative reward associated with being in a particular state,
state-value function V (s), or taking a specific action in that state, action-value function
Q(s, a). Common Algorithms are SARSA, Q-Learning, and Deep Q-Networks (DQN).

2 Markov Decision Processes

The major goal of AI and reinforcement learning is to help us make better decisions.
Markov decision process is a classical way to set up almost any problem in reinforce-
ment learning. All states in the Markov decision process have MP, Markov property
(Corcoran, 2023), which means the future only depends on the present, current state,
not the past, all previous states:

P[st+1 |st ] = P[st+1 |s1 , ..., st ]

Here, we will take about Markov decision processes assuming we have complete informa-
tion about the environment. In most cases, we don’t know exactly how an environment
will react or the rewards for our actions. However, Markov Decision Processes (MDPs)
lay the theoretical foundation for many reinforcement learning algorithms. Markov de-
cision process consists of five elements M = ⟨S, A, P, R, γ⟩, where S → a set of states;

4
A → a set of actions; P → transition probability function which specify the probability
distribution over the next states given the current state (shown in Figure 2); R →
reward function; γ → discounting factor that specifies how much immediate rewards
are favored over future rewards, γ ∈ [0, 1], when γ equals 1, it implies that the future
rewards are equally important as the present rewards. When γ equals 0, this implies
that we only care about present rewards.

Figure 2: Markov Transition Dynamics among three states; A, B, and C with the probabilities
from moving from one state to the other. Let’s say the agent starts in state s1 = C, the
dynamics describe the chances it transitions to other states s2 = B has a probability of 0.1.

2.1 The Bellman Equations

The key idea is that we want to calculate the expected long-term return starting from
any given state. This is called the value of that state, denoted V (s). One way to
calculate V (s) is through simulation — we could sample many episodes starting from
state, s, calculate the sum of discounted rewards in each one, and take the average.
Here is a formula for state-value function (Weng, 2018),

V (s) = E[Gt |st = s]

= E[rt+1 + γrt+2 + γ 2 rt+3 + . . . |st = s]
= E[rt+1 + γ(rt+2 + γrt+3 + . . . )|st = s]
= E[rt+1 + γGt+1 |st = s]
= E[rt+1 + γV (st+1 )|st = s]

5
Similarly, for action-value or Q-value,

Q(s, a) = E[rt+1 + γV (st+1 ) | st = s, at = a]

= E[rt+1 + γEa∼π Q(st+1 , a) | st = s, at = a]

If we only care about finding the optimal values and the optimal policy, π ∗ , which
dictates the best action to take in each state. The Bellman optimality equation gives
us a faster way and break down the values recursively, without having to simulate full
episodes (bootstrapping). It says:

V∗ (s) = max (R(s, a) + γV (s′ ))

a∈A

where R(s) is the immediate reward received after taking action a in state s; γ is the
discount factor; V (s′ ) is the value of the next state s′ that follows s. So instead of
calculating V (s) from scratch using many episodes, we can build it up iteratively using
the values of the next states.

Consider Figure 3 as a summary of what I’m going to cover in the next three sections.

Figure 3: Comparison of the backup diagrams of Monte-Carlo, Temporal-Difference learning,

and Dynamic Programming for state value functions (Silver, 2015)

2.2 Dynamic Programming (DP)

A collection of algorithms that can be used to compute optimal policies given a per-
fect model of the environment as a Markov decision process (MDP).A model is how
the environment change in response to the agent’s actions. While if we have complete
information about the environment, we call it model-based learning and model-free
is the opposite; where we know nothing about the environment dynamics. However in
dynamic programming, the key idea is to break down complex problems into smaller,
simpler subproblems and then solves them recursively, reusing the solutions of sub-
problems to find the solution to the larger problem. DP algorithms leverage Bellman

6
equations iteratively to update the value functions, starting from an initial guess and
progressively getting closer to the optimal values.

There are two main dynamic programming algorithms value iteration and policy iter-
ation; Value Iteration, VI, updates the state-value function, V (s), for all states. In
each iteration, VI uses the current estimate of V (s) to calculate an improved estimate
based on the Bellman optimality equation for V (s). This process continues until the
values converge to the optimal V∗ (s).

Vt+1 (s) = Eπ [r + γVt (s′ )|st = s]

X X
= π(a|s) P (s′ , r|s, a)(r + γVt (s′ ))
a s′ ,r

The second type of dynamic programming algorithms is Policy Iteration, PI, which
is based on the value functions, PI starts with an initial policy, even a random one, and
iteratively improves it.

Qπ (s, a) = E[rt+1 + γVπ (st+1 )|st = s, at = a]

X
= P (s′ , r|s, a)(r + γVπ (s′ ))
s′ ,r

In each iteration, PI evaluates the current policy by calculating the state-value function
for each state under that policy. Then, it uses this state-value function to find greedy
policy, one that takes the action with the highest Q-value in each state. Finally, it
compares the new greedy policy to the old one and keeps the one with the higher
expected return. This process called Generalized Policy Iteration, GPI.

evaluation improve evaluation improve evaluation improve evaluation

π0 −−−−−→ Vπ0 −−−−→ π1 −−−−−→ Vπ1 −−−−→ π2 −−−−−→ . . . −−−−→ π∗ −−−−−→ V∗

This policy iteration process works and always converges to the optimality, but why
this is the case? Say, we have a policy π and then generate an improved version π ′ by
greedily taking actions, π ′ (s) = arg maxa∈A Qπ (s, a). The value of this improved π ′ is
guaranteed to be better because:

Qπ (s, π ′ (s)) = Qπ (s, arg max Qπ (s, a))

a∈A
= max Qπ (s, a) ≥ Qπ (s, π(s))
a∈A
= Vπ (s)

7
2.3 Monte Carlo Methods

Monte Carlo methods estimate the quality of a given policy at the end of an episode
(only episodic tasks). These methods rely on experiencing the environment under the
policy’s control and averaging the observed rewards to estimate the value of states and
actions. A key characteristic of Monte Carlo methods is their reliance on the completion
of an episode before calculating the return. The return, denoted by GT , is computed
using the following formula:

−t−1
TX
GT = γ k Rt+k+1
k=0

This return is then utilized as the target for value updates:

V (st ) ← V (st ) + α[Gt − V (st )]

2.4 Temporal Difference Learning

“If one had to identify one idea as central and novel to reinforcement learning, it would
undoubtedly be temporal-difference learning.” (Sutton and Barto, 2018).

TD Learning is a combination of dynamic programming and Monte Carlo ideas that es-
timates the quality of a given policy at each time step, think of it as an exam and your
grads are updated after each question, instead of just averaging all grads, returns, at
the end of the exam, an episode, like Monte Carlo. Because we didn’t experience an
entire episode, we don’t have return Gt . Instead, we estimate the return by adding
reward and the discounted value of the next state, γ V (st+1 ):

V (st ) ← V (st ) + α[rt+1 + γV (st+1 ) − V (st )]

Similarly, for action-value estimation:

Q(st , at ) ← Q(st , at ) + α(rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))

8
3 RL Algorithms

Figure 4: A non-exhaustive, but useful taxonomy of algorithms in reinforcement learning

(OpenAI, 2018).

3.1 Q-learning and DQN

I told you before that model-free means that the agent doesn’t know anything about
the environment dynamics or how it works. I also want you remember that a policy
is the agent’s brain; it is the center of decision making. Therefore, when the agent try
to improve this policy by learning from its own action, this is called on-policy. In
contrast, when the agent learns from the actions of the others, this is off-policy. Now,
I can say that Q-learning is a model-free, off-policy reinforcement learning algorithm.
So, rather than learning a value estimate, we learn action-value or the quality of the
action, Qπ (a, s), and to get the expected future reward of starting in state s and taking
action a by using a Temporal Difference (TD) learning approach to optimize the value
function:

Q(st , at ) ← Q(st , at ) + α[Rt+1 + γ max Q(st+1 , a) − Q(st , at )]

If you recall from section 1.3, the agent doesn’t know when to explore and when to
exploit. Therefore, Q-learning often uses an epsilon-greedy policy (ϵ − greeds) in which
the agent exploits by taking the best action according to the agent information 1 − ϵ
times and explore a random action ϵ times. Using the outsourced policy (epsilon-greed
policy), the agent updates the Q-values to determine the best action in each state

9
Figure 5: The problem of temporal difference where one frame or state was not enough to
determine the direction of the ball. So we used three frames instead.

(agent’s policy), aiming for optimal performance. Since the environment is stochastic,
it is possible to receive different rewards from taking the same action in the same state,
so we don’t want to overwrite our previous estimate of Q(a, s). Instead, we can create
a loss function by squaring the TD error:

L(st , at , rt , st+1 ) = (rt + γ max Qπ (at+1 , st+1 ) − Qπ (at , st ))2

at+1

By iteratively optimizing this loss function using gradient-based optimization, we can

refine our Q-value estimates until they converge to their actual expected value.

Deep Q-learning (Mnih et al., 2013) is an advanced form of Q-learning that integrates
deep neural networks with reinforcement learning. Deep Q-network (DQN) uses a deep
neural network to estimate the quality of the action Q(a, s) by treating the state s
as input, and having an output neuron for each possible action K to estimate Q(aˆk,
s). But DQN struggle with temporal limitation where one state in not enough (shown
in Figure 5). Deep Q-learning addresses this by considering multiple future states,
allowing the agent to evaluate actions based on both immediate and future rewards.
Another problem is that the agent sometimes forgets previous lessons, so we first store
all observed (st , at , rt , st+1 ) tuples into an experience replay buffer, and randomly
sample batches from this buffer to calculate the loss.

In the early stages of Q-learning, the Q estimates are based on very few samples which
can be quite noisy and tend to be optimistic about the future rewards. That’s why
we need to have a target Q-network (Van Hasselt et al., 2016) which is initialized
randomly, and slowly updated so the parameters move towards the values of the main

10
Q-network (since we now have two the new, target network and the old, main one).

3.2 Policy Gradients

Unlike value-based methods (like Q-learning), which require evaluating the quality of
each action, policy-based methods use gradient descent to directly improve the policy
based on the gradient of the expected return Gt with respect to the policy parameter θ.
So you don’t need separate value function approximation. Also, the idea behind policy
gradients is simple; increasing the probability of actions that led to high rewards, and
decreasing the probability of actions that led to negative, low rewards.

The goal of policy gradient methods — like any RL technique — is to find policy
parameters that maximize the expected cumulative reward (return). In our case, a
neural network outputs a probability distribution over actions (I know everything is
about this probability distribution over actions). To measure policy performance,
you first need to define an objective function that gives the expected return over a
trajectory based on the policy:

J(θ) = Eπθ [r(τ )] = Eπθ [γ t rt ]

You know what... I can’t dive into the policy gradient theorem (bore me), but I want you
to know that this theorem reformulates the objective so you can estimate its gradient
with no need to differentiate the environment dynamics.

∇θ J(θ) = Eπθ [∇θ log πθ (at |st )r(τ )]

The first advantage of policy gradients algorithm is how it deals with exploration/exploitation
trade-off (described in section 1.3) without the need to tune how often the agent should
explore vs exploit (e.g., using ϵ-greedy in Q-learning). But with policy gradients, you di-
rectly model a stochastic policy that outputs a probability over actions. So the agent
automatically explores different states and trajectories because of random sampling
from the policy distribution each time-step. For example, if your policy outputs a 60%
chance for action 1 and 40% for action 2, the agent will naturally end up trying action 1
more often, but also frequently explore action 2 without any extra code for exploration
vs exploitation.

Also, when two different states appear to be similar, but require different actions or
what we simply call perceptual aliasing. For example, if you are training a self-
driving car and it reaches an intersection. The traffic light may look exactly the same
(green light) in multiple environments. However, in a given scenario with the same

11
green light visual, there may be ongoing cross traffic that requires your car to continue
waiting rather than drive into the intersection. Policy gradient methods give distinct
probabilities of proceeding vs waiting to the exact same traffic light input depending
on the surrounding context.

As you know deep Q-learning learns a value function (judging how good each action
is at every state) which works with a limited set of actions. But in case of your self-
driving car, you have infinite actions (tiny variations in wheel angle, brake pressure,
etc.). Therefore it’s impossible to store a Q-value, reward, for every possible tiny
action because you can’t represent infinite values (or maybe you can but it’s not a good
thing to do anyway). So instead, you can use policy gradients which directly output a
probability distribution over the best actions based on the state. Rather than rating
every individual action choice, they learn a policy that says “for this state steer 30
degrees left with high probability”.

The problem of any gradient is that the algorithm often get trapped in local max-
ima rather than the global best policy (the same problem with optimization in deep
learning). Gradient estimates used for updating the policy tend to have high variance,
causing unstable learning. Actor-critic methods help address this.

3.3 Actor Critic

Actor-Critic methods are a hybrid architecture combining value-based (e.g., Q-learning)

and policy-based (e.g., policy gradients). Therefore, it solves the problem of high
variance in policy gradients and makes our agent train faster and better. Actor-critic
methods have two key components (shown in Figure 6); Actor which aims to choose
actions that will lead to high rewards in the long run, and Critic helps the actor learn
better by providing feedback on the chosen actions. With that said, you can see that we
have two function approximations (i.e., two neural networks); the first one is a policy
function represents the actor while the other one is the value function represents the
critic.

The process (Simonini, 2018) starts at timestep, t, where we get the current state St
from the environment and pass it as input through our Actor and Critic where our
policy takes the state and outputs an action At . The critic takes takes the action as
input as well and computes the quality of the action (Q-value). Since the action is
taken, the environment outputs a reward Rt+1 and a new state St+1 . Now, the actor is
ready to update its policy using the Q-value:

∆θ = α∇θ (log πθ (s, a))q̂w (s, a)

12
Figure 6: Two persons; one is playing a game represents the actor and another saying ‘this
is a really bad move’ represents the critic (Simonini, 2018).

where ∆θ is the change in policy parameters/weights and q̂w (s, a) is the action value
estimate. With that, the Actor produces the next action to take at+1 in the new state
st+1 . The Critic then updates its value parameters:

∆w = β(rt+1 + γ q̂w (st+1 , at+1 ) − q̂w (st , at ))∇w q̂w (st , at )

where β is the value learning rate which is different from α (the policy learning rate),
∇w q̂w (st , at ) is the gradient of our value function, and the rest of the equation is the
TD error.

We can stabilize learning further by directly using the advantage function as Critic
instead of the action-value function (Simonini, 2018). The idea is that the Advantage
function calculates the relative advantage of an action compared to other possible at a
state: how taking that action at a state is better compared to the average value of the
state. It’s subtracting the mean value of the state from the state-action pair:

A(a, s) = Q(a, s) − V (st )

= rt + γV (st+1 ) − V (st )

The advantage function describes how much better the current reward is than what we
expect to get. If we substitute the advantage function into policy gradients, we obtain
the Advantage Actor-Critic (A2C) algorithm (Mnih et al., 2016):

13
∇θ J(θ) = Eπθ [[rt + γV (st+1 ) − V (st ) − V (st )]∇θ log πθ (at |st )]
= Eπθ [A(at , st )∇θ log πθ (at |st )]

Final Words

This piece reviewed the basic concepts in reinforcement learning, there are for sure
many concepts that I was not able to cover in this piece. So, consider the following
resources: (Corcoran, 2023) for Markov processes, (Sutton and Barto, 2018) for RL
concepts, (Simonini, 2018) for some RL implementation, and (Jaques, 2019) for social
learning. At the same time, if you found any error, or you want to give me some
suggestions, feel free to email me: mohamedyosef101@outlook.com.

References

Jem Corcoran. Markov processes, 2023. URL https://www.youtube.com/playlist?list=

PLLyj1Zd4UWrP3rME2XvFvE4Q5vI3H 7 Z.

Natasha Jaques. Social and Affective Machine Learning. PhD dissertation, Mas-
sachusetts Institute of Technology, 2019. URL https://www.media.mit.edu/
publications/social-and-affective-machine-learning/.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learn-
ing, December 2013. URL http://arxiv.org/abs/1312.5602.

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P.
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods
for Deep Reinforcement Learning, June 2016. URL http://arxiv.org/abs/1602.01783.
arXiv:1602.01783 [cs].

OpenAI. Part 2: kinds of rl algorithms — spinning up documentation, 2018. URL

https://spinningup.openai.com/en/latest/spinningup/rl intro2.html.

David Silver. Model-free prediction, 2015. URL https://youtu.be/PnHCvfgC ZA?si=

zov8IxW dj3870Ki.

Thomas Simonini. Introduction - hugging face deep rl course, 2018. URL https://
huggingface.co/learn/deep-rl-course/unit6/introduction.

14
Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT
Press, second edition edition, November 2018. ISBN 978-0-262-03924-6. URL https:
//mitpress.mit.edu/9780262039246/reinforcement-learning/.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with
double q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30
(1), March 2016. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v30i1.10295. URL
https://ojs.aaai.org/index.php/AAAI/article/view/10295.

Lilian Weng. A (long) peek into reinforcement learning, 2018. URL https://lilianweng.
github.io/posts/2018-02-19-rl-overview/.

é<Ë@ YÒm'. Õç'

CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Essence of Ganesha Mahima
No ratings yet
Essence of Ganesha Mahima
119 pages
Importance of Functional English
No ratings yet
Importance of Functional English
6 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Praying Like Jesus - 10 Brief Studies in Prayer Author Pastor Rick Ezell
No ratings yet
Praying Like Jesus - 10 Brief Studies in Prayer Author Pastor Rick Ezell
11 pages
HINDI - PAPER-I To PAPER-IV
No ratings yet
HINDI - PAPER-I To PAPER-IV
9 pages
Unit 6
No ratings yet
Unit 6
34 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Unit 4
No ratings yet
Unit 4
56 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Pass The Architect Board Exam Made Easy
No ratings yet
Pass The Architect Board Exam Made Easy
63 pages
Module 1
No ratings yet
Module 1
72 pages
Sections
No ratings yet
Sections
76 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
4.3 Reinforcement Learning
No ratings yet
4.3 Reinforcement Learning
27 pages
Chapter 11 - PPT - PPT - Updated
No ratings yet
Chapter 11 - PPT - PPT - Updated
20 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Unit 3
No ratings yet
Unit 3
29 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
SSLC Result 2024 25 Division Wise
No ratings yet
SSLC Result 2024 25 Division Wise
7 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
ELC 101 Learning Episode 10 - FINALITY
No ratings yet
ELC 101 Learning Episode 10 - FINALITY
12 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
MODULE 2 and 3
No ratings yet
MODULE 2 and 3
37 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
37 RL
No ratings yet
37 RL
18 pages
Equity and Succession Apraku Lecture 1 To Three
No ratings yet
Equity and Succession Apraku Lecture 1 To Three
13 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Unit 5
No ratings yet
Unit 5
10 pages
Learning Content General Mathematics
No ratings yet
Learning Content General Mathematics
40 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Unit 5
No ratings yet
Unit 5
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Marketing Analytics
No ratings yet
Marketing Analytics
3 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
SSC Maths
0% (1)
SSC Maths
27 pages
Gillon (2003) Ethics Needs Principles
No ratings yet
Gillon (2003) Ethics Needs Principles
7 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Apostrophe Lesson
No ratings yet
Apostrophe Lesson
3 pages
Viegas 2014
No ratings yet
Viegas 2014
10 pages
PTMF Pattern - Social Inequality
No ratings yet
PTMF Pattern - Social Inequality
7 pages
Two Types of Writing - Creative and Letter Writing PDF
No ratings yet
Two Types of Writing - Creative and Letter Writing PDF
3 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Winning Appliance MIG Invoice
No ratings yet
Winning Appliance MIG Invoice
61 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
GPU Computing CIS-543: Lecture 10: Streams and Events
No ratings yet
GPU Computing CIS-543: Lecture 10: Streams and Events
23 pages
Markov Decision Process and Reinforcement Learning
No ratings yet
Markov Decision Process and Reinforcement Learning
36 pages
The Kite Runner Essay Good
No ratings yet
The Kite Runner Essay Good
8 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Perspective: Intelligent Supply Chain Management During Uncertain Times
No ratings yet
Perspective: Intelligent Supply Chain Management During Uncertain Times
8 pages
ENG105 Research Paper
No ratings yet
ENG105 Research Paper
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
Crim Midterm Reviewer
No ratings yet
Crim Midterm Reviewer
3 pages
Judicial Council of California Civil Jury Instruction 100
No ratings yet
Judicial Council of California Civil Jury Instruction 100
1 page
Undertow Tanka Review Issue 4
No ratings yet
Undertow Tanka Review Issue 4
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Ecpe Speaking Notes Part 1 Webinar
100% (3)
Ecpe Speaking Notes Part 1 Webinar
12 pages
Rosa Godines PC Statement
No ratings yet
Rosa Godines PC Statement
2 pages
Four Positive Effects From The Teachings of Sri Sathya Sai Baba
No ratings yet
Four Positive Effects From The Teachings of Sri Sathya Sai Baba
13 pages
Grammer Exercise
100% (1)
Grammer Exercise
4 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Reinforcement Learning MY101

Uploaded by

Reinforcement Learning MY101

Uploaded by

Reinforcement Learning: All the

2 Markov Decision Processes 4

Reinforcement learning is different from supervised learning; supervised learning is

Reinforcement learning is also different from unsupervised learning; unsupervised learn-

1.1 Reinforcement Learning Process

1.2 Types of RL tasks

1.3 The Exploration/Exploitation trade-off

1.4 Two approaches for solving RL problems

Policy-based methods (direct approach) focus on directly learning a mapping from

Value-based methods (indirect approach) focus on learning the value of states or

2 Markov Decision Processes

P[st+1 |st ] = P[st+1 |s1 , ..., st ]

2.1 The Bellman Equations

V (s) = E[Gt |st = s]

Q(s, a) = E[rt+1 + γV (st+1 ) | st = s, at = a]

V∗ (s) = max (R(s, a) + γV (s′ ))

Figure 3: Comparison of the backup diagrams of Monte-Carlo, Temporal-Difference learning,

2.2 Dynamic Programming (DP)

Vt+1 (s) = Eπ [r + γVt (s′ )|st = s]

Qπ (s, a) = E[rt+1 + γVπ (st+1 )|st = s, at = a]

evaluation improve evaluation improve evaluation improve evaluation

Qπ (s, π ′ (s)) = Qπ (s, arg max Qπ (s, a))

This return is then utilized as the target for value updates:

V (st ) ← V (st ) + α[Gt − V (st )]

2.4 Temporal Difference Learning

V (st ) ← V (st ) + α[rt+1 + γV (st+1 ) − V (st )]

Similarly, for action-value estimation:

Q(st , at ) ← Q(st , at ) + α(rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))

Figure 4: A non-exhaustive, but useful taxonomy of algorithms in reinforcement learning

3.1 Q-learning and DQN

Q(st , at ) ← Q(st , at ) + α[Rt+1 + γ max Q(st+1 , a) − Q(st , at )]

L(st , at , rt , st+1 ) = (rt + γ max Qπ (at+1 , st+1 ) − Qπ (at , st ))2

By iteratively optimizing this loss function using gradient-based optimization, we can

3.2 Policy Gradients

J(θ) = Eπθ [r(τ )] = Eπθ [γ t rt ]

∇θ J(θ) = Eπθ [∇θ log πθ (at |st )r(τ )]

3.3 Actor Critic

Actor-Critic methods are a hybrid architecture combining value-based (e.g., Q-learning)

∆θ = α∇θ (log πθ (s, a))q̂w (s, a)

∆w = β(rt+1 + γ q̂w (st+1 , at+1 ) − q̂w (st , at ))∇w q̂w (st , at )

A(a, s) = Q(a, s) − V (st )

Jem Corcoran. Markov processes, 2023. URL https://www.youtube.com/playlist?list=

OpenAI. Part 2: kinds of rl algorithms — spinning up documentation, 2018. URL

David Silver. Model-free prediction, 2015. URL https://youtu.be/PnHCvfgC ZA?si=

é<Ë@ YÒm'. Õç'

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

é<Ë@ YÒm'. Õç'