F90de-Introduction To Reinforcement Learning
F90de-Introduction To Reinforcement Learning
Contributors:
• Damien Ernst (dernst@uliege.be),
• Arthur Louette (arthur.uliege@uliege.be)
February 6, 2024
Outline
Characterization of RL problems
Convergence of Q-learning
1/60
Introduction to the reinforcement learning
framework
Artificial autonomous intelligent agent: formal definition
Definition (Agent)
An agent is anything that is capable of acting upon information it perceives.
2/60
Definition (Artificial autonomous intelligent agent)
An artificial autonomous intelligent agent is anything we create that is capable
of actions based on information it perceives, its own experience, and its own
decisions about which actions to perform.
3/60
Application of intelligent agents
5/60
Definition (The policy)
The policy of an agent determines the way the agent selects its action based on
the information it has. A policy can be either deterministic or stochastic and
either stationary or history-dependant.
Where does the intelligence come from? The policies process in an “intelligent
way” the information to select “good actions”.
6/60
An RL agent interracting with its environment
1
1
Source: Kaufmann et al. [2023]
7/60
Demo
https://www.youtube.com/watch?v=EtRXay2kqtc
8/60
Some generic difficulties with designing intelligent agents
• Inference problem. The environment dynamics and the mechanism behind the
reward signal are (partially) unknown. The policies need to be able to infer from
the information the agent has gathered from interaction with the system, “good
control actions”.
2
May be seen as a subproblem of the general inference problem. This problem is often
referred to in the “classical control theory” as the dual control problem.
9/60
The agent has to exploit what it already knows in order to obtain reward, but it
also has to explore in order to make better action selections in the future. The
dilemma is that neither exploration nor exploitation can be pursued exclusively
without failing at the task. The agent must try a variety of actions and
progressively favor those that appear to be best. On a stochastic task, each
action must be tried many times to gain a reliable estimate its expected reward.
10/60
Characterization of RL problems
Different characterizations of RL problems
11/60
• Multi-agent framework versus single-agent framework. In a multi-agent
framework the environment may be itself composed of (intelligent) agents. A
multi-agent framework can often be assimilated to a single-agent framework
by considering that the internal states of the other agents are unobservable
variables. Game theory and, more particularly, the theory of learning in
games study situations where various intelligent agents interact with each
other.
12/60
Characterization of the RL problem adopted in this class
st+1 = f (st , at , wt ) t = 0, 1, 2 . . .
where for all t, the state st is an element of the state space S, the action at is an
element of the action space A and the random disturbance wt is an element of
the disturbance space W . Disturbance wt generated by the time-invariant
conditional probability distribution Pw (·|s, a).
• Reward signal:
The function r(s, a, w) is the so-called reward function supposed to be bounded
by below by 0 and by above by a constant Br ≥ 0.
To the transition from t to t + 1 is associated a reward signal
γ t rt = γ t r(st , at , wt ) where r(s, a, w) is a reward function supposed to be
bounded by a constant Br and γ ∈ [0, 1[ a decay factor.
13/60
Definition (Cumulative reward signal)
Let ht ∈ H be the trajectory from instant time 0 to t in the combined state,
action, reward spaces: ht = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , at−1 , rt−1 , st ). Let π ∈ Π be
a stochastic policy such that π : H × A → [0, 1] and let us denote by J π (s) the
expected return of a policy π (or expected cumulative reward signal) when the
system starts from s0 = s
T
X
J π (s) = lim E[ γ t r(st , at ∼ π(ht , .), wt )|s0 = s]
T →∞
t=0
• Information available:
The agent does not know f , r and Pw . The only information it has on these
three elements is the information contained in ht .
14/60
Exercise 1: Computation of the cumulative reward signal
+1
15/60
Exercise 1: Solution
Solution:
= lim [γ 2 + ... + γ T ]
T →∞
=γ 2 lim [1 + ... + γ T −2 ]
T →∞
γ2
=
1−γ
PT
Reminder: lim t=0 γt = 1
1−γ
T →∞
16/60
Goal of reinforcement learning
Goal of reinforcement learning
∗
J π (s) = maxJ π (s) (1)
π∈Π
3
We will suppose that these mild assumptions are always satisifed afterwards.
17/60
Dynamic Programming (DP) theory reminder: optimality of station-
ary policies
∗
Let µ∗ be a policy such that J µ (s) = max J µ (s) everywhere on S. We name
µ∈Πµ
such a policy an optimal deterministic stationary policy.
We denote π ∗ the optimal history-dependent policy, from classical dynamic
∗ ∗
programming theory, we know J µ (s) = J π (s) everywhere.
⇒ considering only stationary policies is not suboptimal!
18/60
µ
Truncated return JN
µ
Definition (Truncated return JN )
µ
We define the functions JN : S → R by the recurrence equation
µ µ
JN (s) = E [r(s, µ(s), w) + γJN −1 (f (s, µ(s), w))], ∀N ≥ 1 (3)
w∼Pw (·|s,a)
µ
Using (5) and (6), we can show that there exists a bound on ∥J µ − JN ∥∞
γN
∥J µ − JN
µ
∥∞ ≤ Br . (7)
1−γ
Indeed,
γN
∥J µ − JN
µ
∥∞ ≤ γ N ∥J µ ∥∞ = Br .
1−γ
20/60
µ
Exercise 2: Computation of the JN function
Compute J4µ (1), J4µ (2), J4µ (3) and J4µ (4) with policy π(s) = 1, ∀s ∈ S.
The state space and action space are respectively defined by:
+1
21/60
Exercise 2: Solution
Solution:
µ µ
JN (s) = r(s, µ(s)) + γJN −1 (f (s, µ(s))), J0µ (s) ≡ 0
22/60
DP theory reminder: QN -functions and Bellman equation
Definition (Q-function)
We define the Q-function as being the unique solution of the Bellman equation:
Theorem (Convergence of QN )
The sequence of functions QN converges to the Q-function in the infinite norm,
i.e. lim ∥QN − Q∥∞ → 0 .
N →∞
23/60
Exercise 3: Computation of the QN function
+1
24/60
Solution
Solution:
N 1 2 3 4 5
QN (1, −1) 0 0 0 γ3 γ3 + γ4
QN (1, 1) 0 0 γ2 γ2 + γ3 γ2 + γ3 + γ4
QN (2, −1) 0 0 0 γ3 γ3 + γ4
QN (2, 1) 0 γ γ + γ2 γ + γ2 + γ3 γ + γ2 + γ3 + γ4
QN (3, −1) 0 0 γ2 γ2 + γ3 γ2 + γ3 + γ4
QN (3, 1) 1 1+γ 1 + γ + γ2 1 + γ + γ2 + γ3 1+γ + γ 2 + γ 3 + γ 4
QN (4, −1) 0 γ γ + γ2 γ + γ2 + γ3 γ + γ2 + γ3 + γ4
QN (4, 1) 1 1+γ 1 + γ + γ2 1 + γ + γ2 + γ3 1 + γ + γ2 + γ3 + γ4
25/60
Optimal stationary policy
∗
We also have J µ (s) = maxQ(s, a), which is also called the value function.
a∈A
26/60
∗
Bound on the suboptimality of J µN
∗
Theorem (Bound on the suboptimality of J µN )
27/60
RL problems with small state-action spaces
A pragmatic model-based approach for designing good policies π̂ ∗
Information Action
Select an action a
according to this
randomized policy
28/60
2. Resolution of the optimization problem.
∗
Find in Πµ the policy µ̂∗ such that ∀s ∈ S, J µ̂ (s) = max Jˆµ (s)
µ∈Πµ
where Jˆµ̂ is defined similarly as function J µ but with fˆ, P̂w and r̂ replacing f ,
Pw and r, respectively.
4
We will not address further the design of the ’right function’ ε : H → [0, 1]. In many
applications, it is chosen equal to a small constant (say, 0.05) everywhere.
29/60
Some constructive algorithms for designing π̂ ∗ when dealing with finite
state-action spaces
• Until say otherwise, we consider the particular case of finite state and action
spaces (i.e., S × A finite).
• We focus first on approaches which solve separately Step 1. and Step 2. and
then on approaches which solve both steps together.
30/60
Reminder on Markov Decision Processes
• p(s′ |s, a) gives the probability of reaching state s′ after taking action a while
being in state s.
• We consider MDPs for which we want to find decision policies that maximize
the reward signal γ t r(st , at ) over an infinite time horizon.
31/60
MDP Structure Definition from the System Dynamics and Reward
Function
• We define5
• Equations (14) and (15) define the structure of an equivalent MDP in the sense
that the expected return of any policy applied to the original optimal control
problem is equal to its expected return for the MDP.
32/60
Reminder: Random Variable and Strong Law of Large Numbers
• A random variable is not a variable but rather a function that maps outcomes
(of an experiment) to numbers. Mathematically, a random variable is defined as
a measurable function from a probability space to some measurable space. We
consider here random variables θ defined on the probability space (Ω, P ).6
6
For the sake of simplicity, we have considered here that (Ω, P ) indeed defines a probability
space which is not rigorous.
33/60
Step 1. Identification by learning the structure of the equivalent
MPD
• The objective is to infer some ‘good approximations’ of p(s′ |s, a) and r(s, a)
from:
7
If X is a set of elements, #X denote the cardinality of X.
34/60
Estimation of p(s′ |s, a):
The values I{s′ =sk1 +1 } , I{s′ =sk2 +1 } , . . ., I{s′ =sk } are #X(s, a) values of
#X(s,a) +1
the random variable I{s′ =f (s,a,w)} which are drawn independently. To estimate
its mean value p(s′ |s, a), we can use the unbiased estimator:
P
k∈X(s,a) I{sk+1 =s }
′
p̂(s′ |s, a) = (18)
#X(s, a)
35/60
Step 2. Computation of µ̂∗ identification by learning the structure of
the equivalent MPD
• We compute the Q̂N -functions from the knowledge of r̂ and p̂ by exploiting the
recurrence equation:
as approximation of the optimal policy, with N ’large enough’ (e.g., right hand
side of inequality (13) drops below ε).
• One can show that if the estimated MDP structure lies in an ‘ε-neighborhood’
∗ ∗
of the true structure, then, J µ̂ is in a ‘O(ε)-neighborhood’ of J µ where
µ̂∗ (s) = lim arg maxQ̂N (s, a).
N →∞ a∈A
36/60
The Case of Limited Computational Resources
37/60
The Q-learning Algorithm
38/60
Q-learning: some remarks
• Up to now, we have considered problems having discrete (and not too large)
state and action spaces ⇒ µ̂∗ and the Q̂N -functions could be represented in a
tabular form.
• We consider now the case of very large or infinite state-action spaces: functions
approximators need to be used to represent µ̂∗ and the Q̂N -functions.
• These function approximators need to be used in a way that there are able to
‘well generalize’ over the whole state-action space the information contained in ht .
• There is a vast literature on function approximators in reinforcement learning.
We focus first on one algorithm named ‘fitted Q iteration’ which computes the
functions Q̂N from ht by solving a sequence of batch mode supervised learning
problems.
40/60
Reminder: Batch mode supervised learning
41/60
• Typical supervised learning methods are: kernel-based methods, (deep) neural
networks, tree-based methods.
Gender
Fe
ale
m
M
ale
Batch mode Age Survived
supervised
≥2
<2
0
learning
Survived Class
2
Survived Died
T S = {((sk , ak ), rk )}t−1
k=0 (21)
43/60
Iteration N > 1: the algorithm outputs a model Q̂N of
QN (s, a) = E [r(s, a, w) + γmax
′
QN −1 (f (s, a, w), a′ )] by running a SL
w∼Pw (·|s,a) a ∈A
algorithms on the training set:
T S = {((sk , ak ), rk + γmax
′
Q̂N −1 (sk+1 , a′ )}t−1
k=0
a ∈A
• The algorithm stops when N is ‘large enough’ and µ̂∗N (s) ∈ arg maxQ̂N (s, a) is
a∈A
taken as approximation of µ∗ (s).
44/60
The fitted Q iteration algorithm: some remarks
• Excellent and stable performances have been observed when combined with
supervised learning methods based on ensemble of regression trees and of course,
with deep neural nets, especially when images are used as input.
• Fitted Q iteration algorithm can be used with any set of one-step system
transitions (st , at , rt , st+1 ) where each one-step system transition gives
information about: a state, the action taken while being in this state, the reward
signal observed and the next state reached.
45/60
Computation of µ̂∗ : from an inference problem to a problem of com-
putational complexity
• When having at one’s disposal only a few one-step system transitions, the main
problem is a problem of inference.
• Should we rely on algorithms having less inference capabilities than the ‘fitted
Q iteration algorithm’ but which are also less computationally demanding to
mitigate this problem of computational complexity ⇒ Open research question.
46/60
• There is a serious problem plaguing every reinforcement learning algorithm
known as the curse of dimensionality8 : whatever the mechanism behind the
generation of the trajectories and without any restrictive assumptions on
f (s, a, w), r(s, a, w), S and A, the number of computer operations required to
determine (close-to-) optimal policies tends to grow exponentially with the
dimensionality of S × A.
8
A term introduced by Richard Bellman (the founder of the DP theory) in the fifties.
47/60
Q-learning with parametric function approximators
1. Equation (20) provides us with a desired update for Q̃(st , at , θ), here:
δ(st , at ) = rt + γmaxQ̂(st+1 , a, θ) − Q̂(st , at , θ), after observing (st , at , rt , st+1 ).
a∈A
48/60
Convergence of Q-learning
Contraction mapping
Let B(E) be the set of all bounded real-valued functions defined on an arbitrary
set E. With every function R : E → R that belongs to B(E), we associate the
scalar:
49/60
Fixed point
GR∗ = R∗ . (25)
50/60
Algorithmic models for computing a fixed point
All elements of R are refreshed: Suppose have the algorithm that updates at
stage k (k ≥ 0) R as follows:
R ← GR. (27)
51/60
One element of R is refreshed and noise introduction: Let η ∈ R be a noise
factor and α ∈ R. Suppose we have the algorithm that selects at stage k (k ≥ 0)
an element e ∈ E and updates R(e) according to:
52/60
We define the history Fk of the algorithm at stage k as being:
53/60
The Q-function as a fixed point of a contraction mapping
∀(s, a) ∈ S × A.
• The recurrence equation (8) for computing the QN -functions can be rewritten
QN = HQN −1 ∀N > 1, with Q0 (s, a) ≡ 0.
54/60
H is a contraction mapping
= γ∥K − K∥∞
9
We make as additional assumption here that the rewards are strictly positive.
55/60
Q-learning convergence proof
10
The element (sk , ak , rk , sk+1 ) used to refresh the Q-function at iteration k of the Q-learning
algorithm is “replaced” here by (sk , ak , r(sk , ak , wk ), f (sk , ak , wk )).
56/60
By using the H mapping definition (equation (35)), equation (37) can be
rewritten as follows:
with
which has exactly the same form as equation (30) (Qk corresponding to Rk , H to
G, (sk , ak ) to ek and S × A to E).
57/60
We know that H is a contraction mapping. If the αk (sk , ak ) terms satisfy
expression (34), we still have to verify that ηk satisfies expressions (32) and (33),
where
= 0
58/60
In order to prove that expression (33) is satisfied, one can first note that :
By noting that
8Br γ max Qk (s, a) < 8Br γ + 8Br γ( max Qk (s, a))2 (42)
(s,a)∈S×A (s,a)∈S×A
59/60
References and additional readings
References
Pascal Leroy, Pablo G. Morato, Jonathan Pisane, Athanasios Kolios, and Damien
Ernst. Imp-marl: a suite of environments for large-scale infrastructure
management planning via marl, 2023.
Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Mueller,
Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using
deep reinforcement learning. Nature, 620:982–987, 08 2023.
Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst.
Reinforcement learning and dynamic programming using function
approximators. CRC press, 2017.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An
introduction. MIT press, 2018.
Dimitri Bertsekas. Dynamic programming and optimal control: Volume I,
volume 4. Athena scientific, 2012.
Csaba Szepesvári. Algorithms for reinforcement learning. Springer Nature, 2022.
60/60