0% found this document useful (0 votes)
51 views10 pages

Notações Dos Algoritimos

1. Reinforcement learning involves learning optimal policies through interactions with an environment. The goal is to maximize rewards received over time. Dynamic programming and Monte Carlo methods are commonly used to estimate state/action values and learn optimal policies. 2. Dynamic programming methods such as value iteration and policy iteration use the Bellman equations to iteratively estimate values and improve policies. Monte Carlo methods estimate values by averaging returns from sample episodes following a policy. 3. The Bellman equations relate the value of a state/action to expected rewards plus discounted future values, providing a recursive definition used by dynamic programming algorithms to estimate optimal values and policies.

Uploaded by

Jonathan Messias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views10 pages

Notações Dos Algoritimos

1. Reinforcement learning involves learning optimal policies through interactions with an environment. The goal is to maximize rewards received over time. Dynamic programming and Monte Carlo methods are commonly used to estimate state/action values and learn optimal policies. 2. Dynamic programming methods such as value iteration and policy iteration use the Bellman equations to iteratively estimate values and improve policies. Monte Carlo methods estimate values by averaging returns from sample episodes following a policy. 3. The Bellman equations relate the value of a state/action to expected rewards plus discounted future values, providing a recursive definition used by dynamic programming algorithms to estimate optimal values and policies.

Uploaded by

Jonathan Messias
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

REINFORCEMENT LEARNING

1. The Problem
St state at time t
At action at time t
Rt reward at time t
γ discount rate (where 0 ≤ γ ≤P 1)

Gt discounted return at time t ( k=0 γ k Rt+k+1 )
S set of all nonterminal states
S+ set of all states (including terminal states)
A set of all actions
A(s) set of all actions available in state s
R set of all rewards
p(s0 , r|s, a) probability of next state s0 and reward r, given current state s and current action a (P(St+1 = s0 , Rt+1 = r|St = s, At = a))

2. The Solution
π policy
if deterministic: π(s) ∈ A(s) for all s ∈ S
if stochastic: π(a|s) = P(At = a|St = s) for all s ∈ S and a ∈ A(s)
.
vπ state-value function for policy π (vπ (s) = E[Gt |St = s] for all s ∈ S)
.
qπ action-value function for policy π (qπ (s, a) = E[Gt |St = s, At = a] for all s ∈ S and a ∈ A(s))
.
v∗ optimal state-value function (v∗ (s) = maxπ vπ (s) for all s ∈ S)
.
q∗ optimal action-value function (q∗ (s, a) = maxπ qπ (s, a) for all s ∈ S and a ∈ A(s))

1
3. Bellman Equations
3.1. Bellman Expectation Equations.
X X
vπ (s) = π(a|s) p(s0 , r|s, a)(r + γvπ (s0 ))
a∈A(s) s0 ∈S,r∈R

X X
qπ (s, a) = p(s0 , r|s, a)(r + γ π(a0 |s0 )qπ (s0 , a0 ))
s0 ∈S,r∈R a0 ∈A(s0 )

3.2. Bellman Optimality Equations.


X
v∗ (s) = max p(s0 , r|s, a)(r + γv∗ (s0 ))
a∈A(s)
s0 ∈S,r∈R

X
q∗ (s, a) = p(s0 , r|s, a)(r + γ 0max 0 q∗ (s0 , a0 ))
a ∈A(s )
s0 ∈S,r∈R

3.3. Useful Formulas for Deriving the Bellman Equations.


X
vπ (s) = π(a|s)qπ (s, a)
a∈A(s)

v∗ (s) = max q∗ (s, a)


a∈A(s)

X
qπ (s, a) = p(s0 , r|s, a)(r + γvπ (s0 ))
s0 ∈S,r∈R

X
q∗ (s, a) = p(s0 , r|s, a)(r + γv∗ (s0 ))
s0 ∈S,r∈R
2
.
qπ (s, a) = Eπ [Gt |St = s, At = a] (1)
X
= P(St+1 = s0 , Rt+1 = r|St = s, At = a)Eπ [Gt |St = s, At = a, St+1 = s0 , Rt+1 = r] (2)
s0 ∈S,r∈R
X
= p(s0 , r|s, a)Eπ [Gt |St = s, At = a, St+1 = s0 , Rt+1 = r] (3)
s0 ∈S,r∈R
X
= p(s0 , r|s, a)Eπ [Gt |St+1 = s0 , Rt+1 = r] (4)
s0 ∈S,r∈R
X
= p(s0 , r|s, a)Eπ [Rt+1 + γGt+1 |St+1 = s0 , Rt+1 = r] (5)
s0 ∈S,r∈R
X
= p(s0 , r|s, a)(r + γEπ [Gt+1 |St+1 = s0 ]) (6)
s0 ∈S,r∈R
X
= p(s0 , r|s, a)(r + γvπ (s0 )) (7)
s0 ∈S,r∈R

The reasoning for the above is as follows:

.
• (1) by definition (qπ (s, a) = Eπ [Gt |St = s, At = a])

• (2) Law of Total Expectation


.
• (3) by definition (p(s0 , r|s, a) = P(St+1 = s0 , Rt+1 = r|St = s, At = a))

• (4) Eπ [Gt |St = s, At = a, St+1 = s0 , Rt+1 = r] = Eπ [Gt |St+1 = s0 , Rt+1 = r]

• (5) Gt = Rt+1 + γGt+1

• (6) Linearity of Expectation

• (7) vπ (s0 ) = Eπ [Gt+1 |St+1 = s0 ]

3
4. Dynamic Programming

Algorithm 1: Policy Evaluation


Input: MDP, policy π, small positive number θ
Output: V ≈ vπ
Initialize V arbitrarily (e.g., V (s) = 0 for all s ∈ S + )
repeat
∆←0
for s ∈ S do
v ← V (s)
V (s) ← a∈A(s) π(a|s) s0 ∈S,r∈R p(s0 , r|s, a)(r + γV (s0 ))
P P

∆ ← max(∆, |v − V (s)|)
end
until ∆ < θ;
return V

Algorithm 2: Estimation of Action Values


Input: MDP, state-value function V
Output: action-value function Q
for s ∈ S do
for a ∈ A(s) do
Q(s, a) ← s0 ∈S,r∈R p(s0 , r|s, a)(r + γV (s0 ))
P

end
end
return Q

4
Algorithm 3: Policy Improvement
Input: MDP, value function V
Output: policy π 0
for s ∈ S do
for a ∈ A(s) do
Q(s, a) ← s0 ∈S,r∈R p(s0 , r|s, a)(r + γV (s0 ))
P

end
π 0 (s) ← arg maxa∈A(s) Q(s, a)
end
return π 0

Algorithm 4: Policy Iteration


Input: MDP, small positive number θ
Output: policy π ≈ π∗
1
Initialize π arbitrarily (e.g., π(a|s) = |A(s)| for all s ∈ S and a ∈ A(s))
policy-stable ← f alse
repeat
V ← Policy Evaluation(MDP, π, θ)
π 0 ← Policy Improvement(MDP, V )
if π = π 0 then
policy-stable ← true
end
π ← π0
until policy-stable = true;
return π

Algorithm 5: Truncated Policy Evaluation


Input: MDP, policy π, value function V , positive integer max iterations
Output: V ≈ vπ (if max iterations is large enough)
counter ← 0
while counter < max iterations do
for s ∈ S do
V (s) ← a∈A(s) π(a|s) s0 ∈S,r∈R p(s0 , r|s, a)(r + γV (s0 ))
P P

end
counter ← counter + 1
end
return V

5
Algorithm 6: Truncated Policy Iteration
Input: MDP, positive integer max iterations, small positive number θ
Output: policy π ≈ π∗
Initialize V arbitrarily (e.g., V (s) = 0 for all s ∈ S + )
1
Initialize π arbitrarily (e.g., π(a|s) = |A(s)| for all s ∈ S and a ∈ A(s))
repeat
π ← Policy Improvement(MDP, V )
Vold ← V
V ← Truncated Policy Evaluation(MDP, π, V, max iterations)
until maxs∈S |V (s) − Vold (s)| < θ;
return π

Algorithm 7: Value Iteration


Input: MDP, small positive number θ
Output: policy π ≈ π∗
Initialize V arbitrarily (e.g., V (s) = 0 for all s ∈ S + )
repeat
∆←0
for s ∈ S do
v ← V (s)
V (s) ← maxa∈A(s) s0 ∈S,r∈R p(s0 , r|s, a)(r + γV (s0 ))
P

∆ ← max(∆, |v − V (s)|)
end
until ∆ < θ;
π ← Policy Improvement(MDP, V )
return π

6
5. Monte Carlo Methods

Algorithm 8: First-Visit MC Prediction (for state values)


Input: policy π, positive integer num episodes
Output: value function V (≈ vπ if num episodes is large enough)
Initialize N (s) = 0 for all s ∈ S
Initialize returns sum(s) = 0 for all s ∈ S
for i ← 1 to num episodes do
Generate an episode S0 , A0 , R1 , . . . , ST using π
for t ← 0 to T − 1 do
if St is a first visit (with return Gt ) then
N (St ) ← N (St ) + 1
returns sum(St ) ← returns sum(St ) + Gt
end
end
V (s) ← returns sum(s)/N (s) for all s ∈ S
return V

Algorithm 9: First-Visit MC Prediction (for action values)


Input: policy π, positive integer num episodes
Output: value function Q (≈ qπ if num episodes is large enough)
Initialize N (s, a) = 0 for all s ∈ S, a ∈ A(s)
Initialize returns sum(s, a) = 0 for all s ∈ S, a ∈ A(s)
for i ← 1 to num episodes do
Generate an episode S0 , A0 , R1 , . . . , ST using π
for t ← 0 to T − 1 do
if (St , At ) is a first visit (with return Gt ) then
N (St , At ) ← N (St , At ) + 1
returns sum(St , At ) ← returns sum(St , At ) + Gt
end
end
Q(s, a) ← returns sum(s, a)/N (s, a) for all s ∈ S, a ∈ A(s)
return Q

7
Algorithm 10: First-Visit GLIE MC Control
Input: positive integer num episodes, GLIE {i }
Output: policy π (≈ π∗ if num episodes is large enough)
Initialize Q(s, a) = 0 for all s ∈ S and a ∈ A(s)
Initialize N (s, a) = 0 for all s ∈ S, a ∈ A(s)
for i ← 1 to num episodes do
 ← i
π ← -greedy(Q)
Generate an episode S0 , A0 , R1 , . . . , ST using π
for t ← 0 to T − 1 do
if (St , At ) is a first visit (with return Gt ) then
N (St , At ) ← N (St , At ) + 1
Q(St , At ) ← Q(St , At ) + N (S1t ,At ) (Gt − Q(St , At ))
end
end
return π

Algorithm 11: First-Visit Constant-α (GLIE) MC Control


Input: positive integer num episodes, small positive fraction α, GLIE {i }
Output: policy π (≈ π∗ if num episodes is large enough)
Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈ S and a ∈ A(s))
for i ← 1 to num episodes do
 ← i
π ← -greedy(Q)
Generate an episode S0 , A0 , R1 , . . . , ST using π
for t ← 0 to T − 1 do
if (St , At ) is a first visit (with return Gt ) then
Q(St , At ) ← Q(St , At ) + α(Gt − Q(St , At ))
end
end
return π

8
6. Temporal-Difference Methods

Algorithm 12: TD(0)


Input: policy π, positive integer num episodes
Output: value function V (≈ vπ if num episodes is large enough)
Initialize V arbitrarily (e.g., V (s) = 0 for all s ∈ S + )
for i ← 1 to num episodes do
Observe S0
t←0
repeat
Choose action At using policy π
Take action At and observe Rt+1 , St+1
V (St ) ← V (St ) + α(Rt+1 + γV (St+1 ) − V (St ))
t←t+1
until St is terminal ;
end
return V

Algorithm 13: Sarsa


Input: policy π, positive integer num episodes, small positive fraction α, GLIE {i }
Output: value function Q (≈ qπ if num episodes is large enough)
Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈ S and a ∈ A(s), and Q(terminal-state, ·) = 0)
for i ← 1 to num episodes do
 ← i
Observe S0
Choose action A0 using policy derived from Q (e.g., -greedy)
t←0
repeat
Take action At and observe Rt+1 , St+1
Choose action At+1 using policy derived from Q (e.g., -greedy)
Q(St , At ) ← Q(St , At ) + α(Rt+1 + γQ(St+1 , At+1 ) − Q(St , At ))
t←t+1
until St is terminal ;
end
return Q

9
Algorithm 14: Sarsamax (Q-Learning)
Input: policy π, positive integer num episodes, small positive fraction α, GLIE {i }
Output: value function Q (≈ qπ if num episodes is large enough)
Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈ S and a ∈ A(s), and Q(terminal-state, ·) = 0)
for i ← 1 to num episodes do
 ← i
Observe S0
t←0
repeat
Choose action At using policy derived from Q (e.g., -greedy)
Take action At and observe Rt+1 , St+1
Q(St , At ) ← Q(St , At ) + α(Rt+1 + γ maxa Q(St+1 , a) − Q(St , At ))
t←t+1
until St is terminal ;
end
return Q

Algorithm 15: Expected Sarsa


Input: policy π, positive integer num episodes, small positive fraction α, GLIE {i }
Output: value function Q (≈ qπ if num episodes is large enough)
Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈ S and a ∈ A(s), and Q(terminal-state, ·) = 0)
for i ← 1 to num episodes do
 ← i
Observe S0
t←0
repeat
Choose action At using policy derived from Q (e.g., -greedy)
Take action At and observe Rt+1 , St+1 P
Q(St , At ) ← Q(St , At ) + α(Rt+1 + γ a π(a|St+1 )Q(St+1 , a) − Q(St , At ))
t←t+1
until St is terminal ;
end
return Q

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy