0% found this document useful (0 votes)
5 views

Dynamic_Programming_RL_Answers_Final

Uploaded by

ITNishadAnjali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Dynamic_Programming_RL_Answers_Final

Uploaded by

ITNishadAnjali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Dynamic Programming in Reinforcement Learning

1. What is Dynamic Programming in Reinforcement Learning?

- Dynamic Programming (DP) is a group of algorithms used in reinforcement learning to find the

optimal policy and value functions when we have a perfect model of the environment.

- The environment is modeled as a Markov Decision Process (MDP).

- DP uses Bellman equations to compute the value of states or state-action pairs.

- The main idea is to break down a complex problem into smaller subproblems and solve them

recursively.

- DP provides exact solutions, but it is computationally expensive, so it is not used in large real-world

problems.

- Other RL methods like Monte Carlo and Temporal Difference are approximations of DP.

2. Policy Evaluation (Prediction)

- Policy Evaluation is used to calculate how good a policy pi is by estimating the value function vpi(s)

for all states s.

- It uses the Bellman expectation equation: vpi(s) = suma pi(a|s) sums',r p(s', r|s, a) [r + gamma

vpi(s')].

- Instead of solving the equation directly, we use iterative updates starting from an initial guess.

- This method is called Iterative Policy Evaluation and continues until the value function converges.

- The updates are based on expected values, not samples, and are done through multiple sweeps of

the state space.

3. Policy Improvement

- Policy Improvement improves a given policy by checking if different actions provide higher value.

- We compute the action-value function: qpi(s, a) = sums',r p(s', r|s, a) [r + gamma vpi(s')].

- If a different action gives a higher value, we update the policy at that state.

- The new policy pi' is better if vpi'(s) >= vpi(s) for all states s. This is the Policy Improvement
Theorem.

- Acting greedily with respect to the value function usually improves the policy.

4. Policy Iteration

- Policy Iteration finds the optimal policy by repeating Policy Evaluation and Policy Improvement.

- It starts with any policy and evaluates it using Iterative Policy Evaluation.

- Then it improves the policy using the Policy Improvement step.

- These steps are repeated until the policy does not change anymore.

- The final policy and value function are both optimal.

5. Value Iteration

- Value Iteration simplifies Policy Iteration by combining evaluation and improvement into one step.

- It updates value using the Bellman Optimality Equation: v(s) = maxa sums',r p(s', r|s, a) [r +

gamma v(s')].

- Values are updated directly and repeatedly until they converge.

- Once the values stabilize, the optimal policy is formed by choosing the action with the highest

value in each state.

- It is faster than Policy Iteration because it avoids full evaluation.

6. Asynchronous Dynamic Programming

- Asynchronous DP updates the value of states in any order instead of all at once.

- In regular DP, we perform full sweeps of all states, but in Asynchronous DP, we update one or a

few states at a time.

- This is useful in large problems where full sweeps are expensive.

- It still converges if all states are updated enough times.

- It is more practical and flexible for real-world applications.

7. Generalized Policy Iteration (GPI)

- GPI is the general idea of combining Policy Evaluation and Policy Improvement.

- Evaluation and improvement are done together, not in fixed steps.


- As value functions improve, policies improve, and vice versa.

- This process continues until both the policy and value function converge.

- GPI is the foundation for many advanced reinforcement learning algorithms.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy