The document discusses policy search algorithms that operate directly in the policy space, focusing on binary bandit problems and various algorithms like LR-P, LR-εP, and LR-I. It also covers policy gradient approaches, emphasizing parameterized policies and the use of stochastic gradient ascent for policy evaluation. Additionally, it introduces the incremental version of policy gradient algorithms and the concept of characteristic eligibility for reward calibration.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
2 views9 pages
CH3 - 3 Policy Search Alg
The document discusses policy search algorithms that operate directly in the policy space, focusing on binary bandit problems and various algorithms like LR-P, LR-εP, and LR-I. It also covers policy gradient approaches, emphasizing parameterized policies and the use of stochastic gradient ascent for policy evaluation. Additionally, it introduces the incremental version of policy gradient algorithms and the concept of characteristic eligibility for reward calibration.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9
Policy Search Algorithms
Dr. D. John Pradeep
VIT-AP University Policy search • These algorithms act directly over the policy space • Policy is denoted as пt(a) = Pr(at = a) [Probability of selecting an action a at time t. Binary bandit [Only two arms] Rewards Rt = 0 or 1 If Rt = 1, пt+1(at) = пt(at) + α[1- пt(at)] пt+1(a`) = пt (a`) + α[0- пt(a`)] = пt(a`) [1- α] If Rt = 0, пt+1(at) = пt(at) + β[0 - пt(at)] пt+1(a`) = пt(a`) + β[1 - пt(a`)] Policy search • If α = β, Linear reward penalty algorithm – LR-P algorithm • If α >> β, LR-εP algorithm [Reward much greater than penalty] • If β = 0, LR-I algorithm [Reward update, Inaction for penalty] Policy gradient Approaches • Parameterized (θ) policy approach – Assumption is that policy depends on some set of parameters. • Meaning – Policy is probability distribution which can be specified in many ways (like softmax, Gaussian distribution, etc) • For example, in softmax, To define policy, we require the terms in the exponent which need not be explicitly value functions (parameters). • It may be a preference drawn from a probability distribution at a given time • Examples of parameters may be weights in a neural network. Policy gradient Approaches • Let θ be the set of parameters defining the policy and η(θ) is the performance metric evaluating the policy for a specific choice of θ • Specific value of θ refers to one policy • η(θ) is the evaluation of the policy θ • The most natural way of evaluating a policy is the expected payoff • η(θ) = E[Rt] = ∗ • - Gradient ascent • Increment θ in the increasing direction of the performance metric to find the best policy • Since η(θ) is unknown, updating the θ by finding the maxima of the function η(θ) is not possible. η(θ) is calculated by drawing samples and estimating q*(a) – Hence stochastic gradient ascent algorithm. Policy gradient Approaches • ∗ In form of expectation calculation ( , ) • ∗ ( , ) ( , ) • ( , ) ∗ ( , )
Issue: We have to compute expectation of expectation (How??)
Alg: 1. Take sample by selecting an action 2. Compute average Policy gradient Approaches , • ,
• This reduces the computational burden
Incremental Version of PGA • At every step, the parameters are changed by a value Δθn • θn+1 = θn + Δθn , • , [ , ] • Characteristic Eligibility [ , ] • (bn – reinforcement baseline) • bn assesses the goodness of the reward received at each step (reward calibration) • If reward > bn, good reward (∆𝜃 is positive and we go in direction of reward) • If reward < bn, bad reward (∆𝜃 is positive and we go in opposite direction of reward) • bn can be average of all rewards received till time t. Characteristic eligibility • Which component of θ is more responsible for change in Δθn is more eligible to receive the update • REINFORCE update