0% found this document useful (0 votes)

13 views11 pages

Multi-Arm-Bandit Problem

The document discusses the Upper Confidence Bound (UCB) algorithm, emphasizing its approach of balancing exploration and exploitation in multi-armed bandit problems. It details the UCBI variant, its mathematical formulation, and provides code examples for implementation and simulation. The analysis includes performance comparisons against other algorithms, highlighting UCB's limitations in scenarios with closely valued arms, where Epsilon Greedy may perform better.

Uploaded by

komalsiddharth814

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

13 views11 pages

Multi-Arm-Bandit Problem

Uploaded by

komalsiddharth814

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 11

Medium Q sewer Swe Multi-Armed Bandit Analysis of Upper Confidence Bound Algorithm Kenneth Foo » Follow Published in Analee Vidhya - Sminvead - Feb 21,2020 ‘The Upper Confidence Bound (UCB) algorithm is often phrased as “optimism in the face of uncertainty”. To understand why, consider at a given round that each arm's reward function can be perceived as a point estimate based. on the average rate of reward as observed. Drawing intuition from. confidence intervals, for each point estimate, we can also incorporate some form of uncertainty boundary around the point estimate. In that sense, we have both lower boundary and upper boundary for each arm. ‘The UCB algorithm is aptly named because we are only concerned with the upper bound, given that we are trying to find the arm with the highest reward rate. ‘There are different variants of the UCB algorithms but in this article, we will take a look at the UCBI algorithm. At each given round of n trials, the reward UCB of all arms are represented by the following: 2In(n) ny py + where m1 represents the current reward return average of arm iat the current round, n represents the number of trials passed, and 4 represents the number of pulls given to arm i in the playthrough history. ‘The above formulation is simple but yet has several interesting implications as explained in the following: + The upper boundary is proportional to the squared root of tn(o) , which. means that when the experiment progresses, all arms have their upper boundaries increases by a factor of squared root of tn(n) -+ This upper boundary is inversely proportional to the squared root of ni. ‘The more times the specific arm has been engaged before in the past, the greater the confidence boundary reduces towards the point estimate. ‘Thereafter, the UCB algorithm always picks the arm with the highest reward UCB as represented by the equation above. Beyond the formulation explanation, here is a simple thought experiment to glean some intuition on how UCB algorithm incorporates exploration and exploitation, The time complexity between the numerator and denominator provides a tension between exploration and exploitation. For any increase in n, the UCB increases only by logarithmic time, while for any increase in n_1, the UCB decreases by ni. Thus, an arm that has not been explored as often as other arms will have a bigger UCB component. Depending on its current average mean, the overall UCB function representation of that specific arm may be greater than other arms with higher return but smaller components, and consequently enable that arm to be picked. ‘The following analysis is based on the book “Bandit Algorithms for Website Optimization” by John Myles White. For further understanding of the code, I have included comments for easier understanding. Below is the code for creation of the UCB1 algorithm setup and progressive updates of counts and values for arms. + Counts: Represent recorded times when arm was pulled. + Values: Represent the known mean reward. In the case ofa Bernoulli arm, values represent the probability of reward which ranges from 0 to 1. class ucen(): def _init_(self, counts, values) Seif.counts ='counts # Count represent counts of pulls for each arm. For miltiple arms, this will be a List of counts. SeLf.values = values # Value represent average reward for specific arm. For miltiple arms, this will be a list of values. return # Initialise k number of arms dof initialize(self, n_arns’ Self.counts = [0 for col in range(n_arms) ] Self_values = [0.9 for col in range(n_arms)1 return # UCB arm selection based on max of UCB reward of each arm def select_arm(self): arms = len(self-counts) for arm in range(n_arms Sf self.counts[arm) return armUucb_values = [9.0 for arm in range(n_arns)] total_counts = Sum(self.counts) for arm in range(n_arms): bonus = math.sqrt((2 + math. log(totalcounts)) / float(self.counts[arm])) ucb values [arm] = self.values[arm] + bonus return ucb_values. index (max(ucb_values) ) # Choose to update chosen arm and reward def update(self, chosen_arm, reward): seLf.counts[chosen_arn] = self.counts[chosen_ara] + 1 n= Self.counts[chosen_arm] # Update average/mean value/reward for chosen arm value = self.values[chosen_ara] renvalue = ((n - 1) / Float(n)) * value + (1 / Float(n)) + self.values[chosen_arm] = new_value return ‘As per discussion in previous articles, we will use a Bernoulli distribution to represent the reward function of each arm. class Bernoull4arm() def _init_(self, p): Self.p = # Reward system based on Bernoulli def draw(self) if random.randon() > self.p: return 6.0 else: return 1.0 ‘To proceed with any further analysis, an operational script is required to process the simulation where: + num_sims: Represents the number of independent simulations, each of length equal to ‘horizon’ + horizon: Represents the number of time steps/trials per round of simulation def test_algorithm(algo, arms, num_sims, horizon): # Initialise variables for duration of accumulated simulation (eum_sins + horizon_per_similation) chosen_arms = [0.0 for 1 in range(nun_sims * horizon)] renards = [8.0 for 4 in range(num_sins » horizon)] cumilative rewards = [9 for 3 in range(num_sins * horizon)] ‘simnums = [0.9 for 4 in range(num_sims *horizon)] times = [8.9 for 1 in range (num_simsthorizon)} for sim in range(num_sins): sin = sim + 1 algo. initiatize(len(arns)) for t in range(horizon): titelSin hums Hindexd” ‘times[index] = t 4 Selection of best arm and engaging it chosen_arm = algo.select_ara() chosen_arms[index] = chosen_arm 4 Engage chosen Bernoulli Arm and obtain reward info reward = arms[chosen_arm] -drau() rewards{index] = reward Sf toon Ccumiative_rewards[index] = reward els “umilative_rewards[index] = cumulative rewards[index- 2) + reward algo.update(chosen_arm, reward) return [simnums, times, chosen_arns, rewards, cumilative_rewards] ‘Simulation of Arms with relatively big differences in Means Similar to what was done previous analysis for Epsilon-greedy, the simulation comprises of the following: + Create 5 arms, four of which have average reward of 0.1, and the last/best has average reward of 0.9. + Save simulation output to a tab-separated file + Create 5000 independent simulations In this example, since the UCB algorithm does not have any hyperparameter, we create a single set of 5000 simulations. ‘The choice for 5000 independent simulations is because we want to determine the average performance. Each simulation might be subject to the stochastic nature/run and the performances might be skewed due to random chance, Thus it is important to run a reasonably high number of simulations to evaluate the average mean/performance. ‘mport random randoa.seed(1) # out of 5 arms, 1 arm is clearly the best means = [0.1, 0-1, 0.1, 81, 6.9] harms = len(means) # Shuf fling arms randoa. shuf fle(means) 4 Create List of Bernoulli Arms with Reward Information arms = Uist (nap(Lambda mu: BernoulLiarm(md) » means)) print("Best arm is." + str(np.argmax (means))) = open("'standard_ueb_results.tsv", we") 4 Create 1 round of 5000 simulations algo = uCe(E, (1) algo. initialize(n_arms)results = test_algorithn(algo, arms, 5090, 250) 4 Store data for i in range(len(results{0])): fewrite("\t". join([str(results{j1[11) for j in range(Len(results))]) + "\n") Ficlose(), Using some data-preprocessing and basic Altair visualisation, we can plot the probability of pulling the best arm. UCB: Mean Rate of Croosing Best Arm fom 5000 Simulations. 5 Arms ert ot chosing Best Tine sep ‘The UCB algorithm has extreme fluctuations in its rate of choosing the best arm in the early phases of the experiment as shown by time steps between 0 t0 60. This can be explained by the emphasis of exploration amongst all arms since the UCB components for all arms are much bigger at the start. As the trial progresses, the UCB components becomes much smaller for all arms, and the UCB function representation of each arm converges towards the average reward mean of each arm, Thus, the arm with the higher mean becomes more distinguishable by the algorithm and becomes more frequently picked as the trial progresses. Thus, we observe that the rate of choosing the best arm does not seem to have a hard asymptote, but converges towards 1. The rate of convergence towards 1 slows down as it approaches 1, and the experiment time horizon was too short for us to observe any further convergence. UCR: Mean Cumulative Reward om 5000 Simulations. § Arms = 04,109),‘The cumulative reward plot of the UCB algorithm is comparable to the other algorithms, Although it does not do as well as the best of Softmax ( eau = 0.1 or 0.2) where the cumulative reward was beyond 200, the UCB cumulative reward range is close to that range (around 190). We also observe some form of curvature in the early phases of the trial, which can be corroborated by the extreme fluctuations we saw in the rate of choosing best arms. Likewise, when the experiment progresses, the algorithm can distinguish the best arm, and picks it with higher frequency, and the cumulative reward plot has a straight line gradient (which should approximate a value of 0.9 based on consistently choosing the best arm). ‘Simulation of Arms with relatively smaller differences in Means ‘The previous analysis was a simulation exercise on arms with big differences in reward returns. We extend the analysis to a situation where the arms are relatively closer. In the following case, we simulate 5 arms, 4 of which have a mean of 0.8 ‘while the last/best has a mean of 0.9. Based on a reduced difference between the reward returns of all arms, we observe a big deterioration in the performance of the UCB algorithm. The rate of choosing the best arm now approaches 0.32, which is similar to what we saw in the Softmax algorithm.ramiiiy patancis watt Wie GULLILOA aigunssitity Uils SEHD Lu AAA reduced difference in reward function makes it harder to determine which is the best arm, Note that random chance of picking the best arm in this case is Lin 5 or 0.20, Itshould be noted that in this scenario, for Epsilon Greedy algorithm, the rate of choosing the best arm is actually higher as represented by the ranges 0f 0.5 to 0.7. This also seems to imply that Epsilon Greedy might be better suited for multi-armed based situations where the difference in means are much smaller as compared to UCB or Softmax algorithm, ‘Ucn: ean Cumulative Reward om 5000 Simulations. § Arms = [408 3.x03) Tine sep Since the arms are close in average returns, the eventual UCB cumulative reward obtains a value of around 210. Compare this to choosing the best arm which will return 0.9 * 250225, we see a regret of 15. It might seem small in this case, but as a percentage, it can be considered as significant (6.679%) depending on the application focus. ‘Taking a look at the overall cumulative regret may provide a better perspective of the performance, especially with respect to the other algorithms. ‘UCB: Mean Cumulative Regret rom $000 Simulations. § Arms = (408, 1x09}Based on the cumulative regret plots, we see that UCB1 has a cumulative regret of 18, which is similar to the Softmax algorithm, It is also worst off compared to the Epsilon Greedy algorithm which had a range of 12.3 to 14.8. The cumulative regret line is also relatively straight, which means that the algorithm will continue to accumulate more regret with a longer time horizon, ‘Summary In this analysis of UCB algorithm, we broke down the formulation of the algorithm and also performed simulation experiments for different arms to illustrate its robustness (or lack thereof). Similar to the Softmax algorithm, a learning takeaway is that for arms with closer means, the UCB algorithm does not seem to be as robust in terms of ‘ng the best arm, for which Epsilon Greedy is more suitable. determi For reference on this project on bandit simulations analysis, please refer to this Github repo. For quick reference on the actual code, please refer to this, Jupyter notebook. ‘Online Leaming MuitiArmedBarct Python _—DstaScence vege: Published in Analytics Vidhya [roi ‘TakFollowers Last published 1ayago [Anat Vidhya isa community of Generative Al and Data Science professional, We are buling the next-gen dtacrence eocystem tpsdmnvanasticsidhvacom @© werner temnetn roo @ Responses (2) eco Hi Kenneth! in my simulation of he UCS lgoritim for aunform distribution vriabies, alco got near cumulative rewerd However does linear cumulative rerd imply thatthe resets ner? think 20 asits just ‘constant tem mus. ace realy Beal @ vwesun Wow. wast abet fullytakeinallthe data Butthe statement that possibly Epsilon Greedy isa better strategymentlty than optimicm in terms of minimizing rare when choices have cloce means i very Interesting! omy More from Kenneth Foo and Analytics Vidhya. & rorneeFoo inprayce vane ye Ege ARMA: Causality and Invertibility How to create a Python library of Stationary Time Series Ever wantedto create a Python library albeit ‘The primary modelthat wae described in the foryourteam at work or fr some open. previous post was ofthe autoregressive. Decoa.207 or @2 = nz72000 Weak @ 22 Chom ‘=n Analy Vidya by Harshman 8 om tn Antys Via by Kenneth Foo Confusion Matrix, Accuracy, ‘Multi-Armed Bandit Analysis of Precision, Recall, FiScore Softmax Algorithmthe Softmas algorithm provides further Dectes01 Ik © o- Febay.2000 we (Geattomtarunton) (Seater ais vane) Recommended from Medium WB vats At Beyond by Omar haborsi@ —@ Coen Advanced A/Btesting techniques: Statistics Checklist Before Going CUPED, interleaving, and multi. fora Data Science interview Agente introduction with examples in Data scence interviews often test your Python knowiedge of statistics ast the foundato. + deea42008 70 + Fo wor we Lists Predictive Modeling w/ BME .. 3 Coding & Development Practical Guides to Mach 5 Chat@PT prompts Ussing MED ce ote Gn ghey?ae Introduction to Reinforcement Learning and Solving the Multi-... Dissecting “Reinforcement Learning” by Richard. Sutton with Custom Python. + sa20.2008 oe @1 ES 3 i ® rinoiton Dynamic Pricing with Reinforcement Learning from... Pricing decisions ean make or breaka ‘company. Dynamic pricing allows companie + Sepra2004 as a (See more recommendations Yoke Overview: Learning to Rank Overview + decsz0x 2 eed hd @ Henan sargten ing Understanding Backpropagation and Vanishing Gradients [AComprenensive Guide to Backpropagaton and te Vanishing Gradent Problem + Mota 2026 6 uw Help Statue About Carers Press log Privacy Tes Teittospacch Teams

RL Unit
No ratings yet
RL Unit
595 pages
Assignment 2 - Solution
No ratings yet
Assignment 2 - Solution
5 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
RL Week - 2 - 3
No ratings yet
RL Week - 2 - 3
83 pages
RL SEM Updated
No ratings yet
RL SEM Updated
89 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
Master Thesis On Mixed Model Bandits
No ratings yet
Master Thesis On Mixed Model Bandits
73 pages
KLUCB Paper
No ratings yet
KLUCB Paper
59 pages
Mab Notes
No ratings yet
Mab Notes
15 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
Lec07 Baysian Opti
No ratings yet
Lec07 Baysian Opti
94 pages
Bandit
No ratings yet
Bandit
8 pages
Fan Glynn
No ratings yet
Fan Glynn
32 pages
EXP3
No ratings yet
EXP3
36 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
A Handy Guide To UCB Algorithm in Reinforcement Learning.
No ratings yet
A Handy Guide To UCB Algorithm in Reinforcement Learning.
14 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
No ratings yet
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
8 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
CS 747, Autumn 2023 - Lecture 3
No ratings yet
CS 747, Autumn 2023 - Lecture 3
27 pages
Note 2
No ratings yet
Note 2
4 pages
Rlassignment 2
No ratings yet
Rlassignment 2
3 pages
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
No ratings yet
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
6 pages
Upper Confidence Bound Algorithm in Reinforcement Learning
No ratings yet
Upper Confidence Bound Algorithm in Reinforcement Learning
6 pages
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
No ratings yet
Lecture 03: Adaptive Exploration-Based Algorithms: 1.1 Outline of The Algorithm
4 pages
HW 2
No ratings yet
HW 2
3 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
EE675A Lecture 3
No ratings yet
EE675A Lecture 3
8 pages
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
No ratings yet
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
9 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Solution 2
No ratings yet
Solution 2
5 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
Unit II
No ratings yet
Unit II
10 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
UCB Algorithm in RL
No ratings yet
UCB Algorithm in RL
3 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
Mid-Semester Examination
No ratings yet
Mid-Semester Examination
2 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
2 pages
Bandits
No ratings yet
Bandits
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Multi-Arm-Bandit Problem

Uploaded by

Multi-Arm-Bandit Problem

Uploaded by

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.