0% found this document useful (0 votes)
13 views11 pages

Multi-Arm-Bandit Problem

The document discusses the Upper Confidence Bound (UCB) algorithm, emphasizing its approach of balancing exploration and exploitation in multi-armed bandit problems. It details the UCBI variant, its mathematical formulation, and provides code examples for implementation and simulation. The analysis includes performance comparisons against other algorithms, highlighting UCB's limitations in scenarios with closely valued arms, where Epsilon Greedy may perform better.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Multi-Arm-Bandit Problem

The document discusses the Upper Confidence Bound (UCB) algorithm, emphasizing its approach of balancing exploration and exploitation in multi-armed bandit problems. It details the UCBI variant, its mathematical formulation, and provides code examples for implementation and simulation. The analysis includes performance comparisons against other algorithms, highlighting UCB's limitations in scenarios with closely valued arms, where Epsilon Greedy may perform better.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
Medium Q sewer Swe Multi-Armed Bandit Analysis of Upper Confidence Bound Algorithm Kenneth Foo » Follow Published in Analee Vidhya - Sminvead - Feb 21,2020 ‘The Upper Confidence Bound (UCB) algorithm is often phrased as “optimism in the face of uncertainty”. To understand why, consider at a given round that each arm's reward function can be perceived as a point estimate based. on the average rate of reward as observed. Drawing intuition from. confidence intervals, for each point estimate, we can also incorporate some form of uncertainty boundary around the point estimate. In that sense, we have both lower boundary and upper boundary for each arm. ‘The UCB algorithm is aptly named because we are only concerned with the upper bound, given that we are trying to find the arm with the highest reward rate. ‘There are different variants of the UCB algorithms but in this article, we will take a look at the UCBI algorithm. At each given round of n trials, the reward UCB of all arms are represented by the following: 2In(n) ny py + where m1 represents the current reward return average of arm iat the current round, n represents the number of trials passed, and 4 represents the number of pulls given to arm i in the playthrough history. ‘The above formulation is simple but yet has several interesting implications as explained in the following: + The upper boundary is proportional to the squared root of tn(o) , which. means that when the experiment progresses, all arms have their upper boundaries increases by a factor of squared root of tn(n) - + This upper boundary is inversely proportional to the squared root of ni. ‘The more times the specific arm has been engaged before in the past, the greater the confidence boundary reduces towards the point estimate. ‘Thereafter, the UCB algorithm always picks the arm with the highest reward UCB as represented by the equation above. Beyond the formulation explanation, here is a simple thought experiment to glean some intuition on how UCB algorithm incorporates exploration and exploitation, The time complexity between the numerator and denominator provides a tension between exploration and exploitation. For any increase in n, the UCB increases only by logarithmic time, while for any increase in n_1, the UCB decreases by ni. Thus, an arm that has not been explored as often as other arms will have a bigger UCB component. Depending on its current average mean, the overall UCB function representation of that specific arm may be greater than other arms with higher return but smaller components, and consequently enable that arm to be picked. ‘The following analysis is based on the book “Bandit Algorithms for Website Optimization” by John Myles White. For further understanding of the code, I have included comments for easier understanding. Below is the code for creation of the UCB1 algorithm setup and progressive updates of counts and values for arms. + Counts: Represent recorded times when arm was pulled. + Values: Represent the known mean reward. In the case ofa Bernoulli arm, values represent the probability of reward which ranges from 0 to 1. class ucen(): def _init_(self, counts, values) Seif.counts ='counts # Count represent counts of pulls for each arm. For miltiple arms, this will be a List of counts. SeLf.values = values # Value represent average reward for specific arm. For miltiple arms, this will be a list of values. return # Initialise k number of arms dof initialize(self, n_arns’ Self.counts = [0 for col in range(n_arms) ] Self_values = [0.9 for col in range(n_arms)1 return # UCB arm selection based on max of UCB reward of each arm def select_arm(self): arms = len(self-counts) for arm in range(n_arms Sf self.counts[arm) return arm Uucb_values = [9.0 for arm in range(n_arns)] total_counts = Sum(self.counts) for arm in range(n_arms): bonus = math.sqrt((2 + math. log(totalcounts)) / float(self.counts[arm])) ucb values [arm] = self.values[arm] + bonus return ucb_values. index (max(ucb_values) ) # Choose to update chosen arm and reward def update(self, chosen_arm, reward): seLf.counts[chosen_arn] = self.counts[chosen_ara] + 1 n= Self.counts[chosen_arm] # Update average/mean value/reward for chosen arm value = self.values[chosen_ara] renvalue = ((n - 1) / Float(n)) * value + (1 / Float(n)) + self.values[chosen_arm] = new_value return ‘As per discussion in previous articles, we will use a Bernoulli distribution to represent the reward function of each arm. class Bernoull4arm() def _init_(self, p): Self.p = # Reward system based on Bernoulli def draw(self) if random.randon() > self.p: return 6.0 else: return 1.0 ‘To proceed with any further analysis, an operational script is required to process the simulation where: + num_sims: Represents the number of independent simulations, each of length equal to ‘horizon’ + horizon: Represents the number of time steps/trials per round of simulation def test_algorithm(algo, arms, num_sims, horizon): # Initialise variables for duration of accumulated simulation (eum_sins + horizon_per_similation) chosen_arms = [0.0 for 1 in range(nun_sims * horizon)] renards = [8.0 for 4 in range(num_sins » horizon)] cumilative rewards = [9 for 3 in range(num_sins * horizon)] ‘simnums = [0.9 for 4 in range(num_sims *horizon)] times = [8.9 for 1 in range (num_simsthorizon)} for sim in range(num_sins): sin = sim + 1 algo. initiatize(len(arns)) for t in range(horizon): titel Sin hums Hindexd” ‘times[index] = t 4 Selection of best arm and engaging it chosen_arm = algo.select_ara() chosen_arms[index] = chosen_arm 4 Engage chosen Bernoulli Arm and obtain reward info reward = arms[chosen_arm] -drau() rewards{index] = reward Sf toon Ccumiative_rewards[index] = reward els “umilative_rewards[index] = cumulative rewards[index- 2) + reward algo.update(chosen_arm, reward) return [simnums, times, chosen_arns, rewards, cumilative_rewards] ‘Simulation of Arms with relatively big differences in Means Similar to what was done previous analysis for Epsilon-greedy, the simulation comprises of the following: + Create 5 arms, four of which have average reward of 0.1, and the last/best has average reward of 0.9. + Save simulation output to a tab-separated file + Create 5000 independent simulations In this example, since the UCB algorithm does not have any hyperparameter, we create a single set of 5000 simulations. ‘The choice for 5000 independent simulations is because we want to determine the average performance. Each simulation might be subject to the stochastic nature/run and the performances might be skewed due to random chance, Thus it is important to run a reasonably high number of simulations to evaluate the average mean/performance. ‘mport random randoa.seed(1) # out of 5 arms, 1 arm is clearly the best means = [0.1, 0-1, 0.1, 81, 6.9] harms = len(means) # Shuf fling arms randoa. shuf fle(means) 4 Create List of Bernoulli Arms with Reward Information arms = Uist (nap(Lambda mu: BernoulLiarm(md) » means)) print("Best arm is." + str(np.argmax (means))) = open("'standard_ueb_results.tsv", we") 4 Create 1 round of 5000 simulations algo = uCe(E, (1) algo. initialize(n_arms) results = test_algorithn(algo, arms, 5090, 250) 4 Store data for i in range(len(results{0])): fewrite("\t". join([str(results{j1[11) for j in range(Len(results))]) + "\n") Ficlose(), Using some data-preprocessing and basic Altair visualisation, we can plot the probability of pulling the best arm. UCB: Mean Rate of Croosing Best Arm fom 5000 Simulations. 5 Arms ert ot chosing Best Tine sep ‘The UCB algorithm has extreme fluctuations in its rate of choosing the best arm in the early phases of the experiment as shown by time steps between 0 t0 60. This can be explained by the emphasis of exploration amongst all arms since the UCB components for all arms are much bigger at the start. As the trial progresses, the UCB components becomes much smaller for all arms, and the UCB function representation of each arm converges towards the average reward mean of each arm, Thus, the arm with the higher mean becomes more distinguishable by the algorithm and becomes more frequently picked as the trial progresses. Thus, we observe that the rate of choosing the best arm does not seem to have a hard asymptote, but converges towards 1. The rate of convergence towards 1 slows down as it approaches 1, and the experiment time horizon was too short for us to observe any further convergence. UCR: Mean Cumulative Reward om 5000 Simulations. § Arms = 04,109), ‘The cumulative reward plot of the UCB algorithm is comparable to the other algorithms, Although it does not do as well as the best of Softmax ( eau = 0.1 or 0.2) where the cumulative reward was beyond 200, the UCB cumulative reward range is close to that range (around 190). We also observe some form of curvature in the early phases of the trial, which can be corroborated by the extreme fluctuations we saw in the rate of choosing best arms. Likewise, when the experiment progresses, the algorithm can distinguish the best arm, and picks it with higher frequency, and the cumulative reward plot has a straight line gradient (which should approximate a value of 0.9 based on consistently choosing the best arm). ‘Simulation of Arms with relatively smaller differences in Means ‘The previous analysis was a simulation exercise on arms with big differences in reward returns. We extend the analysis to a situation where the arms are relatively closer. In the following case, we simulate 5 arms, 4 of which have a mean of 0.8 ‘while the last/best has a mean of 0.9. Based on a reduced difference between the reward returns of all arms, we observe a big deterioration in the performance of the UCB algorithm. The rate of choosing the best arm now approaches 0.32, which is similar to what we saw in the Softmax algorithm. ramiiiy patancis watt Wie GULLILOA aigunssitity Uils SEHD Lu AAA reduced difference in reward function makes it harder to determine which is the best arm, Note that random chance of picking the best arm in this case is Lin 5 or 0.20, Itshould be noted that in this scenario, for Epsilon Greedy algorithm, the rate of choosing the best arm is actually higher as represented by the ranges 0f 0.5 to 0.7. This also seems to imply that Epsilon Greedy might be better suited for multi-armed based situations where the difference in means are much smaller as compared to UCB or Softmax algorithm, ‘Ucn: ean Cumulative Reward om 5000 Simulations. § Arms = [408 3.x03) Tine sep Since the arms are close in average returns, the eventual UCB cumulative reward obtains a value of around 210. Compare this to choosing the best arm which will return 0.9 * 250225, we see a regret of 15. It might seem small in this case, but as a percentage, it can be considered as significant (6.679%) depending on the application focus. ‘Taking a look at the overall cumulative regret may provide a better perspective of the performance, especially with respect to the other algorithms. ‘UCB: Mean Cumulative Regret rom $000 Simulations. § Arms = (408, 1x09} Based on the cumulative regret plots, we see that UCB1 has a cumulative regret of 18, which is similar to the Softmax algorithm, It is also worst off compared to the Epsilon Greedy algorithm which had a range of 12.3 to 14.8. The cumulative regret line is also relatively straight, which means that the algorithm will continue to accumulate more regret with a longer time horizon, ‘Summary In this analysis of UCB algorithm, we broke down the formulation of the algorithm and also performed simulation experiments for different arms to illustrate its robustness (or lack thereof). Similar to the Softmax algorithm, a learning takeaway is that for arms with closer means, the UCB algorithm does not seem to be as robust in terms of ‘ng the best arm, for which Epsilon Greedy is more suitable. determi For reference on this project on bandit simulations analysis, please refer to this Github repo. For quick reference on the actual code, please refer to this, Jupyter notebook. ‘Online Leaming MuitiArmedBarct Python _—DstaScence vege: Published in Analytics Vidhya [roi ‘TakFollowers Last published 1ayago [Anat Vidhya isa community of Generative Al and Data Science professional, We are buling the next-gen dtacrence eocystem tpsdmnvanasticsidhvacom @© werner temnetn roo @ Responses (2) e co Hi Kenneth! in my simulation of he UCS lgoritim for aunform distribution vriabies, alco got near cumulative rewerd However does linear cumulative rerd imply thatthe resets ner? think 20 asits just ‘constant tem mus. ace realy Beal @ vwesun Wow. wast abet fullytakeinallthe data Butthe statement that possibly Epsilon Greedy isa better strategymentlty than optimicm in terms of minimizing rare when choices have cloce means i very Interesting! omy More from Kenneth Foo and Analytics Vidhya. & rorneeFoo inprayce vane ye Ege ARMA: Causality and Invertibility How to create a Python library of Stationary Time Series Ever wantedto create a Python library albeit ‘The primary modelthat wae described in the foryourteam at work or fr some open. previous post was ofthe autoregressive. Decoa.207 or @2 = nz72000 Weak @ 22 Chom ‘=n Analy Vidya by Harshman 8 om tn Antys Via by Kenneth Foo Confusion Matrix, Accuracy, ‘Multi-Armed Bandit Analysis of Precision, Recall, FiScore Softmax Algorithm the Softmas algorithm provides further Dectes01 Ik © o- Febay.2000 we (Geattomtarunton) (Seater ais vane) Recommended from Medium WB vats At Beyond by Omar haborsi@ —@ Coen Advanced A/Btesting techniques: Statistics Checklist Before Going CUPED, interleaving, and multi. fora Data Science interview Agente introduction with examples in Data scence interviews often test your Python knowiedge of statistics ast the foundato. + deea42008 70 + Fo wor we Lists Predictive Modeling w/ BME .. 3 Coding & Development Practical Guides to Mach 5 Chat@PT prompts Ussing MED ce ote Gn ghey? ae Introduction to Reinforcement Learning and Solving the Multi-... Dissecting “Reinforcement Learning” by Richard. Sutton with Custom Python. + sa20.2008 oe @1 ES 3 i ® rinoiton Dynamic Pricing with Reinforcement Learning from... Pricing decisions ean make or breaka ‘company. Dynamic pricing allows companie + Sepra2004 as a (See more recommendations Yoke Overview: Learning to Rank Overview + decsz0x 2 eed hd @ Henan sargten ing Understanding Backpropagation and Vanishing Gradients [AComprenensive Guide to Backpropagaton and te Vanishing Gradent Problem + Mota 2026 6 uw Help Statue About Carers Press log Privacy Tes Teittospacch Teams

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy