Ai Report Endsem
Ai Report Endsem
Abstract—This submission addresses multiple reinforcement the likelihood of that move being chosen again in similar
learning tasks. First, it reviews and analyzes the MENACE system game situations.
by Donald Michie, identifying key implementation components
and attempting to recreate it in code. Next, it explores a binary IV. C OMPUTER S IMULATION P ROGRAM
bandit problem using an epsilon-greedy algorithm to optimize
In the computerized version of MENACE, two key differ-
action selection. Finally, a 10-armed bandit with non-stationary
rewards is developed, and a modified epsilon-greedy agent is used ences arise:
to track evolving rewards, with its performance evaluated over • Backwards Stage Evaluation: Moves are evaluated
10,000 time steps. These tasks apply core reinforcement learning starting from the end of the game and working backward,
principles to adaptive decision-making in dynamic environments. making decisions in later stages more crucial as the game
progresses.
I. G ITHUB L INK • Odds System: Move probabilities are adjusted using
odds, defined as:
Github Link for all codes: GitHub Link , click here p
Odds =
II. P ROBLEM S TATEMENT 1−p
The task is to read the reference on MENACE by Donald where p represents the probability of selecting a particular
Michie and examine its implementations. The goal is to pick move.
the implementation that you find most compelling, carefully V. R EINFORCEMENT A DJUSTMENT
analyze the code, and highlight the key sections that contribute The reinforcement system modifies the odds after each game
significantly to its function. If feasible, an attempt should based on the outcome:
be made to code the MENACE system in a programming
• Victory: Odds are multiplied by a reinforcement multi-
language of your choice.
plier Mn .
III. MENACE OVERVIEW • Defeat: Odds are divided by Mn .
Abstract—This assignment explores the application of Markov transitions are influenced by the action taken and the
Decision Processes (MDPs) and dynamic programming in solving number of bikes rented or returned.
real-world problems. Specifically, it involves solving the Gbike bi- • Discount Factor: The discount factor is set to 0.9,
cycle rental problem by formulating a finite MDP, applying policy
iteration, and using the value iteration method. Additionally, we considering the long-term nature of the business.
investigate the impact of varying reward functions in a Markov
B. Finite MDP Formulation
Decision Process environment.
The finite MDP can be described as:
I. I NTRODUCTION X
!
′ ′ ′
Markov Decision Processes (MDPs) provide a mathematical V (s) = max T (s, a, s ) [R(s, a, s ) + γV (s )]
a
s′
framework for sequential decision-making under uncertainty,
where decisions impact the future state of the environment. where: - V (s) is the value function at state s, - T (s, a, s′ ) is
This study investigates the Gbike bicycle rental problem, the transition probability from state s to s′ given action a, -
where we manage two locations for renting bicycles, and the R(s, a, s′ ) is the immediate reward for transitioning from s to
goal is to formulate an MDP to optimize the decision-making s′ under action a, - γ is the discount factor.
process. We utilize dynamic programming, specifically value
III. S OLUTION A PPROACH : P OLICY I TERATION
iteration and policy iteration, to find the optimal policy that
maximizes the expected return from bicycle rentals. We apply policy iteration to solve the Gbike problem. Policy
iteration involves two main steps:
II. P ROBLEM F ORMULATION 1) Policy Evaluation: Given a policy, we calculate the
A. Markov Decision Process (MDP) value function for each state.
2) Policy Improvement: Given the value function, we im-
In the Gbike problem, we consider a finite MDP with the prove the policy by selecting the action that maximizes
following components: the expected return at each state.
• States: The state is represented by the number of bikes The process repeats until the policy converges to the optimal
at each location at the end of the day. This is a tuple policy.
of the number of bikes at each location, constrained by
the maximum number of bikes that can be parked at a A. Modified Gbike Problem
location (20 bikes per location). The modified Gbike problem introduces an additional con-
• Actions: The action corresponds to the number of bikes straint: one employee at the first location can shuttle one bike
moved from one location to another overnight, with a to the second location for free. This change reduces the cost
maximum of 5 bikes moved per night. of moving a bike between locations and impacts the policy
• Rewards: The reward depends on whether bikes are iteration calculations. Additionally, a penalty of INR 4 is
available for rent. If a bike is rented, the reward is INR imposed if more than 10 bikes are parked overnight at any
10. If no bike is available, the reward is 0. Additionally, location, influencing the reward function.
moving bikes incurs a cost of INR 2 per bike, and a
B. Implementation of Policy Iteration
parking fee of INR 4 is charged if more than 10 bikes
are parked at a location. To implement policy iteration, we:
• Transition Probability: The transitions between states • Initialize a random policy.
are probabilistic, influenced by random arrival and return • Evaluate the policy using the Bellman equation.
rates of bikes, which follow Poisson distributions. The • Improve the policy based on the value function.
• Iterate the policy evaluation and improvement steps until
convergence.
IV. R ESULTS
After applying policy iteration to the Gbike problem, we
observe the following:
• The optimal policy maximizes the expected profit by ap-
propriately balancing bike movement between locations
and minimizing parking penalties.
• The introduction of the free bike shuttle service reduced
the overall transportation cost and improved the policy.
• The number of bikes parked at each location remained
below 10 in most states, preventing parking penalties.
V. D ISCUSSION
The results demonstrate that policy iteration effectively
solves the Gbike bicycle rental problem. By incorporating
stochastic factors like Poisson distributions for bike rentals
and returns, the policy was adapted to real-world uncertainties.
The free bike shuttle service improved operational efficiency
by reducing transportation costs. Future work could explore
scalability to larger locations and more complex state spaces.
VI. C ONCLUSION
This assignment provided a comprehensive understanding of
Markov Decision Processes and their application to dynamic
decision-making problems. By solving the Gbike bicycle rental
problem using policy iteration, we demonstrated the power
of MDPs in managing real-world logistics with uncertainty.
The methodology can be applied to various domains requiring
sequential decision-making.
R EFERENCES
[1] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
3rd ed. Pearson, 2010.
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. MIT Press, 2018.