Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
Que
s-
Questions CO BTL
tion
No
Develop an agent that can interact with a Multi-Armed Bandit en-
vironment, explore the different arms, and gradually converge to
the arm that provides the highest average reward. The agent
should learn to make decisions that maximize the cumulative re-
ward over time by effectively balancing exploration and exploita-
tion.
Ans: To develop this agent:
Initialize estimated values for each arm.
Use an ε-greedy strategy: CO BTL
1
o With probability ε, explore (choose a random arm). 1 2
o With probability 1–ε, exploit (choose arm with highest
estimated value).
After each pull, update the estimated value using:
Qnew(a)=Qold(a)+α[R−Qold(a)]Q_{new}(a) = Q_{old}(a) + \al-
pha [R - Q_{old}(a)]Qnew(a)=Qold(a)+α[R−Qold(a)]
This balances exploration and exploitation to gradually favor the
best arm.
A. Devise three example tasks of your own that fit into the
MDP framework, identifying for each its states, actions, and
rewards. Make the three examples as different from each
other as possible. The framework is abstract and flexible
and can be applied in many ways. Stretch its limits in some
way in at least one of your examples
Ans: Three Example MDPs:
1. Autonomous Vacuum Cleaner
o States: Room layout and dust status.
o Actions: Move, clean.
o Rewards: +1 for cleaning, −1 for bumping wall.
2. Stock Trading Bot
o States: Current stock prices and portfolio.
o Actions: Buy, sell, hold.
o Rewards: Profit/loss at each step.
3. Dynamic Game NPC Behavior
o States: Player proximity and health.
o Actions: Attack, defend, hide.
o Rewards: +1 for damage, −1 for getting hit.
o P(s1∣s1,a1)=0.8,P(s2∣s1,a1)=0. 2
Transition probabilities:
o P(s1∣s2,a2)=0.4,P(s2∣s2,a2)=0.6
Rewards:
o R(s1,a1)=5,R(s2,a2)=10
Discount factor γ=0.9
Driving Home : Each day as you drive home from work, you try CO BTL
9 to predict how long it will take to get home. When you leave your 3 3
office, you note the time, the day of week, the weather, and any-
thing else that might be relevant. Say on this Friday you are leav-
ing at exactly 6 o’clock, and you estimate that it will take 30
minutes to get home. As you reach your car it is 6:05, and you no-
tice it is starting to rain. Traffic is often slower in the rain, so you
reestimate that it will take 35 minutes from then, or a total of 40
minutes. Fifteen minutes later you have completed the highway
portion of your journey in good time. As you exit onto a secondary
road you cut your estimate of total travel time to 35 minutes. Un-
fortunately, at this point you get stuck behind a slow truck, and
the road is too narrow to pass. You end up having to follow the
truck until you turn onto the side street where you live at 6:40.
Three minutes later you are home. The sequence of states, times,
and predictions is thus as follows:
Use Monte Carlo method to plot the predicted total time
Ans:
Track states: (6:00, 6:05, etc.)
Track actual returns (total time = 43 min)
First-visitMC:Average returns for each state
Plot predictions vs actual over episode to visualize convergence.
V(s') - V(s))∇θlogπ(a∣s)(r+γV(s′)−V(s))
Effective for real-time adaptation in dynamic environments.