Chapter 4 5 Formatted RL Report Kiran
Chapter 4 5 Formatted RL Report Kiran
Results / Interpretation
Figure 1: Reward per episode plotted during training. The increase in average reward
indicates effective learning and convergence of the agent toward a stable policy.
Figure 2: Evaluation result showing the agent achieving the maximum possible reward
consistently across all 10 test episodes.
Figure 3: Snapshot of the CartPole agent during execution. The agent maintains balance
and avoids episode termination by executing a stable policy.
4.4 Observations
Based on the training logs, evaluation metrics, and rendered gameplay, the following
observations were made:
- The agent successfully learned a near-optimal policy to solve the CartPole-v1 task.
- The reward curve shows consistent convergence with minimal oscillations or instability.
- The evaluation score confirms 100% episode success rate, with maximum allowable
reward in each test run.
- The visual inspection validates the effectiveness of the learned policy from a behavior
standpoint.
- The training setup, though relatively lightweight, was sufficient to yield a high-
performing RL agent in a constrained control setting.
These results collectively demonstrate that Deep Q-Networks, when applied correctly
with tuned hyperparameters, are effective at solving classic control problems like
CartPole.
5. Conclusion
This project applied the Deep Q-Network (DQN) algorithm to the CartPole-v1 control
task using a structured reinforcement learning approach. The problem was framed as a
Markov Decision Process (MDP), and the agent's objective was to learn an optimal
control policy for balancing the pole through trial and error using Q-learning with
function approximation.
The training phase involved a neural network that predicted Q-values for state-action
pairs and was optimized using experience replay and a separate target network.
Throughout 100,000 training steps, the agent progressively improved its performance, as
evidenced by the TensorBoard reward graphs and evaluation metrics.
The final model consistently achieved the maximum reward of 500 in every test episode
with zero standard deviation, confirming both convergence and stability. Visual rendering
further supported the effectiveness of the learned policy, showing the agent keeping the
pole balanced throughout its motion.
This work successfully aligns with the objectives of the Reinforcement Learning course.
It demonstrates:
- The transition from theoretical understanding to hands-on application,
- Familiarity with popular libraries like Gymnasium and Stable Baselines3,
- Mastery in monitoring, debugging, and evaluating RL models using tools such as
TensorBoard.