Self-Driving Car Racing: Application of Deep Reinforcement Learning
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Reinforcement Learning
1.2 Innovativeness
Application of RL in the domain of car racing is relatively less explored compared to other various control
tasks. We aim to demonstrate their demonstrate their effectiveness in a challenging and dynamic scenario.
We prioritize responsible AI considerations, which includes designing reward functions that prioritize safety
as well as exploring techniques for interpretability and transparency in the learned policy. We explore
advanced RL algorithms such as DQN, PPO, and Transfer Learning integration to determine which yields
the best results.
Action space
In discrete there are 5 actions: 0 = do nothing, 1 = full steer
left, 2 = full steer right, 3 = full gas, 4 = full brake, represented
as an int as indicated.
In continuous there are 3 actions: steering (-1 is full left, +1 is
full right), gas and breaking, represented as Box([-1, 0, 0],
1.0, (3,), float32), a 3-dimensional array where action[0]
= steering direction, action[1] = % gas pedal and action[2] = Figure 1: Game environment
% brake pedal.
1
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Value-based methods
This method tries to approximate the optimal action-value function (Q-function) given by:
which assesses the expected return of taking a certain action in a given state. The agent then selects the
action that has the highest expected return according to the Q-function. The models implemented here
include Deep Q-network and its self-customized variants, i.e. ResNet transfer learning and LSTM-ResNet
variant.
Policy-based methods
This method directly parameterizes and learns the policy that maps states to actions without explicitly
learning a value function. The model implemented here is Proximal Policy Optimization.
Full algorithm
The core of the DQN algorithm is encapsulated in its loss function for the training of the Q-network. The
loss function, Li (θi ), quantifies the difference between the predicted Q-value and the target Q-value. It’s
given by the mean squared error[10]:
Li (θi ) = Es,a∽ρ(·) [(yi − Q(s, a; θi ))2 ], with target Q-value yi = Es′ ∽ε [r + γ max
′
Q(s′ , a′ ; θi−1 )|s, a]
a
To optimize the Q-function, stochastic gradient descent, namely RMSprop, is applied to the loss function
(equation 3 in algorithm below):
2
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Advantages of DQN
DQN can handle large state space with raw sensory inputs, such as images or complex state representations.
The target network provides a stable target for the online network to learn from, while experience replay
reduces the correlation between consecutive samples and helps to break the temporal dependencies and
stabilize the learning[9].
3
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Here, πθ represents a stochastic policy, and Ât is an estimate of the advantage function at time step t.
The expectation Êt [·] denotes the empirical average over a finite batch of samples, within an algorithm that
iterates between sampling and optimization. Implementations employing automatic differentiation software
create an objective function whose gradient yields the policy gradient estimator. The estimator ĝ is derived
by differentiating the objective:
h i
LP G (θ) = Êt log πθ (at |st )Ât
While it may seem enticing to perform multiple optimization steps on this loss LP G using the same trajec-
tory, such an approach lacks sufficient justification. Empirically, it often results in excessively large policy
updates.
TRPO maximizes a surrogate objective while adhering to a constraint on the magnitude of the policy update.
This optimization problem is formulated as:
πθ (at |st )
max Êt Ât subject to Êt [KL[πθold (·|st ), πθ (·|st )]] ≤ δ
θ πθold (at |st )
Here, θold represents the vector of policy parameters before the update. TRPO utilizes a penalty instead of
a strict constraint in an unconstrained optimization problem to maintain monotonic improvement. However,
selecting a suitable value for the penalty coefficient (β) poses challenges in generalization across different
problems.
PPO addresses the limitations of TRPO by introducing a clipped surrogate objective:
h i
LCLIP (θ) = Êt min rt (θ)Ât , clip (rt (θ), 1 − ϵ, 1 + ϵ) Ât
where ϵ is a hyperparameter (e.g., ϵ = 0.2). This objective ensures that policy updates remain within a
reasonable range by constraining the probability ratio. By choosing the minimum between the clipped and
unclipped objectives, PPO maintains a lower bound on the unclipped objective, thus penalizing excessively
large updates.
Advantages: PPO offers simplicity in implementation, greater generality, and improved empirical stability
and data efficiency. Notably, it performs well on continuous action spaces (where DQN struggles [15]) and
doesn’t require extensive hyperparameter tuning.
4
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Eventually, the algorithm took 15 hours to reach the max average of performance, 910.12 after 1.45 million
time steps, and started to oscillate afterwards. The full training process is shown in Figure 5:
5
Self-Driving Car Racing: Application of Deep Reinforcement Learning
We changed the preprocessing stage to take in all three RGB color channels to fit the ResNet-18 input. Our
model will now process one image frame at a time.
Our model was trained on Google Colab’s L4 GPU for 23 hours. We observed that as compared to the DQN
+ CNN implementation, the model seemed to learn in relatively fewer steps, reaching an average return of
600 in less than 200,000 time steps. This model reached a peak performance of 912 after 1,200,000 time
steps. This performance is likely contributed by the image segmentation effect produced by the ResNet layer,
capturing more meaningful spatial relationships as compared to traditional CNN implementation.
6
Self-Driving Car Racing: Application of Deep Reinforcement Learning
We observe that the contribution of spatio-temporal relationship through the combination of an image
segmentation and a memory layer contributed to a faster convergence to reach high average return values in
less than 100,000 steps.
However, this approach demands substantially greater computational resources in contrast to alternative
methods. This limitation prompted our team to discontinue model evaluation after 90,000 time steps. Our
model was trained on Google Colab’s A100 GPU for 8 hours and surpassed Colab’s 83.5 GB system RAM
at this iteration count. Possible improvements may include improving computational efficiency by reducing
model parameter size, or by applying distributed algorithms to reinforcement learning [8]
7
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Figure 8: PPO with default Adam (left) vs. non-stationary Adam (right)
We predict that PPO will still be able to achieve good performance as we increase the timesteps (as proven
by several other research results 2 ), however we are constrained by the compute power to prove this. Never-
theless, we still think that demonstrating this capability is important in real-world scenario of self-driving,
where you should be able to do 20% gas and 10 degree left turn (achievable by continuous action space),
instead of only full gas or full left turn at a time (discrete action).
8
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Figure 11: The agent being able to handle the potential skid well, i.e. drifting
more unstable and sensitive to experiencing policy collapse during training. This can be due to PPO’s
reliance on a fixed-size trust region. When the policy deviates too much from the previous policy, the trust
region constraint can lead to overly conservative updates, where the policy becomes stuck in suboptimal
solution.
Comparing the performances of our AI agents with human players (ourselves) 3 , the average score by these
3 human players are around 800, whereas the AI could reach an average of 850-900 reward consistently. We
welcome human testers to play the game here.
The videos uploaded to a Google Drive folder illustrated the exploration behavior of the model during
training. The first video shows that the agent indeed treated all actions equally at the start of the training,
causing it to struggle with going forward even though it was on a straight route. The second video shows
that the agent has acquired the ability to drive decently fast on the track, despite consistently performing
some minor turns along the way.
Final behaviour
The three demonstration videos in the google drive showcase the agent’s advanced driving capabilities. In
particular, the agent has learned to perform delicate drifting and handle skidding when encountering U-
turns, a challenging maneuver especially at high speed when the car is prone to skidding. Furthermore, the
agent demonstrates the ability to slow down appropriately when navigating sharp turns. On straight routes,
the agent consistently applies the ”gas” to maintain optimal speed. These behaviors highlight the agent’s
adaptability and its capacity to make intelligent decisions based on the track’s layout, ultimately resulting
in a smooth and efficient driving performance.
3 https://drive.google.com/drive/folders/1ntYOZsL1ZZ1l8miHlr2z3E1mUHlHdVwD?usp=sharing
9
Self-Driving Car Racing: Application of Deep Reinforcement Learning
Florentiana Yuwono: overall direction of the team, research on papers and implementation, in charge of
PPO implementation.
Gan Pang Yen: research on papers and implementation, in charge of DQN and PPO training.
Jason Christopher: research on papers and implementation, in charge of ResNet and ResNet + LSTM
model design.
5 Conclusion
This project has demonstrated the potential and effectiveness of various deep reinforcement learning algo-
rithms in navigating a car autonomously in a simulated environment. Through extensive experimentation
with DQN, PPO, and innovative adaptations incorporating transfer learning and RNNs, we have uncovered
significant insights into the strengths and limitations of each approach within the context of self-driving car
racing.
Our findings reveal that while DQN provides a robust foundation, the incorporation of advanced neural
network architectures like ResNet and LSTM can enhance the agent’s performance by enabling it to capture
complex spatial and temporal dependencies within the environment. Meanwhile, PPO has shown promising
results, particularly in scenarios requiring fine control over continuous action spaces, which are crucial for
realistic driving simulations.
The integration of ResNet with LSTM, while offering superior ability to capture spatio-temporal relation-
ships, poses significant computational challenges. To facilitate the scaling of such models to millions of time
steps, further enhancements in computational efficiency or access to more substantial computing resources
will be necessary. This could involve optimizing the architecture for better performance on available hard-
ware or employing more advanced parallel computing techniques. Future work will focus on refining these
models and exploring the integration of these techniques into actual autonomous driving systems. Addition-
ally, further research into the phenomenon of policy collapse in PPO could lead to more stable and reliable
learning algorithms.
This project not only advances our understanding of applying deep reinforcement learning to autonomous
driving but also sets the stage for future innovations in this exciting and rapidly evolving field.
10
Self-Driving Car Racing: Application of Deep Reinforcement Learning
References
[1] Baldwin, A. (2023). Driverless racecars on track for April Abu Dhabi debut. Reuters. Last Modified:
21 December 2023. Available from: https://www.reuters.com/sports/motor-sports/driverless-racecars-
track-april-abu-dhabi-debut-2023-12-20/
[2] Chen, C., Ying, V., Laird, D. (2016). Deep Q-Learning with Recurrent Neural Networks. Stanford
University.
[3] Dohare S, Lan Q, Mahmood AR. Overcoming Policy Collapse in Deep Reinforcement Learning. Pub-
lished: 20 Jul 2023, Last Modified: 29 Aug 2023.
[4] Van Hasselt, H., Guez, A., & Silver, D. (2016, March). Deep reinforcement learning with double q-
learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 30, No. 1).
[5] Hidden Beginner. CartRacing-v2 DQN. hiddenbeginner.github.io/study-notes/contents/tutorials/2023-
04-20 CartRacing-v2 DQN.html.
[6] Indy Autonomous Challenge Unveils Next Gen Autonomous Vehicle Platform IAC AV-24. AIthority.
Last Modified: 9 January 2024. Available from: https://aithority.com/technology/indy-autonomous-
challenge-unveils-next-gen-autonomous-vehicle-platform-iac-av-24/
[7] Johny Code (2024). Deep Q-Learning (DQL) / Deep Q-Network (DQN) Explained — Python+Pytorch
Deep Reinforcement Learning. https://youtu.be/EUrWGTCGzlA?si=7jeYbCsATmYaxBXZ
[8] Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J., Munos, R. (2019). Recurrent Experience Rpelay
in Distributed Reinforcement Learning. International Conference on Learning Representations 2019.
[9] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D.
(2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533.
[10] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing Atari
with Deep Reinforcement Learning. NIPS Deep Learning Workshop 2013. arXiv:1312.5602 [cs.LG]. DOI:
10.48550/arXiv.1312.5602.
[11] Osband, I., Blundell, C., Pritzel, A., & Van Roy, B. (2016). Deep exploration via bootstrapped DQN.
Advances in neural information processing systems, 29.
[12] Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint
arXiv:1511.05952.
[13] Schmidt, R. (2019). Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. arXiv
preprint arXiv:1912.05911v1.
[14] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal Policy Optimization Algorithms.
arXiv:1707.06347 [cs.LG]. DOI: 10.48550/arXiv.1707.06347.
[15] Wang K, Bartsch A, Barati Farimani A. MAN: Multi-Action Networks Learning. arXiv:2209.09329
[cs.LG]. DOI: 10.48550/arXiv.2209.09329.
[16] Zhu, Z., Lin, K., Jain, A. K., Zhou, J. (2023). Transfer Learning in Deep Reinforcement Learning: A
Survey. arXiv preprint arXiv:2009.07888.
[17] Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q. (2020). A Comprehensive
Survey on Transfer Learning. arXiv preprint arXiv:1911.02685.
11