An End-to-End Curriculum Learning Approach For Aut
An End-to-End Curriculum Learning Approach For Aut
Abstract— In this work, we combine Curriculum Learning and level 5 (full automation) by the SAE J3016 standard [1].
with Deep Reinforcement Learning to learn without any prior High-to-full autonomous vehicles must master tasks known as
domain knowledge, an end-to-end competitive driving policy for perception, planning, and control [2], [3]. Perception refers to
the CARLA autonomous driving simulator. To our knowledge,
we are the first to provide consistent results of our driving policy the ability of an autonomous system to collect information and
on all towns available in CARLA. Our approach divides the extract relevant knowledge from the environment. In order to
reinforcement learning phase into multiple stages of increas- do so, the autonomous vehicles need to understand the driving
ing difficulty, such that our agent is guided towards learning scenario (environmental perception), to compute its pose and
an increasingly better driving policy. The agent architecture motion (localization), and to determine which portions of the
comprises various neural networks that complements the main
convolutional backbone, represented by a ShuffleNet V2. Further driving space are occupied by other objects (occupancy grids).
contributions are given by (i) the proposal of a novel value Planning relies on the output of the perception component to
decomposition scheme for learning the value function in a stable devise and obstacle-free route, that the vehicle has to follow to
way and (ii) an ad-hoc function for normalizing the growth in avoid any collision while reaching its indented destination. The
size of the gradients. We show both quantitative and qualitative planned route is made of high-level commands that do not tell
results of the learned driving policy.
the vehicle’s software how to actually implement them in terms
Index Terms— Autonomous driving, CARLA simulator, auto- of torques and forces. Finally, motion control does account for
motive, deep reinforcement learning, curriculum learning. this, converting high-levels commands to low-levels actions,
I. I NTRODUCTION consisting of specific torques and forces values to be applied
to the vehicle’s actuators in order to make it move and steer
A UTONOMOUS Driving (AD) technology promises to
change the way we travel. Thanks to the emerging auto-
motive applications, Autonomous Vehicles (AV) will be able to
properly.
For such purpose, both level 4 and 5 autonomous vehicles
are equipped with a variety of exteroceptive sensors, like cam-
recognize the road and the driving context so to plan the route
eras, LiDAR, RADAR, and ultrasonic sensors, to perceive the
by monitoring the dynamics of the other vehicles and subjects
external environment including dynamic and static objects, and
within the scene. Thanks to AV, people will be able to travel
proprioceptive sensors, like IMUs, tachometers and altimeters,
from place to place in a safer, more environmentally friendly
for internal vehicle state monitoring [4]. Moreover, high sensor
and even more time-efficient way. These new technologies are
redundancy along with sensor fusion are often necessary to
expected to reduce road fatalities, pollution and have greater
achieve improved performance and high robustness especially
autonomy.
in degraded driving and weather conditions.
Such complex goals can be only achieved by highly
The tasks of perception, planning and control can be solved
autonomous vehicles, classified as level 4 (high automation)
in isolation or jointly. In the isolated approach it is interpreted
Manuscript received 30 June 2021; revised 27 October 2021; accepted with a modular pipeline in which each module is separate
22 February 2022. Date of publication 26 May 2022; date of current version and performs a specific task [4]. The resulting system suffers
11 October 2022. This work was supported in part by PRIN 2017 PREVUE:
“PRediction of activities and Events by Vision in an Urban Environment,” from error propagation: the modules are designed by humans
through the Italian Ministry of Education, University and Research, under and therefore potentially imperfect; every small error would
Grant 2017N2RK7K. The Associate Editor for this article was B. B. Gupta. propagate in the system joining with any errors in other
(Corresponding author: Aniello Castiglione.)
Luca Anzalone is with the Department of Physics and Astronomy modules. Basically, the isolated approach is not optimal and
(DIFA), University of Bologna, 40127 Bologna, Italy (e-mail: not reliable. These weaknesses motivate the choice of the
luca.anzalone2@unibo.it). end-to-end driving paradigm. With end-to-end guidance, the
Paola Barra is with the Department of Computer Science, Sapienza Univer-
sity of Rome, 00185 Rome, Italy (e-mail: barra@di.uniroma1.it). perception, planning and control tasks are solved jointly and
Silvio Barra is with the Department of Electrical and Information Technol- are not presented explicitly. These systems have a more
ogy Engineering (DIETI), University of Naples “Federico II”, 80138 Naples, functional design and are easier to develop and maintain.
Italy (e-mail: silvio.barra@unina.it).
Aniello Castiglione is with the Department of Science and Technology In general, we can distinguish various categories of sys-
(DIST), University of Naples “Parthenope”, 80133 Naples, Italy (e-mail: tem architectures for autonomous vehicle design which also
castiglione@ieee.org). accounts (or not) for connectivity among vehicles [4]:
Michele Nappi is with the Department of Computer Science, University of
Salerno, 84084 Salerno, Italy (e-mail: mnappi@unisa.it). • Ego-only systems (or standalone vehicles) do not
Digital Object Identifier 10.1109/TITS.2022.3160673 share information among other autonomous vehicles.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
19818 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022
A standalone vehicle uses only its knowledge to devise them in (i) Autonomous Driving approaches based on
driving decisions. The lack of connectivity, makes this Deep Learning techniques, (ii) Reinforcement Learning for
category of AVs simpler to design compared to vehicles Autonomous Driving and (iii) Autonomous Driving Simu-
that are connected together. lators. In Section III several formalisms and definitions are
• Connected systems are able to distribute the basic oper- proposed, so to allow the reader to ease the understanding
ations of automated driving among other autonomous of the background context of the paper. In Section IV the
vehicles, thus forming a connected multi-agent system. proposed approach is presented. Sections V shows the obtained
In this way, vehicles can share detailed driving infor- results on the CARLA Towns. Finally, Section VI concludes
mation and use such additional information to perform the paper.
better decisions. Communication among vehicles requires
II. R ELATED W ORK
a specific infrastructure and communication protocols,
other than being able to efficiently transmit and store large A. Deep Learning-Based Autonomous Driving
amounts of data. Deep learning-based end-to-end driving systems aim to
• Modular systems are structured as a pipeline of sepa- achieve human-like driving simply by learning a mapping
rate components (as discussed previously), each of them function from inputs to output targets, so being able to imitate
solving a specific task. The main advantage is that the human experts. These inputs are often (monocular) camera
complex problem of autonomous driving can be decom- images, while the targets can be quantities like the steering
posed in smaller and easier-to-solve set of problems. angle, the vehicle’s speed, the route-following direction, throt-
• End-to-end driving generate ego-motion directly tle and breaking values, or even high-level commands.
from (raw) sensory inputs (e.g. RGB camera images), Reference [13] trained a convolutional neural network to
without the need to design any intermediate module. Ego- map raw pixels from a single front-facing camera directly to
motion can be either the continuous operation of steering steering commands. The authors managed to drive in traffic on
wheel and pedals (i.e. acceleration and breaking) or a local roads, on highways, and even in areas with unclear visual
discrete set of actions. End-to-end driving is simple to guidance. To correct the vehicle drifting from the ground-truth
implement, but often leads to less interpretable systems. trajectory, the authors employed two additional cameras to
Imitation Learning [5], [6] is the preferred approach for end- record left and right shifts. The authors evaluated their system
to-end driving, given its design simplicity and optimization by measuring the autonomy metric, being autonomous 98%
stability, despite requiring a considerable amount of expert of the time. To mitigate this shifting problem, [14] developed
data for learning a competitive policy. Deep Reinforcement a sensor setup that provides a 360-degree view of the area
Learning (RL) is gaining interest for its encouraging results surrounding the vehicle by using eight cameras. Their driving
in the field [7], [8], without requiring to collect expert trajec- model uses multiple Convolutional Neural Networks (CNNs)
tories: just a real or simulated environment (e.g. CARLA [9], as feature encoders, four Long-short Term Memory (LSTM)
or AirSim [10]) is needed, instead. Moreover, RL can poten- recurrent networks [15] as temporal encoders, and a fully-
tially discover better-than-expert behavior since it maximizes connected network to incorporate map information. Their
the agent’s performance with respect a designed reward system is trained to minimize the mean squared error (MSE)
function. against speed and steering angle.
In this paper, we provide the following contributions: Reference [5] propose to condition the imitation learning
• We combine the Proximal Policy Optimization (PPO) [11] procedure on a high-level routing command (i.e. a one-hot
algorithm with Curriculum Learning [12], showing how encoded vector), such that trained policies can be controlled
to learn an end-to-end urban driving policy for the at test time by a passenger or by a topological planner.
CARLA driving simulator [9]. The authors evaluated the approach in a simulated urban
• We evaluate our curriculum-based agent on various met- environment provided by the CARLA driving simulator [9]
rics, towns, weather conditions, and traffic scenarios. and on a physical system: a 1/5-scale truck. For goal-based
To our knowledge, we are the first to demonstrate con- navigation they recorded a success rate of 88% in Town 1
sistent results on all towns provided by CARLA, by just (training scenario), and of 64% in Town 2 (testing scenario);
training the agent on only one town. two of the simplest towns available.
• Moreover, we point out two important sources of insta- End-to-end behavioral cloning is appealing for its sim-
bility in reinforcement learning algorithms: learning the plicity and scalability but there are limitations [6], such
value function V (s), and normalizing the estimated as: dataset bias and overfitting when data is not diverse
advantage function A(s, a). enough, generalization issues towards dynamic objects seen
• Finally, we provide two novel techniques to solve these during training, and domain shifting between the off-line
issues. The two methods can be applied to any value- training experience and the on-line behavior. Despite these
based RL algorithm, as well as actor-critic algorithms. limitations, behavioral cloning can still achieve state-of-the-art
More notably, the same technique we use to learn the results as demonstrated by [6]. In fact, the authors proposed
value function is general enough to be employed in almost a ResNet-based [16] architecture with a speed prediction
any ML regression problem. branch. According to them, in presence of large amounts
The paper is organized as follows: Section II defines of data a deep model can reduce both bias and variance
and describes the related works in the topic, categorizing over data, also having better generalization performances on
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19819
learning reactions to dynamic objects and traffic lights in pretrains the actor’s network on ground-truth actions recorded
complex urban environments. The authors also proposed a from human driving videos, and the subsequent reinforcement
novel CARLA driving benchmark, called NoCrash, in which learning stage employs DDPG [19] to improve the driving
the ability of the ego vehicle is tested on three urban sce- policy. According to the authors, the first imitation stage is
narios with different weather conditions: empty town with no necessary to prevent DDPG to fall in local optima due to poor
dynamic objects, regular traffic with a moderate amount of exploration. CIRL uses a four-branch network with a speed
cars and pedestrians, and dense traffic with a large number of prediction branch, similar to [6]. The authors conducted exper-
vehicles and pedestrians. iments on the CARLA simulator benchmark, showing that
Reference [17] proposed the first direct perception method - the CIRL performance are comparable to the best imitation
an emerging paradigm that combines both end-to-end learning learning methods, such as CIL [5], CAL [17], and CIRLS [6].
and control algorithms - named Conditional Affordance Learn- Often, training a competitive driving policy from high-
ing (CAL), to handle traffic lights and speed signs by using dimensional observations is often too difficult or expensive
image-level labels, as well as smooth car-following, resulting for RL. [22] propose to visual encode the perception and
in a significant reduction of traffic accidents in simulation. routing information the agent receives into a bird-view image,
Their CAL agent consists of a neural network that predicts which is further compressed by a VAE [21]. To reduce training
six types of affordances from input observation, and a lateral complexity the authors employed the frame-skip trick, in which
and longitudinal controller which predicts the throttle, brake, each action made by the ego-vehicle is repeated for subse-
and steering values. quent k = 4 frames. The authors evaluated their approach on
Reference [18] proposed the first interpretable neural CARLA [9], specifically on a challenging roundabout scenario
motion planner for learning to drive autonomously in com- in Town 3. They compared three RL algorithms: Double
plex urban scenarios that include traffic-light handling, yield- DQN [23], TD3 [24], and SAC [25]. The latter achieved the
ing, and interactions with multiple road-users. Their model best performance.
employs a convolutional backbone to predict the bounding Reference [26] proposed a multi-objective DQN agent moti-
boxes of other actors, as well as a space-time cost volume vated by the fact that a multi-objective approach can help
for planning. The input representation consists of Lidar point overcome the difficulties of designing a scalar reward that
clouds coupled with annotated HD maps of the road. The properly weighs each performance criteria. Furthermore, the
space-time cost volume represents the goodness of each authors suggest that when each aspect is learned separately,
location that the self-driving car can take within a planning it is possible to choose which aspect to explore in a given state.
horizon. Their model is trained end-to-end with a multi-task In particular, they learned a separate agent for each objective
objective: the planning loss encourages the minimum cost plan which, collectively, form a combined policy that takes all these
to be similar to the trajectory performed by human demon- objectives into account. The authors trained the agent on two
strators, and the perception loss encourages the intermediate four-way intersecting roads with random surrounding traffic
representations to produce accurate 3D detection and motion provided by the SUMO traffic simulator [27], demonstrating
forecasting. According to the authors, the combination of very low infraction rate.
these two losses ensures the interpretability of the intermediate
representations.
C. Autonomous Driving Simulators
Autonomous driving research requires a considerable
B. Reinforcement Learning-Based Autonomous Driving amount of diversified data, collected on a variety of driving
Reference [8] demonstrated the first application of deep scenarios with different weather conditions as well. Collecting
reinforcement learning to autonomous driving. Their model such amount of data in the real-world is difficult, time-
is able to learn a policy for lane following in a handful of consuming, and costly. Moreover, driving datasets often focus
training episodes using a single monocular image as input. only on specific aspects of the driving task, also collected
The authors used Deep Deterministic Policy Gradient (DDPG) with specific sensor modalities (e.g. RGB cameras vs Lidar
algorithm [19] with prioritized experience replay [20] with all sensors).
exploration and optimization performed on-vehicle. Their state An increasing popular alternative to real-world data are
space consists of monocular camera images compressed by a autonomous driving simulators. Modern driving simulators
learned Variational Auto-Encoder (VAE) [21] together with the like CARLA (Car Learning to Act) [9], and AirSim [10]
observed vehicle speed and steering angle. The authors defined provide realistic 3D graphics and physics simulation, traffic
a two-dimensional continuous action space: steering angle, and management, weather conditions, a variety of sensors, pedes-
speed set-point. The authors utilize a 250 meter section of trian management, different vehicles, and various driving sce-
road for real-world driving experiments. Their best performing narios as well. In particular, AirSim also supports autonomous
model is capable of solving a simple lane following driving aerial vehicles, like drones. These kind of simulators are very
task in half an hour. flexible providing an easy way to collect data in different
Reference [7] proposes Controllable Imitative Reinforce- driving scenarios, weather conditions, with different vehi-
ment Learning (CIRL) to learn a diving policy based only cles and sensor modalities. TORCS (The Open Racing Car
on vision inputs from the CARLA simulator [9]. CIRL Simulator) [28] is a modular, multi-agent car simulator that
adopts a two-stage learning procedure: a first imitation stage focus on racing scenarios, instead. Compared to CARLA and
19820 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022
AirSim, TORCS has a lower-quality graphics, no traffic and In order to learn the desired behavior, the agent has to
pedestrian simulation, and a limited set of sensors. Other kind interact with the target environment: at the first timestep
of driving simulators focus solely on traffic simulation. SUMO (t = 0) the environment provides the agent with an initial
(Simulation of Urban Mobility) [27] is a microscopic traffic state s0 ∼ ρ(s0 ) sampled from the initial state distribution
simulation tool that models each vehicle and their dynamics ρ(s0 ), usually implicitly defined by the environment. Then,
individually. In particular, SUMO can even simulate railways the agent uses its policy to predict and execute the actions a0
and the CO2 emissions of individual vehicles. affecting the environment, resulting in state s1 according to the
environment dynamics, i.e. s1 ∼ P(· | s0 , a0 ). Consequently,
III. BACKGROUND
the environment evaluates the newly reached state s1 with
In this section we provide basic formalism and results about its reward function also providing the agent the respective
Reinforcement Learning [29], Generalized Advantage Estima- immediate reward r0 = r (s0 , a0 ). Then, the interaction loop
tion [30], and Proximal Policy Optimization [11], needed for repeats for the next timestep until either a final state or the
understanding and developing the subsequent sections. maximum number of timesteps have been reached. In general,
the interaction loop proceeds as follows: at a generic timestep t
A. Reinforcement Learning the agent experiences a state st , then it computes actions at
Reinforcement Learning (RL) [29] is a learning paradigm resulting in state st +1 for which it receives a reward rt =
to tackle decision-making problems that provides a formalism r (st , at ) from the environment. In practice, we consider finite
for modeling behavior, in which a software or physical agent horizon episodes of maximum length T .
learns how to take optimal actions within an environment
(i.e. a real or simulated world) by trial and error, guided only B. Proximal Policy Optimization
by positives or negatives scalar reward signals (sometimes
called reinforcements). Proximal Policy Optimization (PPO) [11] is a model-free
Formally, an environment is a Markov Decision Process RL algorithm from the policy optimization family that aims
(MDP) represented by a tuple S, A, P, r, γ , in which: S is to learn policies in a faster, more efficient, and more robust
the state space, A is the action space, P(s | s, a) is the way compared to vanilla policy gradient [33], and TRPO [34].
transition model (also called the environment dynamics) with In general, the aim of RL algorithms is to indirectly maximize
which is possible to predict the evolution of the environment’s the performance objective J (θ ) in order to maximize the
state, r : S × A → R is the reward function, finally γ ∈ (0, 1] agent’s performance on the given task:
is the discount factor. T
−1
The state space defines all the possible states s ∈ S (of the J (θ ) = E γ rt ,
t
(1)
environment) that can be experienced by the agent. Instead, t =0
the action space depicts all the possible actions a ∈ A that the Maximizing the performance objective J (θ ) means maximiz-
agent can predict. If the state space is not fully-observable, the ing the expected sum of discounted rewards, seeking for a
agent perceives observations o ∈ O, instead, which are yield policy π =
by the environment itself. The observation space O contains arg maxπ J (θ ) that achieves maximal perfor-
mance (i.e. t rt is maximal). The objective (1) is stochastic
only a partial amount of information described by S, the others (since the rewards results from sampled states and actions
(such as the environment’s internal stat) are hidden. In order by following π), apart from being not directly differentiable.
to recover such hidden information the agent usually retains Hence, policy optimization algorithms (likewise other RL
(or processes somehow) the full (or partial) history of the methods) optimize a surrogate objective J˜(θ ), instead, called
previous observations, i.e. o1:t , until the current timestep t. the policy gradient:
This setting is usually referred to be a partially-observable
Markov decision process (POMDP). T
−1
The agent derives actions according to its policy π: S → A ˜
∇θ J (θ ) = E ∇θ log πθ (at | st )A(st , at ) , (2)
which can be either deterministic at = π(st ) or stochastic t =0
π(at |st ) mapping states st to actions at . Note that in a where πθ is a policy parameterized by θ , and A(s, a) is the
partially-observable setting (i.e. POMDP) the true states are advantage function. The PPO algorithm optimizes a slightly
not available to the agent, which derives actions by condi- different policy gradient objective to maximize J (θ ). In par-
tioning on (one or more) past observations instead: π(at | ticular we utilize the following clipping objective variant
o j :t ), where the index j ( j ≤ t) indicates how much past (borrowing notation from [11]):
observations are considered. For our purposes, we restrict the
policy to be a Deep Neural Network (DNN) [31], πθ with Lclip (θ )
learnable parameters θ , that samples actions from a probability
= Et min ratiot (θ ) Ât , clip(ratiot (θ ), 1 − , 1 + ) Ât
distribution, i.e. at ∼ πθ (·|st ). In our case, our agent predicts
two continuous actions so we need to sample them from a (3)
continuous probability distribution like a Gaussian. Motivated πθ (at |st )
by [32], we use a Beta distribution instead, which, apart where: ratiot (θ ) = πθold (at |st ) denotes the probability ratio
from outperforming the Gaussian distribution, it is particularly between the current πθ and old policy πθold , Ât represents
suited for continuous actions that are also bounded. the advantages estimated by using GAE, lastly the function
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19821
Fig. 1. Example results of applying data augmentation. The original image is at the top-left corner.
B. Data Augmentation
As demonstrated by previous work [5], [17], [39] data
augmentation is crucial to let the agent generalize across
different towns and weather conditions. Similarly to [5], the
augmentations used are: color distortion (i.e. changes in con-
trast, brightness, saturation, and hue), Gaussian blur, Gaussian
noise, salt-and-pepper noise, cutout, and coarse dropout. Each Fig. 2. The neural network architecture of the proposed agent (with minor
omissions). The first half depicts the shared network Pψ , while the second
augmentation function is applied with a certain probability and half shows, respectively from top to bottom, the value Vψ and policy πθ
intensity (see Fig. 1). branches. At the center, the outputs of the first half of the network is first
Geometrical transformations, commonly used for image concatenated and then linearly combined, before feeding it to both the value
and policy branches.
detection tasks, including horizontal or vertical flipping, rota-
tion, and shearing, are not applied in this case since they would at axis zero. The shared network Pψ (first half) processes
significantly alter the driving scene. each component of the observation tensor oti separately, which
Note that data augmentation have been only used in the last are independently aggregated by GRU layers [40] into single
two stages of the reinforced curriculum learning procedure vectors. Then, the output of the concatenation is linearly
(more details in section IV-F). combined and fed to the two branches. Lastly, values are
decomposes into two numbers, bases b and exponents e,
C. Agent Architecture as motivated in the following section.
The agent is implemented by a deep neural network [31]
that takes the current observation ot as input, and outputs the D. Learning the Value Function
next action at ∼ πθ (z t ) along its value v t = Vφ z t ), where The value function is learned by minimizing the squared
z t = Pψ (ot ). The deep neural network represented by the loss Lv (φ) = v − R 22 of the network estimate of the values
agent has two branches: the policy branch πθ with parameters −1
v = [v t ]tT=0 , towards the true returns R = [Rt ]tT=0−1
, where
θ (the actor), and the value branch Vφ with parameters φ T −1 i
each return Rt = i=t γ ri is the discounted sum of rewards
(the critic). The policy branch samples a actions from a Beta
from timestep t to the end of the episode T − 1.
distribution as motivated by [32]. The value branch outputs the
Let’s notice that when the quantity v − R 22 is large,
value v of the states s that are used to estimate the advantage
because the estimate v is far from the ground-truth R, also (the
function A(s, a) with the GAE [30] technique. Both branches
norm of) its gradient ∇φ Lv (φ) is large, and so the parameters
share a common neural network Pψ with parameters ψ, that
φ got a big update that can cause training to be less stable.
processes observations o into an intermediate representation z.
Both values and returns are normalized to have zero mean
Since each observation ot is a stack of 4 sets of tensors
and unitary variance, this is a commonly used practice to
(see section IV-A), i.e. ot = [ot1 , . . . , ot4 ], the network Pψ
reduce variance, so that the magnitude of the error is always
is applied sequentially on each oti , yielding four z ti which
small. It is not known in advance to what proportion the values
are aggregated by Gated Recurrent Units (GRUs) [40] to
and returns are normalized to avoid bias, for this reason this
obtain z t . Moreover, Pψ embeds a ShuffleNet v2 [41] to
approach is biased: the scale of such quantities changes as the
process image data. Finally, both Vφ and πθ are feed-forward
performance of the agent improves.
NNs with two layers with 320 units SiLU-activated [42] and
The following outlines the approach used to learn the func-
batch normalization [43].
tion solidly and accurately, even without any normalization
The overall architecture of the agent is depicted in Fig. 2.
bias: both values v and returns R are respectively divided into
The blue rectangles indicate fully-connected (or dense) layers.
bases bv , b R ∈ [−1, 1] and exponents ev , e R ∈ [0, k] such that
The blue circle, i.e. ⊕, denotes layer concatenation along
the first dimension (or axis), where the batch dimension is v = bv · 10ev
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19823
Fig. 3. Example of value function learned through base-exponent decomposition. In the leftmost plot, the learned value function compared to returns; in the
center plot, the regression of bases; in the rightmost plot, the regression of exponents.
R = b R · 10e R ,
where k ∈ N is a positive constant that should be large enough
to represent even the largest returns. For example, we set k = 6
so that even returns up to ±106 can be properly depicted. With
such base-exponent decomposition, learning the value function
is a matter of regressing both bases and exponents; the new
loss function Lv (φ) is defined as follows: Fig. 4. Normalized advantages (b) now have a small scale, roughly in
[−1, 1]. The magnitude of the original advantages (a) was much larger, in the
−1
T order of 105 . This ensures the policy gradient’s norm to be small as well.
(bv t − b Rt )2 (ev t − e Rt )2
Lv (φ) = + (7) Notice the scale of the normalized advantages is almost 104 times smaller.
4 k2 Moreover, our normalization scheme ensures the preservation of the sign, that
t =0
is if in (a) some advantages were positive, they will be still positive after our
Hence, even large errors now lie in a small interval because normalization in (b).
both the base b and exponent e take values in a small interval,
and so the gradient ∇φ Lv (φ) is always reasonably small, def sign_preserving_norm (adv, eps=1e-3):
resulting in more stable training. Note that the bases b have adv_max = tf.reduce_max(adv)
a different scale from the exponents e, so we normalize them adv_min = tf.reduce_min(adv)
(by respectively dividing by 4 and k 2 ) such that they equally
# first, filter positives and negatives
contribute to the loss value, once again avoiding the need to pos = adv * tf.cast(adv > 0, tf.float32)
weight these two error terms. The normalizing coefficients neg = adv * tf.cast(adv < 0, tf.float32)
are obtained by considering the worst case in the squared
differences. Since the bases b ∈ [−1, 1], the worst case (i.e. the # then, normalize them separately
larger error value) is given by (1 − (−1))2 = 4: supposing return (pos / (adv_max + eps)) +
bv t = 1 and b Rt = −1 (or vice-versa). Similarly for the (neg / -(adv_min - eps))
exponents (0 − k)2 = k 2 , since e ∈ [0, k], again supposing Advantages normalized with the above function have the
ev t = 0 and e Rt = k (or vice-versa). benefit to have the same sign (and, thus, meaning) of the
original advantages (Fig. 5), while having a small and con-
E. Sign-Preserving Advantage Normalization trollable scale which we argue contribute to stabilize train-
ing. Preserving the sign is an important property which
The estimated advantages Ât directly affect the norm of avoids detrimental gradient flipping issues that cause ambigu-
the gradient ∇θ Lclip (θ ) of the PPO’s policy objective (3) as ity in the policy between better-than-average actions against
being a multiplicative factor. Consequently, if the advantages worse-than-average actions, which are mismatched and vice-
are large also the norm of ∇θ Lclip (θ ) is large, resulting in versa: for example, widely used normalization techniques like
a considerable change of the policy’s parameters: resulting min-max normalization and standardization (i.e. zero-mean
in a probable change of the agent behavior, which may unit-variance normalization) lack this property. In particular,
easily diverge; otherwise we could lower the learning rate min-max normalization transforms values to be in range
by several factors, potentially slowing-down training. Note [0, 1] such that the minimum value corresponds to 0 and
that the magnitude of the advantages strictly depends on the the maximum to 1. Such normalization would make the
quality of the learned value function, thus: poorly estimated normalized advantages to be always positive: thus, the sign
values imply large advantages, since  ≈ Vφ (s) − R, where is lost. Similarly, standardization would change the sign to be
Vφ is a learned value function and R the true returns. So, negative for those values which are below the mean value.
it is important to scale the advantages in a reasonable range
without introducing any bias to stabilize learning (Fig. 4).
For such purpose we propose the sign-preserving normal- F. Reinforced Curriculum Learning
ization function which separately normalizes positive values Since the problem of autonomous driving is extremely
from negative ones. The function is defined by the following complex we adopt a stage-based learning procedure for our
TensorFlow 2 [44] code: PPO agent, inspired by Curriculum Learning [12]. We divide
19824 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022
TABLE I
P ERFORMANCE OF O UR A GENTS : Curriculum (C), Standard (S), AND Untrained (U). B EST R ESULTS A RE H IGHLIGHTED IN B OLD . T HE R ESULTS H AVE
B EEN A GGREGATED OVER THE T WO W EATHER S ETS (Soft AND Hard), AND T HREE T RAFFIC S CENARIOS (No, Regular, AND Dense)
VI. C ONCLUSION
Deep reinforcement learning is still a relatively new field
with lots of unexplored research directions, that enable us to
solve even complex decision-making problems in a completely
end-to-end fashion, thus without leveraging any domain-
specific knowledge or expensive sets of highly-annotated data.
On the contrary, imitation learning is a stronger approach for
autonomous driving that heavily relies on high-quality and
high-quantity datasets, which also should provide demonstra-
tions of recovery from driving mistakes in order to learn a
reliable driving policy.
Although our approach is not yet competitive with the state-
of-the-art (CIRL [7], CAL [17], and CIRLS [6]), we demon-
strate emerging driving behavior that is consistent across all
CARLA towns and robust to change in weather. To our
knowledge, we are the first to provide baseline performance
on all towns, and to demonstrate such consistency. We also
provide a decomposition of the returns that allows learning
the value function in a stable and accurate way, as well as a
proper normalization function for the estimated advantages.
R EFERENCES
[1] International: On-Road Automated Vehicle Standards Committee,
S. SAE, Taxonomy Definitions Terms Rel. On-Road Motor Vehicle
Fig. 6. Performance of our agent in various settings, towns and weather.
Automated Driving Syst., Warrendale, PA, USA, Inf. Rep., 2014.
Notice that scenario (a) and (c) are novel, not experienced by the agent during
[2] S. Pendleton et al., “Perception, planning, control, and coordination for
training.
autonomous vehicles,” Machines, vol. 5, no. 1, p. 6, Feb. 2017.
[3] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey
traffic (without any pedestrian nor vehicle), regular traf- of deep learning techniques for autonomous driving,” J. Field Robot.,
fic (50 pedestrians and 50 vehicles), and dense traffic vol. 37, no. 3, pp. 362–386, 2020.
[4] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of
(200 pedestrians and 100 vehicles). autonomous driving: Common practices and emerging technologies,”
We also evaluate the benefit of curriculum learning, comparing IEEE Access, vol. 8, pp. 58443–58469, 2020.
the same agent with and without curriculum: we refer to the [5] F. Codevilla, M. Müller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int.
former agent as curriculum (C), and the latter as standard (S). Conf. Robot. Automat. (ICRA), May 2018, pp. 1–9.
Moreover, we also provide (non-trivial) baseline performance [6] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limi-
of an agent with the same architecture as the other two but tations of behavior cloning for autonomous driving,” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9329–9338.
with random weights being fixed for the entire evaluation [7] X. Liang, T. Wang, L. Yang, and E. Xing, “CIRL: Controllable imitative
procedure: we refer to this agent as untrained (U). Notice reinforcement learning for vision-based self-driving,” in Proc. Eur. Conf.
that the untrained agent is a stronger (but still naive) baseline Comput. Vis. (ECCV), Sep. 2018, pp. 584–599.
[8] A. Kendall et al., “Learning to drive in a day,” in Proc. Int. Conf. Robot.
compared to a purely random-guess agent, which completely Automat. (ICRA), May 2019, pp. 8248–8254.
discards the input observations it receives solely sampling [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
actions uniformly. Relative performance, aggregated over the “CARLA: An open urban driving simulator,” 2017, arXiv:1711.03938.
[10] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual
three traffic scenarios as well as the two weather sets, are and physical simulation for autonomous vehicles,” in Field and Service
shown in Table I. Qualitative results are provided by Fig. 6. Robotics, M. Hutter and R. Siegwart, Eds. Cham, Switzerland: Springer,
2018, pp. 621–635.
B. Discussion [11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
From the detailed evaluation results, we point out two [12] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
major weaknesses of our approach: (1) the agent struggles at learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 41–48.
[13] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
coordinating acceleration and breaking, and (2) at recognizing arXiv:1604.07316.
obstacles. This results in low speed (about 9 km/h) and many [14] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of driving
collisions as well. Such behavior could be due to lack of models with surround-view cameras and route planners,” in Proc. Eur.
Conf. Comput. Vis. (ECCV), 2018, pp. 435–453.
exploration, network capacity and/or architecture, as well as [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
various difficulties in optimizing the policy gradient. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
We also demonstrate the following: (1) emerging driving [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
behavior without leveraging any domain knowledge, that is Jun. 2016, pp. 770–778.
(2) robust and consistent across towns and weather condi- [17] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning
tions, furthermore (3) the stage-based reinforcement learning for driving in urban environments,” 2018, arXiv:1806.06498.
[18] W. Zeng et al., “End-to-end interpretable neural motion planner,”
procedure has proven to be competitive, even better, compared in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
to plain reinforcement learning. Jun. 2019, pp. 8660–8669.
19826 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022
[19] T. P. Lillicrap et al., “Continuous control with deep reinforcement Paola Barra received the B.S. degree in computer
learning,” 2015, arXiv:1509.02971. science from the University of Salerno, the M.S.
[20] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience degree in business informatics from the Univer-
replay,” 2015, arXiv:1511.05952. sity of Pisa, the Ph.D. degree from the Univerisity
[21] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013, of Salerno in 2021. Her research interests include
arXiv:1312.6114. machine learning techniques to solve issues using
[22] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement computer, as vision facial and gait recognition,
learning for urban autonomous driving,” in Proc. IEEE Intell. Transp. action recognition, and tumor detection. She is a
Syst. Conf. (ITSC), Oct. 2019, pp. 2765–2771. member of GIRPR/IAPR.
[23] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double Q-learning,” 2015, arXiv:1509.06461.
[24] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approx- Silvio Barra was born in Battipaglia, Salerno,
imation error in actor-critic methods,” 2018, arXiv:1802.09477. Italy, in 1985. He received the B.Sc. and M.Sc.
[25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- degrees (cum laude) in computer science from the
policy maximum entropy deep reinforcement learning with a stochastic University of Salerno, in 2009 and 2012, respec-
actor,” 2018, arXiv:1801.01290. tively, and the Ph.D. degree from the University
[26] C. Li and K. Czarnecki, “Urban driving with multi-objective deep of Cagliari, in 2017. Currently, he is a Research
reinforcement learning,” 2018, arXiv:1811.08586. Assistant with the University of Naples, Federico
[27] P. A. Lopez et al., “Microscopic traffic simulation using sumo,” in Proc. II. He has authored more than 50 papers, published
21st Int. Conf. Intell. Transp. Syst. (ITSC), Nov. 2018, pp. 2575–2582. in international journals, conferences, and books.
[28] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and His main research interests include pattern recog-
A. Sumner, (2000). TORCS, The Open Racing Car Simulator. [Online]. nition, biometrics, video analysis and analytics, and
Available: http://torcs.sourceforge.net financial forecasting.
[29] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Aniello Castiglione (Member, IEEE) received the
Cambridge, MA, USA: MIT Press, 2018. Ph.D. degree in computer science from the Univer-
[30] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- sity of Salerno, Italy. He is currently an Associate
dimensional continuous control using generalized advantage estimation,” Professor with the University of Naples Parthenope,
2015, arXiv:1506.02438. Italy. He received the Italian National Qualification
[31] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, as a Full Professor of computer science in 2021.
vol. 1, no. 2. Cambridge, MA, USA: MIT Press, 2016. He published more than 240 papers in international
[32] P.-W. Chou, D. Maturana, and S. Scherer, “Improving stochastic pol- journals and conferences. Considering his journal
icy gradients in continuous control with deep reinforcement learning articles, more than 85 of them are ranked Q1 in
using the beta distribution,” in Proc. Int. Conf. Mach. Learn., 2017, Scopus/Scimago classification and more than 70 of
pp. 834–843. them are ranked Q1 in the Clarivate Analytics/ISI-
[33] R. J. Williams, “Simple statistical gradient-following algorithms for WoS classification. The international academic profile of him is spread among
connectionist reinforcement learning,” Mach. Learn., vol. 8, nos. 3–4, his 86 international coauthors who belong to 75 different institutions located in
pp. 229–256, 1992. 18 countries. He served in the organizations as the Program Chair and a TPC
[34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Member in around 250 international conferences (some of them are ranked
region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015, A+/A/A- in the CORE, LiveSHINE, and Microsoft Academic international
pp. 1889–1897. classifications). His current research interests include information forensics,
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” digital forensics, security and privacy on cloud, communication networks,
2014, arXiv:1412.6980. applied cryptography, and sustainable computing. Currently, he is the Editor-
[36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in-Chief of the Special Issues for the Journal of Ambient Intelligence and
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. Humanized Computing (Springer). He served as the Managing Editor for
[37] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540. two ISI-ranked international journals and as a Reviewer for 110 international
[38] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013, journals. In addition, he served as a Guest Editor for 30 Special Issues
arXiv:1312.5602. and served as the Editor of more than ten Editorial Boards of international
[39] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline journals, such as IEEE T RANSACTIONS ON S USTAINABLE C OMPUTING,
evaluation of vision-based driving models,” in Proc. Eur. Conf. Comput. IEEE A CCESS , IET Image Processing (IET), Journal of Ambient Intelligence
Vis. (ECCV), 2018, pp. 236–251. and Humanized Computing (Springer), MTAP, Sustainability (MDPI), Smart
[40] K. Cho et al., “Learning phrase representations using RNN encoder- Cities (MDPI), and Future Internet (MDPI). One of his papers (published
decoder for statistical machine translation,” 2014, arXiv:1406.1078. in the IEEE T RANSACTIONS ON D EPENDABLE AND S ECURE C OMPUTING)
[41] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical was selected as “Featured Article” in the “IEEE Cybersecurity Initiative” in
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. 2014. In October 2020 and October 2021, he was included into the ranking
Comput. Vis. (ECCV), 2018, pp. 116–131. of the top 100,000 scientists for the years 2019 and 2020. He is a member
[42] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units of ACM.
for neural network function approximation in reinforcement learning,”
Michele Nappi (Senior Member, IEEE) received
Neural Netw., vol. 107, pp. 3–11, Nov. 2018.
the laurea degree (cum laude) in computer science
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
from the University of Salerno, Italy, in 1991, the
deep network training by reducing internal covariate shift,” 2015,
M.Sc. degree in information and communication
arXiv:1502.03167.
technology from I.I.A.S.S. E.R. Caianiello, in 1997,
[44] M. Abadi et al., “Tensorflow: A system for large-scale machine learn-
and the Ph.D. degree in applied mathematics and
ing,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement.
computer science from the University of Padova,
(OSDI), 2016, pp. 265–283.
Italy, in 1997. He was one of the founders of the spin
[45] L. Anzalone, S. Barra, and M. Nappi, “Reinforced curriculum learning
off BS3 (biometric system for security and safety) in
for autonomous driving in CARLA,” in Proc. IEEE Int. Conf. Image
2014. He is currently a Full Professor of computer
Process. (ICIP), Sep. 2021, pp. 3318–3322.
science with the University of Salerno. He is a Team
Leader of the Biometric and Image Processing Laboratory (BIPLAB). He has
authored more than 180 papers in peer-reviewed international journals, inter-
national conferences, and book chapters. His research interests include pattern
recognition, image processing, image compression and indexing, multimedia
databases and biometrics, human–computer interaction, and VR/AR. He is a
Luca Anzalone received the B.Sc. and M.Sc. degrees (cum laude) in computer member of TPC of international conferences. He is a GIRPR/IAPR Member.
science from the University of Salerno, in 2018 and 2020, respectively. He is He received several international awards for scientific and research activities.
currently pursuing the Ph.D. degree in data science and computation with the He is the Co-Editor of several international books. He serves as an Associate
University of Bologna. His research interests include deep learning and deep Editor and a Managing Guest Editor for several international journals. He is
reinforcement learning. the President of the Italian Chapter of the IEEE Biometrics Council.