0% found this document useful (0 votes)

24 views10 pages

An End-to-End Curriculum Learning Approach For Aut

This document discusses an end-to-end curriculum learning approach for autonomous driving scenarios using deep reinforcement learning. It provides background on autonomous driving levels and challenges. The proposed approach divides reinforcement learning into multiple stages of increasing difficulty to gradually learn a better driving policy.

Uploaded by

jorge.lozoya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views10 pages

An End-to-End Curriculum Learning Approach For Aut

Uploaded by

jorge.lozoya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO.

10, OCTOBER 2022 19817

An End-to-End Curriculum Learning Approach

for Autonomous Driving Scenarios
Luca Anzalone , Paola Barra , Silvio Barra , Aniello Castiglione , Member, IEEE,
and Michele Nappi , Senior Member, IEEE

Abstract— In this work, we combine Curriculum Learning and level 5 (full automation) by the SAE J3016 standard [1].
with Deep Reinforcement Learning to learn without any prior High-to-full autonomous vehicles must master tasks known as
domain knowledge, an end-to-end competitive driving policy for perception, planning, and control [2], [3]. Perception refers to
the CARLA autonomous driving simulator. To our knowledge,
we are the first to provide consistent results of our driving policy the ability of an autonomous system to collect information and
on all towns available in CARLA. Our approach divides the extract relevant knowledge from the environment. In order to
reinforcement learning phase into multiple stages of increas- do so, the autonomous vehicles need to understand the driving
ing difficulty, such that our agent is guided towards learning scenario (environmental perception), to compute its pose and
an increasingly better driving policy. The agent architecture motion (localization), and to determine which portions of the
comprises various neural networks that complements the main
convolutional backbone, represented by a ShuffleNet V2. Further driving space are occupied by other objects (occupancy grids).
contributions are given by (i) the proposal of a novel value Planning relies on the output of the perception component to
decomposition scheme for learning the value function in a stable devise and obstacle-free route, that the vehicle has to follow to
way and (ii) an ad-hoc function for normalizing the growth in avoid any collision while reaching its indented destination. The
size of the gradients. We show both quantitative and qualitative planned route is made of high-level commands that do not tell
results of the learned driving policy.
the vehicle’s software how to actually implement them in terms
Index Terms— Autonomous driving, CARLA simulator, auto- of torques and forces. Finally, motion control does account for
motive, deep reinforcement learning, curriculum learning. this, converting high-levels commands to low-levels actions,
I. I NTRODUCTION consisting of specific torques and forces values to be applied
to the vehicle’s actuators in order to make it move and steer
A UTONOMOUS Driving (AD) technology promises to
change the way we travel. Thanks to the emerging auto-
motive applications, Autonomous Vehicles (AV) will be able to
properly.
For such purpose, both level 4 and 5 autonomous vehicles
are equipped with a variety of exteroceptive sensors, like cam-
recognize the road and the driving context so to plan the route
eras, LiDAR, RADAR, and ultrasonic sensors, to perceive the
by monitoring the dynamics of the other vehicles and subjects
external environment including dynamic and static objects, and
within the scene. Thanks to AV, people will be able to travel
proprioceptive sensors, like IMUs, tachometers and altimeters,
from place to place in a safer, more environmentally friendly
for internal vehicle state monitoring [4]. Moreover, high sensor
and even more time-efficient way. These new technologies are
redundancy along with sensor fusion are often necessary to
expected to reduce road fatalities, pollution and have greater
achieve improved performance and high robustness especially
autonomy.
in degraded driving and weather conditions.
Such complex goals can be only achieved by highly
The tasks of perception, planning and control can be solved
autonomous vehicles, classified as level 4 (high automation)
in isolation or jointly. In the isolated approach it is interpreted
Manuscript received 30 June 2021; revised 27 October 2021; accepted with a modular pipeline in which each module is separate
22 February 2022. Date of publication 26 May 2022; date of current version and performs a specific task [4]. The resulting system suffers
11 October 2022. This work was supported in part by PRIN 2017 PREVUE:
“PRediction of activities and Events by Vision in an Urban Environment,” from error propagation: the modules are designed by humans
through the Italian Ministry of Education, University and Research, under and therefore potentially imperfect; every small error would
Grant 2017N2RK7K. The Associate Editor for this article was B. B. Gupta. propagate in the system joining with any errors in other
(Corresponding author: Aniello Castiglione.)
Luca Anzalone is with the Department of Physics and Astronomy modules. Basically, the isolated approach is not optimal and
(DIFA), University of Bologna, 40127 Bologna, Italy (e-mail: not reliable. These weaknesses motivate the choice of the
luca.anzalone2@unibo.it). end-to-end driving paradigm. With end-to-end guidance, the
Paola Barra is with the Department of Computer Science, Sapienza Univer-
sity of Rome, 00185 Rome, Italy (e-mail: barra@di.uniroma1.it). perception, planning and control tasks are solved jointly and
Silvio Barra is with the Department of Electrical and Information Technol- are not presented explicitly. These systems have a more
ogy Engineering (DIETI), University of Naples “Federico II”, 80138 Naples, functional design and are easier to develop and maintain.
Italy (e-mail: silvio.barra@unina.it).
Aniello Castiglione is with the Department of Science and Technology In general, we can distinguish various categories of sys-
(DIST), University of Naples “Parthenope”, 80133 Naples, Italy (e-mail: tem architectures for autonomous vehicle design which also
castiglione@ieee.org). accounts (or not) for connectivity among vehicles [4]:
Michele Nappi is with the Department of Computer Science, University of
Salerno, 84084 Salerno, Italy (e-mail: mnappi@unisa.it). • Ego-only systems (or standalone vehicles) do not
Digital Object Identifier 10.1109/TITS.2022.3160673 share information among other autonomous vehicles.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
19818 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022

A standalone vehicle uses only its knowledge to devise them in (i) Autonomous Driving approaches based on
driving decisions. The lack of connectivity, makes this Deep Learning techniques, (ii) Reinforcement Learning for
category of AVs simpler to design compared to vehicles Autonomous Driving and (iii) Autonomous Driving Simu-
that are connected together. lators. In Section III several formalisms and definitions are
• Connected systems are able to distribute the basic oper- proposed, so to allow the reader to ease the understanding
ations of automated driving among other autonomous of the background context of the paper. In Section IV the
vehicles, thus forming a connected multi-agent system. proposed approach is presented. Sections V shows the obtained
In this way, vehicles can share detailed driving infor- results on the CARLA Towns. Finally, Section VI concludes
mation and use such additional information to perform the paper.
better decisions. Communication among vehicles requires
II. R ELATED W ORK
a specific infrastructure and communication protocols,
other than being able to efficiently transmit and store large A. Deep Learning-Based Autonomous Driving
amounts of data. Deep learning-based end-to-end driving systems aim to
• Modular systems are structured as a pipeline of sepa- achieve human-like driving simply by learning a mapping
rate components (as discussed previously), each of them function from inputs to output targets, so being able to imitate
solving a specific task. The main advantage is that the human experts. These inputs are often (monocular) camera
complex problem of autonomous driving can be decom- images, while the targets can be quantities like the steering
posed in smaller and easier-to-solve set of problems. angle, the vehicle’s speed, the route-following direction, throt-
• End-to-end driving generate ego-motion directly tle and breaking values, or even high-level commands.
from (raw) sensory inputs (e.g. RGB camera images), Reference [13] trained a convolutional neural network to
without the need to design any intermediate module. Ego- map raw pixels from a single front-facing camera directly to
motion can be either the continuous operation of steering steering commands. The authors managed to drive in traffic on
wheel and pedals (i.e. acceleration and breaking) or a local roads, on highways, and even in areas with unclear visual
discrete set of actions. End-to-end driving is simple to guidance. To correct the vehicle drifting from the ground-truth
implement, but often leads to less interpretable systems. trajectory, the authors employed two additional cameras to
Imitation Learning [5], [6] is the preferred approach for end- record left and right shifts. The authors evaluated their system
to-end driving, given its design simplicity and optimization by measuring the autonomy metric, being autonomous 98%
stability, despite requiring a considerable amount of expert of the time. To mitigate this shifting problem, [14] developed
data for learning a competitive policy. Deep Reinforcement a sensor setup that provides a 360-degree view of the area
Learning (RL) is gaining interest for its encouraging results surrounding the vehicle by using eight cameras. Their driving
in the field [7], [8], without requiring to collect expert trajec- model uses multiple Convolutional Neural Networks (CNNs)
tories: just a real or simulated environment (e.g. CARLA [9], as feature encoders, four Long-short Term Memory (LSTM)
or AirSim [10]) is needed, instead. Moreover, RL can poten- recurrent networks [15] as temporal encoders, and a fully-
tially discover better-than-expert behavior since it maximizes connected network to incorporate map information. Their
the agent’s performance with respect a designed reward system is trained to minimize the mean squared error (MSE)
function. against speed and steering angle.
In this paper, we provide the following contributions: Reference [5] propose to condition the imitation learning
• We combine the Proximal Policy Optimization (PPO) [11] procedure on a high-level routing command (i.e. a one-hot
algorithm with Curriculum Learning [12], showing how encoded vector), such that trained policies can be controlled
to learn an end-to-end urban driving policy for the at test time by a passenger or by a topological planner.
CARLA driving simulator [9]. The authors evaluated the approach in a simulated urban
• We evaluate our curriculum-based agent on various met- environment provided by the CARLA driving simulator [9]
rics, towns, weather conditions, and traffic scenarios. and on a physical system: a 1/5-scale truck. For goal-based
To our knowledge, we are the first to demonstrate con- navigation they recorded a success rate of 88% in Town 1
sistent results on all towns provided by CARLA, by just (training scenario), and of 64% in Town 2 (testing scenario);
training the agent on only one town. two of the simplest towns available.
• Moreover, we point out two important sources of insta- End-to-end behavioral cloning is appealing for its sim-
bility in reinforcement learning algorithms: learning the plicity and scalability but there are limitations [6], such
value function V (s), and normalizing the estimated as: dataset bias and overfitting when data is not diverse
advantage function A(s, a). enough, generalization issues towards dynamic objects seen
• Finally, we provide two novel techniques to solve these during training, and domain shifting between the off-line
issues. The two methods can be applied to any value- training experience and the on-line behavior. Despite these
based RL algorithm, as well as actor-critic algorithms. limitations, behavioral cloning can still achieve state-of-the-art
More notably, the same technique we use to learn the results as demonstrated by [6]. In fact, the authors proposed
value function is general enough to be employed in almost a ResNet-based [16] architecture with a speed prediction
any ML regression problem. branch. According to them, in presence of large amounts
The paper is organized as follows: Section II defines of data a deep model can reduce both bias and variance
and describes the related works in the topic, categorizing over data, also having better generalization performances on
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19819

learning reactions to dynamic objects and traffic lights in pretrains the actor’s network on ground-truth actions recorded
complex urban environments. The authors also proposed a from human driving videos, and the subsequent reinforcement
novel CARLA driving benchmark, called NoCrash, in which learning stage employs DDPG [19] to improve the driving
the ability of the ego vehicle is tested on three urban sce- policy. According to the authors, the first imitation stage is
narios with different weather conditions: empty town with no necessary to prevent DDPG to fall in local optima due to poor
dynamic objects, regular traffic with a moderate amount of exploration. CIRL uses a four-branch network with a speed
cars and pedestrians, and dense traffic with a large number of prediction branch, similar to [6]. The authors conducted exper-
vehicles and pedestrians. iments on the CARLA simulator benchmark, showing that
Reference [17] proposed the first direct perception method - the CIRL performance are comparable to the best imitation
an emerging paradigm that combines both end-to-end learning learning methods, such as CIL [5], CAL [17], and CIRLS [6].
and control algorithms - named Conditional Affordance Learn- Often, training a competitive driving policy from high-
ing (CAL), to handle traffic lights and speed signs by using dimensional observations is often too difficult or expensive
image-level labels, as well as smooth car-following, resulting for RL. [22] propose to visual encode the perception and
in a significant reduction of traffic accidents in simulation. routing information the agent receives into a bird-view image,
Their CAL agent consists of a neural network that predicts which is further compressed by a VAE [21]. To reduce training
six types of affordances from input observation, and a lateral complexity the authors employed the frame-skip trick, in which
and longitudinal controller which predicts the throttle, brake, each action made by the ego-vehicle is repeated for subse-
and steering values. quent k = 4 frames. The authors evaluated their approach on
Reference [18] proposed the first interpretable neural CARLA [9], specifically on a challenging roundabout scenario
motion planner for learning to drive autonomously in com- in Town 3. They compared three RL algorithms: Double
plex urban scenarios that include traffic-light handling, yield- DQN [23], TD3 [24], and SAC [25]. The latter achieved the
ing, and interactions with multiple road-users. Their model best performance.
employs a convolutional backbone to predict the bounding Reference [26] proposed a multi-objective DQN agent moti-
boxes of other actors, as well as a space-time cost volume vated by the fact that a multi-objective approach can help
for planning. The input representation consists of Lidar point overcome the difficulties of designing a scalar reward that
clouds coupled with annotated HD maps of the road. The properly weighs each performance criteria. Furthermore, the
space-time cost volume represents the goodness of each authors suggest that when each aspect is learned separately,
location that the self-driving car can take within a planning it is possible to choose which aspect to explore in a given state.
horizon. Their model is trained end-to-end with a multi-task In particular, they learned a separate agent for each objective
objective: the planning loss encourages the minimum cost plan which, collectively, form a combined policy that takes all these
to be similar to the trajectory performed by human demon- objectives into account. The authors trained the agent on two
strators, and the perception loss encourages the intermediate four-way intersecting roads with random surrounding traffic
representations to produce accurate 3D detection and motion provided by the SUMO traffic simulator [27], demonstrating
forecasting. According to the authors, the combination of very low infraction rate.
these two losses ensures the interpretability of the intermediate
representations.
C. Autonomous Driving Simulators
Autonomous driving research requires a considerable
B. Reinforcement Learning-Based Autonomous Driving amount of diversified data, collected on a variety of driving
Reference [8] demonstrated the first application of deep scenarios with different weather conditions as well. Collecting
reinforcement learning to autonomous driving. Their model such amount of data in the real-world is difficult, time-
is able to learn a policy for lane following in a handful of consuming, and costly. Moreover, driving datasets often focus
training episodes using a single monocular image as input. only on specific aspects of the driving task, also collected
The authors used Deep Deterministic Policy Gradient (DDPG) with specific sensor modalities (e.g. RGB cameras vs Lidar
algorithm [19] with prioritized experience replay [20] with all sensors).
exploration and optimization performed on-vehicle. Their state An increasing popular alternative to real-world data are
space consists of monocular camera images compressed by a autonomous driving simulators. Modern driving simulators
learned Variational Auto-Encoder (VAE) [21] together with the like CARLA (Car Learning to Act) [9], and AirSim [10]
observed vehicle speed and steering angle. The authors defined provide realistic 3D graphics and physics simulation, traffic
a two-dimensional continuous action space: steering angle, and management, weather conditions, a variety of sensors, pedes-
speed set-point. The authors utilize a 250 meter section of trian management, different vehicles, and various driving sce-
road for real-world driving experiments. Their best performing narios as well. In particular, AirSim also supports autonomous
model is capable of solving a simple lane following driving aerial vehicles, like drones. These kind of simulators are very
task in half an hour. flexible providing an easy way to collect data in different
Reference [7] proposes Controllable Imitative Reinforce- driving scenarios, weather conditions, with different vehi-
ment Learning (CIRL) to learn a diving policy based only cles and sensor modalities. TORCS (The Open Racing Car
on vision inputs from the CARLA simulator [9]. CIRL Simulator) [28] is a modular, multi-agent car simulator that
adopts a two-stage learning procedure: a first imitation stage focus on racing scenarios, instead. Compared to CARLA and
19820 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022

AirSim, TORCS has a lower-quality graphics, no traffic and In order to learn the desired behavior, the agent has to
pedestrian simulation, and a limited set of sensors. Other kind interact with the target environment: at the first timestep
of driving simulators focus solely on traffic simulation. SUMO (t = 0) the environment provides the agent with an initial
(Simulation of Urban Mobility) [27] is a microscopic traffic state s0 ∼ ρ(s0 ) sampled from the initial state distribution
simulation tool that models each vehicle and their dynamics ρ(s0 ), usually implicitly defined by the environment. Then,
individually. In particular, SUMO can even simulate railways the agent uses its policy to predict and execute the actions a0
and the CO2 emissions of individual vehicles. affecting the environment, resulting in state s1 according to the
environment dynamics, i.e. s1 ∼ P(· | s0 , a0 ). Consequently,
III. BACKGROUND
the environment evaluates the newly reached state s1 with
In this section we provide basic formalism and results about its reward function also providing the agent the respective
Reinforcement Learning [29], Generalized Advantage Estima- immediate reward r0 = r (s0 , a0 ). Then, the interaction loop
tion [30], and Proximal Policy Optimization [11], needed for repeats for the next timestep until either a final state or the
understanding and developing the subsequent sections. maximum number of timesteps have been reached. In general,
the interaction loop proceeds as follows: at a generic timestep t
A. Reinforcement Learning the agent experiences a state st , then it computes actions at
Reinforcement Learning (RL) [29] is a learning paradigm resulting in state st +1 for which it receives a reward rt =
to tackle decision-making problems that provides a formalism r (st , at ) from the environment. In practice, we consider finite
for modeling behavior, in which a software or physical agent horizon episodes of maximum length T .
learns how to take optimal actions within an environment
(i.e. a real or simulated world) by trial and error, guided only B. Proximal Policy Optimization
by positives or negatives scalar reward signals (sometimes
called reinforcements). Proximal Policy Optimization (PPO) [11] is a model-free
Formally, an environment is a Markov Decision Process RL algorithm from the policy optimization family that aims
(MDP) represented by a tuple S, A, P, r, γ , in which: S is to learn policies in a faster, more efficient, and more robust
the state space, A is the action space, P(s | s, a) is the way compared to vanilla policy gradient [33], and TRPO [34].
transition model (also called the environment dynamics) with In general, the aim of RL algorithms is to indirectly maximize
which is possible to predict the evolution of the environment’s the performance objective J (θ ) in order to maximize the
state, r : S × A → R is the reward function, finally γ ∈ (0, 1] agent’s performance on the given task:
is the discount factor. T
−1
The state space defines all the possible states s ∈ S (of the J (θ ) = E γ rt ,
t
(1)
environment) that can be experienced by the agent. Instead, t =0
the action space depicts all the possible actions a ∈ A that the Maximizing the performance objective J (θ ) means maximiz-
agent can predict. If the state space is not fully-observable, the ing the expected sum of discounted rewards, seeking for a
agent perceives observations o ∈ O, instead, which are yield policy π =
by the environment itself. The observation space O contains arg maxπ J (θ ) that achieves maximal perfor-
mance (i.e. t rt is maximal). The objective (1) is stochastic
only a partial amount of information described by S, the others (since the rewards results from sampled states and actions
(such as the environment’s internal stat) are hidden. In order by following π), apart from being not directly differentiable.
to recover such hidden information the agent usually retains Hence, policy optimization algorithms (likewise other RL
(or processes somehow) the full (or partial) history of the methods) optimize a surrogate objective J˜(θ ), instead, called
previous observations, i.e. o1:t , until the current timestep t. the policy gradient:
This setting is usually referred to be a partially-observable
Markov decision process (POMDP). T
−1
The agent derives actions according to its policy π: S → A ˜
∇θ J (θ ) = E ∇θ log πθ (at | st )A(st , at ) , (2)
which can be either deterministic at = π(st ) or stochastic t =0
π(at |st ) mapping states st to actions at . Note that in a where πθ is a policy parameterized by θ , and A(s, a) is the
partially-observable setting (i.e. POMDP) the true states are advantage function. The PPO algorithm optimizes a slightly
not available to the agent, which derives actions by condi- different policy gradient objective to maximize J (θ ). In par-
tioning on (one or more) past observations instead: π(at | ticular we utilize the following clipping objective variant
o j :t ), where the index j ( j ≤ t) indicates how much past (borrowing notation from [11]):
observations are considered. For our purposes, we restrict the
policy to be a Deep Neural Network (DNN) [31], πθ with Lclip (θ )

learnable parameters θ , that samples actions from a probability
= Et min ratiot (θ ) Ât , clip(ratiot (θ ), 1 − , 1 + ) Ât
distribution, i.e. at ∼ πθ (·|st ). In our case, our agent predicts
two continuous actions so we need to sample them from a (3)
continuous probability distribution like a Gaussian. Motivated πθ (at |st )
by [32], we use a Beta distribution instead, which, apart where: ratiot (θ ) = πθold (at |st ) denotes the probability ratio
from outperforming the Gaussian distribution, it is particularly between the current πθ and old policy πθold , Ât represents
suited for continuous actions that are also bounded. the advantages estimated by using GAE, lastly the function
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19821

clip(·) truncates ratiot (θ ) at 1 − if the advantages Ât are IV. M ETHOD

negatives, otherwise the ratio is clipped at 1 + . Lastly, the A. Learning Environment
hyper-parameter is usually set to 0.2. In practice, we also
add an entropy regularization term H[πθ ] to the objective (3) The learning environment E = S, O, A, P, r, γ , formally
to ensure diverse enough actions. a Partially Observable Markov Decision Process (POMDP),
Equation (3) depicts the clipping objective used by PPO defines the task that agent has to complete. This environment E
to improve the policy’s parameters θ , moreover the clipping was built by combining the CARLA driving simulator (version
function ensures the current policy to be not too different 0.9.9) [9] and the OpenAI’s gym library [37]:
• State space S is implicitly defined by CARLA, contain-
from the old policy so that divergent behavior is less likely.
Finally, the agent’s parameters θ got updated by performing ing ground-truth information about the whole world. The
multiple gradient descent steps (usually by using the Adam agent cannot observe the environment’s state st ∈ S, thus
optimizer [35]) with respect to (3). The update rule looks like the states are hidden. At each timestep t, the state st
the following: yields the corresponding observation ot , which is what
the agent observes.
θ ← θ − η∇θ Lclip (θ ) • Observation space O: similarly to [38], an observation
ot ∈ O is a stack of K = 4 sets of tensors from the
where θ are the new parameters, and η is the learning rate. 4
last K timesteps. Specifically, ot = [I, G, V, N]k k=1 ,
where: I is a 90 × 360 × 3 image obtained by con-
C. Generalized Advantage Estimation catenating (along the width axis) three 90 × 120 × 3
Many RL algorithms belonging to the policy optimization images from left, middle, and right RGB camera sen-
family – REINFORCE [33], TRPO [34], PPO [11], and sors, G is a 9-dimensional vector that encodes road
A3C [36] – require to estimate the advantage function A(s, a) features, V is a 4-dimensional vector that embeds vehi-
in order to learn the desired behavior. The advantage function cle features, and, lastly, N is a 5-dimensional vector
A(s, a) = Q(s, a) − V (s) is defined as being the difference that contains navigational features. The road features G
between the action-value function Q(s, a) and the state-value comprise: three Boolean values (is_intersection,
function V (s): is_junction, and is_at_traffic_light), the
speed limit (a float), and the traffic light state
Q(s, a) = r (s, a) + γ · Es ∼P (s,a) V (s ) (4) (a 5-dimensional one-hot vector). The vehicle features
V contains: the current vehicle speed, the actual throttle
V (s) = Ea∼π(s) Q(s, a) (5)
and brake values, and the vehicle similarity score v sim ∈
Intuitively, the advantage function tells us how much the action [−1, 1] w.r.t. the next planned route waypoint (center of a
a is better or worse than the average action while being in road segment). Lastly, the navigational features N include
state s. In particular, the action a is better-than-average if n = 5 distances between the actual location of the vehicle
Q(s, a) > V (s), and is worse-than-average if Q(s, a) < V (s). and the next n planned route waypoints.
To estimate the advantage function is only necessary to • Action space A:composed of two continuous actions
learn either the state-value function V (s) or the action-value with value in the range [−1, +1]. These two actions are
function Q(s, a), since both functions can be defined in terms the accelerator or brake value, and the steering angle.
of the other. In particular we use the Generalized Advantage • Transition dynamics P(st +1 | st , at ): defines how the
Estimation (GAE) [30] technique, which is an exponentially- environment’s state st ∈ S evolves in time due to the
weighted estimator of the advantage function that only requires application of the predicted actions at ∈ A. The transition
a learned state-value function. The GAE estimator has two dynamics is defined by CARLA, and is not explicitly
hyper-parameters γ ∈ (0, 1] and λ ∈ [0, 1] which allows us learned by the agent (model-free setting).
to trade variance for bias. Finally, the generalized advantage • Reward function r : penalizes any collision, as well
estimator GAE(γ , λ) is defined as follows: as following the wrong planned route. Compared to
other methods [7], [22] our reward function is simple,
GAE(γ ,λ) (1) (2) (3)
At = (1 − λ) At + λAt + λ2 At + ··· as it relates vehicle’s speed, direction, infractions, and
collisions in an intuitive way, thus avoiding the need to

T −1
optimally weight many different terms:
= (γ λ)k δt +k
⎧
k=0 ⎪
⎪ −c p if collision,
⎨
rt = slimit − v speed if v speed > slimit,
(n)
it builds from the n-step return estimator At which is defined (6)
⎪
⎪ v ·v
as the sum of n TD-residual δt +k terms: ⎩ speed sim otherwise
(dw /2)2

n−1
A(n)
t = γ k δt +k = −Vφ (st ) + rt + γ rt +1 + γ 2rt +2 where v speed is the vehicle’s speed, v sim is the vehicle’s
k=0 (cosine) similarity with next waypoint w, dw is the l2
+ · · · + γ n−1 rt +n−1 + γ n Vφ (st +n ) distance between the vehicle’s position and w, slimit is
the speed limit, lastly c p is the penalty for colliding with
where Vφ is a learned state-value function. objects, vehicles, and pedestrians.
19822 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022

Fig. 1. Example results of applying data augmentation. The original image is at the top-left corner.

• Discount factor γ ∈ (0, 1]: future rewards are discounted

by a factor of γ at each timestep.

B. Data Augmentation
As demonstrated by previous work [5], [17], [39] data
augmentation is crucial to let the agent generalize across
different towns and weather conditions. Similarly to [5], the
augmentations used are: color distortion (i.e. changes in con-
trast, brightness, saturation, and hue), Gaussian blur, Gaussian
noise, salt-and-pepper noise, cutout, and coarse dropout. Each Fig. 2. The neural network architecture of the proposed agent (with minor
omissions). The first half depicts the shared network Pψ , while the second
augmentation function is applied with a certain probability and half shows, respectively from top to bottom, the value Vψ and policy πθ
intensity (see Fig. 1). branches. At the center, the outputs of the first half of the network is first
Geometrical transformations, commonly used for image concatenated and then linearly combined, before feeding it to both the value
and policy branches.
detection tasks, including horizontal or vertical flipping, rota-
tion, and shearing, are not applied in this case since they would at axis zero. The shared network Pψ (first half) processes
significantly alter the driving scene. each component of the observation tensor oti separately, which
Note that data augmentation have been only used in the last are independently aggregated by GRU layers [40] into single
two stages of the reinforced curriculum learning procedure vectors. Then, the output of the concatenation is linearly
(more details in section IV-F). combined and fed to the two branches. Lastly, values are
decomposes into two numbers, bases b and exponents e,
C. Agent Architecture as motivated in the following section.
The agent is implemented by a deep neural network [31]
that takes the current observation ot as input, and outputs the D. Learning the Value Function
next action at ∼ πθ (z t ) along its value v t = Vφ z t ), where The value function is learned by minimizing the squared
z t = Pψ (ot ). The deep neural network represented by the loss Lv (φ) = v − R 22 of the network estimate of the values
agent has two branches: the policy branch πθ with parameters −1
v = [v t ]tT=0 , towards the true returns R = [Rt ]tT=0−1
, where
θ (the actor), and the value branch Vφ with parameters φ T −1 i
each return Rt = i=t γ ri is the discounted sum of rewards
(the critic). The policy branch samples a actions from a Beta
from timestep t to the end of the episode T − 1.
distribution as motivated by [32]. The value branch outputs the
Let’s notice that when the quantity v − R 22 is large,
value v of the states s that are used to estimate the advantage
because the estimate v is far from the ground-truth R, also (the
function A(s, a) with the GAE [30] technique. Both branches
norm of) its gradient ∇φ Lv (φ) is large, and so the parameters
share a common neural network Pψ with parameters ψ, that
φ got a big update that can cause training to be less stable.
processes observations o into an intermediate representation z.
Both values and returns are normalized to have zero mean
Since each observation ot is a stack of 4 sets of tensors
and unitary variance, this is a commonly used practice to
(see section IV-A), i.e. ot = [ot1 , . . . , ot4 ], the network Pψ
reduce variance, so that the magnitude of the error is always
is applied sequentially on each oti , yielding four z ti which
small. It is not known in advance to what proportion the values
are aggregated by Gated Recurrent Units (GRUs) [40] to
and returns are normalized to avoid bias, for this reason this
obtain z t . Moreover, Pψ embeds a ShuffleNet v2 [41] to
approach is biased: the scale of such quantities changes as the
process image data. Finally, both Vφ and πθ are feed-forward
performance of the agent improves.
NNs with two layers with 320 units SiLU-activated [42] and
The following outlines the approach used to learn the func-
batch normalization [43].
tion solidly and accurately, even without any normalization
The overall architecture of the agent is depicted in Fig. 2.
bias: both values v and returns R are respectively divided into
The blue rectangles indicate fully-connected (or dense) layers.
bases bv , b R ∈ [−1, 1] and exponents ev , e R ∈ [0, k] such that
The blue circle, i.e. ⊕, denotes layer concatenation along
the first dimension (or axis), where the batch dimension is v = bv · 10ev
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19823

Fig. 3. Example of value function learned through base-exponent decomposition. In the leftmost plot, the learned value function compared to returns; in the
center plot, the regression of bases; in the rightmost plot, the regression of exponents.

R = b R · 10e R ,
where k ∈ N is a positive constant that should be large enough
to represent even the largest returns. For example, we set k = 6
so that even returns up to ±106 can be properly depicted. With
such base-exponent decomposition, learning the value function
is a matter of regressing both bases and exponents; the new
loss function Lv (φ) is defined as follows: Fig. 4. Normalized advantages (b) now have a small scale, roughly in
[−1, 1]. The magnitude of the original advantages (a) was much larger, in the
−1
T order of 105 . This ensures the policy gradient’s norm to be small as well.
(bv t − b Rt )2 (ev t − e Rt )2
Lv (φ) = + (7) Notice the scale of the normalized advantages is almost 104 times smaller.
4 k2 Moreover, our normalization scheme ensures the preservation of the sign, that
t =0
is if in (a) some advantages were positive, they will be still positive after our
Hence, even large errors now lie in a small interval because normalization in (b).
both the base b and exponent e take values in a small interval,
and so the gradient ∇φ Lv (φ) is always reasonably small, def sign_preserving_norm (adv, eps=1e-3):
resulting in more stable training. Note that the bases b have adv_max = tf.reduce_max(adv)
a different scale from the exponents e, so we normalize them adv_min = tf.reduce_min(adv)
(by respectively dividing by 4 and k 2 ) such that they equally
# first, filter positives and negatives
contribute to the loss value, once again avoiding the need to pos = adv * tf.cast(adv > 0, tf.float32)
weight these two error terms. The normalizing coefficients neg = adv * tf.cast(adv < 0, tf.float32)
are obtained by considering the worst case in the squared
differences. Since the bases b ∈ [−1, 1], the worst case (i.e. the # then, normalize them separately
larger error value) is given by (1 − (−1))2 = 4: supposing return (pos / (adv_max + eps)) +
bv t = 1 and b Rt = −1 (or vice-versa). Similarly for the (neg / -(adv_min - eps))
exponents (0 − k)2 = k 2 , since e ∈ [0, k], again supposing Advantages normalized with the above function have the
ev t = 0 and e Rt = k (or vice-versa). benefit to have the same sign (and, thus, meaning) of the
original advantages (Fig. 5), while having a small and con-
E. Sign-Preserving Advantage Normalization trollable scale which we argue contribute to stabilize train-
ing. Preserving the sign is an important property which
The estimated advantages Ât directly affect the norm of avoids detrimental gradient flipping issues that cause ambigu-
the gradient ∇θ Lclip (θ ) of the PPO’s policy objective (3) as ity in the policy between better-than-average actions against
being a multiplicative factor. Consequently, if the advantages worse-than-average actions, which are mismatched and vice-
are large also the norm of ∇θ Lclip (θ ) is large, resulting in versa: for example, widely used normalization techniques like
a considerable change of the policy’s parameters: resulting min-max normalization and standardization (i.e. zero-mean
in a probable change of the agent behavior, which may unit-variance normalization) lack this property. In particular,
easily diverge; otherwise we could lower the learning rate min-max normalization transforms values to be in range
by several factors, potentially slowing-down training. Note [0, 1] such that the minimum value corresponds to 0 and
that the magnitude of the advantages strictly depends on the the maximum to 1. Such normalization would make the
quality of the learned value function, thus: poorly estimated normalized advantages to be always positive: thus, the sign
values imply large advantages, since Â ≈ Vφ (s) − R, where is lost. Similarly, standardization would change the sign to be
Vφ is a learned value function and R the true returns. So, negative for those values which are below the mean value.
it is important to scale the advantages in a reasonable range
without introducing any bias to stabilize learning (Fig. 4).
For such purpose we propose the sign-preserving normal- F. Reinforced Curriculum Learning
ization function which separately normalizes positive values Since the problem of autonomous driving is extremely
from negative ones. The function is defined by the following complex we adopt a stage-based learning procedure for our
TensorFlow 2 [44] code: PPO agent, inspired by Curriculum Learning [12]. We divide
19824 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022

TABLE I
P ERFORMANCE OF O UR A GENTS : Curriculum (C), Standard (S), AND Untrained (U). B EST R ESULTS A RE H IGHLIGHTED IN B OLD . T HE R ESULTS H AVE
B EEN A GGREGATED OVER THE T WO W EATHER S ETS (Soft AND Hard), AND T HREE T RAFFIC S CENARIOS (No, Regular, AND Dense)

WetSunset, SoftRainSunset, CloudyNoon,

WetNoon, SoftRainNoon]. Finally, in addition to the
50 pedestrian, CARLA also inserts 50 vehicles into the
map thus creating the regular traffic scenario.
• Stage 4: same conditions of Stage-3 but with data-
augmentation enabled (as detailed in section IV-B).
• Stage 5: the dense traffic scenario is developed, 100 vehi-
cles and 200 pedestrians are placed within the city. (Data
augmentation is still enabled.)
Each stage lasts for 500 episodes of 512 timesteps, for a total
of 1.28M timesteps.
Fig. 5. How normalization alters the sign of the estimated advantages. The
blue curve shows the true estimated advantages (notice they are all positive). V. R ESULTS
Standardized advantages (red) became negative if they were below the average,
compared to sign-preserved advantages (purple) which are still positive. In this section we provide both quantitative (Table I) and
qualitative (Fig. 6) results of the driving policy resulting from
the whole reinforcement learning procedure into five stages our reinforced curriculum learning procedure [45].
of growing difficulty, such that the agent is guided to learn
increasingly complex behavior. Each stage has a version of the A. Evaluation Procedure
learning environment (E) that emphasizes specific aspects of We perform extensive evaluation of our agent against six
the driving tasks. All the following stages occur in Town 3: metrics, on all CARLA’s towns with different weather con-
one of the most complete and challenging town available in ditions, as well as three traffic scenarios as proposed in the
CARLA. NoCrash benchmark [6]:
• Stage 1: the agent starting point is sampled from a fixed • Metrics: collision rate, similarity, waypoint distance,
set of n = 10 locations (determined by fixing the random speed, total reward, and timesteps.
generator’s seed). the agent’s initial position is sampled • Towns: in CARLA each town has its own unique features.
from n = 10 positions. Also, in this situation the agent has We trained our agent only on one town, Town03, and
to respect the speed limits and there are no other dynamic evaluated it on Town01, Town02, Town03, Town04,
objects other than the one controlled by the agent (no Town05, Town06, Town07, Town10.
traffic scenario). • Weather: we evaluate on two disjoint sets of weather
• Stage 2: n is set to 50, to let the agent experience presets. The first set (described in section IV-F) has
more diverse starting locations. In addition, the simulator been only used for training, the other is novel for
tries to randomly generate a maximum of 50 pedestrians the agent: [WetCloudyNoon, WetCloudySunset,
walking freely across the map, possibly crossing streets. CloudySunset, HardRainNoon, MidRainyNoon,
• Stage 3: there are no more restrictions on the start MidRainSunset].
locations. Moreover, the weather condition is randomly • Traffic: as in the NoCrash benchmark [6], we eval-
set to one of the presets [ClearNoon, ClearSunset, uate our agent on three different traffic scenarios: no
ANZALONE et al.: END-TO-END CURRICULUM LEARNING APPROACH FOR AUTONOMOUS DRIVING SCENARIOS 19825

VI. C ONCLUSION
Deep reinforcement learning is still a relatively new field
with lots of unexplored research directions, that enable us to
solve even complex decision-making problems in a completely
end-to-end fashion, thus without leveraging any domain-
specific knowledge or expensive sets of highly-annotated data.
On the contrary, imitation learning is a stronger approach for
autonomous driving that heavily relies on high-quality and
high-quantity datasets, which also should provide demonstra-
tions of recovery from driving mistakes in order to learn a
reliable driving policy.
Although our approach is not yet competitive with the state-
of-the-art (CIRL [7], CAL [17], and CIRLS [6]), we demon-
strate emerging driving behavior that is consistent across all
CARLA towns and robust to change in weather. To our
knowledge, we are the first to provide baseline performance
on all towns, and to demonstrate such consistency. We also
provide a decomposition of the returns that allows learning
the value function in a stable and accurate way, as well as a
proper normalization function for the estimated advantages.

R EFERENCES
[1] International: On-Road Automated Vehicle Standards Committee,
S. SAE, Taxonomy Definitions Terms Rel. On-Road Motor Vehicle
Fig. 6. Performance of our agent in various settings, towns and weather.
Automated Driving Syst., Warrendale, PA, USA, Inf. Rep., 2014.
Notice that scenario (a) and (c) are novel, not experienced by the agent during
[2] S. Pendleton et al., “Perception, planning, control, and coordination for
training.
autonomous vehicles,” Machines, vol. 5, no. 1, p. 6, Feb. 2017.
[3] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey
traffic (without any pedestrian nor vehicle), regular traf- of deep learning techniques for autonomous driving,” J. Field Robot.,
fic (50 pedestrians and 50 vehicles), and dense traffic vol. 37, no. 3, pp. 362–386, 2020.
[4] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of
(200 pedestrians and 100 vehicles). autonomous driving: Common practices and emerging technologies,”
We also evaluate the benefit of curriculum learning, comparing IEEE Access, vol. 8, pp. 58443–58469, 2020.
the same agent with and without curriculum: we refer to the [5] F. Codevilla, M. Müller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int.
former agent as curriculum (C), and the latter as standard (S). Conf. Robot. Automat. (ICRA), May 2018, pp. 1–9.
Moreover, we also provide (non-trivial) baseline performance [6] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limi-
of an agent with the same architecture as the other two but tations of behavior cloning for autonomous driving,” in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9329–9338.
with random weights being fixed for the entire evaluation [7] X. Liang, T. Wang, L. Yang, and E. Xing, “CIRL: Controllable imitative
procedure: we refer to this agent as untrained (U). Notice reinforcement learning for vision-based self-driving,” in Proc. Eur. Conf.
that the untrained agent is a stronger (but still naive) baseline Comput. Vis. (ECCV), Sep. 2018, pp. 584–599.
[8] A. Kendall et al., “Learning to drive in a day,” in Proc. Int. Conf. Robot.
compared to a purely random-guess agent, which completely Automat. (ICRA), May 2019, pp. 8248–8254.
discards the input observations it receives solely sampling [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
actions uniformly. Relative performance, aggregated over the “CARLA: An open urban driving simulator,” 2017, arXiv:1711.03938.
[10] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual
three traffic scenarios as well as the two weather sets, are and physical simulation for autonomous vehicles,” in Field and Service
shown in Table I. Qualitative results are provided by Fig. 6. Robotics, M. Hutter and R. Siegwart, Eds. Cham, Switzerland: Springer,
2018, pp. 621–635.
B. Discussion [11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
From the detailed evaluation results, we point out two [12] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum
major weaknesses of our approach: (1) the agent struggles at learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 41–48.
[13] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
coordinating acceleration and breaking, and (2) at recognizing arXiv:1604.07316.
obstacles. This results in low speed (about 9 km/h) and many [14] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of driving
collisions as well. Such behavior could be due to lack of models with surround-view cameras and route planners,” in Proc. Eur.
Conf. Comput. Vis. (ECCV), 2018, pp. 435–453.
exploration, network capacity and/or architecture, as well as [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
various difficulties in optimizing the policy gradient. Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
We also demonstrate the following: (1) emerging driving [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
behavior without leveraging any domain knowledge, that is Jun. 2016, pp. 770–778.
(2) robust and consistent across towns and weather condi- [17] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning
tions, furthermore (3) the stage-based reinforcement learning for driving in urban environments,” 2018, arXiv:1806.06498.
[18] W. Zeng et al., “End-to-end interpretable neural motion planner,”
procedure has proven to be competitive, even better, compared in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
to plain reinforcement learning. Jun. 2019, pp. 8660–8669.
19826 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 10, OCTOBER 2022

[19] T. P. Lillicrap et al., “Continuous control with deep reinforcement Paola Barra received the B.S. degree in computer
learning,” 2015, arXiv:1509.02971. science from the University of Salerno, the M.S.
[20] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience degree in business informatics from the Univer-
replay,” 2015, arXiv:1511.05952. sity of Pisa, the Ph.D. degree from the Univerisity
[21] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013, of Salerno in 2021. Her research interests include
arXiv:1312.6114. machine learning techniques to solve issues using
[22] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement computer, as vision facial and gait recognition,
learning for urban autonomous driving,” in Proc. IEEE Intell. Transp. action recognition, and tumor detection. She is a
Syst. Conf. (ITSC), Oct. 2019, pp. 2765–2771. member of GIRPR/IAPR.
[23] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double Q-learning,” 2015, arXiv:1509.06461.
[24] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approx- Silvio Barra was born in Battipaglia, Salerno,
imation error in actor-critic methods,” 2018, arXiv:1802.09477. Italy, in 1985. He received the B.Sc. and M.Sc.
[25] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- degrees (cum laude) in computer science from the
policy maximum entropy deep reinforcement learning with a stochastic University of Salerno, in 2009 and 2012, respec-
actor,” 2018, arXiv:1801.01290. tively, and the Ph.D. degree from the University
[26] C. Li and K. Czarnecki, “Urban driving with multi-objective deep of Cagliari, in 2017. Currently, he is a Research
reinforcement learning,” 2018, arXiv:1811.08586. Assistant with the University of Naples, Federico
[27] P. A. Lopez et al., “Microscopic traffic simulation using sumo,” in Proc. II. He has authored more than 50 papers, published
21st Int. Conf. Intell. Transp. Syst. (ITSC), Nov. 2018, pp. 2575–2582. in international journals, conferences, and books.
[28] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and His main research interests include pattern recog-
A. Sumner, (2000). TORCS, The Open Racing Car Simulator. [Online]. nition, biometrics, video analysis and analytics, and
Available: http://torcs.sourceforge.net financial forecasting.
[29] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Aniello Castiglione (Member, IEEE) received the
Cambridge, MA, USA: MIT Press, 2018. Ph.D. degree in computer science from the Univer-
[30] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- sity of Salerno, Italy. He is currently an Associate
dimensional continuous control using generalized advantage estimation,” Professor with the University of Naples Parthenope,
2015, arXiv:1506.02438. Italy. He received the Italian National Qualification
[31] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, as a Full Professor of computer science in 2021.
vol. 1, no. 2. Cambridge, MA, USA: MIT Press, 2016. He published more than 240 papers in international
[32] P.-W. Chou, D. Maturana, and S. Scherer, “Improving stochastic pol- journals and conferences. Considering his journal
icy gradients in continuous control with deep reinforcement learning articles, more than 85 of them are ranked Q1 in
using the beta distribution,” in Proc. Int. Conf. Mach. Learn., 2017, Scopus/Scimago classification and more than 70 of
pp. 834–843. them are ranked Q1 in the Clarivate Analytics/ISI-
[33] R. J. Williams, “Simple statistical gradient-following algorithms for WoS classification. The international academic profile of him is spread among
connectionist reinforcement learning,” Mach. Learn., vol. 8, nos. 3–4, his 86 international coauthors who belong to 75 different institutions located in
pp. 229–256, 1992. 18 countries. He served in the organizations as the Program Chair and a TPC
[34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust Member in around 250 international conferences (some of them are ranked
region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015, A+/A/A- in the CORE, LiveSHINE, and Microsoft Academic international
pp. 1889–1897. classifications). His current research interests include information forensics,
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” digital forensics, security and privacy on cloud, communication networks,
2014, arXiv:1412.6980. applied cryptography, and sustainable computing. Currently, he is the Editor-
[36] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in-Chief of the Special Issues for the Journal of Ambient Intelligence and
in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937. Humanized Computing (Springer). He served as the Managing Editor for
[37] G. Brockman et al., “OpenAI gym,” 2016, arXiv:1606.01540. two ISI-ranked international journals and as a Reviewer for 110 international
[38] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013, journals. In addition, he served as a Guest Editor for 30 Special Issues
arXiv:1312.5602. and served as the Editor of more than ten Editorial Boards of international
[39] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline journals, such as IEEE T RANSACTIONS ON S USTAINABLE C OMPUTING,
evaluation of vision-based driving models,” in Proc. Eur. Conf. Comput. IEEE A CCESS , IET Image Processing (IET), Journal of Ambient Intelligence
Vis. (ECCV), 2018, pp. 236–251. and Humanized Computing (Springer), MTAP, Sustainability (MDPI), Smart
[40] K. Cho et al., “Learning phrase representations using RNN encoder- Cities (MDPI), and Future Internet (MDPI). One of his papers (published
decoder for statistical machine translation,” 2014, arXiv:1406.1078. in the IEEE T RANSACTIONS ON D EPENDABLE AND S ECURE C OMPUTING)
[41] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical was selected as “Featured Article” in the “IEEE Cybersecurity Initiative” in
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. 2014. In October 2020 and October 2021, he was included into the ranking
Comput. Vis. (ECCV), 2018, pp. 116–131. of the top 100,000 scientists for the years 2019 and 2020. He is a member
[42] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units of ACM.
for neural network function approximation in reinforcement learning,”
Michele Nappi (Senior Member, IEEE) received
Neural Netw., vol. 107, pp. 3–11, Nov. 2018.
the laurea degree (cum laude) in computer science
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating
from the University of Salerno, Italy, in 1991, the
deep network training by reducing internal covariate shift,” 2015,
M.Sc. degree in information and communication
arXiv:1502.03167.
technology from I.I.A.S.S. E.R. Caianiello, in 1997,
[44] M. Abadi et al., “Tensorflow: A system for large-scale machine learn-
and the Ph.D. degree in applied mathematics and
ing,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement.
computer science from the University of Padova,
(OSDI), 2016, pp. 265–283.
Italy, in 1997. He was one of the founders of the spin
[45] L. Anzalone, S. Barra, and M. Nappi, “Reinforced curriculum learning
off BS3 (biometric system for security and safety) in
for autonomous driving in CARLA,” in Proc. IEEE Int. Conf. Image
2014. He is currently a Full Professor of computer
Process. (ICIP), Sep. 2021, pp. 3318–3322.
science with the University of Salerno. He is a Team
Leader of the Biometric and Image Processing Laboratory (BIPLAB). He has
authored more than 180 papers in peer-reviewed international journals, inter-
national conferences, and book chapters. His research interests include pattern
recognition, image processing, image compression and indexing, multimedia
databases and biometrics, human–computer interaction, and VR/AR. He is a
Luca Anzalone received the B.Sc. and M.Sc. degrees (cum laude) in computer member of TPC of international conferences. He is a GIRPR/IAPR Member.
science from the University of Salerno, in 2018 and 2020, respectively. He is He received several international awards for scientific and research activities.
currently pursuing the Ph.D. degree in data science and computation with the He is the Co-Editor of several international books. He serves as an Associate
University of Bologna. His research interests include deep learning and deep Editor and a Managing Guest Editor for several international journals. He is
reinforcement learning. the President of the Italian Chapter of the IEEE Biometrics Council.

Deep Learning Sensor Fusion For Autonomous Vehicle
No ratings yet
Deep Learning Sensor Fusion For Autonomous Vehicle
34 pages
End-To-End Autonomous Driving Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving Challenges and Frontiers
20 pages
Deep Learning-Based Robust Positioning For All-Weather Autonomous Driving
No ratings yet
Deep Learning-Based Robust Positioning For All-Weather Autonomous Driving
16 pages
Get Hitler in Argentina But No Teutonic Conspiracy of 1000 Years 1st Edition Bruno Buike Free All Chapters
100% (1)
Get Hitler in Argentina But No Teutonic Conspiracy of 1000 Years 1st Edition Bruno Buike Free All Chapters
51 pages
Motion Planning For Autonomous Driving: The State of The Art and Future Perspectives
No ratings yet
Motion Planning For Autonomous Driving: The State of The Art and Future Perspectives
21 pages
Itmconf Icaect2023 01002
No ratings yet
Itmconf Icaect2023 01002
18 pages
Applications and Challenges of Artificial Neural Networks in Autonomous Vehicle Technology
No ratings yet
Applications and Challenges of Artificial Neural Networks in Autonomous Vehicle Technology
8 pages
Autonomous Vehicles and Systems - A Technological and Societal Perspective
No ratings yet
Autonomous Vehicles and Systems - A Technological and Societal Perspective
464 pages
Medical English 4.4
No ratings yet
Medical English 4.4
37 pages
Ref 4
No ratings yet
Ref 4
9 pages
Model-Based and Machine Learning-Based High-Level Controller For Autonomous Vehicle Navigation: Lane Centering and Obstacles Avoidance
No ratings yet
Model-Based and Machine Learning-Based High-Level Controller For Autonomous Vehicle Navigation: Lane Centering and Obstacles Avoidance
14 pages
DL - Unit V
No ratings yet
DL - Unit V
12 pages
Selfdrivingcar
No ratings yet
Selfdrivingcar
7 pages
Review of PP Algms-Elsevier
No ratings yet
Review of PP Algms-Elsevier
46 pages
A Survey of End-To-End Driving Architectures and Training Methods
No ratings yet
A Survey of End-To-End Driving Architectures and Training Methods
21 pages
Applsci 10 03543
No ratings yet
Applsci 10 03543
26 pages
A Survey of Deep Learning Applications To Autonomous Vehicle Control
No ratings yet
A Survey of Deep Learning Applications To Autonomous Vehicle Control
23 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
25 pages
Emerging Technologies
No ratings yet
Emerging Technologies
27 pages
Quantum Physics For Babies
No ratings yet
Quantum Physics For Babies
13 pages
Integration: Esraa Khatab, Ahmed Onsy, Martin Varley, Ahmed Abouelfarag
No ratings yet
Integration: Esraa Khatab, Ahmed Onsy, Martin Varley, Ahmed Abouelfarag
13 pages
Sustainability 13 11417 v2
No ratings yet
Sustainability 13 11417 v2
29 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
28 pages
Research On The Application of Artificial Intellig
No ratings yet
Research On The Application of Artificial Intellig
6 pages
Research Paper 2
No ratings yet
Research Paper 2
17 pages
Self-Driving Cars Using Genetic Algorithm
No ratings yet
Self-Driving Cars Using Genetic Algorithm
6 pages
SDV IEEE Access V3
No ratings yet
SDV IEEE Access V3
16 pages
End-To-End Autonomous Driving: Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving: Challenges and Frontiers
20 pages
OL Physics Book 2 (MCQ Theory) 2008 Till 2021
No ratings yet
OL Physics Book 2 (MCQ Theory) 2008 Till 2021
386 pages
CL6 Winter 2024-25
No ratings yet
CL6 Winter 2024-25
4 pages
Mahendra Nath Behera,.
No ratings yet
Mahendra Nath Behera,.
20 pages
Sensors 23 00317
No ratings yet
Sensors 23 00317
26 pages
Classification of Objects Using CNN-Based Vision and Lidar Fusion in Autonomous Vehicle Environment
No ratings yet
Classification of Objects Using CNN-Based Vision and Lidar Fusion in Autonomous Vehicle Environment
6 pages
Planning-Oriented Autonomous Driving
No ratings yet
Planning-Oriented Autonomous Driving
24 pages
Bsse 211129 Asad Ullah Ai Assignment
No ratings yet
Bsse 211129 Asad Ullah Ai Assignment
11 pages
1902 07830
No ratings yet
1902 07830
27 pages
End-To-End Autonomous Driving: Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving: Challenges and Frontiers
21 pages
Paper 11
No ratings yet
Paper 11
15 pages
V2Vnet: Vehicle-To-Vehicle Communication For Joint Perception and Prediction
No ratings yet
V2Vnet: Vehicle-To-Vehicle Communication For Joint Perception and Prediction
17 pages
Autonomous Car Driving Using Neural Networks
No ratings yet
Autonomous Car Driving Using Neural Networks
10 pages
2023 CVPR UniID
No ratings yet
2023 CVPR UniID
10 pages
Advanced Self Driving Car Using Machine Learning
No ratings yet
Advanced Self Driving Car Using Machine Learning
5 pages
Deep Learning For Safe Autonomous Driving Current Challenges and Future Directions
No ratings yet
Deep Learning For Safe Autonomous Driving Current Challenges and Future Directions
21 pages
NEET Chemistry Chapter Wise Mock Test - Physical Chemistry I - CBSE Tuts
No ratings yet
NEET Chemistry Chapter Wise Mock Test - Physical Chemistry I - CBSE Tuts
25 pages
A Survey of Autonomous Driving, Common Practices and Emerging Technologies
No ratings yet
A Survey of Autonomous Driving, Common Practices and Emerging Technologies
27 pages
The Kite Runner Essays
100% (2)
The Kite Runner Essays
7 pages
Dissertation Penser Par Soi Meme
100% (2)
Dissertation Penser Par Soi Meme
6 pages
Aiav Unit 2 Notes
No ratings yet
Aiav Unit 2 Notes
8 pages
1 Introduction To Consumer Behavior and Marketing Strategy
No ratings yet
1 Introduction To Consumer Behavior and Marketing Strategy
3 pages
PS2 - Unit 2 (NR)
No ratings yet
PS2 - Unit 2 (NR)
39 pages
Electronics 13 02790
No ratings yet
Electronics 13 02790
15 pages
Literature 2
No ratings yet
Literature 2
12 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
30 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
24 pages
Machines 11 01068 v2
No ratings yet
Machines 11 01068 v2
14 pages
传感器问题
No ratings yet
传感器问题
21 pages
Research Paper (3) (1) 2
No ratings yet
Research Paper (3) (1) 2
8 pages
List of Figures VI Glossary of Terms VII VIII General Introduction 1 Chapter 1 Project Framework 3 3
No ratings yet
List of Figures VI Glossary of Terms VII VIII General Introduction 1 Chapter 1 Project Framework 3 3
25 pages
International and Cross-Cultural Negotiation - 2021 - Student - Compressed
No ratings yet
International and Cross-Cultural Negotiation - 2021 - Student - Compressed
58 pages
2019 - Chen Et Al. - Brain-Inspired Cognitive Model With Attention For Self-Driving Cars
No ratings yet
2019 - Chen Et Al. - Brain-Inspired Cognitive Model With Attention For Self-Driving Cars
13 pages
Aguila JImenez Mario Projecto
No ratings yet
Aguila JImenez Mario Projecto
18 pages
CAD17-060 Journal Self Driving Cars Hirz Final
No ratings yet
CAD17-060 Journal Self Driving Cars Hirz Final
9 pages
Complete Research Paper Computer Vision
No ratings yet
Complete Research Paper Computer Vision
6 pages
Artificial Intelligence in Self-Driving Study of A
No ratings yet
Artificial Intelligence in Self-Driving Study of A
18 pages
Pitch The Future 2022 - Team Milhas Gerais
No ratings yet
Pitch The Future 2022 - Team Milhas Gerais
8 pages
Enhancing The Weather - Governance of Weather Modification Activit
No ratings yet
Enhancing The Weather - Governance of Weather Modification Activit
69 pages
Voronoi Diagrams - A Survey of A Fundamental Geometric Data Structure
No ratings yet
Voronoi Diagrams - A Survey of A Fundamental Geometric Data Structure
61 pages
Me 2017 Dec7
No ratings yet
Me 2017 Dec7
5 pages
Campbell Diagram
No ratings yet
Campbell Diagram
8 pages
Considering Customer Lifetime Network Value in Oligopoly Markets With The Use of Game Theory Approach
No ratings yet
Considering Customer Lifetime Network Value in Oligopoly Markets With The Use of Game Theory Approach
27 pages
A Deep Learning Platooning-Based Video Information-Sharing Internet of Things Framework For Autonomous Driving Systems
No ratings yet
A Deep Learning Platooning-Based Video Information-Sharing Internet of Things Framework For Autonomous Driving Systems
11 pages
Ch. 5 Disposal of Wastewater Numericals
No ratings yet
Ch. 5 Disposal of Wastewater Numericals
6 pages
ProMax LB02A Multifuntion Process Calibrator Datasheet
No ratings yet
ProMax LB02A Multifuntion Process Calibrator Datasheet
5 pages
Self-Driving Cars - The Future Is in Gear
No ratings yet
Self-Driving Cars - The Future Is in Gear
3 pages
IcetranAutonomous Car Driving One Possible Implementation Using Machine Learning Algorithm
No ratings yet
IcetranAutonomous Car Driving One Possible Implementation Using Machine Learning Algorithm
7 pages
Reimagining An Autonomous Vehicle
No ratings yet
Reimagining An Autonomous Vehicle
7 pages
IWC102 Calculus II
No ratings yet
IWC102 Calculus II
2 pages
A Cell-Based Smoothed Finite Element Method For TH
No ratings yet
A Cell-Based Smoothed Finite Element Method For TH
14 pages
Ingles 2 - Parcial 2
No ratings yet
Ingles 2 - Parcial 2
7 pages
Chemistry Investigatory Project Anj
No ratings yet
Chemistry Investigatory Project Anj
16 pages
Master of Public Management: Admission Requirements
No ratings yet
Master of Public Management: Admission Requirements
3 pages
Software Evaluation New 2023
No ratings yet
Software Evaluation New 2023
3 pages
Work and Energy
No ratings yet
Work and Energy
13 pages
Valeo Technical Paper
No ratings yet
Valeo Technical Paper
6 pages
MS Broschuere FLUITEX EN Metric
No ratings yet
MS Broschuere FLUITEX EN Metric
12 pages
AECC Assignment - 2
No ratings yet
AECC Assignment - 2
5 pages
The Role of Subject Knowledge in The Eff PDF
No ratings yet
The Role of Subject Knowledge in The Eff PDF
15 pages
" Druggist Fold : West Manheim Twp. Police Dept. Property Manual
No ratings yet
" Druggist Fold : West Manheim Twp. Police Dept. Property Manual
1 page
Questionnaire 3 - MultiGroup Analysis
No ratings yet
Questionnaire 3 - MultiGroup Analysis
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

An End-to-End Curriculum Learning Approach For Aut

Uploaded by

An End-to-End Curriculum Learning Approach For Aut

Uploaded by

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO.

10, OCTOBER 2022 19817

An End-to-End Curriculum Learning Approach

clip(·) truncates ratiot (θ ) at 1 − if the advantages Ât are IV. M ETHOD

• Discount factor γ ∈ (0, 1]: future rewards are discounted

WetSunset, SoftRainSunset, CloudyNoon,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

An End-to-End Curriculum Learning Approach For Aut

Uploaded by

An End-to-End Curriculum Learning Approach For Aut

Uploaded by

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO.

10, OCTOBER 2022 19817

An End-to-End Curriculum Learning Approach

clip(·) truncates ratiot (θ ) at 1 −  if the advantages Ât are IV. M ETHOD

• Discount factor γ ∈ (0, 1]: future rewards are discounted

WetSunset, SoftRainSunset, CloudyNoon,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

clip(·) truncates ratiot (θ ) at 1 − if the advantages Ât are IV. M ETHOD