0% found this document useful (0 votes)
29 views9 pages

De - Biasing 2 MIT

This document proposes a new method for training deep neural networks for autonomous driving using a variational autoencoder. The method allows for novelty detection by learning latent variables that characterize the training data and can identify situations with insufficient data. It also addresses training data imbalances by resampling the data distribution. The model is trained end-to-end to directly output steering commands from camera images while also learning unsupervised latent variables. This enables novelty detection to increase robustness and automated debiasing to improve training accuracy and speed. The approach was evaluated on a challenging autonomous driving dataset.

Uploaded by

Victor Vite HM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views9 pages

De - Biasing 2 MIT

This document proposes a new method for training deep neural networks for autonomous driving using a variational autoencoder. The method allows for novelty detection by learning latent variables that characterize the training data and can identify situations with insufficient data. It also addresses training data imbalances by resampling the data distribution. The model is trained end-to-end to directly output steering commands from camera images while also learning unsupervised latent variables. This enables novelty detection to increase robustness and automated debiasing to improve training accuracy and speed. The approach was evaluated on a challenging autonomous driving dataset.

Uploaded by

Victor Vite HM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Variational Autoencoder for End-to-End Control of Autonomous

Driving with Novelty Detection and Training De-biasing

The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.

Citation Amini, Alexander, Wilko Schwarting, Guy Rosman, Brandon Araki.


Sertac Karaman and Daniela Rus. "Variational Autoencoder for
End-to-End Control of Autonomous Driving with Novelty Detection
and Training De-biasing." 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS) (Palacio Municipal de
Congresos, Madrid, Spain, Oct.1-5 2018)

As Published https://www.iros2018.org/

Version Author's final manuscript

Citable link http://hdl.handle.net/1721.1/118139

Terms of Use Creative Commons Attribution-Noncommercial-Share Alike

Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/


Variational Autoencoder for End-to-End Control of Autonomous Driving
with Novelty Detection and Training De-biasing
Alexander Amini1 , Wilko Schwarting1 , Guy Rosman2 , Brandon Araki1 , Sertac Karaman3 , Daniela Rus1

Abstract— This paper introduces a new method for end-to- Uncertainty Estimation
end training of deep neural networks (DNNs) and evaluates & Novelty Detection
it in the context of autonomous driving. DNN training has
been shown to result in high accuracy for perception to action

Uncertainty
Propagate
learning given sufficient training data. However, the trained
models may fail without warning in situations with insufficient
or biased training data. In this paper, we propose and evaluate Encoder
a novel architecture for self-supervised learning of latent
variables to detect the insufficiently trained situations. Our
method also addresses training data imbalance, by learning a
set of underlying latent variables that characterize the training Dataset Debiasing
data and evaluate potential biases. We show how these latent Weather Adjacent Road Steering
distributions can be leveraged to adapt and accelerate the (Snow) Vehicles Surface Control
Command
training pipeline by training on only a fraction of the total Resampled data
dataset. We evaluate our approach on a challenging dataset distribution

for driving. The data is collected from a full-scale autonomous Unsupervised Latent
vehicle. Our method provides qualitative explanation for the Variables Sample Efficient
Accelerated Training
latent variables learned in the model. Finally, we show how
our model can be additionally trained as an end-to-end con-
troller, directly outputting a steering control command for an Fig. 1: Semi-supervised end-to-end control. An encoder
autonomous vehicle. neural network is trained to learn a supervised control
command as well as various other unsupervised outputs
I. I NTRODUCTION
that qualitatively describe the image. This enables two key
Robots operating in human-centered environments have contributions of novelty detection and dataset debiasing.
to perform reliably in unanticipated situations. While deep
neural networks (DNNs) offer great promise in enabling datasets that are either biased or contain class imbalances,
robots to learn from humans and their environments (as due to the lack of labeled data. This negatively impacts both
opposed to hand-coding rules), substantial challenges re- the speed and accuracy of training.
main [1]. For example, previous work in autonomous driving In this paper, we address these limitations by developing
has demonstrated the ability to train end-to-end a DNN an end-to-end control method capable of novelty detection
capable of generating vehicle steering commands directly and automated debiasing through self-supervised learning of
from car-mounted video camera data with high accuracy latent variable representations. In addition to learning a final
so long as sufficient training data is provided [2]. But true control output directly from raw perception data, we also
autonomous systems should also gracefully handle scenarios learn a number of underlying latent variables that qualita-
with insufficient training data. Existing DNNs will likely tively capture the underlying features of the data cf. Fig. 1.
produce incorrect decisions without a reliable measure of These latent variables, as well as their associated uncertain-
confidence when placed in environments for which they were ties, are learned through self-supervision of a network trained
insufficiently trained. to reconstruct its own input. By estimating the distribution
A society where robots are safely and reliably integrated of latent factors, we can estimate when the network is likely
into daily life demands agents that are aware of scenarios for to fail (thus increasing the robustness of the controller,) and
which they are insufficiently trained. Furthermore, subsys- adapt the training pipeline to cater to the distribution of these
tems of these agents must effectively convey the confidence underlying factors, thereby improving training accuracy. Our
associated their decisions. Finally, robust performance of approach makes two key contributions:
these systems necessitates an unbiased, balanced training 1) Detection of novel events which the network has been
dataset. To date, many such systems have been trained with insufficiently trained for and not trusted to produce
Toyota Research Institute (TRI) provided funds to assist the authors with
reliable outputs; and
their research but this article solely reflects the opinions and conclusions 2) Automated debiasing of a neural network training
of its authors and not TRI or any other Toyota entity. We gratefully pipeline, leading to faster training convergence and
acknowledge the support of NVIDIA Corporation with the donation of the
Volta V100 GPU used for this research.
increased accuracy.
1 Computer Science and Artificial Intelligence Lab, Massachusetts Insti-
Our solution uses a Variational Autoencoder (VAE) net-
tute of Technology {amini,wilkos,araki,rus}@mit.edu
2 Toyota Research Institute (TRI) {guy.rosman}@tri.global work architecture comprised of two parts, an encoder and a
3 Laboratory for Information and Decision Systems, Massachusetts decoder. The encoder is responsible for learning a mapping
Institute of Technology {sertac}@mit.edu from raw sensor data directly to a low dimensional latent
Data Collection Encoder Latent Space Decoder
Data collected from human drivers
0

0
k-1

0
k-1

Domain Knowledge
k-1
1. Recover from off-center and
off-orientation
2. Downsample straight roads Conv Deconv FC Mean Variance Rand Sample

Fig. 2: Novel VAE architecture for end-to-end control. Image features are extracted through convolutional layers to
encode the input image into the variational latent space with one of the latent variables explicitly supervised to learn
steering control. The resulting latent variables are self-supervised by feeding the entire encoding into a decoder network that
learns to reconstruct the original input image. Uncertainty is modeled through the variance of each latent variable (σk2 ).

space encoding that maximally explains as much of the II. R ELATED W ORKS
data as possible. The decoder is responsible for learning the
inverse mapping that takes as input a single sample from Traditional methods for autonomous driving first decom-
the aforementioned latent space and reconstructs the original pose the problem into smaller components, with an individual
input. As opposed to a standard VAE model, which self- algorithm applied to each component. These submodules can
supervises the entire latent space, we also explicitly supervise range from mapping and localization [3], [4], to perception
a single dimension of the latent space to represent the robotic [5]–[7], planning [8], [9], and control [10], [11]. On the
control output. other hand, end-to-end systems attempt to collapse the entire
problem (from raw sensory data to control outputs) into a
We use end-to-end autonomous driving as the robotic con-
single learned model. The ALVINN system [12] originally
trol use case. Here a steering control command is predicted
demonstrated this idea by training a multilayer perceptron to
from only a single input image. As a safety-critical task, au-
learn the direction a vehicle travel from pixel image inputs.
tonomous driving is particularly well-suited for our approach.
Advancements in deep learning have caused a revival of end-
Control systems for autonomous vehicles, when deployed in
to-end methods for self-driving cars [2], [13]–[15]. These
the real world, face enormous amounts of uncertainty and
systems have shown enormous promise by outputting a single
possibly even environments that they have never encountered
steering command and controlling a full-scale autonomous
before. Additionally, autonomous driving is a safety critical
vehicle through residential and highway streets [2]. The
application of robotics; such control systems must possess
system in [13] learns a discrete probability distribution over
reliable ways of assessing their own confidence.
possible control commands while [15] applies the output
We evaluate our VAE-based approach on a challenging of their convolutional feature extractor to a long short-term
real driving dataset of time synchronized image, control data memory (LSTM) recurrent neural network for generating
samples collected with a full scale autonomous vehicle and a driving policy. However, all of these these models are
demonstrate both novelty detection and automated debiasing still largely trained as black-boxes and lack a measure
of the dataset. Our algorithm includes a VAE-based indicator of associated confidence in their output and method for
to detect novel images that were not contained in the training interpreting the learned features.
distribution. We demonstrate our algorithm’s ability to detect Understanding and estimating confidence of the output of
over 97% of novel image frames that were trained during machine learning models has been investigated in different
a different time of day and detect 100% of camera sensor ways: One can formulate training of DNNs as a maximum
malfunctions in our dataset. likelihood problem using a softmax activation layer on the
We address training set data imbalance by introducing output to estimate probabilities of discrete classes [16] as
unsupervised latent variables into the VAE model. These well as discrete probability distributions [17]. Introspective
latent variables qualitatively represent many of the high level capacity has been used to evaluate model performance for a
semantic features of the scene. Moreover, we show that we variety of commonly used classification and detection tasks
can estimate the distributions of the latent variables over [18] by estimating an algorithm’s uncertainty by the distance
the training data, enabling automated debiasing of newly from train to test distribution in feature space. Bayesian deep
collected datasets against the learned latent variables. This learning has even been applied to end-to-end autonomous
results in accelerated and sample efficient training. vehicle control pipelines [19] and offers an additional way
The remainder of the paper is structured as follows: we to estimate confidence through Monte Carlo dropout sam-
summarize the related work in Sec. II, formulate the model pling of weights in recurrent [20] and convolutional neural
in Sec. III, describe our experimental setup and dataset in networks [21]. However, Monte Carlo dropout sampling is
Sec. IV, and provide an overview of our results in Sec. V. not always feasible in real-time on many hardware platforms
due to its repetitive inference calls. Additionally, there is no
explanation or interpretation of the resulting control actions Note that only a single image is used as input at every
available, as well as no confidence metric of whether the time instant. This follows from original observations where
model was trained on similar data as the current input data. models that were trained end-to-end with a temporal infor-
Variational Autoencoders (VAEs) [22], [23] are gaining mation (CNN+LSTM) are unable to decouple the underlying
importance as an unsupervised method to learn hidden rep- spatial information from the temporal control aspect. While
resentations. Such latent representations have been shown these models perform well on test datasets, they face control
to qualitatively provide semantic structure underlying raw feedback issues when placed on a physical vehicle and
image data [24]. Previous works have leveraged VAEs for consistently drift off the road.
novelty detection [25], but did not directly use the pre-
dicted distributions of the sensor for a model fit criterion.
Conversely, our work presents an indicator to detect novel
images that were not contained in the training distribution by B. VAE Architecture
weighting the reconstructed image by the latent uncertainty
propagated through the network. High loss indicates that the We extend the end-to-end control model from Sec. III-
model has not been trained on that type of image and thus re- A into a variational autoencoding (VAE) model. Unlike the
flects lower confidence in the network’s ability to generalize classical VAE model [22], [23], one particular latent vari-
to that scenario. Operating over input distributions diverging able is explicitly supervised to predict steering control, and
from the training distribution is potentially dangerous, since combined with the remaining latent variables to reconstruct
the model cannot sufficiently reason about the input data. the input image. Our model is shown in Fig. 2. The model
Additionally, learning in the presence of class imbalance accepts as input a 66 × 200 × 3 RGB image in mini-batches
is one of the key challenges underlying all machine learning of size n = 50. We use a convolutional network encoder,
systems. When the classes are explicitly defined and labeled comprised of convolutional (Conv) and fully connected (FC)
it is possible to resample the dataset [26], [27] or even layers, to compute Q(z|X), the distribution of the latent
reweight the loss function [28], [29] to avoid training bias. variables given a data point. The encoder has a similar
However, these techniques are not immediately applicable architecture as a standard end-to-end regression model and
if the underlying class bias is not explicitly labeled (as is is detailed in Table I, with 5 convolutional layers and
the case in many real world training problems). While [30] two fully connected layers with dropout. The latent space
demonstrates how K-means can be used to find clusters section of Fig. 2 shows the encoder outputting 2k activations
within the data before training to provide a loss reweighting corresponding to µ ∈ Rk , Σ = Diag[σ]  0 used to define
scheme, this method does not adapt to the model during the distribution of z. Next, there is a decoder network that
training. Additionally, running K-means on high dimensional mirrors the encoder (two FC layers with dropout and then 5
image data can be extremely computational and destroys de-convolutional layers) to reconstruct the image back from
all spatial information between nearby pixels. This paper the latent space sample z.
provides an algorithm for adaptive dataset debiasing during In order to differentiate the outputs through the sampling
training by learning the latent space distribution and sam- phase, VAEs use a reparameterization “trick”, sampling  ∼
pling from the most representative regions of this space. N (0, I) and compute z = µ(X) + Σ1/2 (X) × . This
allows us to train the encoder and decoder, end-to-end, using
III. M ODEL
backpropagation.
In this section, we start by formulating the end-to-end
In our VAE model we additionally supervise one of the
control problem and then describe our model architecture
latent variables to take on the value of the curvature of
for estimating steering control of an autonomous vehicle (in
the vehicle’s path. We represent this modified variable as
an end-to-end manner). We explain how our model allows us
z0 = zŷ = ŷ. It is visible in Fig. 2 as the steering wheel
to self-supervise on a number of other latent variables which
graphic. The loss function of our VAE has three parts –
qualitatively describe the driving dataset. All models in this
a reconstruction loss, a latent loss, and a supervised latent
paper were trained on a NVIDIA Volta V100 GPU.
loss. The reconstruction loss is the L1 norm between the
A. End-to-End Model Formulation input image and the output image, and serves the purpose of
We start from the end-to-end model framework. In TABLE I: Architecture of the encoder neural network.
this framework we observe n training images, X = “Conv” and “FC” refer to convolutional and fully connected
{x1 , . . . , xn }, which are collections of raw pixels from a layers, while k denotes the number of latent variables.
front-facing video camera. We aim to build a model that can Layer Output Activation Kernel Stride
directly map our input images, X, to output steering com- 1. Input 66x200x3 Identity - -
mands based on the curvature of the road Y = {y1 , . . . , yn }. 2. Conv1 31x98x24 ReLU 5x5 2x2
3. Conv2 14x47x36 ReLU 5x5 2x2
To train a single image model we optimized the mean 4. Conv3 5x22x48 ReLU 5x5 1x1
squared error (MSE) or L2 loss function between the human 5. Conv4 3x20x64 ReLU 3x3 1x1
actuated control, yi , and the predicted control, ŷi : 6. Conv5 1x18x64 ReLU+Flatten 3x3 1x1
7. FC1 1000 ReLU - -
n 8. FC2 100 ReLU - -
1X
Ly (y, ŷ) = (yi − yˆi )2 (1) 9. FC3 2k Identity - -
n i=1 10. Latent Sample k Identity - -
training the decoder. We define this loss function as follows: robustness. Averaging over image pixels yields the error
n
term:
1X *  +
Lx (x, x̂) = |xi − x̂i |

(2) x(p) − E x̂(p) |θ
n i=1 D(x, x̂) = q (6)
 
Var x̂(p) |θ
The latent loss is the Kullback-Liebler divergence between
the latent variables and a unit Gaussian, providing regulariza-
   
where E x̂(p) |θ , Var x̂(p) |θ are computed by sampling
tion for the latent space [22], [23]. For Gaussian distributions, values of z and computing statistics for the resulting decoded
the KL divergence has the closed form images x̂. Additionally, x(p) denotes the value of pixel p in
k−1 image x. The approach for pixelwise uncertainty estimates
1X is similar to the unscented transform [31], and is captured
1 − µ2j − σj2 + log (σj + ρ)2 (3)

LKL (µ, σ) = −
2 j=0 in Algorithm 1. Using these statistics indicates whether
an observed image is well captured by our trained model.
where ρ = 10−8 is added for numerical stability. In practice, we implemented a real-time version of this
Lastly, the supervised latent loss is a new addition to the algorithm using 20 samples on each time iteration.
loss function, and it is defined as the MSE between the
predicted and actual curvature of the vehicle’s path. It is Algorithm 1 Compute pixel-wise expectation and variance
the same as Eq. 1 from Sec. III-A. Require: Input image x, Encoder NN, & Decoder NN
These three losses are summed to compute the total loss: 1: {σi , µi }k
i=1 ← Encoder(x)
2: θ ← {σi , µi }k
i=1
LT OT AL (·) = c1 Ly (y, ŷ) + c2 Lx (x, x̂) + c3 LKL (µ, σ) (4)
3:
where c1 , c2 , and c3 are used weight the importance of 4: for j = 1 to T do
each loss function. We found that c1 = 0.033, c2 = 0.1, 5: for i = 1 to k do
c3 = 0.001 yielded a nice trade off in importance between 6: Sample zi ∼ N (µi , σi2 )
steering control, reconstruction, and KL loss, wherein no 7: end for
one individual loss component overpowers the others during 8: x̂j = Decoder(z)
training. 9: end for
10:
1
PT (p)
C. Novelty Detection 11: E[x̂(p) |θ] = T j=1 x̂j
PT  (p) 2
1
A crucial aspect underlying many deep learning systems 12: Var[x̂(p) |θ] = T j=1 x̂j − E[x̂(p) |θ]
is their ability to reliably compute uncertainty measurements 13:
and thus determine when they have low confidence in the 14: return E[x̂|θ], Var[x̂|θ]
given environment. Many standard end-to-end networks are
seen as black-boxes, producing only a single control output
number at the final layer. In this section, we explore a D. Increased Training on Rare Events
novel method for using the architecture to detect driving
A majority of driving data consists of straight road driving,
environments with insufficient training that cannot be trusted
while turns and curves are vastly underrepresented. End-to-
to produce a reliable decision.
end networks training often [2] handles this by resampling
We note the VAE architecture provides uncertainty es-
the training set to place more emphasis on the rarer events
timates for every variable in the latent space via their
(i.e., turns). We generalize this notion to the latent space
parameters {µi , σi }k−1i=0 . However, what we really desire of our VAE model, better exploring the space of both
are uncertainty estimates in the input image space since
control events and nuisance factors, for not only the steering
these observations are available at test time. We therefore
command but all other underlying features. By feeding the
propagate the uncertainties through the remainder of the
original training data through the learned model we estimate
network by sampling in the latent layer space and computing
the training data distribution Q(z|X) in the latent space.
empirical uncertainty estimates in any of the future layers
Subsequently, it is possible to increase the proportion of rarer
(including the final reconstructed image space).
datapoints by dropping overrepresented regions of the latent
This can be explained by treating the VAE encoder net-
space. We approximate Q(z|X) as a histogram Q̂(z|X),
work as a posterior model inference of the parameters θ,
where z is the output of the Encoder NN corresponding to the
z as samples from the posterior distribution inferred from
input images x ∈ X. To avoid data-hungry high-dimensional
θ. Propagation of z into x̂ results in a posterior predictive
(in our case 25 dimensions) histograms, we further simplify
sample. A common model fit measure for θ would be
and utilize independent histograms Q̂i (zi |X)Qfor each la-
log P (x|θ); θ = {σi , µi }k−1 (5) tent variable zi and approximate Q̂(z|X) ∝ i Q̂i (zi |X).
i=0
Naturally, we would like to train on a higher number of
Using a common pixel-wise independence approximation unlikely training examples and drop many samples over-
allows us to get a model rejection criteria based on the represented in the dataset. We therefore train on a subsam-
images, using a weighted L2 error. As commonly done in Xsub including datapoints x with probability
pled training set Q
image processing, we use instead the L1 norm due to its W (z(x)|X) ∝ i 1/(Q̂i (zi (x)|X) + α). For small α the
1.2
subsampled training distribution is close to uniform over z, 130
1
whereas large α keep the subsampled training distribution 0.8

closer to the original training distribution. At every epoch all 120 0.6
0.4
images x from the original dataset X are propagated through 0.2
110
the (learned) model to evaluate the corresponding latent
140
variables z(x). The histograms Q̂i (zi (x)|X) are updated 20

accordingly. A new subsampled training set Xsub is drawn by 130

keeping images x from the original dataset X with likelihood 10


120
W (z(x)|X). Training on the subsampled data Xsub now
forces the classifier into a choice of parameters that work 0
0 1 2 0 1 2
4 4
x10 x10
better in rare cases without strong deterioration of perfor- Iterations Iterations
mance for common training examples. Most importantly, the Training Set Validation Set
debiasing is not manually specified beforehand but based on
learned latent variables. Fig. 3: Loss evolution. Convergence of our four loss
functions (defined in Eqs. [1]-[4]) on the train (blue) and
IV. DATASET validation (red) data.
The vehicle base platform employed to collect the dataset
is a Toyota Prius 2015 V used in collaboration with the MIT- we used only “lane-stable” in our dataset, which corresponds
Toyota partnership. A forward facing Leopard Imaging LI- to sections of driving where the vehicle is stable in its lane.
AR0231-GMSL camera, which has a field-of-view of 120 To inject domain knowledge into our network we aug-
degrees and captures 1080p RGB images at approximately mented the dataset with images collected from cameras
30Hz, is the vision data source for this study. Sensors also placed approximately 2 feet to the left and right of the main
collected the human actuated steering wheel angle (rad) as center camera. We correspondingly changed the supervised
well as vehicle speed (m/sec). A Xsens MTi 100-series control value to teach the model how to recover from off-
Inertial Measurement Unit (IMU) was used to collect ac- center positions. For example, an image collected from the
celeration, rotation, and orientation data from the rigid body right camera we perturb the steering control with a negative
frame and thus compute the curvature of the vehicle’s path. number to steer slightly left, and vice versa. We add images
Specifically, given a yaw rate γi (rad/sec), and the speed of from all three cameras (center, left, and right) to our dataset
the vehicle, vi (m/sec), we compute the curvature (or inverse for training.
radius) of the path as yi = γvii . Note that we can model a B. Optimization
simple relationship between steering wheel angle and road
curvature given the vehicle slip angle by approximating the We trained our models with the Adam optimizer [32]
vehicle according to the Bicycle Model. While we employ with α = 10−4 , β1 = 0.9, β2 = 0.999, and  = 10−8 .
curvature (yi ) to model our networks, for the remainder We considered the number of latent variables, k, to be a
of this paper we will use the terms “steering control” and hyperparameter and trained models with 400, 100, 50, 25,
“curvature” interchangeably, since we can compute one given and 15 latent variables. By analyzing the validation error
the other by estimating slip angle directly from IMU data. upon convergence, we were able to identify that the model
All communication and data logging was done directly with 25 latent variables provided the best fit of our dataset
on an NVIDIA PX2, which is an in-car supercomputer while providing realistic reconstructions. Therefore, we use
specifically developed for autonomous driving. As part of k = 25 for the rest of our analysis. Fig. 3 shows the evolution
data collection for this project we setup the PX2 to con- of all four losses (as defined in Eqs. 1-4) for the training and
nect, communicate, and control a full-scale Toyota Prius validation sets over 2.6 × 104 steps. The MSE steering loss
with drive-by-wire. Additionally, we developed the software converges to nearly 0 for training but decreases more slowly
infrastructure to read the video stream and synchronize with for the validation set. Meanwhile, the KL divergence loss
the other data streams (inertial and steering data) on the PX2. of Eq. 3 increases rapidly before plateauing at a relatively
We drove the vehicle in the Boston metropolitan area and low value for both training and validation data. This is to be
collected data for approximately 4 hours (which was split expected, since the latent variable distributions are initialized
3/1 into training and test sets). In the following subsection, as N (0, 1) and are then perturbed to find approximations
we outline the data processing and augmentation techniques of the Gaussian that allow the distributions to best reduce
that were performed prior to training our models. the overall loss. Since we use 25 latent variables to model
an image with 66 × 200 × 3 = 39, 600 dimensions, it is
A. Processing and Augmentation expected that the decoder cannot exactly reproduce the input
A number of preprocessing steps were taken to clean and image (i.e. Lx (x, x̂) > 0).
prepare the data for training. First, we annotated each frame V. R ESULTS
of collected video according to the time of day, road type,
weather, and maneuver (lane-stable, turn, lane change, junk). A. Steering Control Accuracy
This labeling process allowed us to segment out the pieces To evaluate the performance of our model for the task of
of our data which we wanted to train with. Note the labels end-to-end autonomous vehicle control of steering we start
were not used as part of the training process, but only for by training a standard regression network which takes as
data filtering and preprocessing. Following previous work [2] input a single image and outputs steering curvature [2]. The
Worst Steering Prediction Best Steering Prediction
+0.1 8

Est. Uncertainty
Est. Steering
0 4

-0.1 0
Mean Uncertainty: Mean Uncertainty: -0.1 0 +0.1 -0.1 0 +0.1
True Steering True Steering
Fig. 4: Bad apple analysis. Test images with the best (right)
and worst (left) steering error. Fig. 5: Precision and uncertainty estimation. Plots of true
vs. estimated steering (left) and true steering vs. estimated
architecture of this model is almost identical to the first nine uncertainty (right). Both show that the model variance tends
layers of our encoder neural network with the exception to increase on larger turns (i.e., greater steering magnitude).
of the final layer being only one neuron (as opposed to
2k neurons). We found that the training loss of both the B. Latent Variables
regression and our VAE network were almost exactly the
In this subsection, we analyze the resulting latent space
same. The steering validation loss for the VAE was roughly
that our encoder learned. We start by gauging the underlying
4.4, versus a value of 3.8 for the regression model loss.
meaning of each of the latent variables. We synthesize
Therefore the loss is 14% higher, corresponding to a mean
images using our decoder, starting with a vector in the
error in steering wheel angle of only 1.4 degrees. Therefore,
latent space, and perturb a single latent variable as we
we use the VAE model to predict both a steering control
reconstruct output images. By varying the perturbation we
and uncertainty measure with roughly equal accuracy as the
can understand how that specific latent variable effects the
baseline regression model but simultaneously gain all of the
image. The results for an examplary set of latent variables
additional advantages from the learned latent encodings.
is shown in Fig. 6. We observe that the network is able
to generate intuitive representations for lane markings and
Fig. 4 shows the images associated with the best and additional environmental structure such as other surrounding
worst steering predictions. The mean uncertainty of the best vehicles and weather without ever actively being told to
predictions was 9.3 × 10−4 vs 1.7 × 10−2 for the worst do so. By identifying latent variable representations we can
predictions, indicating that our model can indeed predict immediately observe what the network sees and explain how
when its estimated steering angle is uncertain. The images the corresponding steering control is derived.
associated with the worst steering position are mostly from Intuitively, we know that the network will be penalized
when the car is at a high angle of incidence towards the for having redundant latent variables due to the fact that the
center of the road, representing a small portion of our training reconstruction loss penalizes reconstructed images that do
dataset. On the other hand, the images with the best steering not resemble the input image. This means that the encoder
prediction are mostly straight roads with strongly defined should learn a latent representation of the input such that as
lanes, probably because our dataset has many examples of much distinct information is explained away as possible. This
straight roads (despite debiasing) and because lane markers causes the variables learned to be the most important under-
are a key feature for predicting steering angle. lying variables in the dataset so the decoder can reconstruct
the image from such a low dimensional space.
Fig. 5 corroborates these results. The first chart shows
that there is a strong correlation between the true steering C. Detecting out of sample environments
and the estimated steering. Interestingly, the values of the
estimated steering fan out at the extreme values, showing that Next, we examined ways to interpret if our network
1) our dataset is sparse at extreme steering values and 2) that is confident in its end-to-end control prediction. Figure 7
the uncertainty of the prediction should increase at extreme shows sample pixel-wise uncertainties (red) obtained from
steering values. The second chart shows the relationship this method overlaid on top of the respective input images.
between true steering and the estimated uncertainty, and it As one might expect, details around adjacent vehicles, the
indeed shows that uncertainty increases as the absolute value distant horizon and the presence of snow next to the road
of the steering increases. Although this shows a weak point highlight the greatest uncertainty. This make sense since the
of a dataset – that it is sparse in images associated with large model is constrained to only 25 latent variables and doesnot
steering angle – it shows a strong point of the VAE model, have capacity to hold any of these less meaningful details.
which is that it is able to predict that there is high uncertainty We can now plot the distribution of this distance over
for large values of the steering angle. Additionally, such datasets on which the network received sufficient and insuf-
uncertainty at greater turns makes intuitive sense since larger ficient training data. To be very explicit, we train a network
road curvatures imply less future road is visible in the image. on daytime driving and detect novel nighttime driving. Since
For example, on extreme turns it is common for less than the network has not been trained with night data, we should
10m to be visible in image, whereas straight road images not trust the output and therefore want to understand, as
present can present visible far more into the future closer best as possible, when nighttime data is being fed into the
to 100m. The fact that we can see less of the future road network. We set a simple threshold γ95 at the 95th percentile
supports the increased uncertainty in such scenarios. of D(x, x̂) for every x in the entire training set. For any
Fig. 6: Latent variable perturbation. A selection of six learned latent variables with associated interpretable descriptors
(left labels). Images along the x-axis (from left to right) were generated by linearly perturbing the latent vector encoding
along that single latent dimension. While steering command (top) was a supervised latent variable, all others (bottom five)
were entirely unsupervised and learned by the model from the dataset.

Training Novel Sensor


(Day) (Night) Malfunction
300

240
Histogram

180
Count

120
Fig. 7: Propagating uncertainty into pixel-space. Sample
60
images from the dataset along with pixel-wise uncertainty
estimates. Uncertain pixels are highlighted red. 0
1 2 3 4 5 6 7 8 9 10
7
subsequent datasample xtest , we classify it as novel if x10

D(xtest , x̂test ) > γ95 (7) Fig. 8: Novelty Detection. Feeding training distribution data
(day time) through the network as well as a novel data
Figure 8 illustrates the training set distribution (day- distribution (night time). A third peak forms when the camera
time driving) in blue as well as the extracted threshold sensor malfunctions from poor illumination.
γ95 . When image frames collected at night are fed through
network we observe the orange distribution and are able to
correctly classify 97% of these frames as novel. Furthermore, Original
we experiment with frames collected during image sensor Debiased
malfunctions. These malfunctions are a realistic and common 100
failure mode for the AR0231 sensor caused from incorrect
white balance estimation. Feeding such images through the
network can be catastrophic and yield unpredictable control 50

responses. Using our metric D(x, x̂) we see that the distri-
bution of these images (green) fall far from the training set
0
and we can successfully detect 100% of these frames. 0 2 4 6 8 10 12
Iterations x10
3

D. Debiasing the model


Fig. 9: Accelerated training with debiasing. Comparison
To evaluate the effect of debiasing during training we train of training loss evolution with/without automated debiasing.
two models, one without debiasing the dataset for inherent la-
tent imbalances and once again now subsampling our dataset not the subsampled distribution), since we ultimately only
to reduce these over-represented (i.e., biased) samples. On care about our performance on the original data. Debiasing
every epoch we sample only 50% of the dataset for training the training pipeline allows the model to focus on events that
while the remaining data is discarded. Figure 9 illustrates the are typically more rare (and inherently more difficult since
loss evolution throughout both training schemes. Note that they occur less frequently). This results in training that is
the loss is computed on the original data distribution (and more data efficient (using only 50% of the data), and also
faster than standard training. Figure 9 shows a minimum loss [8] U. Lee, S. Yoon, H. Shim, P. Vasseur, and C. Demonceaux, “Local
of 20 achieved after roughly half as many training iterations path planning in a complex environment for self-driving car,” in Cyber
Technology in Automation, Control, and Intelligent Systems (CYBER),
compared to training on the original data distribution. 2014 IEEE 4th Annual International Conference on, 2014.
[9] C. Richter, W. Vega-Brown, and N. Roy, “Bayesian learning for safe
VI. C ONCLUSION high-speed navigation in unknown environments,” in Proceedings of
the International Symposium on Robotics Research (ISRR 2015), Sestri
This paper presents a novel deep learning-based algo- Levante, Italy, 2015.
rithm for end-to-end autonomous driving that also captures [10] W. Schwarting, J. Alonso-Mora, L. Paull, S. Karaman, and D. Rus,
“Safe nonlinear trajectory generation for parallel autonomy with a
uncertainty through an intermediate latent representation. dynamic vehicle model,” IEEE Transactions on Intelligent Transporta-
Specifically, we built a learning pipeline that computes a tion Systems, no. 99, pp. 1–15, 2017.
steering control command directly from raw pixel inputs of [11] P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat,
“Predictive active steering control for autonomous vehicle systems,”
a front facing camera. Compared to previous research on IEEE Transactions on control systems technology, 2007.
end-to-end driving, our model also captures uncertainty of [12] D. A. Pomerleau, “ALVINN, an autonomous land vehicle in a neural
the final control prediction. We treat our input image data as network,” Carnegie Mellon University, Computer Science Department,
Tech. Rep., 1989.
being modeled by a set of underlying latent variables (one of [13] A. Amini, L. Paull, T. Balch, S. Karaman, and D. Rus, “Learning
which is the steering command taken by a human driver) with steering bounds for parallel autonomous systems,” in 2018 IEEE
a VAE architecture. Additionally, we propose novel method International Conference on Robotics and Automation (ICRA), 2018.
[14] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of
for detecting novel inputs which have not been sufficiently driving models from large-scale video datasets,” arXiv preprint
trained for by propogating the VAE’s latent uncertainty arXiv:1612.01079, 2016.
through the decoder. Finally, we provide an algorithm for [15] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-
end autonomous driving,” CoRR, vol. abs/1605.06450, 2016.
debiasing against learned biases based on the unsupervised [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
latent space. By adaptively subsampling half of the dataset with deep convolutional neural networks,” in Advances in neural
throughout training, we remove the over-represented latent information processing systems, 2012, pp. 1097–1105.
[17] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep
regions and empirically observe 2× training speedups. reinforcement learning with successor features for navigation across
While the steering command latent variable is explicitly similar environments,” CoRR, vol. abs/1612.05533, 2016.
supervised by ground truth human data and can be used to [18] H. Grimmett, R. Triebel, R. Paul, and I. Posner, “Introspective classi-
fication for robot perception,” The International Journal of Robotics
control the vehicle, we also learn many other latent variables Research, vol. 35, no. 7, pp. 743–762, 2016.
in an unsupervised manner. We show that these unsupervised [19] A. Amini, A. Soleimany, S. Karaman, and D. Rus, “Spatial Uncertainty
latent variables represent concrete and interpretable features Sampling for End-to-End control,” in Neural Information Processing
Systems (NIPS); Bayesian Deep Learning Workshop, 2017.
in the driving scene, such as presence of different lane [20] Y. Gal and Z. Ghahramani, “A theoretically grounded application
markers, surrounding vehicles, and even the weather. of dropout in recurrent neural networks,” in Advances in neural
Our approach is scalable to massive driving datasets since information processing systems, 2016, pp. 1019–1027.
[21] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networks
it does not require any manual data-labeling of the supervised with bernoulli approximate variational inference,” arXiv preprint
signals. While previous work in end-to-end learning presents arXiv:1506.02158, 2015.
a form of reactionary control, lane following, and object [22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
arXiv preprint arXiv:1312.6114, 2013.
avoidance, this technique encodes a much richer set of [23] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-
information in a probability distribution. Furthermore, since agation and approximate inference in deep generative models,” arXiv
autonomous driving is an extremely safety critical application preprint arXiv:1401.4082, 2014.
[24] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,
of AI, the uncertainty measurements that we provide are “Variational autoencoder for deep learning of images, labels and
absolutely crucial for end-to-end techniques to be deployed captions,” in Advances in neural information processing systems, 2016,
onto real-world roads. pp. 2352–2360.
[25] C. Richter and N. Roy, “Safe visual navigation via deep learning
and novelty detection,” in Proc. of the Robotics: Science and Systems
R EFERENCES Conference, 2017.
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
[1] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-
“Smote: synthetic minority over-sampling technique,” Journal of arti-
making for autonomous vehicles,” Annual Review of Control, Robotics,
ficial intelligence research, vol. 16, pp. 321–357, 2002.
and Autonomous Systems, vol. 1, pp. 187–210, 2018.
[27] A. More, “Survey of resampling techniques for improving clas-
[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
sification performance in unbalanced datasets,” arXiv preprint
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End to
arXiv:1608.06048, 2016.
end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
[28] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks
2016.
with methods addressing the class imbalance problem,” IEEE Trans-
[3] J. Levinson and S. Thrun, “Robust vehicle localization in urban actions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77,
environments using probabilistic maps,” in Robotics and Automation 2006.
(ICRA), 2010 IEEE International Conference on, 2010. [29] S. Suresh, N. Sundararajan, and P. Saratchandran, “Risk-sensitive loss
[4] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlo functions for sparse multi-category classification problems,” Informa-
localization for mobile robots,” Artificial intelligence, vol. 128, no. tion Sciences, vol. 178, no. 12, pp. 2621–2638, 2008.
1-2, pp. 99–141, 2001. [30] G. H. Nguyen, A. Bouzerdoum, and S. L. Phung, “A supervised
[5] A. Amini, B. Horn, and A. Edelman, “Accelerated Convolutions for learning approach for imbalanced data sets,” in Pattern Recognition,
Efficient Multi-Scale Time to Contact Computation in Julia,” arXiv 2008. ICPR 2008. 19th International Conference on. IEEE, 2008,
preprint arXiv:1612.08825, 2016. pp. 1–4.
[6] A. A. Assidiq, O. O. Khalifa, M. R. Islam, and S. Khan, “Real time [31] E. A. Wan and R. Van Der Merwe, “The unscented kalman filter
lane detection for autonomous vehicles,” in Computer and Communi- for nonlinear estimation,” in Adaptive Systems for Signal Processing,
cation Engineering, 2008. ICCCE 2008. International Conference on. Communications, and Control Symposium 2000. AS-SPCC. The IEEE
IEEE, 2008, pp. 82–88. 2000. Ieee, 2000, pp. 153–158.
[7] Z. Kim, “Robust lane detection and tracking in challenging scenarios,” [32] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1, arXiv preprint arXiv:1412.6980, 2014.
pp. 16–26, 2008.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy