De - Biasing 2 MIT
De - Biasing 2 MIT
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
As Published https://www.iros2018.org/
Abstract— This paper introduces a new method for end-to- Uncertainty Estimation
end training of deep neural networks (DNNs) and evaluates & Novelty Detection
it in the context of autonomous driving. DNN training has
been shown to result in high accuracy for perception to action
Uncertainty
Propagate
learning given sufficient training data. However, the trained
models may fail without warning in situations with insufficient
or biased training data. In this paper, we propose and evaluate Encoder
a novel architecture for self-supervised learning of latent
variables to detect the insufficiently trained situations. Our
method also addresses training data imbalance, by learning a
set of underlying latent variables that characterize the training Dataset Debiasing
data and evaluate potential biases. We show how these latent Weather Adjacent Road Steering
distributions can be leveraged to adapt and accelerate the (Snow) Vehicles Surface Control
Command
training pipeline by training on only a fraction of the total Resampled data
dataset. We evaluate our approach on a challenging dataset distribution
for driving. The data is collected from a full-scale autonomous Unsupervised Latent
vehicle. Our method provides qualitative explanation for the Variables Sample Efficient
Accelerated Training
latent variables learned in the model. Finally, we show how
our model can be additionally trained as an end-to-end con-
troller, directly outputting a steering control command for an Fig. 1: Semi-supervised end-to-end control. An encoder
autonomous vehicle. neural network is trained to learn a supervised control
command as well as various other unsupervised outputs
I. I NTRODUCTION
that qualitatively describe the image. This enables two key
Robots operating in human-centered environments have contributions of novelty detection and dataset debiasing.
to perform reliably in unanticipated situations. While deep
neural networks (DNNs) offer great promise in enabling datasets that are either biased or contain class imbalances,
robots to learn from humans and their environments (as due to the lack of labeled data. This negatively impacts both
opposed to hand-coding rules), substantial challenges re- the speed and accuracy of training.
main [1]. For example, previous work in autonomous driving In this paper, we address these limitations by developing
has demonstrated the ability to train end-to-end a DNN an end-to-end control method capable of novelty detection
capable of generating vehicle steering commands directly and automated debiasing through self-supervised learning of
from car-mounted video camera data with high accuracy latent variable representations. In addition to learning a final
so long as sufficient training data is provided [2]. But true control output directly from raw perception data, we also
autonomous systems should also gracefully handle scenarios learn a number of underlying latent variables that qualita-
with insufficient training data. Existing DNNs will likely tively capture the underlying features of the data cf. Fig. 1.
produce incorrect decisions without a reliable measure of These latent variables, as well as their associated uncertain-
confidence when placed in environments for which they were ties, are learned through self-supervision of a network trained
insufficiently trained. to reconstruct its own input. By estimating the distribution
A society where robots are safely and reliably integrated of latent factors, we can estimate when the network is likely
into daily life demands agents that are aware of scenarios for to fail (thus increasing the robustness of the controller,) and
which they are insufficiently trained. Furthermore, subsys- adapt the training pipeline to cater to the distribution of these
tems of these agents must effectively convey the confidence underlying factors, thereby improving training accuracy. Our
associated their decisions. Finally, robust performance of approach makes two key contributions:
these systems necessitates an unbiased, balanced training 1) Detection of novel events which the network has been
dataset. To date, many such systems have been trained with insufficiently trained for and not trusted to produce
Toyota Research Institute (TRI) provided funds to assist the authors with
reliable outputs; and
their research but this article solely reflects the opinions and conclusions 2) Automated debiasing of a neural network training
of its authors and not TRI or any other Toyota entity. We gratefully pipeline, leading to faster training convergence and
acknowledge the support of NVIDIA Corporation with the donation of the
Volta V100 GPU used for this research.
increased accuracy.
1 Computer Science and Artificial Intelligence Lab, Massachusetts Insti-
Our solution uses a Variational Autoencoder (VAE) net-
tute of Technology {amini,wilkos,araki,rus}@mit.edu
2 Toyota Research Institute (TRI) {guy.rosman}@tri.global work architecture comprised of two parts, an encoder and a
3 Laboratory for Information and Decision Systems, Massachusetts decoder. The encoder is responsible for learning a mapping
Institute of Technology {sertac}@mit.edu from raw sensor data directly to a low dimensional latent
Data Collection Encoder Latent Space Decoder
Data collected from human drivers
0
0
k-1
0
k-1
Domain Knowledge
k-1
1. Recover from off-center and
off-orientation
2. Downsample straight roads Conv Deconv FC Mean Variance Rand Sample
Fig. 2: Novel VAE architecture for end-to-end control. Image features are extracted through convolutional layers to
encode the input image into the variational latent space with one of the latent variables explicitly supervised to learn
steering control. The resulting latent variables are self-supervised by feeding the entire encoding into a decoder network that
learns to reconstruct the original input image. Uncertainty is modeled through the variance of each latent variable (σk2 ).
space encoding that maximally explains as much of the II. R ELATED W ORKS
data as possible. The decoder is responsible for learning the
inverse mapping that takes as input a single sample from Traditional methods for autonomous driving first decom-
the aforementioned latent space and reconstructs the original pose the problem into smaller components, with an individual
input. As opposed to a standard VAE model, which self- algorithm applied to each component. These submodules can
supervises the entire latent space, we also explicitly supervise range from mapping and localization [3], [4], to perception
a single dimension of the latent space to represent the robotic [5]–[7], planning [8], [9], and control [10], [11]. On the
control output. other hand, end-to-end systems attempt to collapse the entire
problem (from raw sensory data to control outputs) into a
We use end-to-end autonomous driving as the robotic con-
single learned model. The ALVINN system [12] originally
trol use case. Here a steering control command is predicted
demonstrated this idea by training a multilayer perceptron to
from only a single input image. As a safety-critical task, au-
learn the direction a vehicle travel from pixel image inputs.
tonomous driving is particularly well-suited for our approach.
Advancements in deep learning have caused a revival of end-
Control systems for autonomous vehicles, when deployed in
to-end methods for self-driving cars [2], [13]–[15]. These
the real world, face enormous amounts of uncertainty and
systems have shown enormous promise by outputting a single
possibly even environments that they have never encountered
steering command and controlling a full-scale autonomous
before. Additionally, autonomous driving is a safety critical
vehicle through residential and highway streets [2]. The
application of robotics; such control systems must possess
system in [13] learns a discrete probability distribution over
reliable ways of assessing their own confidence.
possible control commands while [15] applies the output
We evaluate our VAE-based approach on a challenging of their convolutional feature extractor to a long short-term
real driving dataset of time synchronized image, control data memory (LSTM) recurrent neural network for generating
samples collected with a full scale autonomous vehicle and a driving policy. However, all of these these models are
demonstrate both novelty detection and automated debiasing still largely trained as black-boxes and lack a measure
of the dataset. Our algorithm includes a VAE-based indicator of associated confidence in their output and method for
to detect novel images that were not contained in the training interpreting the learned features.
distribution. We demonstrate our algorithm’s ability to detect Understanding and estimating confidence of the output of
over 97% of novel image frames that were trained during machine learning models has been investigated in different
a different time of day and detect 100% of camera sensor ways: One can formulate training of DNNs as a maximum
malfunctions in our dataset. likelihood problem using a softmax activation layer on the
We address training set data imbalance by introducing output to estimate probabilities of discrete classes [16] as
unsupervised latent variables into the VAE model. These well as discrete probability distributions [17]. Introspective
latent variables qualitatively represent many of the high level capacity has been used to evaluate model performance for a
semantic features of the scene. Moreover, we show that we variety of commonly used classification and detection tasks
can estimate the distributions of the latent variables over [18] by estimating an algorithm’s uncertainty by the distance
the training data, enabling automated debiasing of newly from train to test distribution in feature space. Bayesian deep
collected datasets against the learned latent variables. This learning has even been applied to end-to-end autonomous
results in accelerated and sample efficient training. vehicle control pipelines [19] and offers an additional way
The remainder of the paper is structured as follows: we to estimate confidence through Monte Carlo dropout sam-
summarize the related work in Sec. II, formulate the model pling of weights in recurrent [20] and convolutional neural
in Sec. III, describe our experimental setup and dataset in networks [21]. However, Monte Carlo dropout sampling is
Sec. IV, and provide an overview of our results in Sec. V. not always feasible in real-time on many hardware platforms
due to its repetitive inference calls. Additionally, there is no
explanation or interpretation of the resulting control actions Note that only a single image is used as input at every
available, as well as no confidence metric of whether the time instant. This follows from original observations where
model was trained on similar data as the current input data. models that were trained end-to-end with a temporal infor-
Variational Autoencoders (VAEs) [22], [23] are gaining mation (CNN+LSTM) are unable to decouple the underlying
importance as an unsupervised method to learn hidden rep- spatial information from the temporal control aspect. While
resentations. Such latent representations have been shown these models perform well on test datasets, they face control
to qualitatively provide semantic structure underlying raw feedback issues when placed on a physical vehicle and
image data [24]. Previous works have leveraged VAEs for consistently drift off the road.
novelty detection [25], but did not directly use the pre-
dicted distributions of the sensor for a model fit criterion.
Conversely, our work presents an indicator to detect novel
images that were not contained in the training distribution by B. VAE Architecture
weighting the reconstructed image by the latent uncertainty
propagated through the network. High loss indicates that the We extend the end-to-end control model from Sec. III-
model has not been trained on that type of image and thus re- A into a variational autoencoding (VAE) model. Unlike the
flects lower confidence in the network’s ability to generalize classical VAE model [22], [23], one particular latent vari-
to that scenario. Operating over input distributions diverging able is explicitly supervised to predict steering control, and
from the training distribution is potentially dangerous, since combined with the remaining latent variables to reconstruct
the model cannot sufficiently reason about the input data. the input image. Our model is shown in Fig. 2. The model
Additionally, learning in the presence of class imbalance accepts as input a 66 × 200 × 3 RGB image in mini-batches
is one of the key challenges underlying all machine learning of size n = 50. We use a convolutional network encoder,
systems. When the classes are explicitly defined and labeled comprised of convolutional (Conv) and fully connected (FC)
it is possible to resample the dataset [26], [27] or even layers, to compute Q(z|X), the distribution of the latent
reweight the loss function [28], [29] to avoid training bias. variables given a data point. The encoder has a similar
However, these techniques are not immediately applicable architecture as a standard end-to-end regression model and
if the underlying class bias is not explicitly labeled (as is is detailed in Table I, with 5 convolutional layers and
the case in many real world training problems). While [30] two fully connected layers with dropout. The latent space
demonstrates how K-means can be used to find clusters section of Fig. 2 shows the encoder outputting 2k activations
within the data before training to provide a loss reweighting corresponding to µ ∈ Rk , Σ = Diag[σ] 0 used to define
scheme, this method does not adapt to the model during the distribution of z. Next, there is a decoder network that
training. Additionally, running K-means on high dimensional mirrors the encoder (two FC layers with dropout and then 5
image data can be extremely computational and destroys de-convolutional layers) to reconstruct the image back from
all spatial information between nearby pixels. This paper the latent space sample z.
provides an algorithm for adaptive dataset debiasing during In order to differentiate the outputs through the sampling
training by learning the latent space distribution and sam- phase, VAEs use a reparameterization “trick”, sampling ∼
pling from the most representative regions of this space. N (0, I) and compute z = µ(X) + Σ1/2 (X) × . This
allows us to train the encoder and decoder, end-to-end, using
III. M ODEL
backpropagation.
In this section, we start by formulating the end-to-end
In our VAE model we additionally supervise one of the
control problem and then describe our model architecture
latent variables to take on the value of the curvature of
for estimating steering control of an autonomous vehicle (in
the vehicle’s path. We represent this modified variable as
an end-to-end manner). We explain how our model allows us
z0 = zŷ = ŷ. It is visible in Fig. 2 as the steering wheel
to self-supervise on a number of other latent variables which
graphic. The loss function of our VAE has three parts –
qualitatively describe the driving dataset. All models in this
a reconstruction loss, a latent loss, and a supervised latent
paper were trained on a NVIDIA Volta V100 GPU.
loss. The reconstruction loss is the L1 norm between the
A. End-to-End Model Formulation input image and the output image, and serves the purpose of
We start from the end-to-end model framework. In TABLE I: Architecture of the encoder neural network.
this framework we observe n training images, X = “Conv” and “FC” refer to convolutional and fully connected
{x1 , . . . , xn }, which are collections of raw pixels from a layers, while k denotes the number of latent variables.
front-facing video camera. We aim to build a model that can Layer Output Activation Kernel Stride
directly map our input images, X, to output steering com- 1. Input 66x200x3 Identity - -
mands based on the curvature of the road Y = {y1 , . . . , yn }. 2. Conv1 31x98x24 ReLU 5x5 2x2
3. Conv2 14x47x36 ReLU 5x5 2x2
To train a single image model we optimized the mean 4. Conv3 5x22x48 ReLU 5x5 1x1
squared error (MSE) or L2 loss function between the human 5. Conv4 3x20x64 ReLU 3x3 1x1
actuated control, yi , and the predicted control, ŷi : 6. Conv5 1x18x64 ReLU+Flatten 3x3 1x1
7. FC1 1000 ReLU - -
n 8. FC2 100 ReLU - -
1X
Ly (y, ŷ) = (yi − yˆi )2 (1) 9. FC3 2k Identity - -
n i=1 10. Latent Sample k Identity - -
training the decoder. We define this loss function as follows: robustness. Averaging over image pixels yields the error
n
term:
1X * +
Lx (x, x̂) = |xi − x̂i |
(2) x(p) − E x̂(p) |θ
n i=1 D(x, x̂) = q (6)
Var x̂(p) |θ
The latent loss is the Kullback-Liebler divergence between
the latent variables and a unit Gaussian, providing regulariza-
where E x̂(p) |θ , Var x̂(p) |θ are computed by sampling
tion for the latent space [22], [23]. For Gaussian distributions, values of z and computing statistics for the resulting decoded
the KL divergence has the closed form images x̂. Additionally, x(p) denotes the value of pixel p in
k−1 image x. The approach for pixelwise uncertainty estimates
1X is similar to the unscented transform [31], and is captured
1 − µ2j − σj2 + log (σj + ρ)2 (3)
LKL (µ, σ) = −
2 j=0 in Algorithm 1. Using these statistics indicates whether
an observed image is well captured by our trained model.
where ρ = 10−8 is added for numerical stability. In practice, we implemented a real-time version of this
Lastly, the supervised latent loss is a new addition to the algorithm using 20 samples on each time iteration.
loss function, and it is defined as the MSE between the
predicted and actual curvature of the vehicle’s path. It is Algorithm 1 Compute pixel-wise expectation and variance
the same as Eq. 1 from Sec. III-A. Require: Input image x, Encoder NN, & Decoder NN
These three losses are summed to compute the total loss: 1: {σi , µi }k
i=1 ← Encoder(x)
2: θ ← {σi , µi }k
i=1
LT OT AL (·) = c1 Ly (y, ŷ) + c2 Lx (x, x̂) + c3 LKL (µ, σ) (4)
3:
where c1 , c2 , and c3 are used weight the importance of 4: for j = 1 to T do
each loss function. We found that c1 = 0.033, c2 = 0.1, 5: for i = 1 to k do
c3 = 0.001 yielded a nice trade off in importance between 6: Sample zi ∼ N (µi , σi2 )
steering control, reconstruction, and KL loss, wherein no 7: end for
one individual loss component overpowers the others during 8: x̂j = Decoder(z)
training. 9: end for
10:
1
PT (p)
C. Novelty Detection 11: E[x̂(p) |θ] = T j=1 x̂j
PT (p) 2
1
A crucial aspect underlying many deep learning systems 12: Var[x̂(p) |θ] = T j=1 x̂j − E[x̂(p) |θ]
is their ability to reliably compute uncertainty measurements 13:
and thus determine when they have low confidence in the 14: return E[x̂|θ], Var[x̂|θ]
given environment. Many standard end-to-end networks are
seen as black-boxes, producing only a single control output
number at the final layer. In this section, we explore a D. Increased Training on Rare Events
novel method for using the architecture to detect driving
A majority of driving data consists of straight road driving,
environments with insufficient training that cannot be trusted
while turns and curves are vastly underrepresented. End-to-
to produce a reliable decision.
end networks training often [2] handles this by resampling
We note the VAE architecture provides uncertainty es-
the training set to place more emphasis on the rarer events
timates for every variable in the latent space via their
(i.e., turns). We generalize this notion to the latent space
parameters {µi , σi }k−1i=0 . However, what we really desire of our VAE model, better exploring the space of both
are uncertainty estimates in the input image space since
control events and nuisance factors, for not only the steering
these observations are available at test time. We therefore
command but all other underlying features. By feeding the
propagate the uncertainties through the remainder of the
original training data through the learned model we estimate
network by sampling in the latent layer space and computing
the training data distribution Q(z|X) in the latent space.
empirical uncertainty estimates in any of the future layers
Subsequently, it is possible to increase the proportion of rarer
(including the final reconstructed image space).
datapoints by dropping overrepresented regions of the latent
This can be explained by treating the VAE encoder net-
space. We approximate Q(z|X) as a histogram Q̂(z|X),
work as a posterior model inference of the parameters θ,
where z is the output of the Encoder NN corresponding to the
z as samples from the posterior distribution inferred from
input images x ∈ X. To avoid data-hungry high-dimensional
θ. Propagation of z into x̂ results in a posterior predictive
(in our case 25 dimensions) histograms, we further simplify
sample. A common model fit measure for θ would be
and utilize independent histograms Q̂i (zi |X)Qfor each la-
log P (x|θ); θ = {σi , µi }k−1 (5) tent variable zi and approximate Q̂(z|X) ∝ i Q̂i (zi |X).
i=0
Naturally, we would like to train on a higher number of
Using a common pixel-wise independence approximation unlikely training examples and drop many samples over-
allows us to get a model rejection criteria based on the represented in the dataset. We therefore train on a subsam-
images, using a weighted L2 error. As commonly done in Xsub including datapoints x with probability
pled training set Q
image processing, we use instead the L1 norm due to its W (z(x)|X) ∝ i 1/(Q̂i (zi (x)|X) + α). For small α the
1.2
subsampled training distribution is close to uniform over z, 130
1
whereas large α keep the subsampled training distribution 0.8
closer to the original training distribution. At every epoch all 120 0.6
0.4
images x from the original dataset X are propagated through 0.2
110
the (learned) model to evaluate the corresponding latent
140
variables z(x). The histograms Q̂i (zi (x)|X) are updated 20
Est. Uncertainty
Est. Steering
0 4
-0.1 0
Mean Uncertainty: Mean Uncertainty: -0.1 0 +0.1 -0.1 0 +0.1
True Steering True Steering
Fig. 4: Bad apple analysis. Test images with the best (right)
and worst (left) steering error. Fig. 5: Precision and uncertainty estimation. Plots of true
vs. estimated steering (left) and true steering vs. estimated
architecture of this model is almost identical to the first nine uncertainty (right). Both show that the model variance tends
layers of our encoder neural network with the exception to increase on larger turns (i.e., greater steering magnitude).
of the final layer being only one neuron (as opposed to
2k neurons). We found that the training loss of both the B. Latent Variables
regression and our VAE network were almost exactly the
In this subsection, we analyze the resulting latent space
same. The steering validation loss for the VAE was roughly
that our encoder learned. We start by gauging the underlying
4.4, versus a value of 3.8 for the regression model loss.
meaning of each of the latent variables. We synthesize
Therefore the loss is 14% higher, corresponding to a mean
images using our decoder, starting with a vector in the
error in steering wheel angle of only 1.4 degrees. Therefore,
latent space, and perturb a single latent variable as we
we use the VAE model to predict both a steering control
reconstruct output images. By varying the perturbation we
and uncertainty measure with roughly equal accuracy as the
can understand how that specific latent variable effects the
baseline regression model but simultaneously gain all of the
image. The results for an examplary set of latent variables
additional advantages from the learned latent encodings.
is shown in Fig. 6. We observe that the network is able
to generate intuitive representations for lane markings and
Fig. 4 shows the images associated with the best and additional environmental structure such as other surrounding
worst steering predictions. The mean uncertainty of the best vehicles and weather without ever actively being told to
predictions was 9.3 × 10−4 vs 1.7 × 10−2 for the worst do so. By identifying latent variable representations we can
predictions, indicating that our model can indeed predict immediately observe what the network sees and explain how
when its estimated steering angle is uncertain. The images the corresponding steering control is derived.
associated with the worst steering position are mostly from Intuitively, we know that the network will be penalized
when the car is at a high angle of incidence towards the for having redundant latent variables due to the fact that the
center of the road, representing a small portion of our training reconstruction loss penalizes reconstructed images that do
dataset. On the other hand, the images with the best steering not resemble the input image. This means that the encoder
prediction are mostly straight roads with strongly defined should learn a latent representation of the input such that as
lanes, probably because our dataset has many examples of much distinct information is explained away as possible. This
straight roads (despite debiasing) and because lane markers causes the variables learned to be the most important under-
are a key feature for predicting steering angle. lying variables in the dataset so the decoder can reconstruct
the image from such a low dimensional space.
Fig. 5 corroborates these results. The first chart shows
that there is a strong correlation between the true steering C. Detecting out of sample environments
and the estimated steering. Interestingly, the values of the
estimated steering fan out at the extreme values, showing that Next, we examined ways to interpret if our network
1) our dataset is sparse at extreme steering values and 2) that is confident in its end-to-end control prediction. Figure 7
the uncertainty of the prediction should increase at extreme shows sample pixel-wise uncertainties (red) obtained from
steering values. The second chart shows the relationship this method overlaid on top of the respective input images.
between true steering and the estimated uncertainty, and it As one might expect, details around adjacent vehicles, the
indeed shows that uncertainty increases as the absolute value distant horizon and the presence of snow next to the road
of the steering increases. Although this shows a weak point highlight the greatest uncertainty. This make sense since the
of a dataset – that it is sparse in images associated with large model is constrained to only 25 latent variables and doesnot
steering angle – it shows a strong point of the VAE model, have capacity to hold any of these less meaningful details.
which is that it is able to predict that there is high uncertainty We can now plot the distribution of this distance over
for large values of the steering angle. Additionally, such datasets on which the network received sufficient and insuf-
uncertainty at greater turns makes intuitive sense since larger ficient training data. To be very explicit, we train a network
road curvatures imply less future road is visible in the image. on daytime driving and detect novel nighttime driving. Since
For example, on extreme turns it is common for less than the network has not been trained with night data, we should
10m to be visible in image, whereas straight road images not trust the output and therefore want to understand, as
present can present visible far more into the future closer best as possible, when nighttime data is being fed into the
to 100m. The fact that we can see less of the future road network. We set a simple threshold γ95 at the 95th percentile
supports the increased uncertainty in such scenarios. of D(x, x̂) for every x in the entire training set. For any
Fig. 6: Latent variable perturbation. A selection of six learned latent variables with associated interpretable descriptors
(left labels). Images along the x-axis (from left to right) were generated by linearly perturbing the latent vector encoding
along that single latent dimension. While steering command (top) was a supervised latent variable, all others (bottom five)
were entirely unsupervised and learned by the model from the dataset.
240
Histogram
180
Count
120
Fig. 7: Propagating uncertainty into pixel-space. Sample
60
images from the dataset along with pixel-wise uncertainty
estimates. Uncertain pixels are highlighted red. 0
1 2 3 4 5 6 7 8 9 10
7
subsequent datasample xtest , we classify it as novel if x10
D(xtest , x̂test ) > γ95 (7) Fig. 8: Novelty Detection. Feeding training distribution data
(day time) through the network as well as a novel data
Figure 8 illustrates the training set distribution (day- distribution (night time). A third peak forms when the camera
time driving) in blue as well as the extracted threshold sensor malfunctions from poor illumination.
γ95 . When image frames collected at night are fed through
network we observe the orange distribution and are able to
correctly classify 97% of these frames as novel. Furthermore, Original
we experiment with frames collected during image sensor Debiased
malfunctions. These malfunctions are a realistic and common 100
failure mode for the AR0231 sensor caused from incorrect
white balance estimation. Feeding such images through the
network can be catastrophic and yield unpredictable control 50
responses. Using our metric D(x, x̂) we see that the distri-
bution of these images (green) fall far from the training set
0
and we can successfully detect 100% of these frames. 0 2 4 6 8 10 12
Iterations x10
3