SAR Target Recognition Based On Deep Learning
SAR Target Recognition Based On Deep Learning
Abstract— Deep learning algorithms such as convolutional hundreds of thousands airborne SAR images, designed five
neural networks (CNN) have been successfully applied in different features extraction algorithms, and finally use these
computer vision. This paper attempts to adapt the optical feature vectors as inputs to train a classifier. The performance
camera-oriented CNN to its microwave counterpart, i.e. synthetic
is nearly perfect when testing conditions are similar to the
aperture radar (SAR). As a preliminary study, a single layer of
training conditions, however, when testing conditions changed
convolutional neural network is used to automatically learn
features from SAR images. Instead of using the classical
the recognition rate will decrease drastically [2]. The reason is
backpropagation algorithm, the convolution kernel is trained on that SAR images are very sensitive to all kinds of variations,
randomly sampled image patches using unsupervised sparse such as articulation, obscuration, and camouflage on military
auto-encoder. After convolution and pooling, an input SAR targets, and changes of background environment. Uncertainties
image is then transformed into a series of feature maps. These in the real world are so large that it is unrealistic to collect
feature maps are then used to train a final softmax classifier. sample images covering all possible variations. To solve these
Initial experiments on MSTAR public data set show that an problems, a model-based module was later added on the back-
accuracy of 90.1% can be achieved on three types of targets
end of MSTAR system. The front-end of MSTAR system is
classification task, and an accuracy of 84.7% is achievable on ten
still a typical pattern recognition system, while targets unable
types of targets classification task.
to be recognized by front-end will be sent into back-end, which
Keywords—Convolutional Neural Network; sparse auto- is an iterative searching classifier. In each iteration, features
encoder; Synthetic Aperture Radar; Automatic Target Recognition extracted from test images are compared with features
predicted from target CAD model using electromagnetic
signature prediction code. The needs for another iteration are
I. INTRODUCTION
determined according to the degree of matching. In both of
Synthetic aperture radar (SAR) is a kind of airborne or
these SAR ATR systems, designing and selecting useful
spaceborne two-dimensional high resolution imaging radar.
features plays an important role in recognition accuracy.
Unlike optical remote sensing, which cannot work in poor
weather conditions and night, SAR can operate regardless of Just like machine learning used in speech recognition and
weather conditions, day and night, so it is of high value in computer vision, the most important thing is feature
military and civil applications. But the interpretation of SAR engineering: designing better features to represent targets and
images needs specialists, since unlike natural images, SAR then use these features as input to train a classifier. Previously,
images reflect the backscattering intensity of electromagnetic researchers in computer vision community developed many
wave. Moreover, searching targets of interest in the massive powerful features for object recognition, such as SIFT (Scale-
SAR images by human is time-consuming and extremely invariant feature transform) and HOG (Histogram of Oriented
difficult, which justifies the need for efficient SAR automatic Gradients). Over the last few years, considerable efforts have
target recognition (ATR) algorithms. been made to design multi-stage architectures to automatically
learn representations or features from data instead of hand-
MIT Lincoln Laboratory has pioneered research on this
crafted features in speech recognition and computer vision
topic, as evidenced in their two SAR ATR system: Template-
domain. In many speech recognition or natural images object
based SAIP (semi-automated image intelligence processing
recognition benchmark dataset, these feature learning
system) program [1, 2] and Model-based MSTAR (moving and
algorithms have achieved superior performance.
stationary target acquisition and recognition) program [3]. The
main component of SAIP system is standard pattern Nowadays, deep architecture with convolution and pooling
recognition architecture. The researchers collected several are found to be highly effective and commonly used in
computer vision and object recognition [4-14]. The most Our approach consists of the following steps to learn a
impressive result was achieved in 2012 ImageNet contest, feature representations: (1) Extract random patches from
where 1.2 million images in the training set with 1000 different unlabeled images, the dimension of these patches is the same
object classes. On the test data set including 150,000 images, as convolution kernel. (2) Subtract mean value and then apply
the deep Convolutional Neural Network (CNN) approach ZCA whitening pre-processing to the sampled patches. (3)
described in [4] achieved the error rates considerably lower Instead of training purely supervised from labeled image data
than the previous state-of-the-art results. Furthermore, CNN using classical backpropagation and stochastic gradient decent,
has achieved superior classification accuracy on different tasks, our convolutional layer is trained using sparse auto-encoder on
such as handwritten digits or Latin and Chinese character randomly sampled image patches.
recognition [5, 6], traffic sign recognition [7], face detection
Given the learned convolution kernel and a set of labeled
and recognition [8]. Deep networks are shown to be powerful
training images, we can perform feature extraction and
for computer vision and image recognition tasks because they
classification: (1) Convolve the input images with learned
extract appropriate features while jointly performing
convolution kernel. (2) Pool features over local neighborhoods
discrimination [9].
using max operation, reducing the dimension of feature values
A few researchers in SAR recognition area have tried to use and achieving invariance to small distortion. (3) Train a
multi-layer network to learn features for classification [15, 16]. softmax regression classifier to predict the labels given the
However, in their work, complex image pre-processing and pooled convolved feature vectors, the weights on this final
hand-crafted feature extraction on the first layer are needed. layer is trained supervised using backpropagation. Fig. 1 shows
Fully unsupervised feature learning algorithm begins from the flowchart of the algorithm.
extracted features instead of raw input.
Unsupervised Feature Learning
This paper attempts to adapt the optical camera-oriented
CNN to its microwave counterpart, i.e. synthetic aperture radar Random Mean subtraction Sparse auto-
patches and ZCA
(SAR). As a preliminary study, a single layer of convolutional encoder
sampling whitening
neural network is used to automatically learn features from
SAR images. It will be shown that although SAR images are
extremely noisy and identifying targets is not an easy task even Convolution
kernel and bias
for humans, features learned from raw input pixels can still be
useful for classification.
CNN
II. ALGORITHMS SAR image
Most successful and commonly used feature learning dataset
models typically consist of multi-stage trainable feature
Supervised Softmax
extractor, greedily layer-wise training one layer at a time, Training classifier
treating the features learned from previous layer as the input to
the next layer, using unsupervised feature learning algorithm. Fig. 1 Flowchart of the algorithm
For each of these layers, a number of parameters are chosen
from cross-validation or empirically: the size of convolution A. Generate image patches for unsupervised learning
receptive field, the number of feature maps in each layer, Our approach begins by resizing all of the SAR images to
dimension of pooling region. Results in [12] demonstrate that 128 128 pixels. And then extract a large number of image
with large number of hidden nodes and dense feature patches from unlabeled training input images. The size of each
extraction, using simple unsupervised learning algorithm and image patches is w w , where w referred to the size of
only a single layer of features, state-of-art results on both convolution receptive field. Next convert each image patch into
CIFAR-10 and NORB datasets can be achieved. Because our an N –dimensional vector, where N w w . Suppose the
goal is firstly to find whether features learned from raw input total number of sampled image patches is m, and concatenate
SAR images would be useful for target recognition or not, for all of them into N m matrix: X , , , , where
simplicity, we choose a single layer convolutional neural .
network (CNN).
B. Pre-processing The first term is an average sum of squared error term, trying
In this step, each patch is normalized by subtracting the to minimize reconstruction error. The second term is
mean value of each image patch and then apply Zero regularization term, which is commonly used to prevent
Component Analysis (ZCA) Whitening [17]. This pre- overfitting.
processing approach is commonly used in deep learning. The Sparsity constraint means that for each inputs , most of
main idea here is that since adjacent pixel values are highly
the corresponding hidden units are close to 0, only a
correlated, the raw input is highly redundant. The goal of ZCA
few of them will be activated. Given a training set with m
whitening is to make the input less correlated with each other,
samples, the average activation value of hidden unit j is:
and to make all input elements have the same variance. We can
compute the ZCA whitened data as: , where 1
T UP / T
U and U and P are the eigenvectors and (4)
eigenvalues of the covariance matrix of
X: Σ ∑ T
. The sparsity penalty term is based on the Kullback-Leibler
(KL) divergence between a Bernoulli random variable with
C. Sparse auto-encoder mean and a Bernoulli random variable with mean . KL-
The sparse auto-encoder is a kind of one hidden layer neural divergence is used to measure the extent of discrepancy
network that sets its output target value to be the same as the between two different distributions. This penalty term can be
input data. In other word, it tries to learn an identity written as:
function , . By minimizing the reconstruction error 1
with an extra sparse penalty term that encourage the hidden KL log 1 log (5)
1
units to keep low average activation, convolution kernel
W and biases b will be learned using where is sparsity parameter, typically a small value around
backpropagation from unlabeled data. Imposing a sparsity zero (say 0.02 . KL 0, if , and increases
constraint on the hidden layer, then structure of input data will monotonically as deviates from . This penalty term seeks
be discovered by the auto-encoder. to enforce the average activation of hidden units to be equal
For each training sample , the activation value on the to .
hidden layer is given by: Suppose the number of hidden units is K, the overall cost
(1) function is:
Fig. 2 Architecture of the 1-layer CNN where 1 · is the indicator function, so that 1{a true statement}
= 1, and 1{a false statement} = 0. For example, 1{2 + 2 =
4} =1, whereas 1{1 + 1 = 3} = 0. The second term is weight
Non-linearity Layer: Simply apply logistic sigmoid
decay term which prevents overfitting. The minimum of
function 1/ 1 point-wise. Then a method
needs to be solved by gradient decent. The gradient of cost
called Local Contrast Normalization was employed to perform
function is:
local subtractive and divisive normalizations, which enforces
local competition between adjacent features in a feature map, 1
and between features at the same spatial location in different 1
(9)
feature maps [10]. The subtractive normalization operation
removes the weighted average of neighboring neurons from the ;
current neuron: ∑ , , , where
Then plug it into gradient decent algorithm, for each
is a Gaussian weighting window (typically of size 9 9)
iteration the parameter will perform the update:
normalized so that∑ 1. The divisive normalization
.
computes /max , where
∑ /
, , .
Feature pooling layer: Max pooling operation is chosen, III. EXPERIMENTAL RESULTS
which computes the maximum value over a neighborhood A. Dataset
region in each feature map: , , , . This The experiment data set was collected by the Sandia
process not only reduce the dimension of output feature maps National Laboratory (SNL) SAR sensor platform. The
collection was jointly sponsored by Defense Advanced
Research Projects Agency (DARPA) and Air Force Research
Laboratory as part of the Moving and Stationary Target
Acquisition and Recognition (MSTAR) program. They
collected hundreds of thousands of SAR images containing
ground military targets, and only a small subset of which are
public available on the website [18]. The public released
datasets include 10 different types of ground military targets
(BMP2, BRDM2, BTR60, BTR70, D7, T62, T72, ZIL131,
ZSU234, 2S1), and was collected by using an X-band SAR
sensor in one foot resolution spotlight mode, full aspect
coverage (in the range of 0 to 360 degree). The data collected
at 17 degree depression angles are used for training, and the
data collected at 15 degree depression angles are used for
testing.
Classification Results
Test set Classification
BMP2 BRDM2 BTR60 BTR70 D7 2S1 T62 T72 ZIL131 ZSU234
accuracy (%)
BMP2 157 9 2 9 0 4 0 4 6 4 80.5
BRDM2 9 220 6 18 0 3 1 2 15 0 80.2
BTR60 0 11 168 4 4 4 1 2 1 0 86.1
BTR70 3 4 3 181 0 4 0 0 1 0 92.3
D7 0 0 0 0 252 0 8 2 5 7 91.9
2S1 14 9 5 5 0 190 7 22 21 1 69.3
T62 2 1 5 0 4 7 242 3 7 2 88.6
T72 3 3 1 1 0 8 2 168 9 1 85.7
ZIL131 5 6 5 7 1 12 3 9 226 0 82.4
ZSU234 1 1 3 0 4 1 2 7 6 249 90.8
average classification rate: 84.7%