0% found this document useful (0 votes)
74 views7 pages

SAR Target Recognition Based On Deep Learning

Deep learning algorithms such as convolutional neural networks (CNN) have been successfully applied in computer vision. This paper attempts to adapt the optical camera-oriented CNN to its microwave counterpart, i.e. synthetic aperture radar (SAR). As a preliminary study, a single layer of convolutional neural network is used to automatically learn features from SAR images. Instead of using the classical backpropagation algorithm, the convolution kernel is trained on randomly sampled image

Uploaded by

Liiz Rozaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views7 pages

SAR Target Recognition Based On Deep Learning

Deep learning algorithms such as convolutional neural networks (CNN) have been successfully applied in computer vision. This paper attempts to adapt the optical camera-oriented CNN to its microwave counterpart, i.e. synthetic aperture radar (SAR). As a preliminary study, a single layer of convolutional neural network is used to automatically learn features from SAR images. Instead of using the classical backpropagation algorithm, the convolution kernel is trained on randomly sampled image

Uploaded by

Liiz Rozaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SAR Target Recognition Based on Deep Learning

Sizhe Chen, Haipeng Wang


Key Laboratory for Information Sciences of Electromagnetic Waves (MoE)
Fudan University, Shanghai, China
Email: hpwang@fudan.edu.cn

Abstract— Deep learning algorithms such as convolutional hundreds of thousands airborne SAR images, designed five
neural networks (CNN) have been successfully applied in different features extraction algorithms, and finally use these
computer vision. This paper attempts to adapt the optical feature vectors as inputs to train a classifier. The performance
camera-oriented CNN to its microwave counterpart, i.e. synthetic
is nearly perfect when testing conditions are similar to the
aperture radar (SAR). As a preliminary study, a single layer of
training conditions, however, when testing conditions changed
convolutional neural network is used to automatically learn
features from SAR images. Instead of using the classical
the recognition rate will decrease drastically [2]. The reason is
backpropagation algorithm, the convolution kernel is trained on that SAR images are very sensitive to all kinds of variations,
randomly sampled image patches using unsupervised sparse such as articulation, obscuration, and camouflage on military
auto-encoder. After convolution and pooling, an input SAR targets, and changes of background environment. Uncertainties
image is then transformed into a series of feature maps. These in the real world are so large that it is unrealistic to collect
feature maps are then used to train a final softmax classifier. sample images covering all possible variations. To solve these
Initial experiments on MSTAR public data set show that an problems, a model-based module was later added on the back-
accuracy of 90.1% can be achieved on three types of targets
end of MSTAR system. The front-end of MSTAR system is
classification task, and an accuracy of 84.7% is achievable on ten
still a typical pattern recognition system, while targets unable
types of targets classification task.
to be recognized by front-end will be sent into back-end, which
Keywords—Convolutional Neural Network; sparse auto- is an iterative searching classifier. In each iteration, features
encoder; Synthetic Aperture Radar; Automatic Target Recognition extracted from test images are compared with features
predicted from target CAD model using electromagnetic
signature prediction code. The needs for another iteration are
I. INTRODUCTION
determined according to the degree of matching. In both of
Synthetic aperture radar (SAR) is a kind of airborne or
these SAR ATR systems, designing and selecting useful
spaceborne two-dimensional high resolution imaging radar.
features plays an important role in recognition accuracy.
Unlike optical remote sensing, which cannot work in poor
weather conditions and night, SAR can operate regardless of Just like machine learning used in speech recognition and
weather conditions, day and night, so it is of high value in computer vision, the most important thing is feature
military and civil applications. But the interpretation of SAR engineering: designing better features to represent targets and
images needs specialists, since unlike natural images, SAR then use these features as input to train a classifier. Previously,
images reflect the backscattering intensity of electromagnetic researchers in computer vision community developed many
wave. Moreover, searching targets of interest in the massive powerful features for object recognition, such as SIFT (Scale-
SAR images by human is time-consuming and extremely invariant feature transform) and HOG (Histogram of Oriented
difficult, which justifies the need for efficient SAR automatic Gradients). Over the last few years, considerable efforts have
target recognition (ATR) algorithms. been made to design multi-stage architectures to automatically
learn representations or features from data instead of hand-
MIT Lincoln Laboratory has pioneered research on this
crafted features in speech recognition and computer vision
topic, as evidenced in their two SAR ATR system: Template-
domain. In many speech recognition or natural images object
based SAIP (semi-automated image intelligence processing
recognition benchmark dataset, these feature learning
system) program [1, 2] and Model-based MSTAR (moving and
algorithms have achieved superior performance.
stationary target acquisition and recognition) program [3]. The
main component of SAIP system is standard pattern Nowadays, deep architecture with convolution and pooling
recognition architecture. The researchers collected several are found to be highly effective and commonly used in
computer vision and object recognition [4-14]. The most Our approach consists of the following steps to learn a
impressive result was achieved in 2012 ImageNet contest, feature representations: (1) Extract random patches from
where 1.2 million images in the training set with 1000 different unlabeled images, the dimension of these patches is the same
object classes. On the test data set including 150,000 images, as convolution kernel. (2) Subtract mean value and then apply
the deep Convolutional Neural Network (CNN) approach ZCA whitening pre-processing to the sampled patches. (3)
described in [4] achieved the error rates considerably lower Instead of training purely supervised from labeled image data
than the previous state-of-the-art results. Furthermore, CNN using classical backpropagation and stochastic gradient decent,
has achieved superior classification accuracy on different tasks, our convolutional layer is trained using sparse auto-encoder on
such as handwritten digits or Latin and Chinese character randomly sampled image patches.
recognition [5, 6], traffic sign recognition [7], face detection
Given the learned convolution kernel and a set of labeled
and recognition [8]. Deep networks are shown to be powerful
training images, we can perform feature extraction and
for computer vision and image recognition tasks because they
classification: (1) Convolve the input images with learned
extract appropriate features while jointly performing
convolution kernel. (2) Pool features over local neighborhoods
discrimination [9].
using max operation, reducing the dimension of feature values
A few researchers in SAR recognition area have tried to use and achieving invariance to small distortion. (3) Train a
multi-layer network to learn features for classification [15, 16]. softmax regression classifier to predict the labels given the
However, in their work, complex image pre-processing and pooled convolved feature vectors, the weights on this final
hand-crafted feature extraction on the first layer are needed. layer is trained supervised using backpropagation. Fig. 1 shows
Fully unsupervised feature learning algorithm begins from the flowchart of the algorithm.
extracted features instead of raw input.
Unsupervised Feature Learning
This paper attempts to adapt the optical camera-oriented
CNN to its microwave counterpart, i.e. synthetic aperture radar Random Mean subtraction Sparse auto-
patches and ZCA
(SAR). As a preliminary study, a single layer of convolutional encoder
sampling whitening
neural network is used to automatically learn features from
SAR images. It will be shown that although SAR images are
extremely noisy and identifying targets is not an easy task even Convolution
kernel and bias
for humans, features learned from raw input pixels can still be
useful for classification.

CNN
II. ALGORITHMS SAR image
Most successful and commonly used feature learning dataset
models typically consist of multi-stage trainable feature
Supervised Softmax
extractor, greedily layer-wise training one layer at a time, Training classifier
treating the features learned from previous layer as the input to
the next layer, using unsupervised feature learning algorithm. Fig. 1 Flowchart of the algorithm
For each of these layers, a number of parameters are chosen
from cross-validation or empirically: the size of convolution A. Generate image patches for unsupervised learning
receptive field, the number of feature maps in each layer, Our approach begins by resizing all of the SAR images to
dimension of pooling region. Results in [12] demonstrate that 128 128 pixels. And then extract a large number of image
with large number of hidden nodes and dense feature patches from unlabeled training input images. The size of each
extraction, using simple unsupervised learning algorithm and image patches is w w , where w referred to the size of
only a single layer of features, state-of-art results on both convolution receptive field. Next convert each image patch into
CIFAR-10 and NORB datasets can be achieved. Because our an N –dimensional vector, where N w w . Suppose the
goal is firstly to find whether features learned from raw input total number of sampled image patches is m, and concatenate
SAR images would be useful for target recognition or not, for all of them into N m matrix: X , , , , where
simplicity, we choose a single layer convolutional neural .
network (CNN).
B. Pre-processing The first term is an average sum of squared error term, trying
In this step, each patch is normalized by subtracting the to minimize reconstruction error. The second term is
mean value of each image patch and then apply Zero regularization term, which is commonly used to prevent
Component Analysis (ZCA) Whitening [17]. This pre- overfitting.
processing approach is commonly used in deep learning. The Sparsity constraint means that for each inputs , most of
main idea here is that since adjacent pixel values are highly
the corresponding hidden units are close to 0, only a
correlated, the raw input is highly redundant. The goal of ZCA
few of them will be activated. Given a training set with m
whitening is to make the input less correlated with each other,
samples, the average activation value of hidden unit j is:
and to make all input elements have the same variance. We can
compute the ZCA whitened data as: , where 1
T UP / T
U and U and P are the eigenvectors and (4)
eigenvalues of the covariance matrix of
X: Σ ∑ T
. The sparsity penalty term is based on the Kullback-Leibler
(KL) divergence between a Bernoulli random variable with
C. Sparse auto-encoder mean and a Bernoulli random variable with mean . KL-
The sparse auto-encoder is a kind of one hidden layer neural divergence is used to measure the extent of discrepancy
network that sets its output target value to be the same as the between two different distributions. This penalty term can be
input data. In other word, it tries to learn an identity written as:
function , . By minimizing the reconstruction error 1
with an extra sparse penalty term that encourage the hidden KL log 1 log (5)
1
units to keep low average activation, convolution kernel
W and biases b will be learned using where is sparsity parameter, typically a small value around
backpropagation from unlabeled data. Imposing a sparsity zero (say 0.02 . KL 0, if , and increases
constraint on the hidden layer, then structure of input data will monotonically as deviates from . This penalty term seeks
be discovered by the auto-encoder. to enforce the average activation of hidden units to be equal
For each training sample , the activation value on the to .
hidden layer is given by: Suppose the number of hidden units is K, the overall cost
(1) function is:

where , denotes to connection weights , W, b KL (6)


and bias respectively, and 1/ 1 is the
logistic sigmoid function. However, sigmoid function cannot
be used on the output layer, since the output of logistic sigmoid D. Convolutional feature extraction and pooling
function is in the range of [0, 1], while input data are not A single stage of convolution network consists of three layers:
rescaled in the range [0, 1]. So a linear activation function are a convolution filter layer, a non-linearity layer, a feature
chosen for the output layer: pooling layer. At the output, each input image is represented by
(2) sets of arrays called feature maps. Each feature maps represents
a particular feature extracted at all locations on the whole
where , . With a training set of m images. Fig. 2 shows the Architecture of the 1-layer CNN.
examples, the overall cost function is defined as:
Convolution filter layer: given an input image of
1 1 pixels, the output y composed of K feature maps of size n
W, b ,
2 w 1 n w 1 . K corresponds to the number of hidden
(3) units in the sparse auto-encoder trained on small patches
previously. All trainable convolution kernel have size
2
and convolve them with large input images. The model
computes ∑ where is the 2D discrete
convolution operator, is a trainable bias, and but also introduce small translation invariance, which helps to
means flipping the matrix horizontally and vertically. build robust features for classification.
Both convolution kernel and biases are E. Softmax regression classifier
unsupervised trained using sparse auto-encoder.
In this step, pooled features and their corresponding labels
Convolutions are used to train the final softmax classifier. Softmax
regression model focuses on multi-class classification. Given a
test input , output of softmax classifier will be a k
dimensional vector, each element of which refers to the
estimated probability of the class label equal to each of the k
different values conditioned on this input feature:
Input image | ; , for each value of 1, , . The output k
Feature maps
dimensional hypothesis takes the form:
Contrast
Normalization 1 ;
2 ; 1
(7)

;

Here, , , , are the parameters to be learned


supervised. The cost function would be:
Pooling
1
1 log

(8)

Fig. 2 Architecture of the 1-layer CNN where 1 · is the indicator function, so that 1{a true statement}
= 1, and 1{a false statement} = 0. For example, 1{2 + 2 =
4} =1, whereas 1{1 + 1 = 3} = 0. The second term is weight
Non-linearity Layer: Simply apply logistic sigmoid
decay term which prevents overfitting. The minimum of
function 1/ 1 point-wise. Then a method
needs to be solved by gradient decent. The gradient of cost
called Local Contrast Normalization was employed to perform
function is:
local subtractive and divisive normalizations, which enforces
local competition between adjacent features in a feature map, 1
and between features at the same spatial location in different 1
(9)
feature maps [10]. The subtractive normalization operation
removes the weighted average of neighboring neurons from the ;
current neuron: ∑ , , , where
Then plug it into gradient decent algorithm, for each
is a Gaussian weighting window (typically of size 9 9)
iteration the parameter will perform the update:
normalized so that∑ 1. The divisive normalization
.
computes /max , where
∑ /
, , .

Feature pooling layer: Max pooling operation is chosen, III. EXPERIMENTAL RESULTS
which computes the maximum value over a neighborhood A. Dataset
region in each feature map: , , , . This The experiment data set was collected by the Sandia
process not only reduce the dimension of output feature maps National Laboratory (SNL) SAR sensor platform. The
collection was jointly sponsored by Defense Advanced
Research Projects Agency (DARPA) and Air Force Research
Laboratory as part of the Moving and Stationary Target
Acquisition and Recognition (MSTAR) program. They
collected hundreds of thousands of SAR images containing
ground military targets, and only a small subset of which are
public available on the website [18]. The public released
datasets include 10 different types of ground military targets
(BMP2, BRDM2, BTR60, BTR70, D7, T62, T72, ZIL131,
ZSU234, 2S1), and was collected by using an X-band SAR
sensor in one foot resolution spotlight mode, full aspect
coverage (in the range of 0 to 360 degree). The data collected
at 17 degree depression angles are used for training, and the
data collected at 15 degree depression angles are used for
testing.

(2) BMP2 (1) BRDM2

(4) BTR60 (3) BTR70


Fig. 4 Visualization of learned convolutional kernels from
SAR images (top) vs. that learned from optical images as
given in [17] (bottom).

The ten types of targets in SAR images are shown in Fig. 3


(6) D7 (5) 2S1 with the corresponding optic images given for comparison. As
we can see, the same target appears to be very different than
optical image. The most prominent difference lies to the
‘speckle’ appearance of SAR image. Fig. 4 compares the
learned convolutional kernels of SAR images vs. conventional
optical images. Interestingly, the SAR features looks very
(8) T62 (7) T72
different from optical features which are basically edge
detector templates. The SAR features looks more like ‘sinc’
shaped responses probabaly due to SAR speckle.
B. Classification Results
The algorithms are tested on two classification experiments.
(10) ZIL131 (9) ZSU234 The first one is classifying three types of military targets. The
Fig. 3 Ten types of military targets in SAR images (left) and
details of training and testing data set are shown in TABLE Ι.
their corresponding optical images (right).
In the experiment, deformation targets are included in the test
set. For example, T72_132, T72_812, and T72_s7 are all
belong to T72 tank, but they have different military
configurations such as whether contain machine guns, antenna,
protection armor or not. The classification accuracy of three TABLE II. SAMPLE NUMBER OF TRAINING AND TESTING SET FOR TEN
TYPES
types of target is 90.1% (error rate 9.9%).
Training set Sample number Testing set Sample number

TABLE I. SAMPLE NUMBER OF TRAINING AND TESTING SET FOR THREE


BMP2 233 BMP2 195
TYPES BRDM2 298 BRDM2 274
BTR60 256 BTR60 195
Training set Sample number Testing set Sample number BTR70 233 BTR70 196
BMP2_c21 233 BMP2_c21 196 D7 299 D7 274
BTR70_c71 233 BMP2_9563 195 2S1 299 2S1 274
T72_132 232 BMP2_9566 196 T62 299 T62 273
BTR70_c71 196 T72 232 T72 196
T72_132 196 ZIL131 299 ZIL131 274
T72_812 195 ZSU234 299 ZSU234 274
T72_s7 191

For comparison, similar studies using handcrafted features


The second one is classifying ten types of military targets. [19] to classify the same set of MSTAR datasets have achieved
In this experiment, all of the targets are in standard accuracy of 96.79% for three types and 90.5% for ten types,
configuration, deformation targets are not included. The details respectively. It can be seen that the fully automatic single-layer
of training and testing data set are shown in TABLE Ⅱ. In CNN is able to achieve reasonable performance (i.e. ~5%
this experiment, after adjust hyper parameters by cross- lower than manually designed classifier) which makes us
validation, we achieve a classification accuracy of 84.7% confident that a multi-layer ‘deep’ CNN might achieve
(error rate 15.3%). The detailed ambiguity matrix is shown in comparable or better performance than hand-crafted algorithms.
TABLE III.

TABLE III. CLASSIFICATION AMBIGUITY MATRIX

Classification Results
Test set Classification
BMP2 BRDM2 BTR60 BTR70 D7 2S1 T62 T72 ZIL131 ZSU234
accuracy (%)
BMP2 157 9 2 9 0 4 0 4 6 4 80.5
BRDM2 9 220 6 18 0 3 1 2 15 0 80.2
BTR60 0 11 168 4 4 4 1 2 1 0 86.1
BTR70 3 4 3 181 0 4 0 0 1 0 92.3
D7 0 0 0 0 252 0 8 2 5 7 91.9
2S1 14 9 5 5 0 190 7 22 21 1 69.3
T62 2 1 5 0 4 7 242 3 7 2 88.6
T72 3 3 1 1 0 8 2 168 9 1 85.7
ZIL131 5 6 5 7 1 12 3 9 226 0 82.4
ZSU234 1 1 3 0 4 1 2 7 6 249 90.8
average classification rate: 84.7%

IV. CONCLUSION AND FUTURE WORK


As a preliminary study, a single stage of convolutional REFERENCES
network is adapted to automatically learn features useful for [1] Novak L M, Owirka G J, Brower W S, et al. “The automatic target-
SAR target recognition. Using learned conolutional features, an recognition system in SAIP,” Lincoln Laboratory Journal, 1997, 10(2).
accuracy of 90.1% and 84.7% are achieved on three and ten [2] Novak L M. “State-of-the-art of SAR automatic target recognition,”
types of targets MSTAR dataset classifcation, respectively. The Radar Conference, 2000. The Record of the IEEE 2000 International.
IEEE, 2000: 836-843.
next step is to learn hierarchical features by using multi-stage
[3] Wissinger J, Ristroph R, Diemunsch J R, et al. “MSTAR's extensible
convolutional network, considering the scarcity of SAR image
search engine and model-based inferencing toolkit,” AeroSense'99.
samples, data augmentation technique and overfitting reduction International Society for Optics and Photonics, 1999: 554-570.
methods must be utilized.
[4] Krizhevsky A, Sutskever I, Hinton G E. “ImageNet Classification with [12] Coates A, Ng A Y, Lee H. “An analysis of single-layer networks in
Deep Convolutional Neural Networks,” NIPS. 2012, 1(2): 4. unsupervised feature learning,” International Conference on Artificial
[5] Ciresan D C, Meier U, Gambardella L M, et al. “Deep, big, simple Intelligence and Statistics. 2011: 215-223.
neural nets for handwritten digit recognition,” Neural computation, 2010, [13] Zeiler M D, Krishnan D, Taylor G W, et al. “Deconvolutional
22(12): 3207-3220. networks,” Computer Vision and Pattern Recognition (CVPR), 2010
[6] Ciresan D C, Meier U, Schmidhuber J. “Transfer learning for Latin and IEEE Conference on. IEEE, 2010: 2528-2535.
Chinese characters with deep neural networks,” Neural Networks [14] Zeiler M D, Taylor G W, Fergus R. “Adaptive deconvolutional networks
(IJCNN), The 2012 International Joint Conference on. IEEE, 2012: 1-6. for mid and high level feature learning,” Computer Vision (ICCV), 2011
[7] Cireşan D, Meier U, Masci J, et al. “Multi-column deep neural network IEEE International Conference on. IEEE, 2011: 2018-2025.
for traffic sign classification,” Neural Networks, 2012, 32: 333-338. [15] Diemunsch J R, Wissinger J. “Moving and stationary target acquisition
[8] Taigman Y, Yang M, Ranzato M, et al. “Deep-Face: Closing the Gap to and recognition (MSTAR) model-based automatic target recognition:
Human-Level Performance in Face Verification,” IEEE CVPR. 2014. search technology for a robust ATR.” Aerospace/Defense Sensing and
Controls. International Society for Optics and Photonics, 1998: 481-492.
[9] Yann LeCun. “Learning invariant feature hierarchies,” Computer
vision–ECCV 2012. Workshops and demonstrations. Springer Berlin [16] Sun Z, Xue L, Xu Y, et al. “Marginal Fisher Analysis Feature Extraction
Heidelberg, 2012: 496-505. Algorithm Based on Multilayer Auto-encoder,” Journal of Information
[10] Jarrett K, Kavukcuoglu K, Ranzato M, et al. “What is the best multi- and Computational Science, 2012, 9(18): 5897-5906.
stage architecture for object recognition?” Computer Vision, 2009 IEEE [17] Deep learning tutorial: http://deeplearning.stanford.edu/wiki/index.php
12th International Conference on. IEEE, 2009: 2146-2153. /UFLDL_Tutorial
[11] Kavukcuoglu K, Sermanet P, Boureau Y L, et al. “Learning [18] MSTAR public targets dataset: https://www.sdms.afrl.af.mil/index.php?
Convolutional Feature Hierarchies for Visual Recognition,” NIPS. 2010, collection=mstar
1(2): 5.
[19] ZHANG Xinzheng, HUANG Peikang. “SAR ATR based on Bayesian
compressive sensing,” Systems Engineering and Electronics, 2013,
35(1): 40-44. [in Chinese]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy