A Method For Improving CNN-Based Image Recognition Using Dcgan
A Method For Improving CNN-Based Image Recognition Using Dcgan
167-178, 2018
Abstract: Image recognition has always been a hot research topic in the scientific
community and industry. The emergence of convolutional neural networks(CNN) has
made this technology turned into research focus on the field of computer vision,
especially in image recognition. But it makes the recognition result largely dependent on
the number and quality of training samples. Recently, DCGAN has become a frontier
method for generating images, sounds, and videos. In this paper, DCGAN is used to
generate sample that is difficult to collect and proposed an efficient design method of
generating model. We combine DCGAN with CNN for the second time. Use DCGAN to
generate samples and training in image recognition model, which based by CNN. This
method can enhance the classification model and effectively improve the accuracy of
image recognition. In the experiment, we used the radar profile as dataset for 4 categories
and achieved satisfactory classification performance. This paper applies image recognition
technology to the meteorological field.
1 Introduction
Nowadays, with the development of deep learning, people are increasingly pursuing the
accuracy of image recognition. Deep learning neural networks that mimic human
thinking also appear more and more in this field. The design of the architecture, the
tuning of parameters, and the selection of samples directly influences the final
recognition results of the neural network. At present, many studies have used
Convolutional Neural Network(CNN) as an entry point to improve the accuracy of image
recognition. As we all know, the CNN can use the original pixels of the image directly as
input. It is no longer necessary to extract the features in advance using the traditional
method. This has reported superior performance compared to earlier work relying on
manual features [Dixit, Chen, Gao et al. (2015)]. In fact, CNN is successfully applied for
the classification of handwritten characters and gesture recognition [Kim, Lee and Park
(2008)] which is applied directly to the data flow without pre-processing or feature
selection. CNN training model is invariant to distortions such as scaling, translation,
1 School of Computer & Software, Jiangsu Engineering Center of Network Monitoring, Nanjing University
of Information Science & Technology, Nanjing, 210044, China.
2 State Key Lab. for Novel Software Technology, Nanjing University, Nanjing, 210023, China.
3 Computer Science Department, University of Central Arkansas, Conway AR, 72035, USA.
rotation and has strong generalization ability. The biggest advantage of CNN is that can
handle high-dimensional data onto the shared convolution kernel. Convolutional kernels
deal with complex feature calculations through multi-layer training in end-to-end
networks. This design can greatly reduce amounts of parameters of the neural network
and at the same time reduce the complexity of the neural network model, giving an
optimization space of large classification accuracy.
Most of the CNN image classification is based on supervised learning. A large amount of
data is needed as a training sample to obtain more accurate classification in training
process. However, some samples are hard to collection. For example, radar profiles of the
specific climate. It is extremely difficult to collect samples due to the limitations of the
conditions. Fortunately, lan Goodfellow [Goodfellow, Pouget-Abadie, Mirza et al. (2014)]
proposed GAN which was a framework of generating models and inspired by game
theory. GAN can generate images or image restoration. In pix2pix, change monochrome
image to color image, line drawing image of texture, shadows and luster image etc. [Isola,
Zhu, Zhou et al. (2017)]. For the dispersive phenomenon that the raw GAN training
appeared, Conditional GAN [Mirza and Osindero (2014)] turns the original generation
process into a generation process based on some additional information. The generator
tries to generate labels and random noise together. The discriminator discriminates
between the data source and the data label at the same time, providing the generator of a
more efficient gradient which makes it easily extendable to semi-supervised learning.
After that, due to the instability and intractability of neural network self-training,
DCGAN [Radford, Metz and Chintala (2016)] extends the structure to convolutional
neural networks. In this work, a set of convolutional neural networks were proposed,
using Batch Normalization to achieve local normalization making it possible to train real
large-scale datasets such as CelebA.
The major contributions of this paper are as follows:
1. We designed a novel model structure to generate sample which hard to collect based
on DCGAN’s high scalability and excellent sample generation capabilities.
2. The learning rate decay strategy is used to speed up learning on generator optimization
problems.
3. In our image recognition experiment, we built a recognition framework based on CNN
and matched enough sample generation to strengthen the training recognition model.
Finally improved the classification accuracy.
4. We used the radar profile as a dataset, applies the proposed technology to the
meteorological field and extends the application of image recognition.
2 Relation work
2.1 GAN
There are two components in the GAN framework. One is the generate model G and the
other is the discriminant model D. The G model is responsible for producing spurious
data that is close to the real data. The D model is responsible for identifying the
authenticity of the data produced by G. Competition between D and G made the two sides
constantly to optimize the training until reaching a balanced state. GAN can learn
A Method for Improving CNN-Based Image Recognition Using DCGAN 169
independently through such clever design. Similar to the two-player min-max game,
where one plays the generator role and attempts to generate samples from random noise,
and another plays the discriminator, attempts to discriminate synthetic samples and real
ones. Its overall loss function is expressed as minimizing the distance between the
generated data distribution and the real data distribution. Given a generator condition, the
best discriminator D is shown as:
𝑃𝑑𝑎𝑡𝑎 (𝑥)
𝐷(𝑥) = 𝑃 (𝑥)+𝑃
(1)
𝑑𝑎𝑡𝑎 𝑚𝑜𝑑𝑒𝑙 (𝑥)
al. (1998)] to AlexNet which is the key to promote the development of CNN; From
Googlenet to VGGNet and OverFeat, deep network extraction features can be used for
image classification, detection and segmentation [Krizhevsky, Sutskever and Hinton
(2012); Szegedy, Liu, Jia et al. (2015); Simonyan and Zisserman (2014)]. The aim of
object detection is to find the location of all the targets and specify each target category
on a given image or video. The radar profile that we want to recognize is different from
the general object image. It describes categories based on spectral distribution and color
similarity. So, this level of semantics can use CNN to perform feature extraction better
[Dixit, Chen, Gao et al. (2015)]. Classifications operation need to be performed after the
features were extracted from the recognition system. We directly connect feature extractors
and classifiers in the network as the main structure of identification framework. This paper
does not use MPM as a classifier, which can directly estimate the probabilistic accuracy
bound by minimizing the maximum probability of misclassification [Gu, Sun and Sheng
(2016)], but adopts the method of full connection layer and Softmax function classification.
3 Method
In this section, we will explain the main ideas and methods of this paper, including the
establishment of network models, the generation of image samples, the performance of the
sample test, and specific image recognition programs. The overall process is shown in Fig. 1.
All activation functions except the output layer uses ReLu, and the output layer uses tanh
function. In this way, it is possible to turn several random feature vectors into pictures.
Upsampling and down sampling are achieved by defining stride when writing code. In
the model structure of this paper, we build our own sample generation network and
discriminant network by referring to the DCGAN network structure.
the descending process do not to converge accurately on the iterative process. Whenever
training a certain number of times, the learning rate will be decreasing once. In the
beginning, a larger learning rate can achieve very fast convergence. As the learning rate
getting smaller, the stride of convergence also decreasing. It will not cause much error
even if it swings around the minimum value. The learning rate decay strategy can be
expressed as:
1
𝛼𝑖 = 1+𝑑𝑒𝑐𝑎𝑦_𝑟𝑎𝑡𝑒∗𝑒𝑝𝑜𝑐ℎ ∗ 𝛼0 (5)
𝑖
The 𝑑𝑒𝑐𝑎𝑦_𝑟𝑎𝑡𝑒 will set to 0.95 in subsequent experiments, 𝑒𝑝𝑜𝑐ℎ𝑖 is represented as the
𝑖 th training. 𝛼0 is the initial learning rate. The decayed learning rate needs to be
combined with a optimizer in order to achieve a goal of quickly obtaining an optimal
solution and making the later training more stable. An optimizer that operates on
parameters, such as Momentum which makes the gradient steeper. Although it can
converge faster, makes the training very difficult. Another optimizer, such as AdaGrad,
which adds punish patterns based on modifying the learning rate so that each parameter
has its own learning efficiency but it’s inefficient. In this paper, we combine these two
optimizers and use Adam to accelerate the training of neural networks. Its mathematical
expression is as follows:
𝑚 = 𝑏1 ∗ 𝑚 + (1 − 𝑏1) ∗ 𝑑𝑥 (6)
2
𝑣 = 𝑏2 ∗ 𝑣 + (1 − 𝑏2) ∗ 𝑑𝑥 (7)
𝑊+= −𝑙𝑒𝑎𝑟𝑛_𝑟𝑎𝑡𝑒 ∗ 𝑚⁄√𝑣 (8)
In the formula, 𝑏 represents bias and 𝑊 represents weight. The updating of weight
parameters depends on two variables 𝑚 and 𝑣 and the amount of change, 𝑑𝑥. Formula (6)
contains the Momentum gradient attribute, and formula (7) contains AdaGrad’s
resistance attribute. Therefore, taking both 𝑚 and 𝑣 into account by formula (8) weight
parameters can be updated.
In the previous tests, it was often found that there were barely noticeable difference in the
generated images because the sample parameters almost converged on one point. During
the sample generation experiment, we used mini-batch execution to improve training
efficiency. This strategy can make reasonable use of the computer’s memory while also
saving the training time. But at the same time, batch training also brings about
competition between the gradients. Through the study of neural network, it is found that
when the parameters of a certain layer of neural network change with the gradient
training, the distribution of the output data of the layer may change. For the each layer of
the neural network, the output distribution of each layer will be different from the
corresponding input signal distribution after the operation within the layer. This
difference will increase with the increase of the network depth, which resulting in the
covariate shift [Ioffe and Szegedy (2015)] problem that the trained model cannot be
generalization well. Gradient may gradually vanish when it spreads. For this reason, we
add Batch Normalization between the convolution and activation functions to solve the
vanishing gradient problem and help the gradient propagated to each layer. Batch
Normalization can overcome the difficulty of deep neural network training well. In the
process of normalization, the batch training samples can be expressed as: 𝑋 =
A Method for Improving CNN-Based Image Recognition Using DCGAN 173
{𝑥1 , 𝑥2 , … , 𝑥𝑚 }; Where 𝑥𝑖 represents the sample at index 𝑖. The mean and variance
can be calculated by these samples. Mathematical expression is as follows:
1
𝑚𝑒𝑎𝑛𝑥 = 𝑚 ∑𝑚
𝑖=1 𝑥𝑖 (9)
1
𝜎𝑥2 = 𝑚 ∑𝑚
𝑖=1(𝑥𝑖 − 𝑚𝑒𝑎𝑛𝑥 )
2
(10)
Formula (9) shows calculating the average value through sample points. Using the
average can be calculated variance by formula (10). According to the calculated results,
the normalization operation can be executed. The range of 𝑥̂𝑖 can be constrained by the
element 𝜀 in conjunction with formula (11). 𝜀 is an indefinite number within a certain
range.
𝑥𝑖 −𝑚𝑒𝑎𝑛𝑥
𝑥̂𝑖 = (11)
√𝜎𝑥2 +𝜀
𝑦𝑖 = 𝛾𝑥̂𝑖 + 𝛽 (12)
To enable DCGAN to learn the appropriate input, γ and β are used to transform 𝑥̂𝑖 , and the
process is represented by 𝐵𝑁𝛾,𝛽(𝑥𝑖) . Both γ and β are learned autonomously by the network.
4 Experiments
In this section, we will first introduce the dataset and then statistically the pre-training
results of the recognition framework as the basis for subsequent optimization. According
to the verification result of the generated sample, we have selected part of the generated
sample and the real sample to participate retraining based on the pre-training. The final
result includes the accuracy in training and the final test accuracy.
According to the experimental results, we abandon the raw CNN network structure and
run our self-designed model for the following experiments. After testing radar profiles of
4 categories, the results are shown in Fig. 6.
Finally, we collected the latest radar data onto testing. The results show that the accuracy
of model recognition after mixed training has improved, as shown in Fig. 9.
5 Conclusion
In this paper, we have combined DCGAN with CNN for the second time. The
experimental results show that accelerating learning through the learning rate decayed
strategy can make the sample generated by DCGAN more valuable for training. Not only
can participate in training with real data onto the CNN network that we design, but also
improve the recognition accuracy of radar profile images. At the same time, we also
solved the problem of difficult convergence of parameters caused by the difficulty of
collecting samples and excessively similar features. In terms of details of recognition,
DCGAN can be optimized by adjusting the number of trainings and learning rate to
obtain more realistic samples. The CNN image recognition framework can also make the
results more accurate by setting network depth and convolution kernel parameters.
Combining image recognition of meteorological applications is an extension of deep
learning in applications. Later, we can automate the recognition task, detect weather
conditions in real time, and use more optimized deep learning algorithms to achieve more
accurate weather forecasts.
Acknowledgement: This work was supported in part by the Priority Academic Program
Development of Jiangsu Higher Education Institutions.
References
Dixit, M.; Chen, S.; Gao, D.; Rasiwasia, N.; Vasconcelos, N. (2015): Scene
classification with semantic fisher vectors. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2974-2983.
Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. (2014): Multi-scale orderless pooling of deep
convolutional activation features. European Conference on Computer Vision, pp. 392-407.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D. et al. (2014):
Generative adversarial nets. Advances in Neural Information Processing Systems, pp.
2672-2680.
Gu, B.; Sun, X.; Sheng V. S. (2011): Structural minimax probability machine. IEEE
Transactions on Neural Networks and Learning Systems, vol. 28, no. 7, pp. 1646-1656.
178 Copyright © 2018 Tech Science Press CMC, vol.57, no.1, pp.167-178, 2018
Ioffe, S.; Szegedy, C. (2015): Batch normalization: Accelerating deep network training
by reducing internal covariate shift. Machine Learning, pp. 1-11.
Isola, P.; Zhu, J. Y.; Zhou, T.; Efros, A. A. (2017): Image-to-Image translation with
conditional adversarial networks. Computer Vision and Pattern Recognition, pp. 1-17.
Kim, H. J.; Lee, J. S.; Park, J. H. (2008): Dynamic hand gesture recognition using a
CNN model with 3D receptive fields. IEEE Conference on Neural Networks and Signal
Processing, pp. 14-19.
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012): Imagenet classification with deep
convolutional neural networks. Advances in Neural Information Processing Systems, pp.
1097-1105.
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. (1998): Gradient-based learning applied
to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324.
Mirza, M.; Osindero, S. (2014): Conditional generative adversarial nets. Machine
Learning, pp. 1-7.
Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. (2014): Learning and transferring mid-level
image representations using convolutional neural networks. IEEE International Conference
on Computer Vision and Pattern Recognition, pp. 1717-1724.
Radford, A.; Metz, L.; Chintala, S. (2015): Unsupervised representation learning with
deep convolutional generative adversarial networks. Computer Science, pp. 1-15.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A. et al. (2016):
Improved techniques for training gans. Advances in Neural Information Processing
Systems, pp. 2234-2242.
Simonyan, K.; Zisserman, A. (2014): Very deep convolutional networks for large-scale
image recognition. Computer Vision and Pattern Recognition, pp. 1-14.
Shin, A.; Yamaguchi, M.; Ohnishi, K.; Harada, T. (2016): Dense image representation
with spatial pyramid VLAD coding of CNN for locally robust captioning. Computer
Vision and Pattern Recognition, pp. 1-18.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S. et al. (2015): Going deeper with
convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1-9.
Theis, L.; Oord, A.; Bethge, M. (2015): A note on the evaluation of generative models.
Machine Learning, pp. 1-10.
Zhou, Z.; Wu, Q. J.; Huang, F.; Sun, X. (2017): Fast and accurate near-duplicate image
elimination for visual sensor networks. International Journal of Distributed Sensor
Networks, vol. 13, no. 2.