0% found this document useful (0 votes)
3 views10 pages

Partial Connection Based On Channel Attention For Differentiable Neural Architecture Search

This paper presents a novel approach called ADARTS for differentiable neural architecture search, which utilizes a partial channel connection based on channel attention to enhance search efficiency and memory utilization. By selecting channels with higher attention weights, ADARTS mitigates the issue of unfair competition among candidate operations and improves network performance, achieving notable classification error rates on CIFAR-10 and CIFAR-100 datasets. The method demonstrates significant improvements over existing NAS techniques, reducing computational costs while maintaining high performance.

Uploaded by

wry1131503103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views10 pages

Partial Connection Based On Channel Attention For Differentiable Neural Architecture Search

This paper presents a novel approach called ADARTS for differentiable neural architecture search, which utilizes a partial channel connection based on channel attention to enhance search efficiency and memory utilization. By selecting channels with higher attention weights, ADARTS mitigates the issue of unfair competition among candidate operations and improves network performance, achieving notable classification error rates on CIFAR-10 and CIFAR-100 datasets. The method demonstrates significant improvements over existing NAS techniques, reducing computational costs while maintaining high performance.

Uploaded by

wry1131503103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO.

XX, XXXX 2022 1

Partial Connection Based on Channel Attention


for Differentiable Neural Architecture Search
Yu Xue, Member, IEEE, Jiafeng Qin

Abstract—Differentiable neural architecture search unlike parameter optimization, how to build a better perfor-
(DARTS), as a gradient-guided search method, greatly reduces mance network structures requires a lot of expert experience.
the cost of computation and speeds up the search. In DARTS, For example, some of the better neural networks, such as
the architecture parameters are introduced to the candidate
arXiv:2208.00791v1 [cs.LG] 1 Aug 2022

operations, but the parameters of some weight-equipped VGG [10] and ResNet [11], were designed by experts. But,
operations may not be trained well in the initial stage, which all of them are designed manually by using the trial and
causes unfair competition between candidate operations. error manner. A lot of time and resources are taken to design
The weight-free operations appear in large numbers which these networks, and different network structures need to be
results in the phenomenon of performance crash. Besides, built for different problems and datasets. Neural architecture
a lot of memory will be occupied during training supernet
which causes the memory utilization to be low. In this paper, search (NAS) [12] as an effective method can automatically
a partial channel connection based on channel attention search network structures with higher performance.
for differentiable neural architecture search (ADARTS) is
As a part of automated machine learning (AutoML), the
proposed. Some channels with higher weights are selected
through the attention mechanism and sent into the operation NAS builds different neural network structures by construct-
space while the other channels are directly contacted with ing a large search space and search strategies. In NAS,
the processed channels. Selecting a few channels with higher the search space, the appropriate search method, and the
attention weights can better transmit important feature evaluation method are three main tasks.
information into the search space and greatly improve search
efficiency and memory utilization. The instability of network In general, there are many methods that can be used for
structure caused by random selection can also be avoided. The architectural search, such as reinforcement learning (RL)
experimental results show that ADARTS achieved 2.46% and [13], evolutionary computing (EC) [14] and gradient-based.
17.06% classification error rates on CIFAR-10 and CIFAR-
100, respectively. ADARTS can effectively solve the problem
Reinforcement learning regards the search of neural network
that too many skip connections appear in the search process structure as an agent’s action. The network is constructed
and obtain network structures with better performance. through different behaviors and rewards which are based
on the evaluation of the network performance on the test
Index Terms—Neural architecture search, channel attention,
image classification, partial connection set. Representing and optimization of the agent policy are
the two keys to using reinforcement learning to search
network architecture. Zoph et al. [15] used recurrent neural
I. I NTRODUCTION network (RNN) strategies to sequentially sample strings
and then encode the neural architecture. Baker et al. [16]
EEP neural network [1], [2] is an important research
D topic in deep learning [3]. Compared with shallow
neural network, its multi-layer network structure can extract
used a q-learning training strategy, which in turn selects
the type of layer and corresponding hyperparameters. How-
ever, the method based on reinforcement learning consumes
richer and more complex feature information to obtain extremely computing resources. For example, Zoph [15]
higher performance. Therefore, it has made remarkable used 800 GPUs to complete the search process for three
progress in semantic recognition [4], image recognition [5], to four weeks. Therefore, as another method to replace
and data forecasting [6], [7]. The performance of deep reinforcement learning, evolutionary computing can reduce
neural network mainly depends on network parameters and computational consumption and find a better solution com-
structure. The vital task is to design appropriate strategies to pared with reinforcement learning. Xie et al. [14] repre-
optimize the structure and parameters of the neural network sented network structure with a fixed length binary code, and
to improve its final performance. At present, gradient meth- used genetic algorithm to initialize individuals and explore
ods are mainly used to optimize network parameters, such network space through selection, crossover and mutation.
as SGD optimizer [8] and ADAM optimizer [9]. However, It only used 17 GPUs to train for one day on the same
dataset which is much faster than the reinforcement learning
This work was partially supported by the National Natural Science
Foundation of China (61876089, 61876185, and 61902281), the Natural method. The difference of evolutionary computation meth-
Science Foundation of Jiangsu Province (BK20141005). ods mainly exists in how to choose the initial population,
Yu Xue (corresponding author) and Jiafeng Qin are with the update the population, and generate offspring. Real et al.
School of Computer Science, Nanjing University of Information Sci-
ence and Technology, Jiangsu, China. E-mails: xueyu@nuist.edu.cn; qin- [17] used the tournament method to select the parent and
jiafeng@nuist.edu.cn. deleted the worst individual from the population, and the new
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 2

offspring inherited all the parameters of the parent. Although The remainder of this article is organized as follows.
such inheritance is not strict with inheritance performance, In Section II, we introduce the basic ideas and methods
it can also speed up the learning process compared with of DARTS. In section III, we describe the proposed al-
random initialization. Elsken et al. [18] adopted the sample gorithm that uses the channel attention strategy to select
parent from the multi-objective Pareto frontier to generate some channels for connection. Section IV introduces the
better offspring. In terms of initialization methods, these experimental plan, datasets, parameters, and accuracy in
algorithms often encoded convolution, pooling and other image classification. In addition, we prove its effectiveness
operations directly [19]. In addition, the basic modules through comparison with other NAS algorithms and ablation
used in ResNet and DenseNet networks can well deal with studies. Finally, the conclusion is expressed in section V.
gradient disappearance, Sun et al. [20] used the genetic
algorithm to search ResNet and DenseNet blocks to get sep_conv_3x3
c_{k-2} 0
constructed network with high performance. Yang et al. skip_connect
[21] used the cell which is a continuous search space as a
skip_connect
basic module to speed up the search. However, the method sep_conv_3x3
1
based on evolutionary computation still has considerable c_{k}

computational overhead, and it takes a lot of time to evaluate


skip_connect
multiple network structures. To further reduce search cost, 2
differentiable architectural search (DARTS) [22] used weight
skip_connect skip_connect
sharing and optimized the process of supernet training and c_{k-1} 3
subnet search so that it can search for the neural network skip_connect

architecture quickly. However, DARTS has the phenomenon


Fig. 1: Searched normal cell on Cifar-10 which has too many
that the weight-free operations increase in the later stage of
skip connections.
the search, such as skip connection and max pool. A large
number of skip connections appear in a cell in Fig. 1. This
type of cell cannot extract image features well , resulting in
II. R ELATED W ORKS
network performance collapse.
Differentiable neural architecture search makes the search
To search for the network architecture with high per-
space continuous and uses gradients to alternately optimize
formance and further improve search efficiency, this pa-
network parameters and network structure weights, which
per proposes a partial channel connection based on atten-
greatly reduces the cost of calculation and improves the
tion mechanism for differentiable neural architecture search
search speed. Fig. 2 describes the basic process of DARTS.
(ADARTS). The channel attention mechanism is used to
First, the search space is composed of nodes, and there
select important channels and reinforce feature information.
is a candidate operation space between the nodes. DARTS
In addition, partial channel connections reduce unfair com-
needs to select the connection mode between two nodes.
petition between operations and reduce memory occupation,
The operation with the highest weight in the candidate space
making the search process faster and more stable. We
is selected as the connection operation. After all connected
summarize our contributions as follows:
directed edges and corresponding operations are selected,
1. The attention mechanism is employed to extract the the final network structure is determined. However, unfair
importance of the channels of the input data in the searching competition in the operation space is caused because the
process. Then, the obtained attention weights are multiplied weight-equipped operations in the initial training stage are
by the original input data to generate new input so that the not trained well, which leads to the network collapse. There
key features in the input data can be identified and the neural are a large number of skip connections in the network and
network can use more important information. the performance of the network drops sharply. FairDARTS
2. The channel selection is used to send the channels [23] used sigmoid functions instead of softmax functions
with higher attention weights into operation space, and to calculate weights to avoid unfair competition between
the other channels are directly contacted with the output operations. PDARTS [24] used dropout to randomly cut skip
of operation space to improve the search efficiency and connections to mask some skip connections during network
stability. Furthermore, it can weaken the unfair competition searching and reduced the proportion of dropout with the
between candidate operations caused by the parameters of gradual training of other operating parameters. In addition, a
weight-equipped operations which are not trained well in the progressive search method was introduced, and the number
initial stage. of cells was gradually increased to make it more similar
3. The proposed method searches 0.2 GPU days on to the final trained network. However, a larger memory
CIFAR-10 dataset, and the searched structure achieves occupation was generated when the cells were increased.
2.46% classification error and 2.9M parameters. It also PDARTS continuously reduced the candidate operations
achieves 17.06% classification error when transferred to in the training process, but the final structure could be
CIFAR-100 dataset for evaluation, which is better than other restricted. To increase the memory utilization of DARTS in
comparative NAS methods. searching, GDAS [25] only sampled a part of the operation
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 3

0
sep_conv_3*3
c_{k-1} 0 sep_conv_3*3
sep_conv_3*3
skip_connect α1

1 max_pool α2 sep_conv_3*3 1
c_{k}
.
1 .
. 3
2 skip_connect 3

.
.
.
c_{k-2} sep_conv_3*3 sep_conv_3*3
sep_conv_5*5 αn
2
dli_conv_3*3
3

Fig. 2: The process of differentiable neural architecture search, the left illustration is the search space of the cell, the middle
illustration is the operation space between node i and node j, the right illustration is the selected structure by selecting the
operation with max weigh α between node i and node j.

0
each time for optimization. PCDARTS [26] adopted the dilconv_5×50 , o(i,j) is the candidate operation, αok(i,j) is the
method of partial channel connection to reduce the operation weight of candidate operation on each directed edge (i, j).
in the operation space, but the random selection of the
channel leaded to the instability of the network structure. B. Search Method
The proposed method in this paper selects more important
channels by channel attention weight to update the operation There are two groups of parameters in DARTS. In addition
weight, making the search more stable and faster. to network weight ω was used to train the network. Besides,
architecture parameter α is added to transform the discrete
space into a continuous search space and the parameters are
A. Search Space trained by gradient descent. The training loss Ltrain is used
In DARTS, some cells are stacked to form the network to optimize the network weight parameter ω, and the valid
structure. Cells are divided into reduction cells and normal loss Lval is used to optimize the architecture parameter α.
cells. The same type of cell has the same internal connection The two optimization methods are carried out as formula(4-
mode and share weight. A cell is a directed graph containing 5) and finally the subnet with better performance is searched:
k nodes xi ={x1 , x2 , ... ,xk }, two input nodes, one output node
and k-3 intermediate nodes. In addition, the jth intermediate min Lval (ω ∗ (α), α) (4)
α
node is connected to the all predecessor nodes respectively
s.t. ω ∗ (α) = argminω Ltrain (ω, α) (5)
and the input of the middle node is obtained through its
predecessor nodes, as shown in formula (1): where Lval is valid loss, ω is network weight, α is network
X architecture weight, Ltrain is training loss.
xj = o(i,j) (xi ) (1) After the ω and α parameters are updated in each gener-
i<j
ation, for each intermediate node xj , two connecting edges
where xj is the j th node, xi represents the ith node, o(i, j) are selected according to the largest αok(i,j) on the directed
represents the candidate operation between node j and node edge (i, j) of the node, and the corresponding operation
(i,j)
i. oselect will be selected. After the connection mode and
Each directed edge (i, j) has corresponding operation operation of all intermediate nodes are determined, other
o(i, j) and weight αok(i,j) which is calculated by softmax, as directed edges and candidate operations will be removed.
shown in Formula (2). The output of the candidate operation The method of selecting operation is as follows:
is weighed with the corresponding weight αok(i,j) . The total (i,j)
input f (xj ) is expressed as formula (3): oselect = argmaxk∈o αok(i,j) (6)
(i,j)
exp(αok(i,j) ) where oselect is the selected operation between node i and
αok(i,j) = P k0
(2) node j, αok(i,j) is the corresponding weight of operation.
k0 ∈o exp(αo(i,j) )
X
f (xj ) = αok(i,j) o(i,j) (xi ) (3) III. METHODOLOGY
k∈o In the search process of DARTS, each operation and
where o is the candidate operation space, including its corresponding output need to be stored in the node,
0
none0 , 0 maxpool_3 × 30 , 0 avgpool_3 × 30 , 0 skip_connect0 , which occupies a large amount of memory. Therefore, the
0
sepconv_3 × 30 , 0 dilconv_3 × 30 , 0 sepconv_5 × 50 , researchers have to set a small batchsize, which limits the
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 4

memory utilization. The proposed partial channel connection perceptron, ω1 are weights between input layer and hidden
method can greatly improve the operation efficiency and layer, ω2 are weights between hidden layer and output layer,
0 0
reduce the unfair competition between candidate operations. fap is the result of fap processed by MLP, fmp is the
However, the random selection of channels also leads to in- result of fmp processed by MLP, Fc is the channel attention
stability in the search process. Therefore, this paper proposes weight, and σ is a sigmoid function which is 1/(1 + e−x ).
to use the attention mechanism to select more meaningful Then, new feature maps are generated by multiplying the
channels and strengthen the features of some channels to channel attention weight and the original feature maps, so
make the searched network structure more stable. In the the importance of input features can be identified. In detail,
following subsection, we firstly introduce the framework Fig. 4 describes the calculation process that the data F with
of our algorithm, and then elaborate the important steps of the original input size of H×W ×C is multiplied with Fc
attention mechanism, channel selection and partial channel of 1×1×C in channel order to obtain F 0 . The calculation
connection in turn. method is expressed as formula (8):
F 0 = Fc ∗ F (8)
A. Overall Framework
where Fc is channel attention weight, F is input feature data,
The framework of ADARTS is simply described as fol-
F 0 is feature data multiplied by channel attention weight.
lows: firstly, the attention mechanism is added before each
operation space to produce inputs, i.e., the features of all Algorithm 1 Attention Mechanism
channels are extracted through global pooling and used as
Input: Feature data F
the inputs of a multi-layer perceptron (MLP). Then channel
Output: Result data F 0 of channel attention
attention weights Fc are obtained through the MLP. After
1: function ATTENTION (F )
that, the obtained weight Fc is multiplied by the original
2: fap ← Global average pooling of H×W ×C feature
input to get the new input. Finally, the top 1/K new channels
data F
with the larger weights are marked as 1 and the rest marked
3: fmp ← Global max pooling of H×W ×C feature
as 0, and the channels marked with 1 are sent into the
data F
operation space of ADARTS for the following calculation, 0
4: fap ← 1×1×C feature data fap is sent into MLP
while the other channels are directly contacted with the
for calculation
output. 0
5: fmp ← 1×1×C feature data fmp is sent into MLP
for calculation
B. Attention Mechanism 0 0
6: Fc ← Add the fap and fmp to get channel attention
Usually, many channels are generated by convolution weight
operations in deep neural networks, but some feature in- 7: F 0 ← Multiply Fc with the original feature data F
formation from some channels has little effect. Channel according to the channel order
attention mechanism focuses on distinguishing the impor- 8: return F 0
tance of channels [27]. In the channel attention mechanism, 9: end function
it is necessary to compress channels to extract feature
information. Therefore, max pooling and average pooling are
used to obtain spatial features. The pseudo codes of attention C. Channel Selection
mechanism are given in Algorithm 1. For an input image The weights of channel attention can reflect the impor-
data F with height of H, width of W , and channels of C in tance of feature channels. The channels with higher atten-
the network, max pooling and average pooling are used for tion weight possess more important information. To better
each channel respectively to obtain the feature data fap and carry out partial channel connections, the more important
fmp with the size of 1×1×C. Then the two feature data are channels need to be selected to make the search more
input into a shared MLP. The neurons of the hidden layer stable and accurate. In this paper, channel mask M (i,j)
are set to C/K to reduce computational complexity. Through which is computed according to formula (9) is used to
0 0
MLP, fap and fmp obtained are added to generate channel represent the selected feature channels and masked feature
attention weight Fc . The calculation process of the channel channels in the operation space o(i,j) . The pseudo codes of
attention weights is shown in Fig. 3. The map function of feature channel selection are shown in Algorithm 2. In the
channel attention mechanism is given as the formula (7) as process, the top 1/K channels will be selected according
follows: to the weight of channel attention and the rest are masked
Fc = σ(M LP (avgpool(F )) + M LP (maxpool(F ))) channels. The selected channels will be assigned a value
= σ(M LP (fap ) + M LP (fmp )) of 1 and transferred to the operation space for calculation.
(7) The masked channels will be assigned a value of 0 and will
= σ(ω2 (ω1 (fap ))) + ω2 (ω1 (fmp )))
0 0
directly skip the operation space and be concatenated with
= fap + fmp the output. This feature channel process is shown in Fig. 5.
where F is input image data, fap is average pooled data

1 Fk ∈ 1/K channels
F , fmp is max pooled data F , MLP is the multi-layer M (i,j) = (9)
0 Fk ∈/ 1/K channels
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 5

max_pool

Channel attention Fc
Input feature F MLP
avg_pool

Fig. 3: The process of calculating channel attention. The input feature F will be processed with maxpool and avgpool
respectively, then go through MLP. The corresponding output of MLP will be added to generate channel attention weight
Fc .

Algorithm 2 Channel selection


F1
Input: Channel attention Fc , selected channel proportion
. . .
.
. .
. .
. 1/K
.
. .
. .
.
Output: Channel mask M (i,j)
. . .
1: function S ELECTION (Fc , K)

Input Processed input


2: k←0
Fn
feature F feature F’ 3: Fmax ← top 1/K weight in Fc
4: while k < c do
5: if Fk ∈ Fmax then
(i,j)
6: Mk ←1
Channel attention Fc 7: else
(i,j)
Fig. 4: The process of enhancing feature of input data F by 8: Mk ←0
channel attention Fc . 9: end if
10: k ←k+1
11: end while
where M (i,j) is the channel mask between ith node and j th 12: return M (i,j)
node, Fk is k th channel attention weight. 13: end function

F1 F2 F3 F4 F5 F6 F7 F8 input data can be greatly increased, which can speed up


the operation and make the network search more stable.
Besides, DARTS tends to choose weight-free operations.
This is because the weight-free operations provide more
0.263 0.852 0.145 0.320 0.113 0.184 0.244 0.872 precise information than the weight-equipped operations
when the neural network is not well trained. For example,
the skip connection is to transfer data directly to the next
node. Because the weight training needs a lot of iterations,
0 1 0 0 0 0 0 1 weight-free operations will accumulate a lot of advantages
before training completely. In addition, there is a competitive
Fig. 5: The process of selecting channels according to relationship between operations, so even in the late training
channel attention weight. The top 1/K channels are selected period, the weight-free operations have more advantages,
and marked as 1 and the rest as 0. which results in the phenomenon of network structure col-
lapse. The pseudo codes of the partial channel connection are
provided in Algorithm 3. In this algorithm, by using partial
channel connections, only the selected channels enter the
D. Partial Channel Connection operation space for training, while the rest of the channels
To solve the problem that too small batchsize will cause are not processed and then contacted with the outputs of
the instability of parameters and structures during the search the selected channels. The process is described in Fig. 6.
process, this paper proposes partial channel connections to The calculation of output of operation space o(i,j) between
reduce the memory usage in operation space at runtime. On node j and node i is expressed by formula (10). M (i,j)
the same batchsize, if the number of partially connected represents the selected and masked channels. Even weight-
channels is 1/K, the running memory usage will be re- equipped operations are not trained well, the loss of using
duced to the original 1/K. In this way, the batchsize of partial channels is smaller than the method that all channels
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 6

0 1 0 0 0 0 0 1 max_pool α1

skip_connect α2
1 .. ..
1 . .
.. ..
. .

sep_conv_5*5 αn

Fig. 6: The selected channels are marked as 1 and the masked channels are marked as 0 according to channel attention
weight. The selected channels will go through operation space, and a calculation result of each channel is multiplied by
the weight α. Finally, the output channels are contacted with the remaining channels.

enter the operation space. It makes the advantage of weight- Algorithm 3 Partial Channel Connection
free operations less obvious. However, because only the Input: Feature data through channel attention F 0 , channel
partial channels are sent into the operation space, random mask M (i,j)
selection of connection channels will also cause instability Output: Hybrid computation f (xj ) of node j
of network architecture and parameters. Therefore, channel 1: function PARTIALCONNECTION (F 0 , M (i,j) )
attention weight Fc = {F1 , F2 ...Fn } in Section III (B) 2: k←0
is used when selecting channels. The selection of 1/K 3: while k < c do
(i,j)
channels with larger weights as the input channel can not 4: if Mk == 1 then
only strengthen the data feature and speed up the search 5: Fk0 is sent to operation space to perform
efficiency, but also make the weight parameters more stable hybrid operations.
during training to strengthen the searched network structure 6: else
and avoid the phenomenon of structural collapse in the later 7: Fk0 skips the operation space
search. 8: end if
9: k++
X
f (xj ) = αok(i,j) o(i,j) (xi ∗M (i,j) )+(1−M (i,j) )∗xi (10) 10: end while
k∈o 11: The feature output through the operation space
and the unprocessed feature are contacted to produce
where xi ∗ M (i,j) are selected channels, (1 − M (i,j) ) ∗ xi
f (xj ) according to the original channel order.
are masked channels, f (xj ) is the input of node j, αok(i,j)
12: return f (xj )
is the weight of candidate operation on each directed edge
13: end function
(i, j), o(i,j) is the candidate operation between node j and
node i.
IV. EXPERIMENTS ON CLASSIFICATION TASKS B. Implementation Details
We applied ADARTS to image classification to test its 1) Network structure: The network consists of the nor-
performance by using common image datasets (CIFAR-10/ mal cells and the reduction cells. There are two input
CIFAR-100). The network architecture found on CIFAR- nodes, four intermediate nodes, and one output node in
10 is transferred to CIFAR-100 for evaluation. In addition, each cell. One-third and two-thirds of the network are
we make ablation experiments to verify the classification reduction cells and the rest are normal cells. The op-
accuracy and stability of ADARTS. eration space between nodes has eight candidate opera-
tions, which are 0 none0 , 0 maxpool_3 × 30 , 0 avgpool_3 ×
A. Dataset 30 , 0 skip_connect0 , 0 sepconv_3 × 30 , 0 dilconv_3 × 30 ,
0
CIFAR-10 and CIFAR-100 have 60,000 color images with sepconv_5 × 50 , 0 dilconv_5 × 50 .
32×32 resolution. There are 10 classes with 6000 images 2) Experimental methods: The search structure and eval-
each in CIFAR-10 and 100 classes with 600 images each uation structure are shown in Fig. 7 (a) and Fig. 7 (b). In
in CIFAR-100. Each dataset is divided into two parts, i.e., the searching stage, eight cells are stacked into a network,
training set and test set. The training set is used to search and some epochs are run on CIFAR-10 to search the cell
for network structure, and the test set is used to verify the structures. The searched normal cells and reduction cells
performance of the network. are used to generate an evaluation structure by stacking
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 7

We compare the accuracy of ADARTS on CIFAR-10


Input Input and CIFAR-100 with some hand-designed networks, RL,
evolution, and other gradient-based search methods. The
Normal cell ×1 Normal cell ×5
results are given in Table I. We can see from this table that
ADARTS achieved 2.46% classification error on CIFAR-
10 and 17.06% classification error on CIFAR-100. More
Reduction cell ×1 Reduction cell ×1
important, ADARTS only uses 0.2 GPU days, which is much
faster than NASNet-A and AmoebaNet-B, and also faster
Normal cell ×2 Normal cell ×6
than other gradient-based search methods. Compared with
the baseline DARTS, ADARTS speeds up the search speed
Reduction cell ×1 Reduction cell ×1 while also improving the classification accuracy by 0.3%
on CIFAR-10 and 0.48% on CIFAR-100. Compared with
Normal cell ×3 Normal cell ×7 GDAS, ADARTS achieves better classification accuracy
within the same search time. Compared with PCDARTS,
ADARTS can reduce the instability of selecting channels
Output Output
randomly so that it achieves better performance. Also, the
(a) (b) parameters of ADARTS are 2.9 million which is less than
most other algorithms. This indicates that ADARTS achieves
Fig. 7: The network structure on the process of searching better classification accuracy and faster searching speed.
and training based on normal cell and reduction cell, (a) Moreover, the network structure searched on CIFAR-10
describes the searching network, (b) describes the evaluation is transferred to CIFAR-100 for testing and performs well,
network. indicating that the searched network architecture have strong
robustness.
again. Then, the network parameters were trained to test
its performance. D. Ablation Studies
3) Parameter setting: The epoch of network search was 1) Analysis of skip connection: In DARTS, too many
set to 80, batchsize was 96, cross entropy loss was used to skip connections lead to low network performance, which
calculate Ltrain and Lval . The SGD optimizer was used to is caused by unfair competition between operations. The
update network parameter ω, the initial learning rate was parameters of the weight-equipped operations are not trained
0.025, momentum was 0.9, and weight decay was 0.0003. well in the early stage of the search and the weight-free
The learning rate was gradually decreased to 0 by using operations accumulate too many advantages. There will be
cosine annealing strategy. The Adam optimizer was used to more and more skip connections in the network with the in-
update the network architecture parameter α with an initial creasing of training epochs. The number of skip connections
learning rate of 0.0006, a momentum of (0.5, 0.999), and a is shown in Fig. 9. In this figure, we can see that four skip
weight decay of 0.001. connections appeared when DARTS searched to 50 epochs,
which greatly reduces the feature information extracted by
C. Experimental Results and Analysis the network, and results in low performance. When partial
1) Analysis of searched cells: The searched cell structures connection method is added to DARTS, the number of
by ADARTS are shown in Fig. 8. Compared with final feature channels in the operation space is reduced, which can
architecture in DARTS, the number of skip connections in weaken the unfair competition in the early stage and reduce
the network architecture obtained by ADARTS is less and the accumulation of advantages of weight-free operations.
more stable, which can effectively alleviate the problems It can also find that the number of skip connections of
of gradient disappearance and network degradation. At the DARTS with partial connection is less than that of DARTS.
same time, more useful feature information can be extracted However, the random channel selection also leads to the
to enhance the performance of the network. instability of search, so the channel attention mechanism
2) Evaluation on CIFAR-10 and CIFAR-100: Twenty can select more important channels and strengthen feature
cells are stacked into a new network for performance eval- information to make the search process more stable. Only
uation and the structure of the evaluation network is shown in the middle and late stages of the search, the network
in Fig. 7 (b). The network was trained for 600 epochs with a allows a few skip connections to alleviate the problem of
batchsize of 64. SGD optimizer was used to update network gradient disappearance. All in all, ADARTS can effectively
parameters with a momentum of 0.9 and a weight decay of solve the problem of too many skip connections and obtain
0.0003. The initial learning rate was 0.017. In addition, the more stable and better network structures.
cosine annealing strategy was used to decrease the learning 2) Analysis of channel proportion 1/K: To verify the
rate to 0. The cutout length of 16, auxiliary weight of 0.4, influence of the K value on memory usage and classification
and drop path probability of 0.3 were applied to prevent accuracy, and find the best K value, we set different values
overfitting. of K={1,2,4,8,16} for experimental analysis since the initial
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 8

TABLE I: Classification error rate of ADARTS on CIFAR-10 and CIFAR-100.


Test Error
Architecture Params(M) Search Cost(GPU days) Search Method
CIFAR-10(%) CIFAR-100(%)
ResNet(depth=110) [11] 1.7 - 6.43 25.16 Manual
ResNet(depth=1202) [11] 10.2 - 7.93 27.82 Manual
DenseNet-BC [28] 25.6 - 3.46 17.18 Manual
VGG [10] 20.1 - 6.66 28.05 Manual
MobileNetV2 [29] 2.2 - 4.26 19.20 Manual
Genetic CNN [14] - 17 7.10 29.05 evolution
AmoebaNet-B [30] 2.8 3150 2.55 - evolution
Hireachical Evolution [31] 15.7 300 3.75 - evolution
CARS [21] 2.4 0.4 3.00 - evolution
ENAS [32] 4.6 0.5 2.89 19.43 RL
NASNet-A [13] 3.3 1800 3.41 - RL
NASNet-A + cutout [13] 3.3 1800 2.65 - RL
SMASH [33] 16 1.5 4.03 - RL
DARTS (first order) + cutout [22] 3.3 1.5 3.00 17.76 gradient-based
DARTS (second order) + cutout [22] 3.3 4.0 2.76 17.54 gradient-based
GDAS [25] 3.4 0.2 3.87 19.68 gradient-based
GDAS+cutout [25] 3.4 0.2 2.93 19.38 gradient-based
SNAS + cutout [34] 2.8 1.5 2.85 17.55 gradient-based
P-DARTS + cutout [24] 3.4 0.3 2.50 17.20 gradient-based
PCDARTS + cutout [26] 3.6 0.1 2.57 - gradient-based
FairDARTS + cutout [23] 2.8 0.4 2.54 17.61 gradient-based
SDARTS-ADV + cutout [35] 3.3 1.3 2.61 16.73 gradient-based
EoiNAS + cutout [36] 3.4 0.6 2.50 17.30 gradient-based
ADARTS 2.9 0.2 3.70 18.21 gradient-based
ADARTS + cutout 2.9 0.2 2.46 17.06 gradient-based

sep_conv_3x3 dil_conv_5x5
c_{k-1} 0 c_{k-1} 0
dil_conv_3x3 skip_connect
max_pool_3x3
1
sep_conv_3x3 dil_conv_3x3
c_{k} c_{k}
1
dil_conv_3x3 2
max_pool_3x3 skip_connect
2
skip_connect
sep_conv_3x3 skip_connect dil_conv_3x3
c_{k-2}
max_pool_3x3 c_{k-2} 3
3
avg_pool_3x3

(a) Normal cell (b) Reduction cell


Fig. 8: The structures of searched cells on CIFAR-10 by ADARTS.

4 3
ADARTS ADARTS
DARTS+partial connection DARTS+partial connection
DARTS DARTS

2
skip connection
skip connection

0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch

(a) Skip connections in normal cell (b) Skip connections in reduction cell
Fig. 9: The number of skip connections of cells searched on CIFAR-10 by ADARTS.
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 9

number of channels for input data is 16. Fig. 10 shows 4) Analysis of batchsize: Because the increase of batch-
the experimental results with different K values. With the size will lead to the decrease of the step of parameter
increase of K, the memory usage decreases continuously, adjustment in the mini-batch gradient descent, the learning
while the classification accuracy of the network improves rate should be appropriately increased [37], [38]. Thus, we
first and then decreases. When K is set to 4, the classification set several groups of learning rate and batchsize to verify the
accuracy of the searched network structure is the highest, effect of increasing batchsize appropriately. The experimen-
and the memory usage decreases greatly, which speeds up tal results are shown in Table III. When the learning rate is
the search speed. This shows that partial channel connection set to 0.01, the accuracy of ADARTS increases as batchsize
based on attention mechanism can more effectively search increases from 8 to 16. Besides, when learning rate is set
network structure with better performance. However, the to 0.025, the accuracy of ADARTS increases as batchsize
number of channels entering the search space should not be increases from 32 to 96. Moreover, too small batchsize
too small, otherwise the network can not get enough feature leads to a significant decrease in stability and accuracy of
information during training, and thus can not get a good ADARTS. Therefore, increasing batchsize appropriately can
network structure. improve the performance of ADARTS.

TABLE III: The classification accuracy of ADARTS on


various learning rate and batchsize.
Learning rate=0.01 Learning rate=0.025
Method
batchsize=8 batchsize=16 batchsize=32 batchsize=64 batchsize=96

ADARTS 96.89% 97.11% 97.36% 97.51% 97.54%

V. C ONCLUSION
In this paper, ADARTS is proposed to solve the per-
formance crash due to too many weight-free operations in
late search and instability of the search structure caused by
Fig. 10: The memory usage and classification accuracy with low memory utilization. The channel attention and partial
different K values. channel connection are used to solve problems. The channels
with higher weights in the channel attention weight are
sent to the operation space. These methods can reduce
3) Analysis of methods of ADARTS: To verify the effec- the consumption of calculation in the operation space to
tiveness of attention mechanism and partial connection on speed up the search speed and improve the stability of
architecture searching, we use DARTS with random partial searching network structure. Besides, the number of weight-
connection, DARTS with attention mechanism and ADARTS free operations in structure can be controlled so that the
to search on CIFAR-10, and test the searched network searched network can perform well. The algorithm is applied
structures on CIFAR-10 and CIFAR-100. The experimental to CIFAR-10 and CIFAR-100 for image classification and
results are shown in Table II. It is obvious that the search the experiment shows that ADARTS can search out the
cost is greatly reduced through partial channel connection. network architecture quickly and stably. In the future work,
In addition, the test classification errors also decrease on we will further enhance the stability of channel selection and
CIFAR-10/CIFAR-100 when using attention mechanism or reduce redundant feature information. We are considering
partial channel connection. Moreover, ADARTS which com- introducing some typical feature selection techniques to
bines attention mechanism and partial channel connection feature map selection problems in deep neural networks.
achieves the smallest parameters, fast search speed, and the In addition, we will conduct the proposed method on some
best classification accuracy among all the methods. It shows industrial datasets for network search, and use the obtained
that ADARTS can speed up the search process and improve network structure in facial recognition, target detection,
the performance of the searched networks. semantic recognition, etc.
TABLE II: The ablation studies on CIFAR-10 and CIFAR-
100. R EFERENCES

Test Error
[1] Y. Sun, J. Xu, G. Lin, W. Ji, and L. Wang, “RBF neural network-
Method Params Search Cost based supervisor control for maglev vehicles on an elastic track with
(M) (GPU days) CIFAR-10 CIFAR-100
network time delay,” IEEE Transactions on Industrial Informatics,
DARTS 3.3 4.0 2.76% 17.54% vol. 18, no. 1, pp. 509–519, 2022.
DARTS + random partial connection 3.4 0.1 2.71% 17.42%
DARTS + attention mechanism 3.1 0.6 2.67% 17.47% [2] A. Slowik, “Application of an adaptive differential evolution algorithm
ADARTS 2.9 0.2 2.46% 17.06% with multiple trial vectors to artificial neural network training,” IEEE
Transactions on Industrial Electronics, vol. 58, no. 8, pp. 3160–3167,
2010.
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX 2022 10

[3] Y. Deng, T. Zhang, G. Lou, X. Zheng, J. Jin, and Q.-L. Han, uation,” in Proceedings of the IEEE/CVF International Conference on
“Deep learning-based autonomous driving systems: a survey of attacks Computer Vision, 2019, pp. 1294–1303.
and defenses,” IEEE Transactions on Industrial Informatics, vol. 17, [25] X. Dong and Y. Yang, “Searching for a robust neural architecture
no. 12, pp. 7897–7912, 2021. in four gpu hours,” in Proceedings of the IEEE/CVF Conference on
[4] R. C. Luo and C.-J. Chen, “Recursive neural network based semantic Computer Vision and Pattern Recognition, 2019, pp. 1761–1770.
navigation of an autonomous mobile robot through understanding hu- [26] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong,
man verbal instructions,” in 2017 IEEE/RSJ International Conference “PC-DARTS: Partial channel connections for memory-efficient archi-
on Intelligent Robots and Systems (IROS), 2017, pp. 1519–1524. tecture search,” arXiv preprint arXiv:1907.05737, 2019.
[5] R. C. Luo, H.-C. Lin, and Y.-T. Hsu, “CNN based reliable classifica- [27] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional
tion of household chores objects for service robotics applications,” in block attention module,” in Proceedings of the European Conference
2019 IEEE 17th International Conference on Industrial Informatics on Computer Vision, 2018, pp. 3–19.
(INDIN), vol. 1, 2019, pp. 547–552. [28] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
[6] S. M. J. Jalali, S. Ahmadian, A. Khosravi, M. Shafie-khah, S. Na- “Densely connected convolutional networks,” in Proceedings of the
havandi, and J. P. S. Catalão, “A novel evolutionary-based deep IEEE Conference on Computer Vision and Pattern Recognition, 2017,
convolutional neural network model for intelligent load forecasting,” pp. 4700–4708.
IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. [29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
8243–8253, 2021. bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
[7] H. Jahangir, H. Tayarani, S. S. Gougheri, M. A. Golkar, A. Ahmadian, of the IEEE Conference on Computer Vision and Pattern Recognition,
and A. Elkamel, “Deep learning-based forecasting approach in smart 2018, pp. 4510–4520.
grids with microclustering and bidirectional LSTM network,” IEEE [30] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution
Transactions on Industrial Electronics, vol. 68, no. 9, pp. 8298–8309, for image classifier architecture search,” in Proceedings of the AAAI
2021. Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4780–
[8] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of 4789.
communication efficient momentum SGD for distributed non-convex [31] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu,
optimization,” in International Conference on Machine Learning, “Hierarchical representations for efficient architecture search,” in
2019, pp. 7184–7193. International Conference on Learning Representations, 2018.
[9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- [32] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural ar-
tion,” in International Conference on Learning Representations, 2015. chitecture search via parameters sharing,” in International Conference
on Machine Learning, 2018, pp. 4095–4104.
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks
[33] A. Brock, T. Lim, J. Ritchie, and N. Weston, “SMASH: One-shot
for large-scale image recognition,” in International Conference on
model architecture search through hypernetworks,” in International
Learning Representations, 2015.
Conference on Learning Representations, 2018.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
[34] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: stochastic neural
for image recognition,” in Proceedings of the IEEE Conference on
architecture search,” in International Conference on Learning Rep-
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
resentations, 2018.
[12] X. Liu, J. Zhao, J. Li, B. Cao, and Z. Lv, “Federated neural [35] X. Chen and C.-J. Hsieh, “Stabilizing differentiable architecture
architecture search for medical data security,” IEEE Transactions on search via perturbation-based regularization,” in International Con-
Industrial Informatics, vol. 18, no. 8, pp. 5628–5636, 2022. ference on Machine Learning, 2020, pp. 1554–1565.
[13] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable [36] Y. Zhou, X. Xie, and S.-Y. Kung, “Exploiting operation importance
architectures for scalable image recognition,” in Proceedings of the for differentiable neural architecture search,” IEEE Transactions on
IEEE Conference on Computer Vision and Pattern Recognition, 2018, Neural Networks and Learning Systems, 2021.
pp. 8697–8710. [37] P. Goyal, P. Doll¨¢r, R. B. Girshick, P. Noordhuis, L. Wesolowski,
[14] L. Xie and A. Yuille, “Genetic CNN,” in Proceedings of the IEEE A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch
International Conference on Computer Vision, 2017, pp. 1379–1388. SGD: Training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677,
[15] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement 2018.
learning,” in International Conference on Learning Representations, [38] A. Krizhevsky, “One weird trick for parallelizing convolutional neural
2017. networks,” arXiv preprint arXiv:1404.5997, 2014.
[16] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural
network architectures using reinforcement learning,” arXiv preprint
arXiv:1611.02167, 2016.
[17] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V.
Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in
International Conference on Machine Learning, 2017, pp. 2902–2911.
[18] T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective
neural architecture search via lamarckian evolution,” in International
Conference on Learning Representations, 2018.
[19] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Evolving deep convo-
lutional neural networks for image classification,” IEEE Transactions
on Evolutionary Computation, vol. 24, no. 2, pp. 394–407, 2019.
[20] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated
CNN architecture design based on blocks,” IEEE Transactions on
Neural Networks and Learning Systems, vol. 31, no. 4, pp. 1242–
1254, 2019.
[21] Z. Yang, Y. Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu,
“Cars: Continuous evolution for efficient neural architecture search,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 1829–1838.
[22] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable archi-
tecture search,” in International Conference on Learning Representa-
tions, 2018.
[23] X. Chu, T. Zhou, B. Zhang, and J. Li, “Fair darts: Eliminating
unfair advantages in differentiable architecture search,” in European
Conference on Computer Vision, 2020, pp. 465–480.
[24] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable
architecture search: Bridging the depth gap between search and eval-

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy