0% found this document useful (0 votes)
90 views9 pages

Large Scale Incremental Learning

This document discusses challenges with large scale incremental learning using deep neural networks. As the number of classes increases, performance degradation becomes more significant for state-of-the-art incremental learning methods due to data imbalance between old and new classes and the increasing likelihood of visually similar classes. The authors propose a simple bias correction method that adds a linear layer after the last fully connected layer to address the data imbalance problem. They demonstrate this BiC method achieves better performance than existing methods on large datasets with thousands of classes.

Uploaded by

williamkin14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views9 pages

Large Scale Incremental Learning

This document discusses challenges with large scale incremental learning using deep neural networks. As the number of classes increases, performance degradation becomes more significant for state-of-the-art incremental learning methods due to data imbalance between old and new classes and the increasing likelihood of visually similar classes. The authors propose a simple bias correction method that adds a linear layer after the last fully connected layer to address the data imbalance problem. They demonstrate this BiC method achieves better performance than existing methods on large datasets with thousands of classes.

Uploaded by

williamkin14
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Large Scale Incremental Learning

Yue Wu1 Yinpeng Chen2 Lijuan Wang2 Yuancheng Ye3


Zicheng Liu2 Yandong Guo2 Yun Fu1
1
Northeastern University 2 Microsoft Research 3 City University of New York
{yuewu,yunfu}@ece.neu.edu, yye@gradcenter.cuny.edu
{yiche,lijuanw,zliu}@microsoft.com, yandong.guo@live.com

Abstract 60

Performance degradation (%)


ImageNet-100
ImageNet-1000
50
Modern machine learning suffers from catastrophic for-
40
getting when learning new classes incrementally. The per-
formance dramatically degrades due to the missing data of 30
old classes. Incremental learning methods have been pro-
20
posed to retain the knowledge acquired from the old classes,
by using knowledge distilling and keeping a few exemplars 10
from the old classes. However, these methods struggle to
0
scale up to a large number of classes. We believe this is iCaRL EEIL BiC(Ours)
because of the combination of two factors: (a) the data im- Figure 1. Performance degradation of incremental learning algo-
balance between the old and new classes, and (b) the in- rithms on ImageNet-100 (100 classes) and ImageNet-1000 (1000
creasing number of visually similar classes. Distinguishing classes). Each dataset has 10 incremental steps. The degradation is
between an increasing number of visually similar classes is the gap between the accuracy of the final incremental step and the
particularly challenging, when the training data is unbal- accuracy of a non-incremental classifier, which is trained using all
data. When the scale goes up (from ImageNet-100 to ImageNet-
anced. We propose a simple and effective method to address
1000), the degradation for the state-of-the-art algorithms (iCaRL
this data imbalance issue. We found that the last fully con-
[19] and EEIL [2]) increases. The degradation for our BiC method
nected layer has a strong bias towards the new classes, and is small for both scales. Although iCaRL has similar relative de-
this bias can be corrected by a linear model. With two bias gratation with our method (increase by 50% from ImageNet-100
parameters, our method performs remarkably well on two to ImageNet-1000), it performs poorly across the scales.
large datasets: ImageNet (1000 classes) and MS-Celeb-
1M (10000 classes), outperforming the state-of-the-art al- classes. Distillation [13, 19, 2] has been used to effectively
gorithms by 11.1% and 13.2% respectively. address the former challenge. Recent studies [19, 2] also
show that selecting a few exemplars from the old classes can
alleviate the imbalance problem. These methods perform
1. Introduction well on small datasets. However, they suffer from a signif-
Natural learning systems are inherently incremental icant performance degradation when the number of classes
where new knowledge is continuously learned over time becomes large (e.g. thousands of classes). Fig. 1 demon-
while existing knowledge is maintained [19, 13]. Many strates the performance degradation of these state-of-the-art
computer vision applications in the real world require in- algorithms, using a non-incremental classifier as the refer-
cremental learning capabilities. For example, a face recog- ence. When the number of classes increases from 100 to
nition system should be able to add new persons with- 1000, both iCaRL [19] and EEIL[2] have more degradation.
out forgetting the faces already learned. However, most Why is it more challenging to handle a large number of
deep learning approaches suffer from catastrophic forget- classes for incremental learning? We believe this is due to
ting [15] - a significant performance degradation, when the the coupling of two factors. First, the training data are un-
past data are not available. balanced. Secondly, as the number of classes increases, it
The missing data for old classes introduce two chal- is more likely to have visually similar classes (e.g. multi-
lenges - (a) maintaining the classification performance on ple dog classes in ImageNet) across different incremental
old classes, and (b) balancing between old classes and new steps. Under the incremental constraint with data imbal-

374
been a lot of research on incremental learning with deep
neural network models. The work can be roughly divided
into three categories depending on whether they require real
data or synthetic data or nothing from the old classes.
Without using old data: Methods in the first category
do not require any old data. [9] presented a method for
domain transfer learning. They try to maintain the perfor-
mance on old tasks by freezing the final layer and discour-
Figure 2. Overview of our BiC method. The exemplars from the aging the change of shared weights in feature extraction lay-
old classes and the samples of the new classes are split into training ers. [10] proposed a technique to remember old tasks by
and validation sets. The training set is used to train the convolution constraining the important weights when optimizing a new
layers and FC layer (in stage 1). The validation set is used for bias task. One limitation of this approach is that the old and new
correction (in stage 2). tasks may conflict on these important weights. [13] pre-
sented a method that applies knowledge distillation [8] to
ance, the increasing number of visually similar classes is
maintain the performance on old tasks. [13] separated the
particularly challenging since the small margin around the
old and new tasks in multi-task learning, which is different
boundary between classes is too sensitive to the data imbal-
from learning classifier incrementally. [23] applied knowl-
ance. The boundary is pushed to favor classes with more
edge distillation for learning object detectors incrementally.
samples.
[18] utilized autoencoder to retain the knowledge from old
In this work, we present a method to address the data im-
tasks. [25, 26] updated knowledge dictionary for new tasks
balance problem in large scale incremental learning. Firstly,
and kept dictionary coefficients for old tasks.
we found a strong bias towards the new classes in the clas-
sifier layer (i.e. the last fully connected layer) of the con- Using synthetic data: Both [22] and [27] employed
volution neural network (CNN). Based upon this finding, GAN [4] to replay synthetic data for old tasks. [22] applied
we propose a simple and effective method, called BiC (bias cross entropy loss on synthesis data with the old solver’s
correction), to correct the bias. We add a bias correction response as the target. [27] utilized a root mean-squared er-
layer after the last fully connected (FC) layer (shown in ror for learning the response of old tasks on synthetic data.
Fig. 2), which is a simple linear model with two param- [22, 27] highly depends on the capability of generative mod-
eters. The bias correction layer is learned at the second els and struggles with complex objects and scenes.
stage, after learning the convolution layers and FC layer at Using exemplars from old data: Methods in the third
the first stage. The data, including exemplars from the old category require part of the old data. [19] proposed a
classes and samples from the new classes, are split into a method to select a small number of exemplars from each
training set for the first stage and a validation set for the old class. [2] keeps classifiers for all incremental steps
second stage. The validation set is helpful to approximate and used them as distillation. It introduces balanced fine-
the real distribution of both old and new classes in the fea- tuning and temporary distillation to alleviate the imbalance
ture space, allowing us to estimate the bias in FC layer. We between the old and new classes. [14] proposed a continu-
found that the bias can be effectively corrected with a small ous learning framework where the training samples for dif-
validation set. ferent tasks are used one by one during training. It con-
Our BiC method achieves remarkably good perfor- strains the cross entropy loss on softmax outputs of old tasks
mance, especially on large scale datasets. The experimental when the new task comes. [28] proposed a training method
results show that our method outperforms state-of-the-art that grows a network hierarchically as new training data are
algorithms (iCaRL[19] and EEIL [2]) on two large datasets added. Similarly, [21] increases the number of layers in the
(ImageNet ILSVRC 2012 and MS-Celeb-1M) by a large network to handle new coming data.
margin. Our BiC method gains 11.1% on ImageNet and
13.2% on MS-Celeb-1M, respectively. Our BiC method belongs to the third category, we keep
exemplars from the old classes in the similar manner to [19,
2. Related Work 2]. However, we handle the data imbalance differently. We
first locate a strong bias in the classifier layer (the last fully
Incremental learning has been a long standing problem connected layer), and then apply a linear model to correct
in machine learning [3, 17, 16, 12]. Before the deep learn- the bias using a small validation set. The validation set is a
ing took off, people had been developing incremental learn- small subset of exemplars which is excluded from training
ing techniques by leveraging linear classifiers, ensemble of and used for bias correction alone. Compared with the state
weak classifiers, nearest neighbor classifiers, etc. Recently, of the art ([19, 2]), our BiC method is more effective on
thanks to the exciting progress in deep learning, there has large datasets with 1000+ classes.

375
Old model 100 1

80 20 0.8
𝑥 Feature FC &' (𝑥) = [o-., o-0 , … , o-' ]

Accuracy(%)
𝒐

True classes
extraction 𝑛 classes
60 40 0.6

Distilling Loss
New model 40 60 0.4

Feature 20 Classifier without bias removal 80 0.2


𝑥 FC 𝒐'34 (𝑥) = [o., o0 , … , o' , o'3. , … , o'34 ] Our method: remove bias in the last FC layer
Retrain the last FC layer using all data
extraction 𝑛 + 𝑚 classes Train all layers using all data
0 100 0
Cross Entropy Loss 0 20 40 60 80 100 20 40 60 80 100
Number of classes Predict classes
(a) (b)
Figure 3. Diagram of the baseline solution using distillation. It
contains two losses: the distilling loss on old classes and the soft- Figure 4. Experimental results on CIFAR-100 with split of 20
max cross-entropy loss on all old and new classes. classes to validate the bias in the last FC layer. (a) classification
accuracy curves for baseline, our bias correction (BiC), retraining
3. Baseline: Incremental Learning using FC layer using all data, and training the whole network using all
data (from bottom to top). (b) confusion matrix of the incremen-
Knowledge Distillation tal classifier from 80 classes to 100 classes without bias removal.
In this section, we introduce a baseline solution for in- (Best viewed in color)
cremental learning using knowledge distillation [13]. This The overall loss combines the distilling loss and the clas-
is corresponding to the first stage in Fig. 2. For an incre- sification loss as follows:
mental step with n old class and m new classes, we learn
a new model to perform classification on n + m classes, L = λLd + (1 − λ)Lc , (3)
by using the knowledge distillation from an old model that
where the scalar λ is used to balance between the two terms.
classifies the old n classes (illustrated in Fig. 3). The new n
The scalar λ is set to n+m , where n and m are the number
model is learned by using a distilling loss and a classifica-
of old and new classes. λ is 0 for the first batch since all
tion loss.
classes are new. For the extreme case where n ≫ m, λ
Let us denote the samples of the new classes as X m = is nearly 1, indicating the importance to maintain the old
{(xi , yi ), 1 ≤ i ≤ M, yi ∈ [n + 1, .., n + m]}, where M is classes.
the number of new samples, xi and yi are the image and the
label, respectively. The selected exemplars from the old n 4. Diagnosis: FC Layer is Biased
classes are denoted as X̂ n = {(x̂j , ŷj ), 1 ≤ j ≤ Ns , ŷj ∈
[1, .., n]}, where Ns is the number of selected old images The baseline model has a bias towards the new classes,
(Ns /n ≪ M/m). Let us also denote the output logits of due to the imbalance between the number of samples from
the old and new classifiers as ôn (x) = [ô1 (x), ..., ôn (x)] the new classes and the number of exemplars from the old
and on+m (x) = [o1 (x), ..., on (x), on+1 (x), ..., on+m (x)] classes. We have a hypothesis that the last fully connected
respectively. The distilling loss is formulated as follows: layer is biased as the weights are not shared across classes.
To validate this hypothesis, we design an experiment on
n
X X CIFAR-100 dataset with five incremental batches (each has
Ld = −π̂k (x) log[πk (x)], (1) 20 classes).
x∈X̂ n ∪X m k=1 First, we train a set of incremental classifiers using the
eôk (x)/T eok (x)/T baseline method. The classification accuracy quickly drops
π̂k (x) = Pn ôj (x)/T
, πk (x) = Pn oj (x)/T
, as more incremental steps arrive (shown as the bottom curve
j=1 e j=1 e
in Fig. 4-(a)). For the last incremental step (class 81-100),
where T is the temperature scalar. The distilling loss is we observe a strong bias towards the newest 20 classes in
computed for all samples from the new classes and exem- the confusion matrix (Fig. 4-(b)). Compared to the upper
plars from the old classes (i.e. X̂ n ∪ X m ). bound, i.e. the classifiers learned using all training data (the
We use the softmax cross entropy as the classification top curve in Fig. 4-(a)), the baseline model has a perfor-
loss, which is computed as follows: mance degradation.
Then, we conduct another experiment to evaluate if the
X n+m
X fully connected layer is heavily biased. This experiment
Lc = −δy=k log[pk (x)], (2) has two steps for each incremental batch: (a) applying the
(x,y)∈X̂ n ∪X m k=1 baseline model to learn both the feature and fully connected
layers, (b) freezing the feature layers and retrain the fully
where δy=k is the indicator function and pk (x) is the output connected layer alone using all training samples from both
probability (i.e. softmax of logits) of the k-th class in n + m old and new classes. Compared to the baseline, the accu-
old and new classes. racy improves (the second top curve in Fig. 4-(a)). The

376
in the bias correction layer. Note that valold and valnew are
Unbiased Biased
classifier classifier
Biased distribution balanced.
5.2. Bias Correction Layer
The bias correction layer should be simple with a small
New Class Old Class Unbiased distribution
number of parameters, since valold and valnew have small
Training data Exemplars
Feature Space size. Thus, we use a linear model (with two parameters) to
Validation samples Validation samples
correct the bias. This is achieved by adding a bias correction
Figure 5. Diagram of bias correction. Since the number of ex- layer in the network (shown in Fig. 2). We keep the output
emplars from old classes is small, they have narrow distributions logits for the old classes (1, . . . , n) and apply a linear model
on the feature space. This causes the learned classifier to prefer to correct the bias on the output logits for the new classes
new classes. Validation samples, not involved in training feature (n + 1, . . . , n + m) as follows:
representation, may better reflect the unbiased distribution of both (
old and new classes in the feature space. Thus, we can use the ok 1≤k≤n
validation samples to correct the bias. (Best viewed in color)
qk = , (4)
αok + β n + 1 ≤ k ≤ n + m
accuracy on the final classifier on 100 classes improves by where α and β are the bias parameters on the new classes
20%. These results validate our hypothesis that the fully and ok (defined in Section 3) is the output logits for the k-th
connected layer is heavily biased. We also observe the gap class. Note that the bias parameters (α, β) are shared by all
between this result and the upper bound, which reflects the new classes, allowing us to estimate them with a small val-
bias within the feature layers. In this paper, we focus on idation set. When optimizing the bias parameters, the con-
correcting the bias in the fully connected layer. volution and fully connected layers are frozen. The classifi-
cation loss (softmax with cross entropy) is used to optimize
5. Bias Correction (BiC) Method the bias parameters as follows:
Based upon our finding that the fully connected layer is n+m
X
heavily biased, we propose a simple and effective bias cor- Lb = − δy=k log[sof tmax(qk )]. (5)
rection method (BiC). Our method includes two stages in k=1
training (shown in Fig. 2). Firstly, we train the convolution
layers and the fully connected layer by following the base- We found that this simple linear model is effective to correct
line method. At the second stage, we freeze both the con- the bias introduced in the fully connected layer.
volution and the fully connected layers, and estimate two
bias parameters by using a small validation set. In this sec-
6. Experiments
tion, we discuss how the validation set is generated and the We compare our BiC method to the state-of-the-art meth-
details of the bias correction layer. ods on two large datasets (ImageNet ILSVRC 2012 [20]
and MS-Celeb-1M [6]), and one small dataset (CIFAR-100
5.1. Validation Set [11]). We also perform ablation experiments to analyze dif-
We estimate the bias by using a small validation set. The ferent components of our approach.
basic idea is to exclude the validation set from training the
6.1. Datasets
feature representation, allowing them to reflect the unbiased
distribution of both old and new classes on the feature space We use all data in CIFAR-100 and ImageNet ILSVRC
(shown in Fig. 5). Therefore, we split the exemplars from 2012 (referred to ImageNet-1000), and randomly choose
the old classes and the samples from the new classes into 10000 classes in MS-Celeb-1M (referred to Celeb-10000).
a training set and a validation set. The training set is used We follow iCaRL benchmark protocol [19] to select exem-
to learn the convolution and fully connected layers (see Fig. plars. The total number of exemplars for the old classes are
2), while the validation set is used for the bias correction. fixed. The details of these three datasets are as follows:
Fig. 2 illustrates the generation of the validation set. The CIFAR-100: contains 60k 32 × 32 RGB images of 100 ob-
stored exemplars from the old classes are split into a train- ject classes. Each class has 500 training images and 100
ing subset (referred to trainold ) and a validation subset (re- testing images. 100 classes are split into 5, 10, 20 and 50
ferred to valold ). The samples for the new classes are also incremental batches. 2,000 samples are stored as exemplars.
split into a training subset (referred to trainnew ) and a val- ImageNet-1000: includes 1,281,167 images for training
idation subset (referred to valnew ). trainold and trainnew and 50,000 images for validation. 1000 classes are split into
are used to learn the convolution and FC layers (see Fig. 10 incremental batches. 20,000 samples are stored as exem-
2). valold and valnew are used to estimate the parameters plars.

377
ImageNet-1000 MS-Celeb-1M
Celeb-10000: a random subset of 10,000 classes are se- 100 100

lected from MS-Celeb-1M-base [5] face dataset which has 80 80

Accuracy (%)

Accuracy (%)
20,000 classes. MS-Celeb-1M-base is a smaller yet nearly
60 60
noise-free version of MS-Celeb-1M [6], which has near
40 40
100,000 classes with a total of 1.2 million aligned face im- LwF
iCaRL
ages. For the randomly selected 10,000 classes, there are 20 iCaRL
EEIL
BiC(Ours)
20
BiC(Ours)
UpperBound UpperBound
293,052 images for training and 141,984 images for vali- 0 0
0 200 400 600 800 1000 0 2000 4000 6000 8000 10000
dation. 10000 classes are split into 10 incremental batches Number of classes Number of Classes
(1000 classes per batch). 50,000 samples are stored as ex- (a) (b)
emplars. Figure 6. Incremental learning results (accuracy %) on (a)
For our BiC method, the ratio of train/validation split on ImageNet-1000 and (b) Celeb-10000. Both datasets have ten in-
the exemplars is 9:1 for CIFAR-100 and ImageNet-1000. cremental batches. The Upper Bound result, shown in the last
This ratio is obtained from the ablation study (see Section step, is obtained by training a non-incremental model using all
6.6). We change the split ratio to 4:1 on Celeb-10000, al- training samples from all classes. (Best viewed in color)
lowing at least one validation image kept per person.
utilize knowledge distillation to prevent catastrophic forget-
6.2. Implementation Details ting. iCaRL and EEIL keep exemplars for old classes, while
Our implementation uses TensorFlow [1]. We use an 18- LwF does not use any old data.
layer ResNet [7] for ImageNet-1000 and Celeb-10000 and The incremental learning results on ImageNet-1000 are
use a 32-layer ResNet for CIFAR-100. The ResNet imple- shown in Table 1 and Figure 6-(a). Our BiC method out-
mentation is from TensorFlow official models1 . The train- performs both EEIL [2] and iCaRL [19] by a large mar-
ing details for each dataset are listed as follows: gin. BiC has a small gain for the first couple of incremental
ImageNet-1000 and Celeb-10000: Each incremental train- batches compared with iCaRL and is worse than EEIL in the
ing has 100 epochs. The learning rate is set to 0.1 and re- first two increments. However, the gain of BiC increases as
duces to 1/10 of the previous learning rate after 30, 60, 80 more incremental batches arrive. Regarding the final incre-
and 90 epochs. The weight decay is set to 0.0001 and the mental classifier on all classes, our BiC method outperforms
batch size is 256. Image pre-processing follows the VGG EEIL [2] and iCaRL [19] by 18.5% and 26.5% respectively.
pre-processing steps [24], including random cropping, hor- On average over 10 incremental batches, BiC outperforms
izontal flip and aspect preserving resizing and mean sub- EEIL [2] and iCaRL [19] by 11.1% and 19.7% respectively.
traction. Note that the data imbalance increases as more incre-
CIFAR-100: Each incremental training has 250 epochs. mental steps arrive. The reason is that the number of ex-
The learning rate starts from 0.1 initially and reduces to emplars per old class decreases as the incremental step in-
0.01, 0.001 and 0.0001 after 100, 150 and 200 epochs, re- creases, since the total number of exemplars is fixed (by
spectively. The weight decay is set to 0.0002 and the batch following the fix memory protocol in EEIL [2] and iCaRL
size is 128. Random cropping and horizontal flip is adapted [19]). The gap between our BiC method and other meth-
for data augmentation following the original ResNet imple- ods becomes wider as the incremental step increases with
mentation [7]. more data imbalance. This demonstrates the advantage of
For a fair comparison with iCaRL [19] and EEIL [2], our BiC method.
we use the same networks, keep the same number of ex- We also observe that EEIL performs better for the sec-
emplars and follow the same protocols of splitting classes ond batch (even higher than the first batch) on ImageNet-
into incremental batches. We use the identical class or- 1000. This is mostly due to the enhanced data augmentation
der generated from iCaRL implementation2 for CIFAR-100 (EDA) in EEIL that is more effective for the first couple of
and ImageNet-1000. On Celeb-10000, the class order is incremental batches when data imbalance is mild. EDA in-
randomly generated and identical for all comparisons. The cludes random brightness shift, contrast normalization, ran-
temperature scalar T in Eq. 1 is set to 2 by following [13, 2]. dom cropping and horizontal flipping. In contrast, BiC only
6.3. Comparison on Large Datasets applies random cropping and horizontal flipping. EEIL [2]
shows that EDA is effective for early incremental batches
In this section, we compare our BiC method with the when data imbalance is not severe. Even without the en-
state-of-the-art methods on two large datasets (ImageNet- hanced data augmentation, our BiC still outperforms EEIL
1000 and Celeb-10000). The state-of-the-art methods in- by a large margin on ImageNet-1000 starting from the third
clude LwF [13], iCaRL[19] and EEIL [2]. All of them batch.
1 https://github.com/tensorflow/models/tree/ The incremental learning results on Celeb-10000 are
master/official/resnet shown in Table 2 and Figure 6-(b). To the best of our knowl-

378
100 200 300 400 500 600 700 800 900 1000
LwF [13] 90.0 77.0 68.0 59.5 52.5 49.5 46.5 43.0 40.5 39.0
iCaRL [19] 90.0 83.0 77.5 70.5 63.0 57.5 53.5 50.0 48.0 44.0
EEIL [2] 95.0 95.5 86.0 77.5 71.0 68.0 62.0 59.8 55.0 52.0
BiC(Ours) 94.1 92.5 89.6 89.1 85.7 83.2 80.2 77.5 75.0 73.2

Table 1. Incremental learning results (accuracy %) on ImageNet-1000 dataset with an increment of 100 classes. LwF [13] does not use any
exemplars from the old classes. iCaRL [19], EEIL [2] and our BiC method use the same amount of exemplars from the old classes. Note
that the numbers for LwF, iCaRL and EEIL on ImageNet-1000 are estimated from the figures in the original papers. The best results are
marked in bold.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
iCaRL [19] 94.31 94.26 91.09 86.88 81.06 77.45 75.29 71.34 68.78 65.56
BiC(Ours) 95.90 96.65 96.68 96.16 95.43 94.45 93.35 91.90 90.18 87.98
Table 2. Incremental learning results (accuracy %) on Celeb-10000 dataset with an increment of 1000 classes. iCaRL [19] and our BiC
method use the same amount of exemplars from the old classes. The best results are marked in bold.

ImageNet-100 ImageNet-1000
edge, we have not seen any incremental learning method 100 100

reporting results on 10,000 or more classes. The results for 80 80

Accuracy (%)

Accuracy (%)
iCaRL is generated by applying its github implementation2 60 60
on Celeb-10000 dataset. For the first couple of incremental
40 40
steps, our BiC method is slightly better than (< 3%) iCaRL. LwF LwF
iCaRL iCaRL
But since the third incremental step, the gap becomes wider. 20 EEIL
BiC(Ours)
20 EEIL
BiC(Ours)
UpperBound UpperBound
At the last incremental step, BiC outperforms iCaRL by 0 0
0 20 40 60 80 100 0 200 400 600 800 1000
22.4%. The average gain over 10 incremental batches is Number of classes Number of classes
13.2%. (a) (b)
These results demonstrate our BiC method is more effec- Figure 7. Incremental learning results (accuracy %) on ImageNet-
tive and robust to deal with a large number of classes. As 100 and ImageNet-1000. Both have ten incremental batches. The
the number of classes increases, it is more frequent to have Upper Bound result, shown in the last step, is obtained by train-
visually similar classes across different increment batches ing a non-incremental model using all training samples from all
with unbalanced data. This introduces a strong bias towards classes. (Best viewed in color)
new classes and misclassifies the old classes that are visu-
ing data (shown at the last step in Fig. 7). Compared
ally similar. Our BiC method is able to effectively reduce
to the upper bound, our BiC method degrades 10.5% and
this bias and improve the classification accuracy.
16.0% on ImageNet-100 and ImageNet-1000 respectively.
6.4. Comparison between Different Scales However, EEIL [2] degrades 15.1% and 37.2% and iCaRL
[19] degrades 31.1% and 45.2%. Compared with EEIL [2]
In this section, we compare our BiC method with the and iCaRL [19], which have more performance degradation
state-of-the-art on two different scales on ImageNet. The from the small scale to large scale, our BiC method is much
small scale deals with random selected 100 classes (referred more consistent. This demonstrates that BiC has better ca-
to ImageNet-100), while the large scale involves all 1000 pability to handle the large scale.
classes (referred to ImageNet-1000). Both scales have 10 We are aware that BiC is behind EEIL [2] for the first
incremental batches. This follows the same protocol with three incremental batches on ImageNet-100. As explained
EEIL [2] and iCaRL [19]. The results for ImageNet-1000 is in Section 6.3, this is mostly due to enhanced data argumen-
the same as in the previous section. tation (EDA) in EEIL [2].
The incremental learning results on Imagenet-100 and
ImageNet-1000 are shown in Fig. 7. Our BiC method 6.5. Comparison on a Small Dataset
outperforms the state-of-the-art for both scales in terms of
the final incremental accuracy and the average incremen- We also compare our BiC method with the state-of-the-
tal accuracy. But the gain for the large scale is bigger. art algorithms on a small dataset - CIFAR-100 [11]. The in-
We also compare the final incremental accuracy (the last cremental learning results with four different splits of 5, 10,
step) to the upper bound, which is obtained by training 20 and 50 classes are shown in Fig. 8. Our BiC method has
a non-incremental model using all classes and their train- similar performance with iCaRL [19] and EEIL [2]. BiC is
better on the split of 50 and 20 classes, but is slightly behind
2 https://github.com/srebuffi/iCaRL EEIL on the split of 10 and 5 classes. The margins are small

379
100 100 100 100

80 80 80 80
Accuracy (%)

Accuracy (%)

Accuracy (%)

Accuracy (%)
60 60 60 60

40 40 40 40
LwF
iCaRL
20 20 20 20 EEIL
BiC(Ours)
UpperBound
0 0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
100
Number of classes Number of classes Number of classes Number
Number of classesof classes
(a) (b) (c) (d)
Figure 8. Incremental learning results on CIFAR-100 with split of (a) 5 classes, (b) 10 classes, (c) 20 classes and (d) 50 classes. The Upper
Bound result, shown in the last step, is obtained by training a non-incremental model using all training samples for all classes. (Best
viewed in color)

for all splits. tion capability on the old classes. However, both baseline-1
Although our method focuses on the large scale incre- and baseline-2 have low accuracy on the final step to clas-
mental learning, it is also compelling on the small scale. sify all 100 classes (about 40%). This is mainly because of
Note that EEIL has more data augmentation such as bright- the data imbalance between the old and new classes. When
ness augmentation and contrast normalization, which are using the bias correction, BiC improves the accuracy on all
not utilized in LwF, iCaRL or BiC. incremental steps. The classification accuracy on the final
step (100 classes) is boosted from 40.34% to 56.69%. This
6.6. Ablation Study demonstrates that the bias is a big issue and our method is
effective to address it. Furthermore, our method is close to
We now analyze the components of our BiC method
the upper bound. The small gap (4.24%) from our approach
and demonstrate their impact. The ablation study is per-
56.69% to the upper bound 60.93% shows the superiority
formed on CIFAR-100 [11], as incremental learning on
of our method.
large dataset is time consuming. The ablation study is per-
formed on CIFAR-100 with an incremental of 20 classes. The confusion matrices of these four variations are
The size of the stored exemplars from old classes is 2,000. shown in Fig. 9. Clearly, baseline-1 and baseline-2 suffer
In the following ablation study, we analyze (a) the impact from the bias towards the new classes (strong confusions on
of bias correction, (b) the split of validation set, and (c) the the last 20 classes). BiC reduces the bias and has similar
sensitivity of exemplar selection. confusion matrix to the upper bound.
The Impact of Bias Correction We compare our BiC These results validate our hypothesis that there exists a
method with two variations of baselines and the upper strong bias towards the new classes in the last fully con-
bound, to analyze the impact of bias correction. The nected layer. In addition, the results demonstrate that the
baselines and the upper bound are explained as follows: proposed bias correction using a linear model on a small
baseline-1: the model is trained using the classification validation set is capable to correct the bias.
loss alone (Eq. 2). The Split of Validation Set We study the impact of dif-
baseline-2: the model is trained using both the distilling ferent splits of the validation set (see Section 5.1). As illus-
loss and the classification loss (Eq. 3). Compared to the trated in Fig. 2, our BiC splits the stored exemplars from
baseline-1, the distilling loss is added. the old classes into a training set (trainold ) and a valida-
BiC: the model is trained using both the distilling loss and tion set (valold ). The samples from the new classes also
the classification loss, with the bias correction. have a train/val split (trainnew and valnew ). trainold and
upper bound: the model is firstly trained using both the trainnew are used to learn the convolution layers and the
distilling loss and classification loss. Then, the feature fully connected layer, while valold and valnew are used to
layers are frozen and the classifier layer (i.e. the fully learn the bias correction layer. Note that valold and valnew
connected layer) is retrained using all training data (includ- are balanced, having the same number of samples S per class.
ing the samples from the old classes that are not stored). Since only a few exemplars (i.e. trainold valold ) are
Although it is infeasible to have all training samples from stored for the old classes, it is critical to find a good split
the old classes, it shows the upper bound for the bias that deals with the trade-off between training the feature
correction in the fully connected layer. representation and correcting the bias in the fully connected
layer.
The incremental learning results are shown in Table 3. Table 4 shows the incremental learning results for four
With the help of the knowledge distillation, baseline-2 is different splits of trainold : valold . The split of 9:1 has the
slightly better than baseline-1 since it retains the classifica- best classification accuracy for all four incremental steps.

380
Variations cls loss distilling loss bias removal FC retrain 20 40 60 80 100
baseline-1 X 84.40 68.30 55.10 48.52 39.83
baseline-2 X X 85.05 72.22 59.41 50.43 40.34
BiC(Ours) X X X 84.00 74.69 67.93 61.25 56.69
upper bound X X X 84.39 76.15 69.51 64.03 60.93
Table 3. Incremental learning results on CIFAR-100 with a batch of 20 classes. baseline-1 uses the classification loss alone. baseline-2
uses both the distilling loss and the classification loss. BiC corrects the bias in FC layer of baseline-2. Upper bound retrains the last FC
layer using all samples from both old and new classes after learning the model of baseline-2. The best results are marked in bold.
1

20 20 20 20 0.8

40 40 40 40 0.6
True classes

60 60 60 60 0.4

80 80 80 80 0.2

100 100 100 100 0


20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
100
Predict classes Predict classes Predict classes Predict classes
(a) (b) (c) (d)

Figure 9. Confusion matrices of four different variations: (a) baseline-1 (b) baseline-2, (c) BiC, (d) upper bound. Both baseline-1 and
baseline-2 have strong bias towards new classes. BiC is capable to remove most of the bias and have similar confusion matrix with the
upper bound. (Best viewed in color)

trainold :valold 20 40 60 80 100 20 40 60 80 100


9:1 84.00 74.69 67.93 61.25 56.69 random 85.20 74.59 66.76 60.14 55.55
8:2 84.50 73.19 65.01 58.68 54.31 iCaRL [19] 84.00 74.69 67.93 61.25 56.69
7:3 84.70 71.60 63.68 58.12 53.74
Table 5. Incremental learning results on CIFAR-100 with a batch
6:4 83.33 68.84 62.21 56.00 51.17
of 20 classes for different exemplar management strategies. The
Table 4. Incremental learning results on CIFAR-100 with a batch best results are marked in bold.
of 20 classes for different training/validation split on exemplars
from old classes. The training set is used to learn the feature and gap is about 1%. This demonstrates that our method is not
classifier layers, and the validation set is used to learn the bias sensitive to the exemplar selection.
correction layer. The best results are marked in bold.
7. Conclusions
The column 20 refers to learning a classifier for the first 20
In this paper, we proposed a new method to address the
classes, without incremental learning. As the portion for the
imbalance issue in incremental learning, which is critical
validation set increases, the performance drops consistently
when the number of classes becomes large. Firstly, we
due to the lack of exemplars (from the old classes) to train
1 validated our hypothesis that the classifier layer (the last
the feature layers. A small validation set ( 10 of exemplars)
fully connected layer) has a strong bias towards the new
is good enough to estimate the bias parameters (α and β
classes, which has substantially more training data than the
in Eq. 4). In this paper, we use split 9:1 for all other ex-
old classes. Secondly, we found that this bias can be ef-
periments except Celeb-10000. The split 4:1 is adopted in
fectively corrected by applying a linear model with a small
Celeb-10000, as each old class only has 5 exemplars for the
validation set. Our method has excellent results on two
last incremental step.
large datasets with 1,000+ classes (ImageNet ILSVRC 2012
The Sensitivity of Exemplar Selection We also study and MS-Celeb-1M), outperforming the state-of-the-art by
the impact of different exemplar management strategies. a large margin (11.1% on ImageNet ILSVRC 2012 and
We compare two strategies: (a) random selection, and 13.2% on MS-Celeb-1M).
(b) the exemplar management strategy proposed by iCaRL
[19]. iCaRL maintains the samples that closed to the class 8. Acknowledgments
center in the feature space. Both strategies store 2,000 ex-
emplars from old classes. The incremental learning results Part of the work was done when Yue Wu was an intern
are shown in Table 5. iCaRL exemplar management strat- at Microsoft. This research is supported in part by the NSF
egy performs slightly better than the random selection. The IIS Award 1651902.

381
References [16] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and
Gabriela Csurka. Distance-based image classification: Gen-
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene eralizing to new classes at near-zero cost. IEEE transactions
Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy on pattern analysis and machine intelligence, 35(11):2624–
Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: 2637, 2013.
Large-scale machine learning on heterogeneous distributed
[17] Robi Polikar, Lalita Upda, Satish S Upda, and Vasant
systems. arXiv preprint arXiv:1603.04467, 2016.
Honavar. Learn++: An incremental learning algorithm for
[2] Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas supervised neural networks. IEEE transactions on systems,
Guil, Cordelia Schmid, and Karteek Alahari. End-to-end in- man, and cybernetics, part C (applications and reviews),
cremental learning. In The European Conference on Com- 31(4):497–508, 2001.
puter Vision (ECCV), September 2018. [18] Amal Rannen Ep Triki, Rahaf Aljundi, Matthew Blaschko,
[3] Gert Cauwenberghs and Tomaso Poggio. Incremental and and Tinne Tuytelaars. Encoder based lifelong learning. In
decremental support vector machine learning. In Advances Proceedings ICCV 2017, pages 1320–1328, 2017.
in neural information processing systems, pages 409–415, [19] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg
2001. Sperl, and Christoph H. Lampert. icarl: Incremental clas-
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing sifier and representation learning. In The IEEE Conference
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and on Computer Vision and Pattern Recognition (CVPR), July
Yoshua Bengio. Generative adversarial nets. In Advances 2017.
in neural information processing systems, pages 2672–2680, [20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
2014. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[5] Yandong Guo and Lei Zhang. One-shot face recognition by Aditya Khosla, Michael Bernstein, et al. Imagenet large
promoting underrepresented classes. 2017. scale visual recognition challenge. International Journal of
[6] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Computer Vision, 115(3):211–252, 2015.
Jianfeng Gao. MS-Celeb-1M: A dataset and benchmark for [21] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins,
large scale face recognition. In ECCV, 2016. Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz-
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. van Pascanu, and Raia Hadsell. Progressive neural networks.
Deep residual learning for image recognition. In Proceed- arXiv preprint arXiv:1606.04671, 2016.
ings of the IEEE conference on computer vision and pattern [22] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon
recognition, pages 770–778, 2016. Kim. Continual learning with deep generative replay. In
[8] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling Advances in Neural Information Processing Systems, pages
the knowledge in a neural network. In NIPS Deep Learning 2994–3003, 2017.
and Representation Learning Workshop, 2015. [23] Konstantin Shmelkov, Cordelia Schmid, and Karteek Ala-
[9] Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. hari. Incremental learning of object detectors without catas-
Less-forgetting learning in deep neural networks. arXiv trophic forgetting. In Proceedings of the International Con-
preprint arXiv:1607.00122, 2016. ference on Computer Vision, 2017.
[10] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel [24] Karen Simonyan and Andrew Zisserman. Very deep convo-
Veness, Guillaume Desjardins, Andrei A Rusu, Kieran lutional networks for large-scale image recognition. arXiv
Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- preprint arXiv:1409.1556, 2014.
Barwinska, et al. Overcoming catastrophic forgetting in neu- [25] Gan Sun, Yang Cong, Ji Liu, Lianqing Liu, Xiaowei Xu, and
ral networks. Proceedings of the National Academy of Sci- Haibin Yu. Lifelong metric learning. IEEE transactions on
ences, 114(13):3521–3526, 2017. cybernetics, (99):1–12, 2018.
[11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple [26] Gan Sun, Yang Cong, and Xiaowei Xu. Active lifelong learn-
layers of features from tiny images. 2009. ing with” watchdog”. In Thirty-Second AAAI Conference on
[12] Ilja Kuzborskij, Francesco Orabona, and Barbara Caputo. Artificial Intelligence, 2018.
From n to n+ 1: Multiclass transfer incremental learning. [27] Ragav Venkatesan, Hemanth Venkateswara, Sethuraman
In Proceedings of the IEEE Conference on Computer Vision Panchanathan, and Baoxin Li. A strategy for an
and Pattern Recognition, pages 3358–3365, 2013. uncompromising incremental learner. arXiv preprint
[13] Zhizhong Li and Derek Hoiem. Learning without forgetting. arXiv:1705.00744, 2017.
In European Conference on Computer Vision, pages 614– [28] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng,
629. Springer, 2016. and Zheng Zhang. Error-driven incremental learning in deep
[14] David Lopez-Paz et al. Gradient episodic memory for contin- convolutional neural network for large-scale image classifi-
ual learning. In Advances in Neural Information Processing cation. In Proceedings of the 22nd ACM international con-
Systems, pages 6470–6479, 2017. ference on Multimedia, pages 177–186. ACM, 2014.
[15] Michael McCloskey and Neal J.Cohen. Catastrophic inter-
ference in connectionist networks: The sequential learning
problem. Psychology of Learning and Motivation, 24:109–
165, 1989.

382

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy