Large Scale Incremental Learning
Large Scale Incremental Learning
Abstract 60
374
been a lot of research on incremental learning with deep
neural network models. The work can be roughly divided
into three categories depending on whether they require real
data or synthetic data or nothing from the old classes.
Without using old data: Methods in the first category
do not require any old data. [9] presented a method for
domain transfer learning. They try to maintain the perfor-
mance on old tasks by freezing the final layer and discour-
Figure 2. Overview of our BiC method. The exemplars from the aging the change of shared weights in feature extraction lay-
old classes and the samples of the new classes are split into training ers. [10] proposed a technique to remember old tasks by
and validation sets. The training set is used to train the convolution constraining the important weights when optimizing a new
layers and FC layer (in stage 1). The validation set is used for bias task. One limitation of this approach is that the old and new
correction (in stage 2). tasks may conflict on these important weights. [13] pre-
sented a method that applies knowledge distillation [8] to
ance, the increasing number of visually similar classes is
maintain the performance on old tasks. [13] separated the
particularly challenging since the small margin around the
old and new tasks in multi-task learning, which is different
boundary between classes is too sensitive to the data imbal-
from learning classifier incrementally. [23] applied knowl-
ance. The boundary is pushed to favor classes with more
edge distillation for learning object detectors incrementally.
samples.
[18] utilized autoencoder to retain the knowledge from old
In this work, we present a method to address the data im-
tasks. [25, 26] updated knowledge dictionary for new tasks
balance problem in large scale incremental learning. Firstly,
and kept dictionary coefficients for old tasks.
we found a strong bias towards the new classes in the clas-
sifier layer (i.e. the last fully connected layer) of the con- Using synthetic data: Both [22] and [27] employed
volution neural network (CNN). Based upon this finding, GAN [4] to replay synthetic data for old tasks. [22] applied
we propose a simple and effective method, called BiC (bias cross entropy loss on synthesis data with the old solver’s
correction), to correct the bias. We add a bias correction response as the target. [27] utilized a root mean-squared er-
layer after the last fully connected (FC) layer (shown in ror for learning the response of old tasks on synthetic data.
Fig. 2), which is a simple linear model with two param- [22, 27] highly depends on the capability of generative mod-
eters. The bias correction layer is learned at the second els and struggles with complex objects and scenes.
stage, after learning the convolution layers and FC layer at Using exemplars from old data: Methods in the third
the first stage. The data, including exemplars from the old category require part of the old data. [19] proposed a
classes and samples from the new classes, are split into a method to select a small number of exemplars from each
training set for the first stage and a validation set for the old class. [2] keeps classifiers for all incremental steps
second stage. The validation set is helpful to approximate and used them as distillation. It introduces balanced fine-
the real distribution of both old and new classes in the fea- tuning and temporary distillation to alleviate the imbalance
ture space, allowing us to estimate the bias in FC layer. We between the old and new classes. [14] proposed a continu-
found that the bias can be effectively corrected with a small ous learning framework where the training samples for dif-
validation set. ferent tasks are used one by one during training. It con-
Our BiC method achieves remarkably good perfor- strains the cross entropy loss on softmax outputs of old tasks
mance, especially on large scale datasets. The experimental when the new task comes. [28] proposed a training method
results show that our method outperforms state-of-the-art that grows a network hierarchically as new training data are
algorithms (iCaRL[19] and EEIL [2]) on two large datasets added. Similarly, [21] increases the number of layers in the
(ImageNet ILSVRC 2012 and MS-Celeb-1M) by a large network to handle new coming data.
margin. Our BiC method gains 11.1% on ImageNet and
13.2% on MS-Celeb-1M, respectively. Our BiC method belongs to the third category, we keep
exemplars from the old classes in the similar manner to [19,
2. Related Work 2]. However, we handle the data imbalance differently. We
first locate a strong bias in the classifier layer (the last fully
Incremental learning has been a long standing problem connected layer), and then apply a linear model to correct
in machine learning [3, 17, 16, 12]. Before the deep learn- the bias using a small validation set. The validation set is a
ing took off, people had been developing incremental learn- small subset of exemplars which is excluded from training
ing techniques by leveraging linear classifiers, ensemble of and used for bias correction alone. Compared with the state
weak classifiers, nearest neighbor classifiers, etc. Recently, of the art ([19, 2]), our BiC method is more effective on
thanks to the exciting progress in deep learning, there has large datasets with 1000+ classes.
375
Old model 100 1
80 20 0.8
𝑥 Feature FC &' (𝑥) = [o-., o-0 , … , o-' ]
Accuracy(%)
𝒐
True classes
extraction 𝑛 classes
60 40 0.6
Distilling Loss
New model 40 60 0.4
376
in the bias correction layer. Note that valold and valnew are
Unbiased Biased
classifier classifier
Biased distribution balanced.
5.2. Bias Correction Layer
The bias correction layer should be simple with a small
New Class Old Class Unbiased distribution
number of parameters, since valold and valnew have small
Training data Exemplars
Feature Space size. Thus, we use a linear model (with two parameters) to
Validation samples Validation samples
correct the bias. This is achieved by adding a bias correction
Figure 5. Diagram of bias correction. Since the number of ex- layer in the network (shown in Fig. 2). We keep the output
emplars from old classes is small, they have narrow distributions logits for the old classes (1, . . . , n) and apply a linear model
on the feature space. This causes the learned classifier to prefer to correct the bias on the output logits for the new classes
new classes. Validation samples, not involved in training feature (n + 1, . . . , n + m) as follows:
representation, may better reflect the unbiased distribution of both (
old and new classes in the feature space. Thus, we can use the ok 1≤k≤n
validation samples to correct the bias. (Best viewed in color)
qk = , (4)
αok + β n + 1 ≤ k ≤ n + m
accuracy on the final classifier on 100 classes improves by where α and β are the bias parameters on the new classes
20%. These results validate our hypothesis that the fully and ok (defined in Section 3) is the output logits for the k-th
connected layer is heavily biased. We also observe the gap class. Note that the bias parameters (α, β) are shared by all
between this result and the upper bound, which reflects the new classes, allowing us to estimate them with a small val-
bias within the feature layers. In this paper, we focus on idation set. When optimizing the bias parameters, the con-
correcting the bias in the fully connected layer. volution and fully connected layers are frozen. The classifi-
cation loss (softmax with cross entropy) is used to optimize
5. Bias Correction (BiC) Method the bias parameters as follows:
Based upon our finding that the fully connected layer is n+m
X
heavily biased, we propose a simple and effective bias cor- Lb = − δy=k log[sof tmax(qk )]. (5)
rection method (BiC). Our method includes two stages in k=1
training (shown in Fig. 2). Firstly, we train the convolution
layers and the fully connected layer by following the base- We found that this simple linear model is effective to correct
line method. At the second stage, we freeze both the con- the bias introduced in the fully connected layer.
volution and the fully connected layers, and estimate two
bias parameters by using a small validation set. In this sec-
6. Experiments
tion, we discuss how the validation set is generated and the We compare our BiC method to the state-of-the-art meth-
details of the bias correction layer. ods on two large datasets (ImageNet ILSVRC 2012 [20]
and MS-Celeb-1M [6]), and one small dataset (CIFAR-100
5.1. Validation Set [11]). We also perform ablation experiments to analyze dif-
We estimate the bias by using a small validation set. The ferent components of our approach.
basic idea is to exclude the validation set from training the
6.1. Datasets
feature representation, allowing them to reflect the unbiased
distribution of both old and new classes on the feature space We use all data in CIFAR-100 and ImageNet ILSVRC
(shown in Fig. 5). Therefore, we split the exemplars from 2012 (referred to ImageNet-1000), and randomly choose
the old classes and the samples from the new classes into 10000 classes in MS-Celeb-1M (referred to Celeb-10000).
a training set and a validation set. The training set is used We follow iCaRL benchmark protocol [19] to select exem-
to learn the convolution and fully connected layers (see Fig. plars. The total number of exemplars for the old classes are
2), while the validation set is used for the bias correction. fixed. The details of these three datasets are as follows:
Fig. 2 illustrates the generation of the validation set. The CIFAR-100: contains 60k 32 × 32 RGB images of 100 ob-
stored exemplars from the old classes are split into a train- ject classes. Each class has 500 training images and 100
ing subset (referred to trainold ) and a validation subset (re- testing images. 100 classes are split into 5, 10, 20 and 50
ferred to valold ). The samples for the new classes are also incremental batches. 2,000 samples are stored as exemplars.
split into a training subset (referred to trainnew ) and a val- ImageNet-1000: includes 1,281,167 images for training
idation subset (referred to valnew ). trainold and trainnew and 50,000 images for validation. 1000 classes are split into
are used to learn the convolution and FC layers (see Fig. 10 incremental batches. 20,000 samples are stored as exem-
2). valold and valnew are used to estimate the parameters plars.
377
ImageNet-1000 MS-Celeb-1M
Celeb-10000: a random subset of 10,000 classes are se- 100 100
Accuracy (%)
Accuracy (%)
20,000 classes. MS-Celeb-1M-base is a smaller yet nearly
60 60
noise-free version of MS-Celeb-1M [6], which has near
40 40
100,000 classes with a total of 1.2 million aligned face im- LwF
iCaRL
ages. For the randomly selected 10,000 classes, there are 20 iCaRL
EEIL
BiC(Ours)
20
BiC(Ours)
UpperBound UpperBound
293,052 images for training and 141,984 images for vali- 0 0
0 200 400 600 800 1000 0 2000 4000 6000 8000 10000
dation. 10000 classes are split into 10 incremental batches Number of classes Number of Classes
(1000 classes per batch). 50,000 samples are stored as ex- (a) (b)
emplars. Figure 6. Incremental learning results (accuracy %) on (a)
For our BiC method, the ratio of train/validation split on ImageNet-1000 and (b) Celeb-10000. Both datasets have ten in-
the exemplars is 9:1 for CIFAR-100 and ImageNet-1000. cremental batches. The Upper Bound result, shown in the last
This ratio is obtained from the ablation study (see Section step, is obtained by training a non-incremental model using all
6.6). We change the split ratio to 4:1 on Celeb-10000, al- training samples from all classes. (Best viewed in color)
lowing at least one validation image kept per person.
utilize knowledge distillation to prevent catastrophic forget-
6.2. Implementation Details ting. iCaRL and EEIL keep exemplars for old classes, while
Our implementation uses TensorFlow [1]. We use an 18- LwF does not use any old data.
layer ResNet [7] for ImageNet-1000 and Celeb-10000 and The incremental learning results on ImageNet-1000 are
use a 32-layer ResNet for CIFAR-100. The ResNet imple- shown in Table 1 and Figure 6-(a). Our BiC method out-
mentation is from TensorFlow official models1 . The train- performs both EEIL [2] and iCaRL [19] by a large mar-
ing details for each dataset are listed as follows: gin. BiC has a small gain for the first couple of incremental
ImageNet-1000 and Celeb-10000: Each incremental train- batches compared with iCaRL and is worse than EEIL in the
ing has 100 epochs. The learning rate is set to 0.1 and re- first two increments. However, the gain of BiC increases as
duces to 1/10 of the previous learning rate after 30, 60, 80 more incremental batches arrive. Regarding the final incre-
and 90 epochs. The weight decay is set to 0.0001 and the mental classifier on all classes, our BiC method outperforms
batch size is 256. Image pre-processing follows the VGG EEIL [2] and iCaRL [19] by 18.5% and 26.5% respectively.
pre-processing steps [24], including random cropping, hor- On average over 10 incremental batches, BiC outperforms
izontal flip and aspect preserving resizing and mean sub- EEIL [2] and iCaRL [19] by 11.1% and 19.7% respectively.
traction. Note that the data imbalance increases as more incre-
CIFAR-100: Each incremental training has 250 epochs. mental steps arrive. The reason is that the number of ex-
The learning rate starts from 0.1 initially and reduces to emplars per old class decreases as the incremental step in-
0.01, 0.001 and 0.0001 after 100, 150 and 200 epochs, re- creases, since the total number of exemplars is fixed (by
spectively. The weight decay is set to 0.0002 and the batch following the fix memory protocol in EEIL [2] and iCaRL
size is 128. Random cropping and horizontal flip is adapted [19]). The gap between our BiC method and other meth-
for data augmentation following the original ResNet imple- ods becomes wider as the incremental step increases with
mentation [7]. more data imbalance. This demonstrates the advantage of
For a fair comparison with iCaRL [19] and EEIL [2], our BiC method.
we use the same networks, keep the same number of ex- We also observe that EEIL performs better for the sec-
emplars and follow the same protocols of splitting classes ond batch (even higher than the first batch) on ImageNet-
into incremental batches. We use the identical class or- 1000. This is mostly due to the enhanced data augmentation
der generated from iCaRL implementation2 for CIFAR-100 (EDA) in EEIL that is more effective for the first couple of
and ImageNet-1000. On Celeb-10000, the class order is incremental batches when data imbalance is mild. EDA in-
randomly generated and identical for all comparisons. The cludes random brightness shift, contrast normalization, ran-
temperature scalar T in Eq. 1 is set to 2 by following [13, 2]. dom cropping and horizontal flipping. In contrast, BiC only
6.3. Comparison on Large Datasets applies random cropping and horizontal flipping. EEIL [2]
shows that EDA is effective for early incremental batches
In this section, we compare our BiC method with the when data imbalance is not severe. Even without the en-
state-of-the-art methods on two large datasets (ImageNet- hanced data augmentation, our BiC still outperforms EEIL
1000 and Celeb-10000). The state-of-the-art methods in- by a large margin on ImageNet-1000 starting from the third
clude LwF [13], iCaRL[19] and EEIL [2]. All of them batch.
1 https://github.com/tensorflow/models/tree/ The incremental learning results on Celeb-10000 are
master/official/resnet shown in Table 2 and Figure 6-(b). To the best of our knowl-
378
100 200 300 400 500 600 700 800 900 1000
LwF [13] 90.0 77.0 68.0 59.5 52.5 49.5 46.5 43.0 40.5 39.0
iCaRL [19] 90.0 83.0 77.5 70.5 63.0 57.5 53.5 50.0 48.0 44.0
EEIL [2] 95.0 95.5 86.0 77.5 71.0 68.0 62.0 59.8 55.0 52.0
BiC(Ours) 94.1 92.5 89.6 89.1 85.7 83.2 80.2 77.5 75.0 73.2
Table 1. Incremental learning results (accuracy %) on ImageNet-1000 dataset with an increment of 100 classes. LwF [13] does not use any
exemplars from the old classes. iCaRL [19], EEIL [2] and our BiC method use the same amount of exemplars from the old classes. Note
that the numbers for LwF, iCaRL and EEIL on ImageNet-1000 are estimated from the figures in the original papers. The best results are
marked in bold.
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
iCaRL [19] 94.31 94.26 91.09 86.88 81.06 77.45 75.29 71.34 68.78 65.56
BiC(Ours) 95.90 96.65 96.68 96.16 95.43 94.45 93.35 91.90 90.18 87.98
Table 2. Incremental learning results (accuracy %) on Celeb-10000 dataset with an increment of 1000 classes. iCaRL [19] and our BiC
method use the same amount of exemplars from the old classes. The best results are marked in bold.
ImageNet-100 ImageNet-1000
edge, we have not seen any incremental learning method 100 100
Accuracy (%)
Accuracy (%)
iCaRL is generated by applying its github implementation2 60 60
on Celeb-10000 dataset. For the first couple of incremental
40 40
steps, our BiC method is slightly better than (< 3%) iCaRL. LwF LwF
iCaRL iCaRL
But since the third incremental step, the gap becomes wider. 20 EEIL
BiC(Ours)
20 EEIL
BiC(Ours)
UpperBound UpperBound
At the last incremental step, BiC outperforms iCaRL by 0 0
0 20 40 60 80 100 0 200 400 600 800 1000
22.4%. The average gain over 10 incremental batches is Number of classes Number of classes
13.2%. (a) (b)
These results demonstrate our BiC method is more effec- Figure 7. Incremental learning results (accuracy %) on ImageNet-
tive and robust to deal with a large number of classes. As 100 and ImageNet-1000. Both have ten incremental batches. The
the number of classes increases, it is more frequent to have Upper Bound result, shown in the last step, is obtained by train-
visually similar classes across different increment batches ing a non-incremental model using all training samples from all
with unbalanced data. This introduces a strong bias towards classes. (Best viewed in color)
new classes and misclassifies the old classes that are visu-
ing data (shown at the last step in Fig. 7). Compared
ally similar. Our BiC method is able to effectively reduce
to the upper bound, our BiC method degrades 10.5% and
this bias and improve the classification accuracy.
16.0% on ImageNet-100 and ImageNet-1000 respectively.
6.4. Comparison between Different Scales However, EEIL [2] degrades 15.1% and 37.2% and iCaRL
[19] degrades 31.1% and 45.2%. Compared with EEIL [2]
In this section, we compare our BiC method with the and iCaRL [19], which have more performance degradation
state-of-the-art on two different scales on ImageNet. The from the small scale to large scale, our BiC method is much
small scale deals with random selected 100 classes (referred more consistent. This demonstrates that BiC has better ca-
to ImageNet-100), while the large scale involves all 1000 pability to handle the large scale.
classes (referred to ImageNet-1000). Both scales have 10 We are aware that BiC is behind EEIL [2] for the first
incremental batches. This follows the same protocol with three incremental batches on ImageNet-100. As explained
EEIL [2] and iCaRL [19]. The results for ImageNet-1000 is in Section 6.3, this is mostly due to enhanced data argumen-
the same as in the previous section. tation (EDA) in EEIL [2].
The incremental learning results on Imagenet-100 and
ImageNet-1000 are shown in Fig. 7. Our BiC method 6.5. Comparison on a Small Dataset
outperforms the state-of-the-art for both scales in terms of
the final incremental accuracy and the average incremen- We also compare our BiC method with the state-of-the-
tal accuracy. But the gain for the large scale is bigger. art algorithms on a small dataset - CIFAR-100 [11]. The in-
We also compare the final incremental accuracy (the last cremental learning results with four different splits of 5, 10,
step) to the upper bound, which is obtained by training 20 and 50 classes are shown in Fig. 8. Our BiC method has
a non-incremental model using all classes and their train- similar performance with iCaRL [19] and EEIL [2]. BiC is
better on the split of 50 and 20 classes, but is slightly behind
2 https://github.com/srebuffi/iCaRL EEIL on the split of 10 and 5 classes. The margins are small
379
100 100 100 100
80 80 80 80
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
60 60 60 60
40 40 40 40
LwF
iCaRL
20 20 20 20 EEIL
BiC(Ours)
UpperBound
0 0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
100
Number of classes Number of classes Number of classes Number
Number of classesof classes
(a) (b) (c) (d)
Figure 8. Incremental learning results on CIFAR-100 with split of (a) 5 classes, (b) 10 classes, (c) 20 classes and (d) 50 classes. The Upper
Bound result, shown in the last step, is obtained by training a non-incremental model using all training samples for all classes. (Best
viewed in color)
for all splits. tion capability on the old classes. However, both baseline-1
Although our method focuses on the large scale incre- and baseline-2 have low accuracy on the final step to clas-
mental learning, it is also compelling on the small scale. sify all 100 classes (about 40%). This is mainly because of
Note that EEIL has more data augmentation such as bright- the data imbalance between the old and new classes. When
ness augmentation and contrast normalization, which are using the bias correction, BiC improves the accuracy on all
not utilized in LwF, iCaRL or BiC. incremental steps. The classification accuracy on the final
step (100 classes) is boosted from 40.34% to 56.69%. This
6.6. Ablation Study demonstrates that the bias is a big issue and our method is
effective to address it. Furthermore, our method is close to
We now analyze the components of our BiC method
the upper bound. The small gap (4.24%) from our approach
and demonstrate their impact. The ablation study is per-
56.69% to the upper bound 60.93% shows the superiority
formed on CIFAR-100 [11], as incremental learning on
of our method.
large dataset is time consuming. The ablation study is per-
formed on CIFAR-100 with an incremental of 20 classes. The confusion matrices of these four variations are
The size of the stored exemplars from old classes is 2,000. shown in Fig. 9. Clearly, baseline-1 and baseline-2 suffer
In the following ablation study, we analyze (a) the impact from the bias towards the new classes (strong confusions on
of bias correction, (b) the split of validation set, and (c) the the last 20 classes). BiC reduces the bias and has similar
sensitivity of exemplar selection. confusion matrix to the upper bound.
The Impact of Bias Correction We compare our BiC These results validate our hypothesis that there exists a
method with two variations of baselines and the upper strong bias towards the new classes in the last fully con-
bound, to analyze the impact of bias correction. The nected layer. In addition, the results demonstrate that the
baselines and the upper bound are explained as follows: proposed bias correction using a linear model on a small
baseline-1: the model is trained using the classification validation set is capable to correct the bias.
loss alone (Eq. 2). The Split of Validation Set We study the impact of dif-
baseline-2: the model is trained using both the distilling ferent splits of the validation set (see Section 5.1). As illus-
loss and the classification loss (Eq. 3). Compared to the trated in Fig. 2, our BiC splits the stored exemplars from
baseline-1, the distilling loss is added. the old classes into a training set (trainold ) and a valida-
BiC: the model is trained using both the distilling loss and tion set (valold ). The samples from the new classes also
the classification loss, with the bias correction. have a train/val split (trainnew and valnew ). trainold and
upper bound: the model is firstly trained using both the trainnew are used to learn the convolution layers and the
distilling loss and classification loss. Then, the feature fully connected layer, while valold and valnew are used to
layers are frozen and the classifier layer (i.e. the fully learn the bias correction layer. Note that valold and valnew
connected layer) is retrained using all training data (includ- are balanced, having the same number of samples S per class.
ing the samples from the old classes that are not stored). Since only a few exemplars (i.e. trainold valold ) are
Although it is infeasible to have all training samples from stored for the old classes, it is critical to find a good split
the old classes, it shows the upper bound for the bias that deals with the trade-off between training the feature
correction in the fully connected layer. representation and correcting the bias in the fully connected
layer.
The incremental learning results are shown in Table 3. Table 4 shows the incremental learning results for four
With the help of the knowledge distillation, baseline-2 is different splits of trainold : valold . The split of 9:1 has the
slightly better than baseline-1 since it retains the classifica- best classification accuracy for all four incremental steps.
380
Variations cls loss distilling loss bias removal FC retrain 20 40 60 80 100
baseline-1 X 84.40 68.30 55.10 48.52 39.83
baseline-2 X X 85.05 72.22 59.41 50.43 40.34
BiC(Ours) X X X 84.00 74.69 67.93 61.25 56.69
upper bound X X X 84.39 76.15 69.51 64.03 60.93
Table 3. Incremental learning results on CIFAR-100 with a batch of 20 classes. baseline-1 uses the classification loss alone. baseline-2
uses both the distilling loss and the classification loss. BiC corrects the bias in FC layer of baseline-2. Upper bound retrains the last FC
layer using all samples from both old and new classes after learning the model of baseline-2. The best results are marked in bold.
1
20 20 20 20 0.8
40 40 40 40 0.6
True classes
60 60 60 60 0.4
80 80 80 80 0.2
Figure 9. Confusion matrices of four different variations: (a) baseline-1 (b) baseline-2, (c) BiC, (d) upper bound. Both baseline-1 and
baseline-2 have strong bias towards new classes. BiC is capable to remove most of the bias and have similar confusion matrix with the
upper bound. (Best viewed in color)
381
References [16] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and
Gabriela Csurka. Distance-based image classification: Gen-
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene eralizing to new classes at near-zero cost. IEEE transactions
Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy on pattern analysis and machine intelligence, 35(11):2624–
Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: 2637, 2013.
Large-scale machine learning on heterogeneous distributed
[17] Robi Polikar, Lalita Upda, Satish S Upda, and Vasant
systems. arXiv preprint arXiv:1603.04467, 2016.
Honavar. Learn++: An incremental learning algorithm for
[2] Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas supervised neural networks. IEEE transactions on systems,
Guil, Cordelia Schmid, and Karteek Alahari. End-to-end in- man, and cybernetics, part C (applications and reviews),
cremental learning. In The European Conference on Com- 31(4):497–508, 2001.
puter Vision (ECCV), September 2018. [18] Amal Rannen Ep Triki, Rahaf Aljundi, Matthew Blaschko,
[3] Gert Cauwenberghs and Tomaso Poggio. Incremental and and Tinne Tuytelaars. Encoder based lifelong learning. In
decremental support vector machine learning. In Advances Proceedings ICCV 2017, pages 1320–1328, 2017.
in neural information processing systems, pages 409–415, [19] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg
2001. Sperl, and Christoph H. Lampert. icarl: Incremental clas-
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing sifier and representation learning. In The IEEE Conference
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and on Computer Vision and Pattern Recognition (CVPR), July
Yoshua Bengio. Generative adversarial nets. In Advances 2017.
in neural information processing systems, pages 2672–2680, [20] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
2014. jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[5] Yandong Guo and Lei Zhang. One-shot face recognition by Aditya Khosla, Michael Bernstein, et al. Imagenet large
promoting underrepresented classes. 2017. scale visual recognition challenge. International Journal of
[6] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Computer Vision, 115(3):211–252, 2015.
Jianfeng Gao. MS-Celeb-1M: A dataset and benchmark for [21] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins,
large scale face recognition. In ECCV, 2016. Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Raz-
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. van Pascanu, and Raia Hadsell. Progressive neural networks.
Deep residual learning for image recognition. In Proceed- arXiv preprint arXiv:1606.04671, 2016.
ings of the IEEE conference on computer vision and pattern [22] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon
recognition, pages 770–778, 2016. Kim. Continual learning with deep generative replay. In
[8] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling Advances in Neural Information Processing Systems, pages
the knowledge in a neural network. In NIPS Deep Learning 2994–3003, 2017.
and Representation Learning Workshop, 2015. [23] Konstantin Shmelkov, Cordelia Schmid, and Karteek Ala-
[9] Heechul Jung, Jeongwoo Ju, Minju Jung, and Junmo Kim. hari. Incremental learning of object detectors without catas-
Less-forgetting learning in deep neural networks. arXiv trophic forgetting. In Proceedings of the International Con-
preprint arXiv:1607.00122, 2016. ference on Computer Vision, 2017.
[10] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel [24] Karen Simonyan and Andrew Zisserman. Very deep convo-
Veness, Guillaume Desjardins, Andrei A Rusu, Kieran lutional networks for large-scale image recognition. arXiv
Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- preprint arXiv:1409.1556, 2014.
Barwinska, et al. Overcoming catastrophic forgetting in neu- [25] Gan Sun, Yang Cong, Ji Liu, Lianqing Liu, Xiaowei Xu, and
ral networks. Proceedings of the National Academy of Sci- Haibin Yu. Lifelong metric learning. IEEE transactions on
ences, 114(13):3521–3526, 2017. cybernetics, (99):1–12, 2018.
[11] Alex Krizhevsky and Geoffrey Hinton. Learning multiple [26] Gan Sun, Yang Cong, and Xiaowei Xu. Active lifelong learn-
layers of features from tiny images. 2009. ing with” watchdog”. In Thirty-Second AAAI Conference on
[12] Ilja Kuzborskij, Francesco Orabona, and Barbara Caputo. Artificial Intelligence, 2018.
From n to n+ 1: Multiclass transfer incremental learning. [27] Ragav Venkatesan, Hemanth Venkateswara, Sethuraman
In Proceedings of the IEEE Conference on Computer Vision Panchanathan, and Baoxin Li. A strategy for an
and Pattern Recognition, pages 3358–3365, 2013. uncompromising incremental learner. arXiv preprint
[13] Zhizhong Li and Derek Hoiem. Learning without forgetting. arXiv:1705.00744, 2017.
In European Conference on Computer Vision, pages 614– [28] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng,
629. Springer, 2016. and Zheng Zhang. Error-driven incremental learning in deep
[14] David Lopez-Paz et al. Gradient episodic memory for contin- convolutional neural network for large-scale image classifi-
ual learning. In Advances in Neural Information Processing cation. In Proceedings of the 22nd ACM international con-
Systems, pages 6470–6479, 2017. ference on Multimedia, pages 177–186. ACM, 2014.
[15] Michael McCloskey and Neal J.Cohen. Catastrophic inter-
ference in connectionist networks: The sequential learning
problem. Psychology of Learning and Motivation, 24:109–
165, 1989.
382