Remembering Transformer For Continual Learning
Remembering Transformer For Continual Learning
∗ ∗
Yuwei Sun1,2 , Ippei Fujisawa1 , Arthur Juliani3 , Jun Sakuma2,4, , Ryota Kanai1,
1
Araya, 2 RIKEN AIP, 3 Microsoft Research, 4 Tokyo Institute of Technology
arXiv:2404.07518v3 [cs.LG] 16 May 2024
Abstract
1 Introduction
The Catastrophic Forgetting (CF) problem arises in Attention layer outputs
the sequential learning of neural networks, wherein + " "
! "!
𝑤 RT Module Adapter Autoencoder
the new task knowledge tends to interfere with old "#$ ! "
! !!
2 Related work
This section provides a summary and comparison of relevant research on continual learning. To
tackle the catastrophic forgetting (CF) problem, there are approaches including fine-tuning, generative
replay, regularization, and soft parameter sharing. The fine-tuning method retrains the model with
data from previous tasks when training on new task data [28]. Generative replay aims to train a
generative model to reconstruct previous task data for the retraining [10, 2]. However, with the
increasing number of tasks, it becomes less and less feasible to either store or learn a generative
model for the retraining on the previous task data. A recent study [10] showed that the learned
generative model was encountered with the CF problem as well, leading to the degraded quality of
reconstructed task data. Additionally, fine-tuning on each observed task is computationally expensive
for continual learning. To tackle these problems, regularization and soft parameter sharing methods
selectively update model parameters without disturbing parameters important for old tasks [14, 1].
The soft parameter sharing method leverages a modular architecture to learn task-specific neural
module parameters [11, 6, 2]. The isolation of parameters alleviates negative interference among
various tasks. However, the soft parameter sharing method is inefficient in terms of model parameters
and typically requires the task identity information for selecting relevant parameters.
In contrast, the Remembering Transformer leverages a novelty detection method to accurately infer
relevant neural modules without requiring any task identity information. It significantly reduces
2
Step 1: Old task selection Step 2: Distillation Cross-entropy loss
Outputs Outputs L2 loss Outputs
Frozen weights
𝑒 ∗ = arg min ℓ% ×𝐿 Transformer blocks ×𝐿
…
…
…
'
New task learning Attention layer Attention layer
Novelty detection Old task replay outputs outputs
+ +
𝐺%
Reconstruction !
s loss ℓ& E 𝐺"#$ 𝐵"#$ Pretrained 𝐵%! ∗
𝐹% Reconstruction weights
loss
s r r
𝑤!
𝐹"#$ 𝐴!"#$ 𝐴!% ∗
("#$)
New task
Inputs 𝑣
Old task
Replay memory Replay
Herding … … memory
Figure 2: Adapter fusion based on knowledge distillation with a limited capacity E. We update the
l
E + 1-th adapter {BE+1 , AlE+1 }L
l=1 using new task data and the soft probability output of the old
task replay.
the model parameter size of each module based on a mixture-of-adapters architecture, wherein
a set of task-specific linear transformation matrices is employed in each attention layer of a pre-
trained Transformer model. By adaptively learning, activating, and fusing these neural modules, it
significantly reduces the memory cost for continual learning.
3 Methodology
In this section, we delve into a comprehensive exploration of the assumptions and the proposed
Remembering Transformer’s technical underpinnings. These include the incorporation of mixture-of-
adapters in Transformer models, a generative model-based novelty detection for expert routing, and
adapter fusion based on knowledge distillation (see Figure 2). The adapter fusion enables efficient
knowledge retrieval from relevant previously learned adapters while learning new tasks, preventing
catastrophic forgetting of previous tasks.
We aim to tackle continual learning on a sequence of tasks {T (1) , T (2) , . . . , T (N ) }, where T (t) =
{xti , yit }K t
i=1 ⊂ T is a labeled dataset including Kt pairs of instances x and their corresponding labels
t
t
y . We investigate the most challenging class-incremental learning setting in continual learning,
where each task T (t) is defined by a non-overlapping subset of a dataset T (t) ⊂ {xi , yi }K i=1 where
PN
K = t=1 Kt . Y t ⊂ Y is changing over time, i.e., Y i ∩ Y j = ∅ ∀i ̸= j. The posterior P (X|Y ) is
1
PN
consistent across tasks T (t) ⊂ T . During inference, we evaluate on P (X) ∼ K t=1 P (X|Y )Pt (Y )
without the task identity information t.
Our method is inspired by the Mixture-of-Experts (MoE) [24, 5, 3], with diverse neural modules
constituting a comprehensive neural network. Different parts of the neural network learn and adapt
independently without disrupting the knowledge that other parts have acquired, thus mitigating
interference between tasks. The intuition is to learn a diverse collection of neural modules based
on a pretrained Vision Transformer (ViT) and adaptively select the most relevant neural module for
adapting to various continual learning tasks.
ViT partitions an image into a sequence of non-overlapping patches. Let x ∈ RH×W ×C be an
image input, where (H, W ) is the resolution of the image and C is the number of channels. x is
2
separated into a sequence of 2D patches xp ∈ RN ×(P ·C) , where (P, P ) is the resolution of each
HW
×D
image patch and HW P 2 is the number of patches. These patches are mapped to tokens vp ∈ R
P2
D
with a learnable linear projection. A learnable 1D position embedding v0 ∈ R is prepended to the
HW
tokens to retain positional information, resulting in v ∈ R( P 2 +1)×D .
3
Moreover, we employ the soft parameter sharing method to enhance ViT’s ability for continual
learning, alleviating catastrophic forgetting. Notably, for each task T (t) ⊆ T , we utilize the Low-
Rank Adaptation (LoRA) method [9] for efficient fine-tuning using low-rank decomposition matrices.
We employ the trainable decomposition matrices to the attention weights of ViT’s different layers. For
each attention layer l, linear transformations are applied to the query, key, value, and output weights
(WlQ , WlK , WlV , WlO ). For Wl ∈ RD×D ∈ (WlQ , WlK , WlV , WlO ), the parameter update ∆Wl is
then computed by Wl + ∆Wl = Wl + B l Al , where Al ∈ Rr×D , B l ∈ RD×r , and r is a low rank
r ≪ D. For an input token v t , the output of the lth attention layer is Ŵl v t +∆Wl v t = Ŵl v t +B l Al v t
where ˆ· indicates untrainable parameters. For each task T (t) , θadapter t
= {Alt , Btl }L
l=1 is added to
t
the pretrained ViT model {θ̂ViT , θadapter } where L represents the total number of attention layers.
We formulate the optimization of the adapter based on task t’s dataset {vit , yit }K t
i=1 as follows:
t
PKt t t t t
θ̂adapter ← arg min − i=1 f (vi ; {θ̂ViT , θadapter })log(yi ). Then, θ̂adapter is added to the list of adapters
t
θadapter
1 2 t−1
{θadapter , θadapter , ..., θadapter } to form the mixture-of-adapters architecture.
Utilizing the input tokens, our goal is to adaptively activate the most relevant adapters within the
mixture-of-adapters framework. We propose a novelty detection mechanism based on generative
models for effective adapter routing. In particular, a routing neural network outputs gating weights
Wg ∈ RE that indicate the probability of employing a specific adapter for a given task, where E
represents the number of existing adapters. Conventional routing functions in Mixture-of-Experts
(MoE) usually struggle with the more complex continual learning tasks, where sequential task inputs
necessitate a continual update of the gating weights leading to forgetting of the routing neural
networks.
A generative model encodes tokens from a specific task and assesses the familiarity of a new task in
relation to the encoded knowledge of the old task. Notably, to facilitate efficient routing, we employ
a low-rank autoencoder θAE that consists of an encoder and decoder, each of which leverages one
linear transformation layer F ∈ Rs×D and G ∈ RD×s where s is a low rank s ≪ D. For each task t,
we train the AE to encode the embedding layer output V t , which are flattened and passed through
a Sigmoid activation σ(·), Vˆt = Gt Ft σ(V t ). The Sigmoid activation constraints the reconstructed
tokens from the AE between the range of 0 and 1 to facilitate learning. Then, the AE is updated based
on the mean squared error loss in Eq. (1).
t
ℓt (θAE ) = (σ(V t ) − Vˆt )2 , θ̂AE = arg min ℓt . (1)
t θAE
e
During inference, the collection of AEs θ̂AE ∈ {Ĝe , F̂e }E
e=1 trained on various tasks provide estimates
of the input tokens v’s novelty in relation to the previously learned tasks based on the computed
reconstruction loss ℓe (v). In particular, when the input closely resembles an old task learned by an
existing AE, the reconstruction loss for the input task by this AE tends to be low. Consequently, the
old task e∗ corresponding to the AE with the minimum reconstruction loss is the most relevant to the
e∗
input task. Then, the adapter θadapter is leveraged to adapt the pretrained ViT for tackling the input
task. We devise the proposed novelty detection mechanism as follows:
ℓe = (σ(v) − Ĝe F̂e σ(v))2 , e∗ = arg min ℓe ,
e
1 if e = e∗ (2)
Wg (e) =
0 otherwise.
PE
The output vl of the lth attention layer is then computed by Ŵl vl−1 + e=1 Wg (e)Bel Ale vl−1 , where
vl−1 is the output of the previous l − 1th attention layer. Additionally, for enhanced parameter
t
efficiency, the routing model {θAE }N
t=1 is globally shared across different attention layers.
Remembering Transformer leverages the mixture-of-adapters architecture and the low-rank generative
routing for enhanced parameter efficiency in continual learning. However, with the increasing number
4
Algorithm 1 Remembering Transformer with Adapter Fusion
1: for each task t = 1, 2, . . . , N do
t
2: Initialize a new autoencoder θAE : {Gt , Ft }.
1
PKt t
3: ℓt = Kt i=1 (σ(vi ) − GF σ(vit ))2 .
t
4: θ̂AE = arg min ℓt .
t
θAE
t 1 2 t−1
5: Add to the list of AE {θAE
θ̂AE , θAE , ..., θAE }.
y
t 1
P Kt t
6: For y ∈ Y , µy ← K y i=1 F̂t σ(vy,i ).
t
7: Ξt ← {top- M t t
U (vy,i )}y∈Y by ranking ||µy − F̂t σ(vy,i )|| in the ascending order.
t
of tasks, it is still encountered with increasing parameters. To further enhance its parameter efficiency,
we explore a scenario where the number of adapters is constrained. We then propose a novel adapter
fusion approach within the mixture-of-adapters framework to identify and aggregate resembling
adapters based on knowledge distillation [8]. This involves transferring the knowledge of a selected
e∗ t
old adapter θadapter to the new adapter θadapter by replaying old task samples when the model capacity
E is reached, i.e., when t ≤ E + 1 (see Figure 2).
Furthermore, to reduce the computational cost of knowledge distillation, we leverage the Herd-
ing method [29, 22] to obtain a small set of the most representative replay samples of each
task during training. In particular, given tokens from a specific class y ∈ Y t of task t, i.e.,
y
Vyt = {vy,1
t t
, vy,2 t
, . . . , vy,K y } where Kt is the total number of samples from class y, we compute the
t
t
class center µy based on the latent representations learned by the autoencoder θ̂AE := {Ĝt , F̂t }. The
y
1
P Kt t
class center in the latent space is µy ← K y i=1 F̂t σ(vy,i ). Then, the distance of each token’s latent
t
t
representation to the class center ||µy − F̂t σ(vy,i )|| is ranked in the ascending order. For each class
y ∈ Y t , we select and add the top- MU tokens (U is the number of classes and M is the task replay
memory size) with the least distance to the class center in the latent space to the replay memory Ξt of
task t, i.e., Ξt ← {top- M t
U (vy,i )}y∈Y .
t
The old adapter e∗ for knowledge distillation is selected based on the novelty detection mechanism in
Eq. (2). We determine the most relevant old adapter for the new task by identifying the minimum
e∗
reconstruction loss. We compute the soft probability output f (Ξe∗ ; {θ̂ViT , θ̂adapter }) of the old adapter
E+1
by replaying samples Ξe from the memory. Then, the new adapter θadapter is trained using its task
∗
5
∗
data {V E+1 , Y E+1 } and the soft probability output of the old task {Ξe∗ , f (Ξe∗ ; {θ̂ViT , θ̂adapter
e
})}.
Notably, we employ the cross-entropy loss ℓCE (·) and the L2 loss ℓL2 (·) for training on the new and
old tasks, respectively. We devise the optimization process for the adapter fusion as follows:
KE+1
X
E+1
ℓCE (θadapter )=− f (viE+1 ; {θ̂ViT , θadapter
E+1
}) log(yiE+1 ),
i=1
M 2
∗
e∗ ∗
(3)
X
E+1 E+1
ℓL2 (θadapter )= f (vie ; {θ̂ViT , θ̂adapter }) − f (vie ; {θ̂ViT , θadapter }) ,
i=1
E+1
θ̂adapter = arg min α · ℓCE + (1 − α) · ℓL2 ,
E+1
θadapter
4 Experiments
In this section, we provide a detailed description of the datasets, model architectures, and metrics
used in the experiments. An extensive empirical evaluation including an ablation study is performed
based on a wide range of split and permutation tasks in continual learning. Moreover, we investigate
the Remembering Transformer’s performance when its capacity of adapters is limited. The results
demonstrate that the Remembering Transformer achieves SOTA performance while retaining great
parameter efficiency compared with conventional methods.
4.1 Settings
Datasets We evaluated the model performance based on continual learning datasets adapted from
CIFAR10, CIFAR100 [15], and Permuted MNIST [26] tasks. We investigated two types of continual
learning datasets: (1) the split task where each dataset is split into several subsets of equal numbers
of classes, with each subset as a task. For example, we divide the CIFAR-10 dataset into five tasks.
The first task consists of digits (classes) 0 - 1 and the second task consists of digits (classes) 2 - 3 and
so on. We denote the task of CIFAR-10 in five splits as CIFAR10/5. (2) the permutation task where
the input pixels of an image in the training and test data are shuffled with a random permutation, with
a different permutation for each task.
Metrics For the split tasks, we measure the average task accuracy over all tasks after training, i.e.,
PN
Acc = N1 i=1 Acci where Acci is the accuracy on the ith task after learning the final task T (N ) .
Moreover, we evaluate the memory footprint of a method based on its trainable model parameter size.
For the permutation tasks, we evaluate the average task accuracy each time the model learns a new
permutation. We then report the mean and standard deviation of three individual experiments with
random seeds, with each seed resampling the splits to avoid any favorable class split and resampling
the permutation patterns to avoid any favorable permutation.
6
Table 1: Average task accuracy in the split tasks.
Method CIFAR10/5 (%) CIFAR100/10 (%) CIFAR100/20 (%) Average (%)
SupSup [30] 26.2 ± 0.46 33.1 ± 0.47 12.3 ± 0.30 23.87
BI-R+SI [26] 41.7 ± 0.25 22.7 ± 0.81 19.1 ± 0.04 27.83
HyperNet [27] 47.4 ± 5.78 29.7 ± 2.19 19.4 ± 1.44 32.17
MUC [19] 53.6 ± 0.95 30.0 ± 1.37 14.4 ± 0.93 32.67
OWM [31] 51.7 ± 0.06 29.0 ± 0.72 24.2 ± 0.11 35.63
PASS [33] 47.3 ± 0.97 36.8 ± 1.64 25.3 ± 0.81 36.47
LwF [18] 54.7 ± 1.18 45.3 ± 0.75 44.3 ± 0.46 48.10
iCaRL [23] 63.4 ± 1.11 51.4 ± 0.99 47.8 ± 0.48 54.20
Mnemonicsy [18] 64.1 ± 1.47 51.0 ± 0.34 47.6 ± 0.74 54.23
DER++ [4] 66.0 ± 1.27 55.3 ± 0.10 46.6 ± 1.44 55.97
IL2A[32] 92.0 ± 0.23 60.3 ± 0.14 57.9 ± 0.31 70.07
CLOM [13] 88.0 ± 0.48 65.2 ± 0.71 58.0 ± 0.45 70.40
SSRE [34] 90.5 ± 0.61 65.0 ± 0.27 61.7 ± 0.15 72.40
FeTrIL [21] 90.9 ± 0.38 65.2 ± 0.16 61.5 ± 0.73 72.53
Ours 99.3 ± 0.24 86.1 ± 2.09 79.9 ± 3.56 88.43
Task/Split 1 Task/Split 2
…
…
Figure 3: Test accuracy curves in the CIFAR100 tasks.
and a weight decay of 0.01. We employed learning rates of 0.001 and 0.005 for updating the
adapters and autoencoders, respectively. We employed a coefficient of 0.5 in the loss function of the
knowledge distillation. We trained the model for 50 epochs in the CIFAR10 split task, 200 epochs in
the CIFAR100 split tasks, 2000 epochs in the permuted-MNIST task. Additionally, we employed
the average along the latent dimension of each token of the embedding layer output to train the
autoencoders for enhanced parameter-efficiency. The autoencoders were trained once for 10 epochs
at the beginning of each task. All experiments were based on PyTorch and four A100 GPUs. The
code would be made publicly available.
Accuracy curves Figure 3 demonstrated the accuracy curves of the CIFAR100/10 and CI-
FAR100/20 split tasks. For comparison, we visualized the performance of FeTrIL [21] and SSRE
[34]. We added an oracle model based on an Independent and Identically Distributed (IID) training
setting where tasks are defined as identical instances of the same dataset with all classes. This is
equivalent to finetuning the pretrained model on the entire dataset based on a single adapter. The
Remembering Transformer demonstrated greatly enhanced continual learning ability. In contrast,
7
Figure 4: Memory footprint of the comparison models.
Table 2: Model capacity vs. test accuracy. Table 3: Replay memory size vs. test accuracy.
#Adapters (memory footprint) Test accuracy (%) #Replay samples Test accuracy (%)
5 (0.37M) 99.3 ± 0.24 512 93.2 ± 0.72
3 (0.22M) 93.2 ± 0.72 256 90.5 ± 0.41
2 (0.15M) 87.1 ± 0.85 128 86.6 ± 0.26
the existing methods encountered the forgetting problem when the task changed. The Remembering
Transformer even outperforms the oracle model in the CIFAR100/10 task leveraging the proposed
modular architecture for reduced interference among classes.
Memory footprint We evaluated the memory footprint by measuring the trainable model parameter
size. Figure 4 demonstrated that the Remembering Transformer is parameter-efficient for continual
learning compared to conventional methods, reducing the memory footprint in the CIFAR10 split
task from 11.18M (FeTrIL [21]) to 0.22M. Moreover, although OWM and HyperNet utilized smaller
model sizes, they achieved much worse model performance. The Remembering Transformer achieved
SOTA performance while retaining a small memory footprint.
We studied the efficacy of the Remembering Transformer by conducting ablations on the generative
routing mechanism and varying the model capacity of adapters.
Ablating the generative routing Remembering Transformer leverages the generative routing to
facilitate sparse adapter activation. We investigated whether the Remembering Transformer with the
generative routing ablated could tackle the CIFAR100 split tasks (see Figure 3). The empirical results
demonstrated that the mixture-of-adapters architecture alone could not adapt to the continual learning
task. It highlighted the importance of the generative routing mechanism for a model continually
learning new tasks without forgetting old knowledge.
Varying the model capacity The adapter fusion method leverages knowledge distillation to transfer
old task knowledge to a newly learned adapter for reducing the memory footprint. To assess the
model efficacy under a constrained adapter capacity, we varied the capacity by reducing the adapters
to E = 3, 2 in the CIFAR10/5 split task while employing a replay memory size of M = 512. Table
2 demonstrated that the Remembering Transformer could learn a number of tasks that were more
than the adapters, retaining competitive performance. Moreover, we varied the replay memory size
to M = {128, 256, 512} when employing a model capacity of E = 3 (see Table 3). Consequently,
compared to FeTrIL, which obtained a performance of 90.9% with a memory footprint of 11.18M
(see Table 1), the Remembering Transformer achieved a performance of 93.2% with a much smaller
memory footprint of 0.22M, when E = 3 and M = 512.
8
Task/Permutation 1
Task/Permutation 2
…
Figure 5: Test accuracy curves of the Remembering Transformer compared to the conventional
methods for the permuted-MNIST task.
9
Acknowledgements
We would like to thank Steve Lin and Zhirong Wu for their insightful discussions and valuable
contributions to this work.
References
[1] Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual
learning with adaptive regularization. Advances in neural information processing systems, 32,
2019.
[2] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning
with a network of experts. In CVPR, 2017.
[3] James Urquhart Allingham, Florian Wenzel, Zelda E. Mariet, and et al. Sparse moes meet
efficient ensembles. arXiv:2110.03360, 2021.
[4] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and et al. Dark experience for general
continual learning: a strong, simple baseline. In NeurIPS, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and et al. An image is worth 16x16
words: Transformers for image recognition at scale. In ICLR, 2021.
[6] Beyza Ermis, Giovanni Zappella, Martin Wistuba, and et al. Continual learning with transform-
ers for image classification. In CVPR Workshops, 2022.
[7] Ruy Gómez-Ocádiz, Massimiliano Trippa, Chun-Lei Zhang, Lorenzo Posani, Simona Cocco,
Rémi Monasson, and Christoph Schmidt-Hieber. A synaptic signal for novelty processing in
the hippocampus. Nature Communications, 13(1):4122, 2022.
[8] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural
network. arXiv:1503.02531, 2015.
[9] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.
[10] Wenpeng Hu, Zhou Lin, Bing Liu, and et al. Overcoming catastrophic forgetting for continual
learning via model adaptation. In ICLR, 2019.
[11] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive
mixtures of local experts. Neural Comput., 3(1):79–87, 1991.
[12] Alex Kafkas and Daniela Montaldi. How do memory systems detect and respond to novelty?
Neuroscience letters, 680:60–68, 2018.
[13] Gyuhak Kim, Sepideh Esmaeilpour, Changnan Xiao, and Bing Liu. Continual learning based
on OOD detection and task masking. In CVPR Workshops, pages 3855–3865. IEEE, 2022.
[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,
Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.
Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of
sciences, 114(13):3521–3526, 2017.
[15] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[16] Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do
intelligent agents need? complementary learning systems theory updated. Trends in cognitive
sciences, 20(7):512–534, 2016.
[17] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern
analysis and machine intelligence, 40(12):2935–2947, 2017.
[18] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics training:
Multi-class incremental learning without forgetting. In CVPR, pages 12242–12251. Computer
Vision Foundation / IEEE, 2020.
[19] Yu Liu, Sarah Parisot, Gregory G. Slabaugh, and et al. More classifiers, less forgetting: A
generic multi-classifier paradigm for incremental learning. In ECCV (26), volume 12371 of
Lecture Notes in Computer Science, pages 699–716. Springer, 2020.
10
[20] James L McGaugh. Memory–a century of consolidation. Science, 287(5451):248–251, 2000.
[21] Grégoire Petit, Adrian Popescu, Hugo Schindler, and et al. Fetril: Feature translation for
exemplar-free class-incremental learning. In WACV, pages 3900–3909. IEEE, 2023.
[22] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer
Society, 2017.
[23] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer
Society, 2017.
[24] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and et al. Scaling vision with sparse mixture
of experts. In NeurIPS, 2021.
[25] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska,
Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework
for continual learning. In International conference on machine learning, pages 4528–4537.
PMLR, 2018.
[26] Gido M Van de Ven, Hava T Siegelmann, and Andreas S Tolias. Brain-inspired replay for
continual learning with artificial neural networks. Nature communications, 11(1):4069, 2020.
[27] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual
learning with hypernetworks. In ICLR, 2020.
[28] Tianchun Wang, Wei Cheng, Dongsheng Luo, and et al. Personalized federated learning via
heterogeneous modular networks. In ICDM, 2022.
[29] Max Welling. Herding dynamical weights to learn. In ICML, volume 382 of ACM International
Conference Proceeding Series, pages 1121–1128. ACM, 2009.
[30] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, and et al. Supermasks in superposition. In
NeurIPS, 2020.
[31] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continuous learning of context-dependent
processing in neural networks. Nature Machine Intelligence.
[32] Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Chenglin Liu. Class-incremental learning via dual
augmentation. In NeurIPS, pages 14306–14318, 2021.
[33] Fei Zhu, Xu-Yao Zhang, Chuang Wang, and et al. Prototype augmentation and self-supervision
for incremental learning. In CVPR, pages 5871–5880. Computer Vision Foundation / IEEE,
2021.
[34] Kai Zhu, Wei Zhai, Yang Cao, and et al. Self-sustaining representation expansion for non-
exemplar class-incremental learning. In CVPR, pages 9286–9295. IEEE, 2022.
11