0% found this document useful (0 votes)
25 views11 pages

Remembering Transformer For Continual Learning

The document presents the Remembering Transformer, a novel approach to address the Catastrophic Forgetting problem in continual learning by utilizing a mixture-of-adapters architecture and a generative model-based novelty detection mechanism. This method allows for efficient parameter usage and task-specific routing without requiring task identity information, demonstrating superior performance in various tasks compared to conventional methods. Experimental results indicate significant improvements in task accuracy and memory efficiency, particularly in class-incremental learning scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

Remembering Transformer For Continual Learning

The document presents the Remembering Transformer, a novel approach to address the Catastrophic Forgetting problem in continual learning by utilizing a mixture-of-adapters architecture and a generative model-based novelty detection mechanism. This method allows for efficient parameter usage and task-specific routing without requiring task identity information, demonstrating superior performance in various tasks compared to conventional methods. Experimental results indicate significant improvements in task accuracy and memory efficiency, particularly in class-incremental learning scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Remembering Transformer For Continual Learning

∗ ∗
Yuwei Sun1,2 , Ippei Fujisawa1 , Arthur Juliani3 , Jun Sakuma2,4, , Ryota Kanai1,
1
Araya, 2 RIKEN AIP, 3 Microsoft Research, 4 Tokyo Institute of Technology
arXiv:2404.07518v3 [cs.LG] 16 May 2024

Abstract

Neural networks encounter the challenge of Catastrophic Forgetting (CF) in contin-


ual learning, where new task learning interferes with previously learned knowledge.
Existing data fine-tuning and regularization methods necessitate task identity infor-
mation during inference and cannot eliminate interference among different tasks,
while soft parameter sharing approaches encounter the problem of an increasing
model parameter size. To tackle these challenges, we propose the Remembering
Transformer, inspired by the brain’s Complementary Learning Systems (CLS).
Remembering Transformer employs a mixture-of-adapters architecture and a gen-
erative model-based novelty detection mechanism in a pretrained Transformer
to alleviate CF. Remembering Transformer dynamically routes task data to the
most relevant adapter with enhanced parameter efficiency based on knowledge
distillation. We conducted extensive experiments, including ablation studies on the
novelty detection mechanism and model capacity of the mixture-of-adapters, in a
broad range of class-incremental split tasks and permutation tasks. Our approach
demonstrated SOTA performance surpassing the second-best method by 15.90% in
the split tasks, reducing the memory footprint from 11.18M to 0.22M in the five
splits CIFAR10 task.

1 Introduction
The Catastrophic Forgetting (CF) problem arises in Attention layer outputs
the sequential learning of neural networks, wherein + " "
! "!
𝑤 RT Module Adapter Autoencoder
the new task knowledge tends to interfere with old "#$ ! "
! !!

knowledge, with previously learned tasks forgotten. Attention layer


outputs Parameter sharing of 𝐹 and 𝐺
! !
The ability to learn a new task without interfering +
with previously learned ones is of great importance × Novelty detection
to achieve continual learning. An intuitive approach Pretrained 𝐵"! 𝐺"
to CF is data fine-tuning [28]. A model alleviates the weights s
Reconstruction
r RT Module
𝑤! loss E
forgetting of previous tasks by replaying and retrain- 𝐴!" 𝐹"

ing on old task samples during learning a new task.


Nevertheless, as the number of encountered tasks in- Inputs 𝑣 Herding
(") Replay memory

creases, it necessitates the storage of a large number


of old task samples. In addition, the data fine-tuning Figure 1: Remembering Transformer leverages
based on memory replay usually cannot eliminate the mixture-of-adapters that are sparsely acti-
the CF problem leading to suboptimal performance. vated with a novelty detection mechanism.
Biological neural networks exhibit apparent advantages in continual learning via the Complementary
Learning Systems (CLS) [20, 16]. In CLS, the Hippocampus rapidly encodes task data and then
consolidates the task knowledge into the Cortex by forming new neural connections. The hippocampus
has developed a novelty detection mechanism to facilitate consolidation by switching among neural
modules in the Cortex for various tasks [12, 7]. To this end, we propose a novel Remembering

Equal advising; Corresponding author: yuwei_sun@araya.org

Preprint. Under review.


Transformer inspired by the CLS in the brain. Our approach consolidates a mixture-of-adapters
architecture with generative routing to tackle the challenges in conventional continual learning
methods (see Figure 1). Notably, the mixture-of-adapters architecture incorporates and jointly trains
low-rank adapters with a pretrained Transformer. Then, these adapters are selectively activated for
parameter-efficient fine-tuning through a novelty detection mechanism, where each task’s samples
are encoded in specific autoencoders for effective routing. During inference, the autoencoders, each
encoding different task knowledge, predict routing weights for the various adapters, thus allocating
an input sample to the most relevant adapter.
Moreover, we leverage low-rank autoencoders to facilitate novelty detection based on the magnitudes
of reconstruction losses of various autoencoders, with the current task data as input. The reconstruc-
tion losses represent the similarity between the current task and the old task knowledge encoded in
these autoencoders. In a more challenging setting of limited model parameter capacity, we assume
there is a limit for the number of adapters added to the Transformer. To tackle this challenge, we fur-
ther propose the adapter fusion approach based on knowledge distillation [8] by distilling knowledge
from the most relevant old adapter into a newly learned adapter. The knowledge distillation leverages
a small set of old task replay memory and trains the new adapter on the probability distribution
of the replay memory samples. Consequently, we demonstrate the superiority of Remembering
Transformer in terms of enhanced task accuracy and parameter efficiency compared to a wide range
of conventional continual learning methods. The results indicate that the proposed Remembering
Transformer can achieve competitive performance even with the limited model capacity.
Overall, our main contributions are three-fold:
1) We propose the Remembering Transformer inspired by the Complementary Learning Systems
to tackle the catastrophic forgetting problem in split tasks and permutation tasks. We investigate
two challenging real-world scenarios in continual learning: class-incremental learning without task
identity information and learning with limited model parameter capacity (Section 3.3).
2) We propose the adapter fusion to distill old task knowledge from existing adapters and aggregate
relevant adapters, enhancing parameter efficiency when learning with limited model parameter
capacity (Section 3.4).
3) The empirical experiment results demonstrate the superiority of Remembering Transformer
compared to conventional methods including Feature Translation and Representation Expansion, in
terms of task accuracy and parameter efficiency in various continual learning tasks (Section 4).
The remainder of this paper is structured as follows. Section 2 reviews the most relevant work on
continual learning. Section 3 demonstrates the essential definitions, assumptions, and technical
underpinnings of the Remembering Transformer. Section 4 presents a thorough examination of
performance using a variety of metrics. Section 5 concludes findings and gives out future directions.

2 Related work

This section provides a summary and comparison of relevant research on continual learning. To
tackle the catastrophic forgetting (CF) problem, there are approaches including fine-tuning, generative
replay, regularization, and soft parameter sharing. The fine-tuning method retrains the model with
data from previous tasks when training on new task data [28]. Generative replay aims to train a
generative model to reconstruct previous task data for the retraining [10, 2]. However, with the
increasing number of tasks, it becomes less and less feasible to either store or learn a generative
model for the retraining on the previous task data. A recent study [10] showed that the learned
generative model was encountered with the CF problem as well, leading to the degraded quality of
reconstructed task data. Additionally, fine-tuning on each observed task is computationally expensive
for continual learning. To tackle these problems, regularization and soft parameter sharing methods
selectively update model parameters without disturbing parameters important for old tasks [14, 1].
The soft parameter sharing method leverages a modular architecture to learn task-specific neural
module parameters [11, 6, 2]. The isolation of parameters alleviates negative interference among
various tasks. However, the soft parameter sharing method is inefficient in terms of model parameters
and typically requires the task identity information for selecting relevant parameters.
In contrast, the Remembering Transformer leverages a novelty detection method to accurately infer
relevant neural modules without requiring any task identity information. It significantly reduces

2
Step 1: Old task selection Step 2: Distillation Cross-entropy loss
Outputs Outputs L2 loss Outputs
Frozen weights
𝑒 ∗ = arg min ℓ% ×𝐿 Transformer blocks ×𝐿




'
New task learning Attention layer Attention layer
Novelty detection Old task replay outputs outputs
+ +
𝐺%
Reconstruction !
s loss ℓ& E 𝐺"#$ 𝐵"#$ Pretrained 𝐵%! ∗
𝐹% Reconstruction weights
loss
s r r
𝑤!
𝐹"#$ 𝐴!"#$ 𝐴!% ∗
("#$)
New task
Inputs 𝑣
Old task
Replay memory Replay
Herding … … memory

Figure 2: Adapter fusion based on knowledge distillation with a limited capacity E. We update the
l
E + 1-th adapter {BE+1 , AlE+1 }L
l=1 using new task data and the soft probability output of the old
task replay.

the model parameter size of each module based on a mixture-of-adapters architecture, wherein
a set of task-specific linear transformation matrices is employed in each attention layer of a pre-
trained Transformer model. By adaptively learning, activating, and fusing these neural modules, it
significantly reduces the memory cost for continual learning.

3 Methodology
In this section, we delve into a comprehensive exploration of the assumptions and the proposed
Remembering Transformer’s technical underpinnings. These include the incorporation of mixture-of-
adapters in Transformer models, a generative model-based novelty detection for expert routing, and
adapter fusion based on knowledge distillation (see Figure 2). The adapter fusion enables efficient
knowledge retrieval from relevant previously learned adapters while learning new tasks, preventing
catastrophic forgetting of previous tasks.

3.1 Continual learning task setting

We aim to tackle continual learning on a sequence of tasks {T (1) , T (2) , . . . , T (N ) }, where T (t) =
{xti , yit }K t
i=1 ⊂ T is a labeled dataset including Kt pairs of instances x and their corresponding labels
t

t
y . We investigate the most challenging class-incremental learning setting in continual learning,
where each task T (t) is defined by a non-overlapping subset of a dataset T (t) ⊂ {xi , yi }K i=1 where
PN
K = t=1 Kt . Y t ⊂ Y is changing over time, i.e., Y i ∩ Y j = ∅ ∀i ̸= j. The posterior P (X|Y ) is
1
PN
consistent across tasks T (t) ⊂ T . During inference, we evaluate on P (X) ∼ K t=1 P (X|Y )Pt (Y )
without the task identity information t.

3.2 Vision Transformers with mixture-of-adapters

Our method is inspired by the Mixture-of-Experts (MoE) [24, 5, 3], with diverse neural modules
constituting a comprehensive neural network. Different parts of the neural network learn and adapt
independently without disrupting the knowledge that other parts have acquired, thus mitigating
interference between tasks. The intuition is to learn a diverse collection of neural modules based
on a pretrained Vision Transformer (ViT) and adaptively select the most relevant neural module for
adapting to various continual learning tasks.
ViT partitions an image into a sequence of non-overlapping patches. Let x ∈ RH×W ×C be an
image input, where (H, W ) is the resolution of the image and C is the number of channels. x is
2
separated into a sequence of 2D patches xp ∈ RN ×(P ·C) , where (P, P ) is the resolution of each
HW
×D
image patch and HW P 2 is the number of patches. These patches are mapped to tokens vp ∈ R
P2

D
with a learnable linear projection. A learnable 1D position embedding v0 ∈ R is prepended to the
HW
tokens to retain positional information, resulting in v ∈ R( P 2 +1)×D .

3
Moreover, we employ the soft parameter sharing method to enhance ViT’s ability for continual
learning, alleviating catastrophic forgetting. Notably, for each task T (t) ⊆ T , we utilize the Low-
Rank Adaptation (LoRA) method [9] for efficient fine-tuning using low-rank decomposition matrices.
We employ the trainable decomposition matrices to the attention weights of ViT’s different layers. For
each attention layer l, linear transformations are applied to the query, key, value, and output weights
(WlQ , WlK , WlV , WlO ). For Wl ∈ RD×D ∈ (WlQ , WlK , WlV , WlO ), the parameter update ∆Wl is
then computed by Wl + ∆Wl = Wl + B l Al , where Al ∈ Rr×D , B l ∈ RD×r , and r is a low rank
r ≪ D. For an input token v t , the output of the lth attention layer is Ŵl v t +∆Wl v t = Ŵl v t +B l Al v t
where ˆ· indicates untrainable parameters. For each task T (t) , θadapter t
= {Alt , Btl }L
l=1 is added to
t
the pretrained ViT model {θ̂ViT , θadapter } where L represents the total number of attention layers.
We formulate the optimization of the adapter based on task t’s dataset {vit , yit }K t
i=1 as follows:
t
PKt t t t t
θ̂adapter ← arg min − i=1 f (vi ; {θ̂ViT , θadapter })log(yi ). Then, θ̂adapter is added to the list of adapters
t
θadapter
1 2 t−1
{θadapter , θadapter , ..., θadapter } to form the mixture-of-adapters architecture.

3.3 Generative model-based novelty detection for expert routing

Utilizing the input tokens, our goal is to adaptively activate the most relevant adapters within the
mixture-of-adapters framework. We propose a novelty detection mechanism based on generative
models for effective adapter routing. In particular, a routing neural network outputs gating weights
Wg ∈ RE that indicate the probability of employing a specific adapter for a given task, where E
represents the number of existing adapters. Conventional routing functions in Mixture-of-Experts
(MoE) usually struggle with the more complex continual learning tasks, where sequential task inputs
necessitate a continual update of the gating weights leading to forgetting of the routing neural
networks.
A generative model encodes tokens from a specific task and assesses the familiarity of a new task in
relation to the encoded knowledge of the old task. Notably, to facilitate efficient routing, we employ
a low-rank autoencoder θAE that consists of an encoder and decoder, each of which leverages one
linear transformation layer F ∈ Rs×D and G ∈ RD×s where s is a low rank s ≪ D. For each task t,
we train the AE to encode the embedding layer output V t , which are flattened and passed through
a Sigmoid activation σ(·), Vˆt = Gt Ft σ(V t ). The Sigmoid activation constraints the reconstructed
tokens from the AE between the range of 0 and 1 to facilitate learning. Then, the AE is updated based
on the mean squared error loss in Eq. (1).
t
ℓt (θAE ) = (σ(V t ) − Vˆt )2 , θ̂AE = arg min ℓt . (1)
t θAE

e
During inference, the collection of AEs θ̂AE ∈ {Ĝe , F̂e }E
e=1 trained on various tasks provide estimates
of the input tokens v’s novelty in relation to the previously learned tasks based on the computed
reconstruction loss ℓe (v). In particular, when the input closely resembles an old task learned by an
existing AE, the reconstruction loss for the input task by this AE tends to be low. Consequently, the
old task e∗ corresponding to the AE with the minimum reconstruction loss is the most relevant to the
e∗
input task. Then, the adapter θadapter is leveraged to adapt the pretrained ViT for tackling the input
task. We devise the proposed novelty detection mechanism as follows:
ℓe = (σ(v) − Ĝe F̂e σ(v))2 , e∗ = arg min ℓe ,
e

1 if e = e∗ (2)
Wg (e) =
0 otherwise.
PE
The output vl of the lth attention layer is then computed by Ŵl vl−1 + e=1 Wg (e)Bel Ale vl−1 , where
vl−1 is the output of the previous l − 1th attention layer. Additionally, for enhanced parameter
t
efficiency, the routing model {θAE }N
t=1 is globally shared across different attention layers.

3.4 Adapter fusion based on knowledge distillation

Remembering Transformer leverages the mixture-of-adapters architecture and the low-rank generative
routing for enhanced parameter efficiency in continual learning. However, with the increasing number

4
Algorithm 1 Remembering Transformer with Adapter Fusion
1: for each task t = 1, 2, . . . , N do
t
2: Initialize a new autoencoder θAE : {Gt , Ft }.
1
PKt t
3: ℓt = Kt i=1 (σ(vi ) − GF σ(vit ))2 .
t
4: θ̂AE = arg min ℓt .
t
θAE
t 1 2 t−1
5: Add to the list of AE {θAE
θ̂AE , θAE , ..., θAE }.
y
t 1
P Kt t
6: For y ∈ Y , µy ← K y i=1 F̂t σ(vy,i ).
t

7: Ξt ← {top- M t t
U (vy,i )}y∈Y by ranking ||µy − F̂t σ(vy,i )|| in the ascending order.
t

8: Add Ξt to the replay memory {Ξ1 , Ξ2 , ..., Ξt−1 }.


9: for each AE e = 1, 2, ..., t − 1 in the AE list do
10: ℓe = (σ(vit ) − Ĝe Fˆe σ(vit ))2 .
11: end for
12: Obtain the index of the most relevant AE: e∗ = arg min ℓe .
e
13: if t > E then ∗
14: Replay old task e∗ samples from the memory: {vie }M i=1 ← Ξe∗ .
 ∗ ∗ ∗
2
t M e
) = i=1 f (vie ; {θ̂ViT , θ̂adapter t
}) − f (vie ; {θ̂ViT , θadapter
P
15: ℓL2 (θadapter }) .
t
PKt
16: ℓCE (θadapter ) = − i=1 f (vit ; {θ̂ViT , θadapter
t
}) log(yit ).
t
17: θ̂adapter ← arg min α · ℓCE + (1 − α) · ℓL2 .
t
θadapter
18: The old adapter is removed and the samples of task e∗ are distributed to the newly learned
adapter t: Gate(e∗ ) ← t.
19: else
t
20: Initialize a new adapter θadapter = {Alt , Btl }L
l=1 .
Kt
θ̂adapter ← arg min − i=1 f (vi ; {θ̂ViT , θadapter })log(yit ).
t t t
P
21:
t
θadapter
t 1 2 t−1
22: Add θ̂adapter to the list of adapters {θadapter , θadapter , ..., θadapter }.
23: end if
24: end for

of tasks, it is still encountered with increasing parameters. To further enhance its parameter efficiency,
we explore a scenario where the number of adapters is constrained. We then propose a novel adapter
fusion approach within the mixture-of-adapters framework to identify and aggregate resembling
adapters based on knowledge distillation [8]. This involves transferring the knowledge of a selected
e∗ t
old adapter θadapter to the new adapter θadapter by replaying old task samples when the model capacity
E is reached, i.e., when t ≤ E + 1 (see Figure 2).
Furthermore, to reduce the computational cost of knowledge distillation, we leverage the Herd-
ing method [29, 22] to obtain a small set of the most representative replay samples of each
task during training. In particular, given tokens from a specific class y ∈ Y t of task t, i.e.,
y
Vyt = {vy,1
t t
, vy,2 t
, . . . , vy,K y } where Kt is the total number of samples from class y, we compute the
t
t
class center µy based on the latent representations learned by the autoencoder θ̂AE := {Ĝt , F̂t }. The
y
1
P Kt t
class center in the latent space is µy ← K y i=1 F̂t σ(vy,i ). Then, the distance of each token’s latent
t
t
representation to the class center ||µy − F̂t σ(vy,i )|| is ranked in the ascending order. For each class
y ∈ Y t , we select and add the top- MU tokens (U is the number of classes and M is the task replay
memory size) with the least distance to the class center in the latent space to the replay memory Ξt of
task t, i.e., Ξt ← {top- M t
U (vy,i )}y∈Y .
t

The old adapter e∗ for knowledge distillation is selected based on the novelty detection mechanism in
Eq. (2). We determine the most relevant old adapter for the new task by identifying the minimum
e∗
reconstruction loss. We compute the soft probability output f (Ξe∗ ; {θ̂ViT , θ̂adapter }) of the old adapter
E+1
by replaying samples Ξe from the memory. Then, the new adapter θadapter is trained using its task

5

data {V E+1 , Y E+1 } and the soft probability output of the old task {Ξe∗ , f (Ξe∗ ; {θ̂ViT , θ̂adapter
e
})}.
Notably, we employ the cross-entropy loss ℓCE (·) and the L2 loss ℓL2 (·) for training on the new and
old tasks, respectively. We devise the optimization process for the adapter fusion as follows:
KE+1
X
E+1
ℓCE (θadapter )=− f (viE+1 ; {θ̂ViT , θadapter
E+1
}) log(yiE+1 ),
i=1
M  2

e∗ ∗
(3)
X
E+1 E+1
ℓL2 (θadapter )= f (vie ; {θ̂ViT , θ̂adapter }) − f (vie ; {θ̂ViT , θadapter }) ,
i=1
E+1
θ̂adapter = arg min α · ℓCE + (1 − α) · ℓL2 ,
E+1
θadapter

where α is a coefficient to balance the two loss items.


Additionally, after adapter fusion, the old adapter is removed, and the samples of task e∗ are
distributed to the newly learned adapter E + 1 for inference, i.e., Gate(e∗ ) ← E + 1 where Gate(·)
is a function to project the result of novelty detection to the defined new gate. Note that Gate(·)
is initialized with an identity projection Gateinit (e) = e. We can formulate the routing weights as
1 if e = Gate(e∗ )

follows: Wg (e) =
0 otherwise.
We devise the algorithm for the Remembering Transformer in Algorithm 1.

4 Experiments
In this section, we provide a detailed description of the datasets, model architectures, and metrics
used in the experiments. An extensive empirical evaluation including an ablation study is performed
based on a wide range of split and permutation tasks in continual learning. Moreover, we investigate
the Remembering Transformer’s performance when its capacity of adapters is limited. The results
demonstrate that the Remembering Transformer achieves SOTA performance while retaining great
parameter efficiency compared with conventional methods.

4.1 Settings

Datasets We evaluated the model performance based on continual learning datasets adapted from
CIFAR10, CIFAR100 [15], and Permuted MNIST [26] tasks. We investigated two types of continual
learning datasets: (1) the split task where each dataset is split into several subsets of equal numbers
of classes, with each subset as a task. For example, we divide the CIFAR-10 dataset into five tasks.
The first task consists of digits (classes) 0 - 1 and the second task consists of digits (classes) 2 - 3 and
so on. We denote the task of CIFAR-10 in five splits as CIFAR10/5. (2) the permutation task where
the input pixels of an image in the training and test data are shuffled with a random permutation, with
a different permutation for each task.

Metrics For the split tasks, we measure the average task accuracy over all tasks after training, i.e.,
PN
Acc = N1 i=1 Acci where Acci is the accuracy on the ith task after learning the final task T (N ) .
Moreover, we evaluate the memory footprint of a method based on its trainable model parameter size.
For the permutation tasks, we evaluate the average task accuracy each time the model learns a new
permutation. We then report the mean and standard deviation of three individual experiments with
random seeds, with each seed resampling the splits to avoid any favorable class split and resampling
the permutation patterns to avoid any favorable permutation.

Hyperparameters We employed the base-size Vision Transformer pretrained on ImageNet-21k,


with a resolution of 224x224 pixels and a patch size of 16 × 16. For the detailed architecture, we
followed the default author-suggested settings [5]. We employed a two-layer autoencoder with a
latent dimension of one for all split tasks (5-20 splits) and a four-layer autoencoder with a hidden
dimension of 32 and a latent dimension of one for the permutation task (50 permutations). Moreover,
the hyperparameters were chosen based on a grid search. Batch sizes of 128 and 1 were employed
for training and test, respectively. We utilized the AdamW optimizer with β1 = 0.9, β2 = 0.999,

6
Table 1: Average task accuracy in the split tasks.
Method CIFAR10/5 (%) CIFAR100/10 (%) CIFAR100/20 (%) Average (%)
SupSup [30] 26.2 ± 0.46 33.1 ± 0.47 12.3 ± 0.30 23.87
BI-R+SI [26] 41.7 ± 0.25 22.7 ± 0.81 19.1 ± 0.04 27.83
HyperNet [27] 47.4 ± 5.78 29.7 ± 2.19 19.4 ± 1.44 32.17
MUC [19] 53.6 ± 0.95 30.0 ± 1.37 14.4 ± 0.93 32.67
OWM [31] 51.7 ± 0.06 29.0 ± 0.72 24.2 ± 0.11 35.63
PASS [33] 47.3 ± 0.97 36.8 ± 1.64 25.3 ± 0.81 36.47
LwF [18] 54.7 ± 1.18 45.3 ± 0.75 44.3 ± 0.46 48.10
iCaRL [23] 63.4 ± 1.11 51.4 ± 0.99 47.8 ± 0.48 54.20
Mnemonicsy [18] 64.1 ± 1.47 51.0 ± 0.34 47.6 ± 0.74 54.23
DER++ [4] 66.0 ± 1.27 55.3 ± 0.10 46.6 ± 1.44 55.97
IL2A[32] 92.0 ± 0.23 60.3 ± 0.14 57.9 ± 0.31 70.07
CLOM [13] 88.0 ± 0.48 65.2 ± 0.71 58.0 ± 0.45 70.40
SSRE [34] 90.5 ± 0.61 65.0 ± 0.27 61.7 ± 0.15 72.40
FeTrIL [21] 90.9 ± 0.38 65.2 ± 0.16 61.5 ± 0.73 72.53
Ours 99.3 ± 0.24 86.1 ± 2.09 79.9 ± 3.56 88.43

Task/Split 1 Task/Split 2



Figure 3: Test accuracy curves in the CIFAR100 tasks.

and a weight decay of 0.01. We employed learning rates of 0.001 and 0.005 for updating the
adapters and autoencoders, respectively. We employed a coefficient of 0.5 in the loss function of the
knowledge distillation. We trained the model for 50 epochs in the CIFAR10 split task, 200 epochs in
the CIFAR100 split tasks, 2000 epochs in the permuted-MNIST task. Additionally, we employed
the average along the latent dimension of each token of the embedding layer output to train the
autoencoders for enhanced parameter-efficiency. The autoencoders were trained once for 10 epochs
at the beginning of each task. All experiments were based on PyTorch and four A100 GPUs. The
code would be made publicly available.

4.2 Class-incremental split tasks

Average task accuracy We train all tasks for total epochs


# of tasks epochs before the model training on
a different task. We conducted a comprehensive evaluation in the CIFAR10/5, CIFAR100/10,
and CIFAR100/20 split tasks based on the average task accuracy, comparing to a wide range of
conventional methods. Table 1 demonstrated that Remembering Transformer significantly enhanced
model performance in the challenging class-incremental learning without the task identity information.
Remembering Transformer surpassed the second-best method, FeTrIL [21], by 15.90%, establishing
a SOTA performance.

Accuracy curves Figure 3 demonstrated the accuracy curves of the CIFAR100/10 and CI-
FAR100/20 split tasks. For comparison, we visualized the performance of FeTrIL [21] and SSRE
[34]. We added an oracle model based on an Independent and Identically Distributed (IID) training
setting where tasks are defined as identical instances of the same dataset with all classes. This is
equivalent to finetuning the pretrained model on the entire dataset based on a single adapter. The
Remembering Transformer demonstrated greatly enhanced continual learning ability. In contrast,

7
Figure 4: Memory footprint of the comparison models.

Table 2: Model capacity vs. test accuracy. Table 3: Replay memory size vs. test accuracy.
#Adapters (memory footprint) Test accuracy (%) #Replay samples Test accuracy (%)
5 (0.37M) 99.3 ± 0.24 512 93.2 ± 0.72
3 (0.22M) 93.2 ± 0.72 256 90.5 ± 0.41
2 (0.15M) 87.1 ± 0.85 128 86.6 ± 0.26

the existing methods encountered the forgetting problem when the task changed. The Remembering
Transformer even outperforms the oracle model in the CIFAR100/10 task leveraging the proposed
modular architecture for reduced interference among classes.

Memory footprint We evaluated the memory footprint by measuring the trainable model parameter
size. Figure 4 demonstrated that the Remembering Transformer is parameter-efficient for continual
learning compared to conventional methods, reducing the memory footprint in the CIFAR10 split
task from 11.18M (FeTrIL [21]) to 0.22M. Moreover, although OWM and HyperNet utilized smaller
model sizes, they achieved much worse model performance. The Remembering Transformer achieved
SOTA performance while retaining a small memory footprint.

4.3 Ablation studies

We studied the efficacy of the Remembering Transformer by conducting ablations on the generative
routing mechanism and varying the model capacity of adapters.

Ablating the generative routing Remembering Transformer leverages the generative routing to
facilitate sparse adapter activation. We investigated whether the Remembering Transformer with the
generative routing ablated could tackle the CIFAR100 split tasks (see Figure 3). The empirical results
demonstrated that the mixture-of-adapters architecture alone could not adapt to the continual learning
task. It highlighted the importance of the generative routing mechanism for a model continually
learning new tasks without forgetting old knowledge.

Varying the model capacity The adapter fusion method leverages knowledge distillation to transfer
old task knowledge to a newly learned adapter for reducing the memory footprint. To assess the
model efficacy under a constrained adapter capacity, we varied the capacity by reducing the adapters
to E = 3, 2 in the CIFAR10/5 split task while employing a replay memory size of M = 512. Table
2 demonstrated that the Remembering Transformer could learn a number of tasks that were more
than the adapters, retaining competitive performance. Moreover, we varied the replay memory size
to M = {128, 256, 512} when employing a model capacity of E = 3 (see Table 3). Consequently,
compared to FeTrIL, which obtained a performance of 90.9% with a memory footprint of 11.18M
(see Table 1), the Remembering Transformer achieved a performance of 93.2% with a much smaller
memory footprint of 0.22M, when E = 3 and M = 512.

8
Task/Permutation 1

Task/Permutation 2


Figure 5: Test accuracy curves of the Remembering Transformer compared to the conventional
methods for the permuted-MNIST task.

4.4 Permutation tasks

We evaluated the Remembering Transformer in the permuted-


MNIST dataset. Each task consists of all 10 digits, with a
specific permutation shift employed to the image pixels. Dur-
ing inference, the model is given samples featuring the same
set of permutation patterns but lacks identity information on the
various permutations. Notably, we employed a total of 50 differ-
ent random permutation patterns and trained the model for 2000
epochs, with 40 epochs for each permutation. We evaluated the
average task accuracy each time the model learned a new per-
mutation. We demonstrated the Remembering Transformer’s
performance, comparing to a broad range of conventional meth- Figure 6: Heatmap of the recon-
ods for this dataset, including Learning without Forgetting [17], struction losses of autoencoders
Elastic Weight Consolidation (EWC) [14], Online EWC [25], with respect to different tasks dur-
and Brain-Inspired Replay [26]. Figure 5 demonstrated that ing inference time.
the Remembering Transformer retained its performance while
learning new permutations. While Brain-Inspired Replay also demonstrated competitive performance,
it relies on model retraining using replay memory samples from all old tasks. In contrast, our method
utilizes only the most relevant old task’s replay samples during the adapter fusion when there is
a constraint on the model capacity. Furthermore, we visualized the reconstruction losses between
pairwise tasks and autoencoders during inference. A small magnitude of the reconstruction loss
indicates a high degree of relevance of a task to the knowledge encoded in the specific autoencoder.
For visualization, the raw losses were normalized within a range of 0 and 1. Figure 6 demonstrated
that the proposed generative routing method facilitated the adaptive selection of the most relevant
adapter (along the diagonal) for tackling a specific task.

5 Limitations and conclusions

The Remembering Transformer leverages the mixture-of-adapters architecture that is seamlessly


integrated into conventional vision Transformers for parameter-efficient continual learning. The
novelty detection method identifies task similarities and routes samples to the most relevant adapters.
We demonstrated the superiority of the Remembering Transformer in various class-incremental
split tasks and permutation tasks, resulting in SOTA performance with a small memory footprint.
Remembering Transformer learns a set of adapters to efficiently tackle various tasks, reducing the
memory footprint through the adapter fusion. However, addressing longer task sequences necessitates
learning hierarchical representations within these adapters. This would facilitate more flexible and
efficient knowledge reuse for continual learning.

9
Acknowledgements
We would like to thank Steve Lin and Zhirong Wu for their insightful discussions and valuable
contributions to this work.

References
[1] Hongjoon Ahn, Sungmin Cha, Donggyu Lee, and Taesup Moon. Uncertainty-based continual
learning with adaptive regularization. Advances in neural information processing systems, 32,
2019.
[2] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning
with a network of experts. In CVPR, 2017.
[3] James Urquhart Allingham, Florian Wenzel, Zelda E. Mariet, and et al. Sparse moes meet
efficient ensembles. arXiv:2110.03360, 2021.
[4] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and et al. Dark experience for general
continual learning: a strong, simple baseline. In NeurIPS, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, and et al. An image is worth 16x16
words: Transformers for image recognition at scale. In ICLR, 2021.
[6] Beyza Ermis, Giovanni Zappella, Martin Wistuba, and et al. Continual learning with transform-
ers for image classification. In CVPR Workshops, 2022.
[7] Ruy Gómez-Ocádiz, Massimiliano Trippa, Chun-Lei Zhang, Lorenzo Posani, Simona Cocco,
Rémi Monasson, and Christoph Schmidt-Hieber. A synaptic signal for novelty processing in
the hippocampus. Nature Communications, 13(1):4122, 2022.
[8] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural
network. arXiv:1503.02531, 2015.
[9] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv
preprint arXiv:2106.09685, 2021.
[10] Wenpeng Hu, Zhou Lin, Bing Liu, and et al. Overcoming catastrophic forgetting for continual
learning via model adaptation. In ICLR, 2019.
[11] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive
mixtures of local experts. Neural Comput., 3(1):79–87, 1991.
[12] Alex Kafkas and Daniela Montaldi. How do memory systems detect and respond to novelty?
Neuroscience letters, 680:60–68, 2018.
[13] Gyuhak Kim, Sepideh Esmaeilpour, Changnan Xiao, and Bing Liu. Continual learning based
on OOD detection and task masking. In CVPR Workshops, pages 3855–3865. IEEE, 2022.
[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,
Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.
Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of
sciences, 114(13):3521–3526, 2017.
[15] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[16] Dharshan Kumaran, Demis Hassabis, and James L McClelland. What learning systems do
intelligent agents need? complementary learning systems theory updated. Trends in cognitive
sciences, 20(7):512–534, 2016.
[17] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern
analysis and machine intelligence, 40(12):2935–2947, 2017.
[18] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics training:
Multi-class incremental learning without forgetting. In CVPR, pages 12242–12251. Computer
Vision Foundation / IEEE, 2020.
[19] Yu Liu, Sarah Parisot, Gregory G. Slabaugh, and et al. More classifiers, less forgetting: A
generic multi-classifier paradigm for incremental learning. In ECCV (26), volume 12371 of
Lecture Notes in Computer Science, pages 699–716. Springer, 2020.

10
[20] James L McGaugh. Memory–a century of consolidation. Science, 287(5451):248–251, 2000.
[21] Grégoire Petit, Adrian Popescu, Hugo Schindler, and et al. Fetril: Feature translation for
exemplar-free class-incremental learning. In WACV, pages 3900–3909. IEEE, 2023.
[22] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer
Society, 2017.
[23] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. In CVPR, pages 5533–5542. IEEE Computer
Society, 2017.
[24] Carlos Riquelme, Joan Puigcerver, Basil Mustafa, and et al. Scaling vision with sparse mixture
of experts. In NeurIPS, 2021.
[25] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska,
Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework
for continual learning. In International conference on machine learning, pages 4528–4537.
PMLR, 2018.
[26] Gido M Van de Ven, Hava T Siegelmann, and Andreas S Tolias. Brain-inspired replay for
continual learning with artificial neural networks. Nature communications, 11(1):4069, 2020.
[27] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual
learning with hypernetworks. In ICLR, 2020.
[28] Tianchun Wang, Wei Cheng, Dongsheng Luo, and et al. Personalized federated learning via
heterogeneous modular networks. In ICDM, 2022.
[29] Max Welling. Herding dynamical weights to learn. In ICML, volume 382 of ACM International
Conference Proceeding Series, pages 1121–1128. ACM, 2009.
[30] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, and et al. Supermasks in superposition. In
NeurIPS, 2020.
[31] Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continuous learning of context-dependent
processing in neural networks. Nature Machine Intelligence.
[32] Fei Zhu, Zhen Cheng, Xu-Yao Zhang, and Chenglin Liu. Class-incremental learning via dual
augmentation. In NeurIPS, pages 14306–14318, 2021.
[33] Fei Zhu, Xu-Yao Zhang, Chuang Wang, and et al. Prototype augmentation and self-supervision
for incremental learning. In CVPR, pages 5871–5880. Computer Vision Foundation / IEEE,
2021.
[34] Kai Zhu, Wei Zhai, Yang Cao, and et al. Self-sustaining representation expansion for non-
exemplar class-incremental learning. In CVPR, pages 9286–9295. IEEE, 2022.

11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy