0% found this document useful (0 votes)

14 views15 pages

Meta Learning With

This paper presents a novel approach to few-shot learning by integrating fast convergent non-deep learning methods, such as ridge regression, into deep networks for efficient adaptation to new concepts. By back-propagating errors through solver steps, the proposed method achieves competitive performance on benchmarks like Omniglot and CIFAR-100, while maintaining computational efficiency. The authors argue that this approach allows for better flexibility and adaptability compared to traditional nearest-neighbor methods and iterative learning algorithms.

Uploaded by

ranaimransa227

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views15 pages

Meta Learning With

Uploaded by

ranaimransa227

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Published as a conference paper at ICLR 2019

M ETA - LEARNING WITH

DIFFERENTIABLE CLOSED - FORM SOLVERS

Luca Bertinetto João Henriques

FiveAI & University of Oxford University of Oxford
luca@robots.ox.ac.uk joao@robots.ox.ac.uk

Philip H.S. Torr Andrea Vedaldi

FiveAI & University of Oxford University of Oxford
philip.torr@eng.ox.ac.uk vedaldi@robots.ox.ac.uk
arXiv:1805.08136v3 [cs.CV] 24 Jul 2019

A BSTRACT

Adapting deep networks to new concepts from a few examples is challenging,

due to the high computational requirements of standard fine-tuning procedures.
Most work on few-shot learning has thus focused on simple learning techniques
for adaptation, such as nearest neighbours or gradient descent. Nonetheless, the
machine learning literature contains a wealth of methods that learn non-deep models
very efficiently. In this paper, we propose to use these fast convergent methods as
the main adaptation mechanism for few-shot learning. The main idea is to teach
a deep network to use standard machine learning tools, such as ridge regression,
as part of its own internal model, enabling it to quickly adapt to novel data. This
requires back-propagating errors through the solver steps. While normally the
cost of the matrix operations involved in such a process would be significant, by
using the Woodbury identity we can make the small number of examples work to
our advantage. We propose both closed-form and iterative solvers, based on ridge
regression and logistic regression components. Our methods constitute a simple
and novel approach to the problem of few-shot learning and achieve performance
competitive with or superior to the state of the art on three benchmarks.

1 I NTRODUCTION
Humans can efficiently perform fast mapping (Carey, 1978; Carey & Bartlett, 1978), i.e. learning
a new concept after a single exposure. By contrast, supervised learning algorithms — and neural
networks in particular — typically need to be trained using a vast amount of data in order to generalize
well. This requirement is problematic, as the availability of large labelled datasets cannot always be
taken for granted. Labels can be costly to acquire: in drug discovery, for instance, campaign budgets
often limits researchers to only operate with a small amount of biological data that can be used to
form predictions about properties and activities of compounds (Altae-Tran et al., 2017). In other
circumstances, data itself can be scarce, as it can happen for example with the problem of classifying
rare animal species, whose exemplars are not easy to observe. Such a scenario, in which just one or
a handful of training examples is provided, is referred to as one-shot or few-shot learning (Miller
et al., 2000; Fei-Fei et al., 2006; Lake et al., 2015; Hariharan & Girshick, 2017) and has recently seen
a tremendous surge in interest within the machine learning community (e.g.Vinyals et al. (2016);
Bertinetto et al. (2016); Ravi & Larochelle (2017); Finn et al. (2017)).
Currently, most methods tackling few-shot learning operate within the general paradigm of meta-
learning, which allows one to develop algorithms in which the process of learning can improve
with the number of training episodes (Thrun, 1998; Vilalta & Drissi, 2002). This can be achieved
by distilling and transferring knowledge across episodes. In practice, for the problem of few-shot
classification, meta-learning is often implemented using two “nested training loops”. The base learner
works at the level of individual episodes, which correspond to learning problems characterised by
having only a small set of labelled training images available. The meta learner, by contrast, learns
from a collection of such episodes, with the goal of improving the performance of the base learner
across episodes.

1
Published as a conference paper at ICLR 2019

Base training-set Base

training-set
CNN X labels Y

Φ
R.R. W1
Episode 1
Base test-set
labels Y’

Base test-set ω Λ
CNN X’ Cross-entropy
Φ • Loss

Episode N
Episode 2

Episode 3
…

Figure 1: Diagram of the proposed method for one episode, of which several are seen during
meta-training. The task is to learn new classes given just a few sample images per class. In this
illustrative example, there are 3 classes and 2 samples per class, making each episode a 3-way, 2-shot
classification problem. At the base learning level, learning is accomplished by a differentiable ridge
regression layer (R.R.), which computes episode-specific weights (referred to as wE in Section 3.1
and as W in Section 3.2). At the meta-training level, by back-propagating errors through many of
these small learning problems, we train a network whose weights are shared across episodes, together
with the hyper-parameters of the R.R. layer. In this way, the R.R. base learner can improve its learning
capabilities as the number of experienced episodes increases.

Clearly, in any meta-learning algorithm, it is of paramount importance to choose the base learner
carefully. On one side of the spectrum, methods related to nearest-neighbours, such as learning
similarity functions (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017), are fast but rely solely
on the quality of the similarity metric, with no additional data-dependent adaptation at test-time. On
the other side of the spectrum, methods that optimize standard iterative learning algorithms, such as
backpropagating through gradient descent (Finn et al., 2017; Nichol et al., 2018) or explicitly learning
the learner’s update rule (Hochreiter et al., 2001; Andrychowicz et al., 2016; Ravi & Larochelle,
2017), are slower but allow more adaptability to different problems/datasets.
In this paper, we take a different perspective. As base learners, we propose to adopt simple learning
algorithms that admit a closed-form solution such as ridge regression. Crucially, the simplicity and
differentiability of these solutions allow us to backpropagate through learning problems. Moreover,
these algorithms are particularly suitable for use within a meta-learning framework for few-shot
classification for two main reasons. First, their closed-form solution allows learning problems to be
solved efficiently. Second, in a data regime characterized by few examples of high dimensionality,
the Woodbury’s identity (Petersen et al., 2008, Chapter 3.2) can be used to obtain a very significant
gain in terms of computational speed.
We demonstrate the strength of our approach by performing extensive experiments on Omniglot (Lake
et al., 2015), CIFAR-100 (Krizhevsky & Hinton, 2009) (adapted to the few-shot problem) and
miniImageNet (Vinyals et al., 2016). Our base learners are fast, simple to implement, and can achieve
performance that is competitive with or superior to the state of the art in terms of accuracy.

2 R ELATED W ORK

The topic of meta-learning gained importance in the machine learning community several decades ago,
with the first examples already appearing in the eighties and early nineties (Utgoff, 1986; Schmidhuber,
1987; Naik & Mammone, 1992; Bengio et al., 1992; Thrun & Pratt, 1998). Utgoff (1986) proposed a
framework describing when and how it is useful to dynamically adjust the inductive bias of a learning
algorithm, thus implicitly “changing the ordering” of the elements of its hypothesis space (Vilalta &
Drissi, 2002). Later, Bengio et al. (1992) interpreted the update rule of a neural network’s weights
as a function that is learnable. Another seminal work is the one of Thrun (1996), which presents

2
Published as a conference paper at ICLR 2019

the so-called lifelong learning scenario, where a learning algorithm gradually encounters an ordered
sequence of learning problems. Throughout this course, the learner can benefit from re-using the
knowledge accumulated during previous tasks. In later work, Thrun & Pratt (1998) stated that an
algorithm is learning to learn if “[...] its performance at each task improves with experience and with
the number of tasks”. This characterisation has been inspired by Mitchell et al. (1997)’s definition of
a learning algorithm as a computer program whose performance on a task improves with experience.
Similarly, Vilalta & Drissi (2002) explained meta-learning as organised in two “nested learning
levels”. At the base level, an algorithm is confined within a limited hypothesis space while solving a
single learning problem. Contrarily, the meta-level can “accrue knowledge” by spanning multiple
problems, so that the hypothesis space at the base level can be adapted effectively.
Arguably, the simplest approach to meta-learning is to train a similarity function by exposing it to
many matching problems (Bromley et al., 1993; Chopra et al., 2005; Koch et al., 2015). Despite
its simplicity, this general strategy is particularly effective and it is at the core of several state-
of-the-art few-shot classification algorithms (Vinyals et al., 2016; Snell et al., 2017; Sung et al.,
2018). Interestingly, Garcia & Bruna (2018) interpret learning as information propagation from
support (training) to query (test) images and propose a graph neural network that can generalize
matching-based approaches. Since this line of work relies on learning a similarity metric, one
distinctive characteristic is that parameter updates only occur within the long time horizon of the
outer training loop. While this can clearly spare costly computations, it also prevents these methods
from performing adaptation at test time. A possible way to overcome the lack of adaptability is
to train a neural network capable of predicting (some of) its own parameters. This technique has
been first introduced in Schmidhuber (1992; 1993) and recently revamped by Bertinetto et al. (2016)
and Munkhdalai & Yu (2017). Rebuffi et al. (2017) showed that a similar approach can be used to
adapt a neural network, on the fly, to entirely different visual domains.
Another popular approach to meta-learning is to interpret the gradient update of SGD as a parametric
and learnable function rather than a fixed ad-hoc routine. Younger et al. (2001) and Hochreiter et al.
(2001) observed that, because of the sequential nature of a learning algorithm, a recurrent neural
network can be considered as a meta-learning system. They identify LSTMs as particularly apt
for the task because of their ability to span long-term dependencies, which are essential in order to
meta-learn. A modern take on this idea has been presented by Andrychowicz et al. (2016) and Ravi &
Larochelle (2017), showing benefits on large-scale classification, style transfer and few-shot learning.
A recent and promising research direction is the one set by Maclaurin et al. (2015) and by the MAML
algorithm (Finn et al., 2017; Finn & Levine, 2018). Instead of explicitly designing a meta-learner
module for learning the update rule, they backpropagate through the very operation of gradient
descent to optimize for the hyperparameters or the initial parameters of the learner. However, back-
propagation through gradient descent steps is costly in terms of memory, and thus the total number of
steps must be kept small.
To alleviate the drawback of catastrophic forgetting typical of deep neural networks (McCloskey &
Cohen, 1989), several recent methods (Santoro et al., 2016; Kaiser et al., 2017; Munkhdalai & Yu,
2017; Sprechmann et al., 2018) make use of memory-augmented models, which can first retain and
then access important and previously unseen information associated with newly encountered episodes.
While such memory modules store and retrieve information in the long time range, approaches
based on attention like the one of Vinyals et al. (2016) are useful to specify the most relevant pieces
of knowledge within an episode. Mishra et al. (2018) complemented soft attention with temporal
convolutions (Oord et al., 2016), thus allowing the attention mechanism to access information related
to past episodes.
In this paper, we instead argue for simple, fast and differentiable base learners such as ridge regression.
Compared to nearest-neighbour methods, they allow more flexibility because they produce a different
set of parameters for different episodes (Wi in Figure 1). Compared to methods that adapt SGD,
they exhibit an inherently fast rate of convergence, particularly in cases where a closed form solution
exists. A similar idea has been discussed by Bengio (2000), where the analytic formulations of
zero-gradient solutions are used to obtain meta-gradients analytically and optimize hyper-parameters.
More recently, Ionescu et al. (2015) and Valmadre et al. (2017) have derived backpropagation forms
for the SVD and Correlation Filter, so that SGD can be applied, respectively, to a deep neural network
that computes the solution to either an eigenvalue problem or a system of linear equations where the
data matrix has a circulant structure.

3
Published as a conference paper at ICLR 2019

3 M ETHOD
3.1 M ETA - LEARNING

According to widely accepted definitions of learning (Mitchell, 1980) and meta-learning (Vilalta &
Drissi, 2002; Vinyals et al., 2016), an algorithm is “learning to learn” if it can improve its learning
skills with the number of experienced episodes (by progressively and dynamically modifying its
inductive bias). There are two main components in a meta-learning algorithm: a base learner and a
meta-learner (Vilalta & Drissi, 2002). The base learner works at the level of individual episodes (or
tasks), which in the few-shot scenario correspond to learning problems characterised by having only
a small set of labelled training images available. The meta-learner learns from several such episodes
in sequence with the goal of improving the performance of the base learner across episodes.
In other words, the goal of meta-learning is to enable a base learning algorithm to adapt to new
episodes efficiently by generalizing from a set of training episodes E ∈ E. E can be modelled as
a probability distribution of example inputs x ∈ Rm and outputs y ∈ Ro , such that we can write
(x, y) ∼ E.
In the case of few-shot classification, the inputs are represented by few images belonging to different
unseen classes, while the outputs are the (episode-specific) class labels. It is important not to confuse
the small sets that are used in an episode E with the super-set E (such as Omniglot or miniImageNet,
Section 4.1) from which they are drawn.
Consider a generic feature extractor, such as commonly used pre-trained networks 1 φ(x) : Rm → Re .
Then, a much simpler episode-specific predictor f (φ(x); wE ) : Re × Rp → Ro can be trained to
map input embeddings to outputs. The predictor is parameterized by a set of parameters wE ∈ Rp ,
which are specific to the episode E.
To train and assess the predictor on one episode, we are given access to training samples ZE =
{(xi , yi )} ∼ E and test samples ZE0 = {(x0i , yi0 )} ∼ E, sampled independently from the distribution
E. We can then use a learning algorithm Λ to obtain the parameters wE = Λ(φ(ZE )), where
φ(ZE ) , {(φ(xi ), yi )}. The expected quality of the trained predictor is then computed by a standard
loss or error function L : Ro × Ro → R, which is evaluated on the test samples ZE0 :
1 X
q(E) = 0 L (f (φ (x0 ) ; wE ) , y 0 ) , with wE = Λ(φ(ZE )). (1)
|ZE | 0 0 0
(x ,y )∈ZE

Other than abstracting away the complexities of the learning algorithm as Λ, eq. (1) corresponds to the
standard train-test protocol commonly employed in machine learning, here applied to a single episode
E. However, simply re-training a predictor for each episode ignores potentially useful knowledge
that can be transferred between them. For this reason, we now take the step of parameterizing φ
and Λ with two sets of meta-parameters, respectively ω and ρ, which can aid the training procedure.
In particular, ω affects the representation of the input of the base learner algorithm Λ, while ρ
corresponds to its hyper-parameters, which here can be learnt by the meta-learner loop instead of
being manually set, as it usually happens in a standard training scenario. These meta-parameters will
affect the generalization properties of the learned predictors. This motivates evaluating the result of
training on a held-out test set ZE0 (eq. (1)). In order to learn ω and ρ, we minimize the expected loss
on held-out test sets over all episodes E ∈ E:
1 X X
min L (f (φ (x0 ; ω) ; wE ) , y 0 ) , with wE = Λ(φ(ZE ; ω) ; ρ). (2)
ω,ρ |E| · |Z 0 |
E 0 0 0
E∈E (x ,y )∈ZE

Since eq. (2) consists of a composition of non-linear functions, we can leverage the same tools used
successfully in deep learning, namely back-propagation and stochastic gradient descent (SGD), to
optimize it. The main obstacle is to choose a learning algorithm Λ that is amenable to optimization
with such tools. This means that, in practice, Λ must be quite simple.
Examples of meta-learning algorithms. Using eq. 2, it is possible to describe several of the meta-
learning methods in the literature, which mostly differ for the choice of Λ. The feature extractor
φ is typically a standard CNN, whose intermediate layers are trained jointly as ω (and thus are not
1
Note that in practice we do not use pre-trained networks, but are able to train them from scratch.

4
Published as a conference paper at ICLR 2019

episode-specific). The last layer represents the linear predictor f , with episode-specific parameters
wE . In Siamese networks (Bromley et al., 1993; Chopra et al., 2005; Koch et al., 2015), f is a nearest
neighbour classifier, which becomes soft k-means in the semi-supervised setting proposed by Ren
et al. (2018). Ravi & Larochelle (2017) and Andrychowicz et al. (2016) used an LSTM to implement
Λ, while the Learnet (Bertinetto et al., 2016) uses a factorized CNN and MAML (Finn et al., 2017)
implements it using SGD (and furthermore adapts all parameters of the CNN).
Instead, we use simple and fast-converging methods as base learner Λ, namely least-squares based
solutions for ridge regression and logistic regression. In the outer loop, we allow SGD to learn both
the parameters ω of the feature representation of Λ and its hyper-parameters ρ.

3.2 E FFICIENT RIDGE REGRESSION BASE LEARNERS

Similarly to the methods discussed in Section 3.1, over the course of a single episode we adapt a
linear predictor f , which can be considered as the final layer of a CNN. The remaining layers φ
are trained from scratch (within the outer loop of meta-learning) to generalize between episodes,
but for the purposes of one episode they are considered fixed. In this section, we assume that the
inputs were pre-processed by the CNN φ, and that we are dealing only with the final linear predictor
f (φ(x)) = φ(x)W ∈ Ro , where the parameters wE are reorganized into a matrix W ∈ Re×o .
The motivation for our work is that, while not quite as simple as nearest neighbours, least-squares
regressors admit closed-form solutions. Although simple least-squares is prone to overfitting, it is
easy to augment it with L2 regularization (controlled by a positive hyper-parameter λ), in what is
known as ridge regression:
2 2
Λ(Z) = arg min kXW − Y k + λ kW k (3)
W
−1
= (X T X + λI) X T Y, (4)
where X ∈ Rn×e and Y ∈ Rn×o contain the n sample pairs of input embeddings and outputs from
Z, stacked as rows.
Because ridge regression admits a closed form solution (eq. (4)), it is relatively easy to integrate into
meta-learning (eq. (2)) using standard automatic differentiation packages. The only element that
may have to be treated more carefully is the matrix inversion. When the matrix to invert is close to
singular (which we do not expect when λ > 0), it is possible to achieve more numerically accurate
results by replacing the matrix inverse and vector product with a linear system solver (Murphy, 2012,
7.5.2). In our experiments, the matrices were not close to singular and we did not find this necessary.
Another concern about eq. (4) is that the intermediate matrix X T X ∈ Re×e grows quadratically
with the embedding size e. Given the high dimensionality of features typically used in deep networks,
the inversion could come at a very expensive cost. To alleviate this, we rely on the Woodbury
formula (Petersen et al., 2008, Chapter 3.2), obtaining:
−1
W = Λ(Z) = X T (XX T + λI) Y. (5)
The main advantage of eq. (5) is that the intermediate matrix XX T ∈ Rn×n now grows quadratically
with the number of samples in the episode, n. As we are interested in one or few-shot learning, this is
typically very small. The overall cost of eq. (5) is only linear in the embedding size e.
Although this method was originally designed for regression, we found that it works well also in a
(few-shot) classification scenario, where the target outputs are one-hot vectors representing classes.
However, since eq. 4 does not directly produce classification labels, it is important to calibrate its
output for the cross-entropy loss, which is used to evaluate the episode’s test samples (L in eq. 2).
This can be done by simply adjusting our prediction X 0 W with a scale and a bias α, β ∈ R:
Yb = αX 0 W + β. (6)
Note that λ, α and β are hyper-parameters of the base learner Λ and can be learnt by the outer learning
loop represented by the meta-learner, together with the CNN parameters ω.

3.3 I TERATIVE BASE LEARNERS AND LOGISTIC REGRESSION

It is natural to ask whether other learning algorithms can be integrated as efficiently as ridge regression
within our meta-learning framework. In general, a similar derivation is possible for iterative solvers,

5
Published as a conference paper at ICLR 2019

as long as the operations are differentiable. For linear models with convex loss functions, a better
choice than gradient descent is Newton’s method, which uses curvature (second-order) information to
reach the solution in very few steps. One learning objective of particular interest is logistic regression,
which unlike ridge regression directly produces classification labels, and thus does not require the use
of calibration before the (binary) cross-entropy loss.
When one applies Newton’s method to logistic regression, the resulting algorithm takes a familiar
form — it consists of a series of weighted least squares (or ridge regression) problems, giving it the
name Iteratively Reweighted Least Squares (IRLS) (Murphy, 2012, Chapter 8.3.4). Given inputs
n
X ∈ Rn×e and binary outputs y ∈ {−1, 1} , the i-th iteration updates the parameters wi ∈ Re as:
−1 T
wi = X T diag(si )X + λI X diag(si )zi , (7)
T T
where I is an identity matrix, si = µi (1 − µi ), zi = wi−1 X + (y − µi )/si , and µi = σ(wi−1 X)
applies a sigmoid function σ to the predictions using the previous parameters wi−1 .
Since eq. (7) takes a similar form to ridge regression, we can use it for meta-learning in the same
way as in section 3.2, with the difference that a small number of steps (eq. (7)) must be performed in
order to obtain the final parameters wE . Similarly, at each step i, we obtain a solution with a cost
which is linear rather than quadratic in the embedding size by employing the Woodbury formula:
−1
−1
wi = X T XX T + λdiag(si ) zi ,
where the inner inverse has negligible cost since it is a diagonal matrix. Note that a similar strategy
could be followed for other learning algorithms based on IRLS, such as L1 minimization and LASSO.
We take logistic regression to be a sufficiently illustrative example, of particular interest for binary
classification in one/few-shot learning, leaving the exploration of other variants for future work.

3.4 T RAINING POLICY

Figure 1 illustrates our overall framework. Like most meta-learning techniques, we organize our
training procedure into episodes, each of which corresponds to a few-shot classification problem. In
standard classification, training requires sampling from a distribution of images and labels. Instead,
in our case we sample from a distribution of episodes, each containing its own training set and test
set, with just a few samples per image. Each episode also contains two sets of labels: Y and Y 0 . The
former is used to train the base learner, while the latter to compute the error of the just-trained base
learner, enabling back-propagation in order to learn ω, λ, α and β.
In our implementation, one episode corresponds to a mini-batch of size S = N (K + Q), where N is
the number of different classes (“ways”), K the number of samples per classes (“shots”) and Q the
number of query (or test) images per class.

4 E XPERIMENTS
In this section, we provide practical details for the two novel methods introduced in Section 3.2
and 3.3, which we dub R2-D2 (Ridge Regression Differentiable Discriminator) and LR-D2 (Lo-
gistic Regression Differentiable Discriminator). We analyze their performance against the recent
literature on multi-class and binary classification problems using three few-shot learning bench-
marks: Omniglot (Lake et al., 2015), miniImageNet (Vinyals et al., 2016) and CIFAR - FS, which
we introduce in this paper. The code for both our methods and the splits of CIFAR - FS are available
at http://www.robots.ox.ac.uk/~luca/r2d2.html.

4.1 F EW- SHOT LEARNING BENCHMARKS

Let I? and C? be respectively the set of images and the set of classes belonging to a certain data split
?. In standard classification datasets, Itrain ∩ Itest = ∅ and Ctrain = Ctest . Instead, the few-shot setup
requires both Imeta-train ∩ Imeta-test = ∅ and Cmeta-train ∩ Cmeta-test = ∅, while within an episode we
have Ctask-train = Ctask-test .
Omniglot (Lake et al., 2015) is a dataset of handwritten characters that has been referred to as the
“MNIST transpose” for its high number of classes and small number of instances per class. It contains

6
Published as a conference paper at ICLR 2019

20 examples of 1623 characters, grouped in 50 different alphabets. In order to be able to compare

against the state of the art, we adopt the same setup and data split used in Vinyals et al. (2016).
Hence, we resize images to 28×28 and we augment the dataset using four rotated versions of the
each instance (0°, 90°, 180°, 270°). Including rotations, we use 4800 classes for meta-training and
meta-validation and 1692 for meta-testing.
miniImageNet (Vinyals et al., 2016) aims at representing a challenging dataset without demanding
considerable computational resources. It is randomly sampled from ImageNet (Russakovsky et al.,
2015) and it is constituted by a total of 60,000 images from 100 different classes, each with 600
instances. All images are RGB and have been downsampled to 84×84. As all recent work, we
adopt the same splits of Ravi & Larochelle (2017), who employ 64 classes for meta-training, 16 for
meta-validation and 20 for meta-testing.
CIFAR - FS . On the one hand, despite being lightweight, Omniglot is becoming too simple for
modern few-shot learning methods, especially with the splits of Vinyals et al. (2016). On the other,
miniImageNet is more challenging, but it might still require a model to train for several hours before
convergence. Thus, we propose CIFAR - FS (CIFAR100 few-shots), which is randomly sampled from
CIFAR-100 (Krizhevsky & Hinton, 2009) by using the same criteria with which miniImageNet has
been generated. We observed that the average inter-class similarity is sufficiently high to represent a
challenge for the current state of the art. Moreover, the limited original resolution of 32×32 makes
the task harder and at the same time allows fast prototyping.

4.2 E XPERIMENTAL RESULTS

In order to produce the features X for the base learners (eq. 4 and 7), as many recent methods we
use a shallow network of four convolutional “blocks”, each consisting of the following sequence: a
3×3 convolution (padding=1, stride=1), batch-normalization, 2×2 max-pooling, and a leaky-ReLU
with a factor of 0.1. Max pooling’s stride is 2 for the first three layers and 1 for the last one. The
four convolutional layers have [96, 192, 384, 512] filters. Dropout is applied to the last two blocks
for the experiments on miniImageNet and CIFAR - FS, respectively with probabilities 0.1 and 0.4. We
do not use any fully connected layer. Instead, we flatten and concatenate the output of the third and
fourth convolutional blocks and feed it to the base learner. Doing so, we obtain high-dimensional
features of size 3584, 72576 and 8064 for Omniglot, miniImageNet and CIFAR - FS respectively. It is
important to mention that the use of the Woodbury formula (section 3.2) allows us to make use of
high-dimensional features without incurring burdensome computations. In fact, in few-shot problems
the data matrix X is particularly “large and short”. As an example, with a 5-way/1-shot problem
from miniImageNet we have X ∈ R5×72576 . Applying the Woodbury identity, we obtain significant
gains in computation, as in eq. 5 we invert a matrix that is only 5×5 instead of 72576×72576.
As Snell et al. (2017), we observe that using a higher number of classes during training is important.
Hence, despite the few-shot problem at test time being 5 or 20-way, in our multi-class classification
experiments we train using 60 classes for Omniglot, 16 for miniImageNet and 20 for CIFAR - FS.
Moreover, in order not to train a different model for every single configuration (two for miniImageNet
and CIFAR - FS, four for Omniglot), similarly to (Mishra et al., 2018) and differently from previous
work, we train our models with a random number of shots, which does not deteriorate the performance
and allow us to simply train one model per dataset. We then choose Q (the size of the query or test
set) accordingly, so that the batch size S remains constant throughout the episodes. We set S to 600
for Omniglot and 240 for both miniImageNet and CIFAR - FS.
At the meta-learning level, we train our methods with Adam (Kingma & Ba, 2015) with an initial
learning rate of 0.005, dampened by 0.5 every 2,000 episodes. Training is stopped when the error on
the meta-validation set does not decrease meaningfully for 20,000 episodes.
As for the base learner, we let SGD learn the parameters ω of the CNN, as well as the regularization
factor λ and the scale α and bias β of the calibration layer of R2-D2 (end of Section 3.2). In practice,
we observed that it is important to use SGD to adapt α and β, while it is indifferent whether λ is
learnt or not. A more detailed analysis can be found in Appendix C.
Multi-class classification. Tables 1 and 2 show the performance of our closed-form base learner
R2-D2 against the current state of the art for shallow architectures of four convolutional layers.
Values represent average classification accuracies obtained by sampling 10,000 episodes from the

7
Published as a conference paper at ICLR 2019

Table 1: Few-shot multi-class classification accuracies on miniImageNet and CIFAR - FS.

miniImageNet, 5-way CIFAR - FS , 5-way
Method 1-shot 5-shot 1-shot 5-shot
M ATCHING NET (Vinyals et al., 2016) 44.2% 57% — —
MAML (Finn et al., 2017) 48.7±1.8% 63.1±0.9% 58.9±1.9% 71.5±1.0%
MAML ∗ 40.9±1.5% 58.9±0.9% 53.8±1.8% 67.6±1.0%
M ETA -LSTM (Ravi & Larochelle, 2017) 43.4±0.8% 60.6±0.7% — —
P ROTO NET (Snell et al., 2017) 47.4±0.6% 65.4±0.5% 55.5±0.7% 72.0±0.6%
P ROTO NET ∗ 42.9±0.6% 65.9±0.6% 57.9±0.8% 76.7±0.6%
R ELATION NET (Sung et al., 2018) 50.4±0.8% 65.3±0.7% 55.0±1.0% 69.3±0.8%
SNAIL (with ResNet) (Mishra et al., 2018) 55.7±1.0% 68.9±0.9% — —
SNAIL (with 32C) (Mishra et al., 2018) 45.1% 55.2% — —
GNN (Garcia & Bruna, 2018) 50.3% 66.4% 61.9% 75.3%
GNN∗ 50.3% 68.2% 56.0% 72.5%
O URS /R2-D2 (with 64C) 49.5±0.2% 65.4±0.2% 62.3±0.2% 77.4±0.2%
O URS /R2-D2 51.8±0.2% 68.4±0.2% 65.4±0.2% 79.4±0.2%
O URS /LR-D2 (1 iter.) 51.0±0.2% 65.6±0.2% 64.5±0.2% 75.8±0.2%
O URS /LR-D2 (5 iter.) 51.9±0.2% 68.7±0.2% 65.3±0.2% 78.3±0.2%

meta test-set and are presented with 95% confidence intervals. For each column, the best performance
is in bold. If more than one value is outlined, it means their intervals overlap. For prototypical
networks, we report the results reproduced by the code provided by the authors. For our comparison,
we report the results of methods which train their models from scratch for few-shot classification,
omitting very recent work of Qiao et al. (2018) and Gidaris & Komodakis (2018), which instead
make use of pre-trained embeddings.
In terms of feature embeddings, Vinyals et al. (2016); Finn et al. (2017); Snell et al. (2017); Ravi &
Larochelle (2017) use 64 filters per layer (which become 32 for miniImageNet in (Ravi & Larochelle,
2017; Finn et al., 2017) to limit overfitting). On top of this, Sung et al. (2018) also uses a relation
module of two convolutional and two fully connected layers. GNN (Garcia & Bruna, 2018) employs
an embedding with [64, 96, 128, 256] filters, a fully connected layer and a graph neural network (with
its own extra parameters). In order to ensure a fair comparison, we increased the capacity of the
architectures of three representative methods (MAML, prototypical networks and GNN) to match
ours. The results of these experiments are reported with a ∗ on Table 1. We make use of dropout on
the last two layers for all the experiments on baselines with ∗, as we verified it is helpful to reduce
overfitting. Moreover, we report results for experiments on our R2-D2 in which we use a 64 channels
embedding.
Despite its simplicity, our proposed method achieves an average accuracy that, on miniImageNet
and CIFAR - FS, is superior to the state of the art with shallow architectures. For example, on the
four problems of Table 1, R2-D2 improves on average of a relative 4.3% w.r.t. GNN (the second
best method). R2-D2 shows competitive results also on Omniglot (Table 2), achieving among the
best performance for all problems. Furthermore, when we use the “lighter” embedding, we can still
observe a performance which is in line with the state of the art. Interestingly, increasing the capacity
of the other methods it is not particularly helpful. It is beneficial only for GNN on miniImageNet and
prototypical networks on CIFAR - FS, while being detrimental in all the other cases.
Our R2-D2 is also competitive against SNAIL, which uses a much deeper architecture (a ResNet
with a total of 14 convolutional layers). Despite being outperformed for the 1-shot case, we can
match its results on the 5-shot one. Moreover, it is paramount for SNAIL to make use of such deep
embedding, as its performance drops significantly with a shallow one.
LR-D2 performance on multi-class classification. In order to be able to compare our binary
classifier LR-D2 with the state-of-the-art in few-shot N -class classification, it is possible to jointly
consider N binary classifiers, each of which discriminates between a specific class and all the
remaining ones (Bishop, 2006, Chapter 4.1). In our framework, this can be easily implemented
by concatenating together the outputs of N instances of LR-D2, resulting in a single multi-class
prediction.

8
Published as a conference paper at ICLR 2019

Table 2: Few-shot multi-class classification accuracies on Omniglot.

Omniglot, 5-way Omniglot, 20-way

Method 1-shot 5-shot 1-shot 5-shot
S IAMESE NET (Koch et al., 2015) 96.7% 98.4% 88% 96.5%
M ATCHING NET (Vinyals et al., 2016) 98.1% 98.9% 93.8% 98.5%
MAML (Finn et al., 2017) 98.7±0.4% 99.9±0.1% 95.8±0.3% 98.9±0.2%
P ROTO NET (Snell et al., 2017) 98.5±0.2% 99.5±0.1% 95.3±0.2% 98.7±0.1%
SNAIL (Mishra et al., 2018) 99.07±0.16% 99.77±0.09% 97.64±0.30% 99.36±0.18%
GNN (Garcia & Bruna, 2018) 99.2% 99.7% 97.4% 99.0%
O URS /R2-D2 (with 64C) 98.55±0.05% 99.66±0.02% 94.70±0.05% 98.91±0.02%
O URS /R2-D2 98.91±0.05% 99.74±0.02% 96.24±0.05% 99.20±0.02%

Table 3: Few-shot binary classification accuracies on miniImageNet and CIFAR - FS.

miniImageNet, 2-way CIFAR - FS, 2-way
Method 1-shot 5-shot 1-shot 5-shot
MAML (Finn et al., 2017) 74.9±3.0% 84.4±1.2% 82.8±2.7% 88.3±1.1%
P ROTO NETS (Snell et al., 2017) 71.7±1.0% 84.8±0.7% 76.4±0.9% 88.5±0.6%
R ELATION NET (Sung et al., 2018) 76.2±1.2% 86.8±1.0% 75.0±1.5% 86.7±0.9%
GNN (Garcia & Bruna, 2018) 78.4% 87.1% 79.3% 89.1%
O URS /R2-D2 77.4±0.3% 86.8±0.2% 84.1±0.3% 91.7±0.2%
O URS /LR-D2 (10 iter.) 78.1±0.3% 86.5±0.2% 84.7±0.3% 91.5±0.2%

We use the same setup and hyper-parameters of R2-D2 (Section 4), except for the number of
classes/ways used at training, which we limit to 10. Interestingly, with five IRLS iterations the
accuracy of the 1-vs-all variant of LR-D2 is similar to the one of R2-D2 (Table 1): 51.9% and
68.7% for miniImageNet (1-shot and 5-shot); 65.3% and 78.3% for CIFAR - FS. With a single iteration,
performance is still very competitive: 51.0% and 65.6% for miniImageNet; 64.5% and 75.8% for
CIFAR - FS . However, the requirement of solving N binary problems per iteration makes it much less
efficient than R2-D2, as evident in Table 4.
Binary classification. Finally, in Table 3 we report the performance of both our ridge regression
and logistic regression base learners, together with four representative methods. Since LR-D2 is
limited to operate in a binary classification setup, we run our R2-D2 and prototypical network
without oversampling the number of ways. For both methods and prototypical networks, we report
the performance obtained annealing the learning rate by a factor of 0.99, which works better than the
schedule used for multi-class classification. Moreover, motivated by the small size of the mini-batches,
we replace Batch Normalization with Group Normalization (Wu & He, 2018). For this table, we
use the default setup found in the code of MAML, which uses 5 SGD iterations during training and
10 during testing. Table 3 confirms the validity of both our approaches on the binary classification
problem.
Although different in nature, both MAML and our LR-D2 make use of iterative base learners:
the former is based on SGD, while the latter on Newton’s method (under the form of Iteratively
Reweighted Least Squares). The use of second-order optimization might suggest that LR-D2 is
characterized by computationally demanding steps. However, we can apply the Woodbury identity at
every iteration and obtain a significant speedup. In Figure 2 we compare the performance of LR-D2
vs the one of MAML for a different number of steps of the base learner (kept constant between
training and testing). LR-D2 is superior to MAML, especially for a higher number of steps.
Efficiency. In Table 4 we compare the amount of time required by two representative methods and
ours to solve 10,000 episodes (each with 10 images) on a single NVIDIA GTX 1080 GPU. We use
miniImageNet (5-way, 1-shot) and adopt, for the lower part of the table, a lightweight embedding
network of 4 layers and 32 channels per layer. For reference, in the upper part of the table we also
report the timings for R2-D2 with [64, 64, 64, 64] and [96, 192, 384, 512] embeddings.

9
Published as a conference paper at ICLR 2019

79.0
miniImageNet 2-way, 1-shot miniImageNet 2-way, 5-shot
87.0
78.5
78.0 86.5
77.5 86.0
77.0 85.5

Accuracy

Accuracy
76.5
76.0 85.0
75.5 84.5
75.0 MAML 84.0 MAML
74.5 Ours/LR-D2 Ours/LR-D2
74.0 Ours/R2-D2 83.5 Ours/R2-D2
83.0
0 1 2 5 10 0 1 2 5 10
Num iterations Num iterations
86
CIFAR-FS 2-way, 1-shot CIFAR-FS 2-way, 5-shot
92.0
85 91.5
91.0
84 90.5
90.0
Accuracy

Accuracy
83
89.5
82
89.0
81 88.5
MAML 88.0 MAML
80 Ours/LR-D2 Ours/LR-D2
Ours/R2-D2 87.5 Ours/R2-D2
79 87.0
0 1 2 5 10 0 1 2 5 10
Num iterations Num iterations

Figure 2: Binary classification accuracy on two datasets and two setups at different number of steps of
the base learner for MAML, R2-D2 and LR-D2. Shaded areas represent 95% confidence intervals.

Interestingly, we can observe how R2-D2 allows us to achieve an efficiency that is comparable to
the one of prototypical networks and significantly higher than MAML. Notably, unlike prototypical
networks, our methods do allow per-episode adaptation through the weights W of the solver.

Table 4: Time required to solve 10,000 miniImageNet episodes of 10 samples each.

miniImageNet, 5-way, 1-shot
O URS /R2-D2 1 min 23 sec
O URS /R2-D2 (with 64C) 1 min 4 sec
MAML (Finn et al., 2017) (with 32C) 6 min 35 sec
O URS /LR-D2 (1-vs-all) (1 iter.) (with 32C) 5 min 48 sec
O URS /R2-D2 (with 32C) 57 sec
P ROTO NETS (Snell et al., 2017) (with 32C) 24 sec

5 C ONCLUSIONS
With the aim of allowing efficient adaptation to unseen learning problems, in this paper we explored
the feasibility of incorporating fast solvers with closed-form solutions as the base learning component
of a meta-learning system. Importantly, the use of the Woodbury identity allows significant computa-
tional gains in a scenario presenting only a few samples with high dimensionality, like one-shot of
few-shot learning. R2-D2, the differentiable ridge regression base learner we introduce, is almost as
fast as prototypical networks and strikes a useful compromise between not performing adaptation
for new episodes (like metric-learning-based approaches) and conducting a costly iterative approach
(like MAML or LSTM-based meta-learners). In general, we showed that our base learners work
remarkably well, with excellent results on few-shot learning benchmarks, generalizing to episodes
with new classes that were not seen during training. We believe that our findings point in an exciting
direction of more sophisticated yet efficient online adaptation methods, able to leverage the potential
of prior knowledge distilled in an offline training phase. In future work, we would like to explore
Newton’s methods with more complicated second-order structure than ridge regression.

ACKNOWLEDGMENTS

We would like to thank Jack Valmadre, Namhoon Lee and the anonymous reviewers for their insightful
comments, which have been useful to improve the manuscript. This work was partially supported by
the ERC grant 638009-IDIU.

10
Published as a conference paper at ICLR 2019

R EFERENCES
Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery
with one-shot learning. ACS central science, 2017.
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,
and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in
Neural Information Processing Systems, 2016.
Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic
learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8.
Univ. of Texas, 1992.
Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural computation, 2000.
Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning
feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016.
Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard
Säckinger, and Roopak Shah. Signature verification using a “Siamese” time delay neural network.
International Journal of Pattern Recognition and Artificial Intelligence, 1993.
Susan Carey. Less may never mean more. Recent advances in the psychology of language, 1978.
Susan Carey and Elsa Bartlett. Acquiring a single new word. 1978.
Rich Caruana. Multitask learning. In Learning to learn. Springer, 1998.
Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with
application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition,
2005.
Brian Chu, Vashisht Madhavan, Oscar Beijbom, Judy Hoffman, and Trevor Darrell. Best practices
for fine-tuning visual classifiers to new domains. In European Conference on Computer Vision
workshops. Springer, 2016.
Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2006.
Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient
descent can approximate any learning algorithm. 2018.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of
deep networks. In International Conference on Machine Learning, 2017.
Victor Garcia and Joan Bruna. Few-shot learning with graph neural networks. In International
Conference on Learning Representations, 2018.
Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. In IEEE
Conference on Computer Vision and Pattern Recognition, 2018.
Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised
domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
Bharath Hariharan and Ross B Girshick. Low-shot visual recognition by shrinking and hallucinating
features. In IEEE International Conference on Computer Vision, 2017.
Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent.
In International Conference on Artificial Neural Networks, pp. 87–94. Springer, 2001.
Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Training deep networks with structured
layers by matrix backpropagation. arXiv preprint arXiv:1509.07838, 2015.

11
Published as a conference paper at ICLR 2019

Łukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. Learning to remember rare events. In
International Conference on Learning Representations, 2017.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2015.
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot
image recognition. In International Conference on Machine Learning workshops, 2015.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning
through probabilistic program induction. Science, 2015.
Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization
through reversible learning. In International Conference on Machine Learning, pp. 2113–2122,
2015.
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation. 1989.
Erik G Miller, Nicholas E Matsakis, and Paul A Viola. Learning from one example through shared
densities on transforms. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE,
2000.
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-
learner. In International Conference on Learning Representations, 2018.
Tom M Mitchell. The need for biases in learning generalizations. Department of Computer Science,
Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.
Tom M Mitchell et al. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 1997.
Tsendsuren Munkhdalai and Hong Yu. Meta networks. In International Conference on Machine
Learning, 2017.
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Neural Networks,
1992. IJCNN., International Joint Conference on. IEEE, 1992.
Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. CoRR,
2018. URL http://arxiv.org/abs/1803.02999.
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw
audio. arXiv preprint arXiv:1609.03499, 2016.
Kaare Brandt Petersen, Michael Syskind Pedersen, et al. The matrix cookbook. Technical University
of Denmark, 2008.
Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. Few-shot image recognition by predicting
parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition,
2018.
Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International
Conference on Learning Representations, 2017.
Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with
residual adapters. In Advances in Neural Information Processing Systems, 2017.
Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum,
Hugo Larochelle, and Richard S Zemel. Meta-learning for semi-supervised few-shot classification.
In International Conference on Learning Representations, 2018.

12
Published as a conference paper at ICLR 2019

Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint
arXiv:1706.05098, 2017.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,
Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition
challenge. 2015.
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-
learning with memory-augmented neural networks. In International Conference on Machine
Learning, 2016.
Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn:
the meta-meta-... hook. PhD thesis, Technische Universität München, 1987.
Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent
networks. Neural Computation, 1992.
Jürgen Schmidhuber. A neural network that embeds its own meta-levels. In Neural Networks, 1993.,
IEEE International Conference on. IEEE, 1993.
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In
Advances in Neural Information Processing Systems, 2017.
Pablo Sprechmann, Siddhant M Jayakumar, Jack W Rae, Alexander Pritzel, Adrià Puigdomènech
Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, and Charles Blundell.
Memory-based parameter adaptation. 2018.
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.
Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer
Vision and Pattern Recognition, 2018.
Albert Tarantola. Inverse problem theory and methods for model parameter estimation, volume 89.
siam, 2005.
Sebastian Thrun. Is learning the n-th thing any easier than learning the first? In Advances in Neural
Information Processing Systems, 1996.
Sebastian Thrun. Lifelong learning algorithms. In Learning to learn. Springer, 1998.
Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998.
Paul E Utgoff. Shift of bias for inductive concept learning. Machine learning: An artificial intelligence
approach, 1986.
Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end
representation learning for correlation filter based tracking. In IEEE Conference on Computer
Vision and Pattern Recognition, 2017.
Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artificial
Intelligence Review, 2002.
Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot
learning. In Advances in Neural Information Processing Systems, 2016.
Yuxin Wu and Kaiming He. Group normalization. CoRR, 2018. URL http://arxiv.org/
abs/1803.08494.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep
neural networks? In Advances in Neural Information Processing Systems, 2014.
A Steven Younger, Sepp Hochreiter, and Peter R Conwell. Meta-learning with backpropagation. In
Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on. IEEE, 2001.

13
Published as a conference paper at ICLR 2019

A E XTENDED DISCUSSION

Contributions within the few-shot learning paradigm. In this work, we evaluated our proposed
methods R2-D2 and LR-D2 in the few-shot learning scenario (Fei-Fei et al., 2006; Lake et al., 2015;
Vinyals et al., 2016; Ravi & Larochelle, 2017; Hariharan & Girshick, 2017), which consists in
learning how to discriminate between images given one or very few examples. For methods tackling
this problem, it is common practice to organise the training procedure in two nested loops. The
inner loop is used to solve the actual few-shot classification problem, while the outer loop serves as
a guidance for the former by gradually modifying the inductive bias of the base learner (Vilalta &
Drissi, 2002). Differently from standard classification benchmarks, the few-shot ones enforce that
classes are disjoint between dataset splits.
In the literature (e.g. Vinyals et al. (2016)), the very small classification problems with unseen classes
solved within the inner loop have often been referred to as episodes or tasks. Considering the general
few-shot learning paradigm just described, methods in the recent literature mostly differ for the type
of learner they use in the inner loop and the amount of per-episode adaptability they allow. For
example, at the one end of the spectrum in terms of “amount of adaptability”, we can find methods
such as MAML Finn et al. (2017), which learns how to efficiently fine-tune the parameters of a
neural-network with few iterations of SGD. On the other end, we have methods based on metric
learning such as prototypical networks Snell et al. (2017) and relation network Sung et al. (2018),
which are fast but do not perform adaptation. Note that the amount of adaptation to a new episode
(i.e.a new classification problem with unseen classes) is not at all indicative of the performance in
few-shot learning benchmarks. As a matter of fact, both Snell et al. (2017) and Sung et al. (2018)
achieve higher accuracy than MAML. Nonetheless, adaptability is a desirable property, as it allows
more design flexibility.
Within this landscape, our work proposes a novel technique (R2-D2) that does allow per-episode
adaptation while at the same time being fast (Table 4) and achieving strong performance (Table 1).
The key innovation is to use a simple (and differentiable) solver such as ridge regression within the
inner loop, which requires back-propagating through the solution of a learning problem. Crucially,
its closed-form solution and the use of the Woodbury identity (particularly advantageous in the low
data regime) allow this non-trivial endeavour to be efficient. We further demonstrate that this strategy
is not limited to the ridge regression case, but it can also be extended to other solvers (LR-D2) by
dividing the problem into a short series of weighted least squares problems ((Murphy, 2012, Chapter
8.3.4)).
Disambiguation from the multi-task learning paradigm. Our work – and more generally the
few-shot learning literature as a whole – is related to the multi-task learning paradigm (Caruana,
1998; Ruder, 2017). However, several crucial differences exist. In terms of setup, multi-task learning
methods are trained to solve a fixed set of T tasks (or domains). At test time, the same T tasks or
domains are encountered. For instance, the popular Office-Caltech (Gong et al., 2012) dataset is
constructed by considering all the images from 10 classes present in 4 different datasets (the domains).
For multi-task learning, the splits span the domains but contain all the 10 classes. Conversely, few-shot
learning datasets have splits with disjoint sets of classes (i.e. each split’s classes are not contained in
other splits). Moreover, only a few examples (shots) can be used as training data within one episode,
while in multi-task learning this limitation is not present. For this reason, meta-learning methods
applied to few-shot learning (e.g.ours, (Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle,
2017; Mishra et al., 2018)) crucially take into account adaptation already during the training process
to mimic the test-time setting, de facto learning how to learn from limited data.
The importance of considering adaptation during training. Considering adaptation during train-
ing is also one of the main traits that differentiate our approach from basic transfer learning approaches
in which a neural network is first pre-trained on one dataset/task and then adapted to a different
dataset/task by simply adapting the final layer(s) (e.g. Yosinski et al. (2014); Chu et al. (2016)).
To better illustrate this point, we conducted a baseline experiment. First, we pre-trained for a standard
classification problem the same 4-layers CNN architecture using the same training datasets. We
simply added a final fully-connected layer (with 64 outputs, like the number of classes in the training
splits) and used the cross-entropy loss. Then, we used the convolutional part of this trained network
as a feature extractor and fed its activations to our ridge-regression layer to produce a per-episode
set of weights W . On miniImagenet, the drop in performance w.r.t. our proposed R2-D2 is very

14
Published as a conference paper at ICLR 2019

significant: −13.8% and −11.6% accuracy for the 1 and 5 shot problems respectively. The drop in
performance is consistent on CIFAR, though a bit less drastic: −11.5% and −5.9%.
These results empirically confirm that simply using basic transfer learning techniques with a shared
feature representation and task-specific final layers is not a good strategy to obtain results competitive
with the state-of-the-art in few-shot learning. Instead, it is necessary to enforce the generality of
the underlying features during training explicitly, which we do by back-propagating through the
adaptation procedure (the regressors R2-D2 and LR-D2).

B D IFFERENT GAUSSIAN PRIORS FOR REGULARIZATION

The regularization term can be seen as a prior gaussian distribution of the parameters in a Bayesian
interpretation, or more simply Tikhonov regularization (Tarantola, 2005). In the most common case
of λI, it corresponds to an isotropic gaussian prior on the parameters.
In addition to the case in which λ is a scalar, we also experiment with the variant diag(λ), corre-
sponding to an axis-aligned gaussian prior with an independent variance for each parameter, which
can potentially exploit the fact that the parameters have different scales. Replacing λI with diag(λ)
in 4, the final expression for W after having applied the Woodbury identity becomes:
W = Λ(Z) = diag(λ)−1 X T (Xdiag(λ)−1 X T + I)−1 Y. (8)

C BASE LEARNER HYPER - PARAMETERS

Figure 3 illustrates the effect of using SGD to learn, together with the parameters ω of the CNN,
also the hyper-parameters (ρ in eq. 2) of the base learner Λ. We find that it is very important to
learn the scalar α (right plot of Figure 3) used to calibrate the output of R2-D2 in eq. 6, while it is
indifferent whether or not to learn λ. Note that, by using SGD to update α, it is possible (e.g.in the
range [10−3 , 100 ]) to recover from poor initial values and suffer just a little performance loss w.r.t.
the optimal value of α = 10.
The left plot of Figure 3 also shows the performance of R2-D2 with the variant diag(λ) introduced
in Appendix B. Unfortunately, despite this formulation allows us to make use of a more expressive
prior, it does not improve the results compared to using a simple scalar λ. Moreover, performance
abruptly deteriorate for λ > 0.01.

67
CIFAR-FS 5-way, 1-shot 67
CIFAR-FS 5-way, 1-shot
66
66
65
64 65
Accuracy

Accuracy

63
64
62
61 63
Fixed
60 Learnt Fixed
62
59 Learnt diag( ) Learnt
58 61
10 4 10 3 10 2 10 1 100 101 102 103 104 105 10 3 10 2 10 1 100 101 102 103 104 105
Initial value of Initial value of

Figure 3: Shaded areas represent 95% confidence intervals.

ML MCQ
100% (4)
ML MCQ
31 pages
L F E: AS A F - S L: Earning From EW Xamples Ummary of Pproaches To EW HOT Earning
No ratings yet
L F E: AS A F - S L: Earning From EW Xamples Ummary of Pproaches To EW HOT Earning
32 pages
Meta-Learning With Temporal Convolutions
No ratings yet
Meta-Learning With Temporal Convolutions
14 pages
6-2019-ICLR-A. Antoniou-How To Train Your MAML
No ratings yet
6-2019-ICLR-A. Antoniou-How To Train Your MAML
11 pages
Representation Meta Learning
No ratings yet
Representation Meta Learning
9 pages
Peerj Cs 613
No ratings yet
Peerj Cs 613
21 pages
A Closer Look at Few-Shot Classification
No ratings yet
A Closer Look at Few-Shot Classification
17 pages
2010 07140 PDF
No ratings yet
2010 07140 PDF
34 pages
Chen Meta-Baseline Exploring Simple Meta-Learning For Few-Shot Learning ICCV 2021 Paper
No ratings yet
Chen Meta-Baseline Exploring Simple Meta-Learning For Few-Shot Learning ICCV 2021 Paper
10 pages
An Overview of Deep Neural Networks For Few-Shot Learning
No ratings yet
An Overview of Deep Neural Networks For Few-Shot Learning
44 pages
Discovering Reinforcement Learning Algorithms: Preprint. Under Review
No ratings yet
Discovering Reinforcement Learning Algorithms: Preprint. Under Review
19 pages
C - A M - L: Ontext Ware ETA Earning
No ratings yet
C - A M - L: Ontext Ware ETA Earning
22 pages
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
No ratings yet
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
19 pages
Meta-Learning With Versatile Loss Geometries - For Fast Adaptation Using Mirror Descent
No ratings yet
Meta-Learning With Versatile Loss Geometries - For Fast Adaptation Using Mirror Descent
7 pages
NIPS 2016 Matching Networks For One Shot Learning Paper
No ratings yet
NIPS 2016 Matching Networks For One Shot Learning Paper
9 pages
2024 MTH058 Lecture09 Meta Learning
No ratings yet
2024 MTH058 Lecture09 Meta Learning
25 pages
Deep Meta-Learning Learning To Learn in The Concept Space
No ratings yet
Deep Meta-Learning Learning To Learn in The Concept Space
10 pages
DR +Sofia+Kovacs
No ratings yet
DR +Sofia+Kovacs
9 pages
1 Metalearning: Concepts and Systems
No ratings yet
1 Metalearning: Concepts and Systems
11 pages
Meta-Learning With Implicit Gradients: Equal Contributions. Project Page
No ratings yet
Meta-Learning With Implicit Gradients: Equal Contributions. Project Page
18 pages
1805 07297 PDF
No ratings yet
1805 07297 PDF
29 pages
Meta Learning For Semi Supervised Few Shot
No ratings yet
Meta Learning For Semi Supervised Few Shot
15 pages
Neural-Network-based Metamodeling For Financial Time Series Forecasting
No ratings yet
Neural-Network-based Metamodeling For Financial Time Series Forecasting
4 pages
Generalizing From A Few Examples: A Survey On Few-Shot Learning
No ratings yet
Generalizing From A Few Examples: A Survey On Few-Shot Learning
34 pages
Model-Agnostic Meta-Learning Techniques A State-Of-The-Art Short Review
No ratings yet
Model-Agnostic Meta-Learning Techniques A State-Of-The-Art Short Review
4 pages
La-MAML: Look-Ahead Meta Learning For Continual Learning: Gunshi Gupta Karmesh Yadav Liam Paull
No ratings yet
La-MAML: Look-Ahead Meta Learning For Continual Learning: Gunshi Gupta Karmesh Yadav Liam Paull
20 pages
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
No ratings yet
Large-Scale Retrieval For Reinforcement Learning: These Authors Contributed Equally To This Work
16 pages
Automated Relational Meta-Learning
No ratings yet
Automated Relational Meta-Learning
19 pages
Edgecomm - 2023 - Lahari (5) - ACM - Workshop
No ratings yet
Edgecomm - 2023 - Lahari (5) - ACM - Workshop
5 pages
Few Shot Learning Seminar
No ratings yet
Few Shot Learning Seminar
14 pages
Generalizing From A Few Examples: A Survey On Few-Shot Learning
No ratings yet
Generalizing From A Few Examples: A Survey On Few-Shot Learning
34 pages
Data Augmentation For Meta-Learning
No ratings yet
Data Augmentation For Meta-Learning
10 pages
1 s2.0 S0893608021003919 Main
No ratings yet
1 s2.0 S0893608021003919 Main
11 pages
12 Reproducibility Challenge Meta
No ratings yet
12 Reproducibility Challenge Meta
9 pages
Meta-Learning in Neural Networks A Survey
No ratings yet
Meta-Learning in Neural Networks A Survey
20 pages
Learn To Learn
No ratings yet
Learn To Learn
17 pages
M L S H: ETA Earning Hared Ierarchies
No ratings yet
M L S H: ETA Earning Hared Ierarchies
11 pages
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
No ratings yet
SAPENet: Self-Attention Based Prototype Enhancement Network For Few-Shot Learning
11 pages
A Lazy Approach To Long-Horizon Gradient-Based Meta-Learning
No ratings yet
A Lazy Approach To Long-Horizon Gradient-Based Meta-Learning
10 pages
Personalized Federated Learning: A Meta-Learning Approach: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
No ratings yet
Personalized Federated Learning: A Meta-Learning Approach: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar
29 pages
Meta Learning, SSL, GCNN
No ratings yet
Meta Learning, SSL, GCNN
51 pages
S F G W L L: Caling Orward Radient ITH Ocal Osses
No ratings yet
S F G W L L: Caling Orward Radient ITH Ocal Osses
31 pages
4 CS826 - Meta Learning
No ratings yet
4 CS826 - Meta Learning
40 pages
Meta-Learning: Synonyms
No ratings yet
Meta-Learning: Synonyms
7 pages
Make 04 00002 v2
No ratings yet
Make 04 00002 v2
20 pages
Transfer Learning in Building Neural Network Model Case Study
No ratings yet
Transfer Learning in Building Neural Network Model Case Study
6 pages
Learn2learn A Library For Meta-Learning Research
No ratings yet
Learn2learn A Library For Meta-Learning Research
10 pages
On Episodes, Prototypical Networks, and Few-Shot Learning: Steinar Laenen Luca Bertinetto WWW - Five.ai
No ratings yet
On Episodes, Prototypical Networks, and Few-Shot Learning: Steinar Laenen Luca Bertinetto WWW - Five.ai
19 pages
Diligenti 2017
No ratings yet
Diligenti 2017
4 pages
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
No ratings yet
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
164 pages
4 PDF
No ratings yet
4 PDF
37 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
INTRODUCTION
No ratings yet
INTRODUCTION
67 pages
Enhancing Few-Shot Image Classification Through Learnable Multi-Scale Embedding and Attention Mechanisms
No ratings yet
Enhancing Few-Shot Image Classification Through Learnable Multi-Scale Embedding and Attention Mechanisms
14 pages
Ad3501-Dl-Unit 1 Notes
No ratings yet
Ad3501-Dl-Unit 1 Notes
43 pages
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
No ratings yet
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
41 pages
AD3501-DL-Unit 1 Notes
No ratings yet
AD3501-DL-Unit 1 Notes
43 pages
High-Mobility Free-Standing InSb Nanoflags Grown On InP Nanowire Stems For Quantum Devices
No ratings yet
High-Mobility Free-Standing InSb Nanoflags Grown On InP Nanowire Stems For Quantum Devices
9 pages
Top-Down Fabrication of Bulk-Insulating Topological Insulator Nanowires For Quantum Devices
No ratings yet
Top-Down Fabrication of Bulk-Insulating Topological Insulator Nanowires For Quantum Devices
32 pages
Quantum Coherent Tunable Coupling of Superconducting Qubits: Reports
No ratings yet
Quantum Coherent Tunable Coupling of Superconducting Qubits: Reports
5 pages
Quantum Nanophotonics in Two-Dimensional Materials
No ratings yet
Quantum Nanophotonics in Two-Dimensional Materials
27 pages
Topological Quantum Devices - A Review
No ratings yet
Topological Quantum Devices - A Review
56 pages
Simulating 2D Lattice Gauge Theories On A Qudit Quantum Computer
No ratings yet
Simulating 2D Lattice Gauge Theories On A Qudit Quantum Computer
23 pages
Gate-Controlled Quantum Dots Based On Two-Dimensional Materials
No ratings yet
Gate-Controlled Quantum Dots Based On Two-Dimensional Materials
37 pages
Physics
No ratings yet
Physics
42 pages
Quantum Sensing With Erasure Qubits
No ratings yet
Quantum Sensing With Erasure Qubits
6 pages
Parallel Computing and Monte Carlo Algorithms
No ratings yet
Parallel Computing and Monte Carlo Algorithms
27 pages
Superconducting Phase Qubits
No ratings yet
Superconducting Phase Qubits
23 pages
Impurity in A Bose-Einstein Condensate - Study of The Attractive
No ratings yet
Impurity in A Bose-Einstein Condensate - Study of The Attractive
13 pages
RSC Advances
No ratings yet
RSC Advances
14 pages
Multimodality in Meta-Learning - A Comprehensive Survey
No ratings yet
Multimodality in Meta-Learning - A Comprehensive Survey
21 pages
Markov Chain Monte Carlo Without Likelihoods
No ratings yet
Markov Chain Monte Carlo Without Likelihoods
5 pages
Nano-Alumina Modified With 2,4-Dinitrophenylhydrazine
No ratings yet
Nano-Alumina Modified With 2,4-Dinitrophenylhydrazine
9 pages
10.1351 Pac200476040801
No ratings yet
10.1351 Pac200476040801
13 pages
HM Removal by Iron NP
No ratings yet
HM Removal by Iron NP
11 pages
Non-Hermitian Quantum Thermodynamics
No ratings yet
Non-Hermitian Quantum Thermodynamics
8 pages
Provable Meta-Learning of Linear Representati
No ratings yet
Provable Meta-Learning of Linear Representati
10 pages
Meta-Learning in Distributed Data Mining Systems
No ratings yet
Meta-Learning in Distributed Data Mining Systems
38 pages
Meta-Learning Assisted Robust Control of Universal Quantum Gates With Uncertainties
No ratings yet
Meta-Learning Assisted Robust Control of Universal Quantum Gates With Uncertainties
10 pages
Meta-Data - Characterization of Input Features For Meta-Learning
No ratings yet
Meta-Data - Characterization of Input Features For Meta-Learning
12 pages
Federated Meta-Learning With Fast Convergence and
No ratings yet
Federated Meta-Learning With Fast Convergence and
14 pages
Making Scalable Meta Learning Practical
No ratings yet
Making Scalable Meta Learning Practical
20 pages
Clock-Work Trade-Off Relation For Coherence in Quantum Thermodynamics
No ratings yet
Clock-Work Trade-Off Relation For Coherence in Quantum Thermodynamics
16 pages
Entropy 12 01975
No ratings yet
Entropy 12 01975
70 pages
Quantum Information Processing For A Coherent Superposition State Via A Mixed Entangled Coherent Channel
No ratings yet
Quantum Information Processing For A Coherent Superposition State Via A Mixed Entangled Coherent Channel
7 pages
Operational Interpretation of Quantum Fisher Information in Quantum Thermodynamics
No ratings yet
Operational Interpretation of Quantum Fisher Information in Quantum Thermodynamics
7 pages
Naivebayes Tute
No ratings yet
Naivebayes Tute
4 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
11 pages
MBA786M Project
No ratings yet
MBA786M Project
2 pages
Lecture1 IntroductiontoML
No ratings yet
Lecture1 IntroductiontoML
70 pages
Data Mining Lesson Plan-Revised Syllabus
No ratings yet
Data Mining Lesson Plan-Revised Syllabus
4 pages
AI Algorithm - Detect Glasses
No ratings yet
AI Algorithm - Detect Glasses
41 pages
Group Technology and Cellular Manufacturing-I
100% (1)
Group Technology and Cellular Manufacturing-I
20 pages
ANOVA & ANACOVA and Previous RGUHS Questions With Answers
No ratings yet
ANOVA & ANACOVA and Previous RGUHS Questions With Answers
55 pages
Data Science II: Charles C.N. Wang
No ratings yet
Data Science II: Charles C.N. Wang
38 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
24 pages
Hydroclassifier
No ratings yet
Hydroclassifier
14 pages
ML Visuals
No ratings yet
ML Visuals
61 pages
Email Classification: Roll No-41463 (LP-3)
No ratings yet
Email Classification: Roll No-41463 (LP-3)
5 pages
P1 - Single Layer Feed Forward Networks
No ratings yet
P1 - Single Layer Feed Forward Networks
52 pages
CCST9017 (2023-24lecture11printed Version) MachineLearning
No ratings yet
CCST9017 (2023-24lecture11printed Version) MachineLearning
55 pages
Buet Cse Syllabus
No ratings yet
Buet Cse Syllabus
24 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
70 pages
Analyzing Large Scale Human Mobility Dat
No ratings yet
Analyzing Large Scale Human Mobility Dat
23 pages
A Review On Data Mining Techniques For Fertilizer Recommendation
No ratings yet
A Review On Data Mining Techniques For Fertilizer Recommendation
5 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Cu5096 Ia1
No ratings yet
Cu5096 Ia1
1 page
ML Probable Questions 2026 - أسئلة محتملة لامتحان تعلم الآلة 2026 ??
No ratings yet
ML Probable Questions 2026 - أسئلة محتملة لامتحان تعلم الآلة 2026 ??
2 pages
Lab Manual DAR
No ratings yet
Lab Manual DAR
81 pages
Predicting Probability of Debt Default A Study of Corporate Debt Market in India and Other Countries
No ratings yet
Predicting Probability of Debt Default A Study of Corporate Debt Market in India and Other Countries
321 pages
Classifier
No ratings yet
Classifier
39 pages
R23 3rd Year B.Tech AI and DS
No ratings yet
R23 3rd Year B.Tech AI and DS
52 pages
DMLab
No ratings yet
DMLab
27 pages
The Handbook of Data Mining - 1st Edition ISBN 0805840818, 9780805840810 Complete EPUB Ebook
No ratings yet
The Handbook of Data Mining - 1st Edition ISBN 0805840818, 9780805840810 Complete EPUB Ebook
17 pages
Business ANalytics - Logistic Regression
No ratings yet
Business ANalytics - Logistic Regression
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Meta Learning With

Uploaded by

Meta Learning With

Uploaded by

Published as a conference paper at ICLR 2019

M ETA - LEARNING WITH

Luca Bertinetto João Henriques

Philip H.S. Torr Andrea Vedaldi

Adapting deep networks to new concepts from a few examples is challenging,

Base training-set Base

3.2 E FFICIENT RIDGE REGRESSION BASE LEARNERS

3.3 I TERATIVE BASE LEARNERS AND LOGISTIC REGRESSION

3.4 T RAINING POLICY

4.1 F EW- SHOT LEARNING BENCHMARKS

20 examples of 1623 characters, grouped in 50 different alphabets. In order to be able to compare

4.2 E XPERIMENTAL RESULTS

Table 1: Few-shot multi-class classification accuracies on miniImageNet and CIFAR - FS.

Table 2: Few-shot multi-class classification accuracies on Omniglot.

Omniglot, 5-way Omniglot, 20-way

Table 3: Few-shot binary classification accuracies on miniImageNet and CIFAR - FS.

Table 4: Time required to solve 10,000 miniImageNet episodes of 10 samples each.

B D IFFERENT GAUSSIAN PRIORS FOR REGULARIZATION

C BASE LEARNER HYPER - PARAMETERS

Figure 3: Shaded areas represent 95% confidence intervals.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.