0% found this document useful (0 votes)
15 views1 page

Safari - 07-Apr-2023 at 4:10 PM

This document provides an overview of neural network pruning techniques. It discusses different types of pruning structures like unstructured pruning which prunes individual connections and structured pruning which prunes entire filters or neurons. It also covers different criteria for determining what parts of the network to prune, such as the weight magnitude criterion which prunes weights with the smallest absolute values. The document aims to provide foundational knowledge on neural network pruning methods and help navigate the extensive literature on this topic.

Uploaded by

suriyars004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views1 page

Safari - 07-Apr-2023 at 4:10 PM

This document provides an overview of neural network pruning techniques. It discusses different types of pruning structures like unstructured pruning which prunes individual connections and structured pruning which prunes entire filters or neurons. It also covers different criteria for determining what parts of the network to prune, such as the weight magnitude criterion which prunes weights with the smallest absolute values. The document aims to provide foundational knowledge on neural network pruning methods and help navigate the extensive literature on this topic.

Uploaded by

suriyars004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Open in app Sign up Sign In

Published in Towards Data Science

Hugo Tessier Follow

Sep 9, 2021 · 22 min read · Listen

Save

Neural Network
Pruning 101
All you need to know not to get lost

Whether it is in computer vision, natural


language processing or image generation,
deep neural networks yield the state of the
art. However, their cost in terms of
computational power, memory or energy
consumption can be prohibitive, making
some of them downright unaffordable for
most limited hardware. Yet, many
domains would benefit from neural
networks, hence the need to reduce their
cost while maintaining their performance.

That is the whole point of neural networks


compression. This field counts multiple
families of methods, such as quantization
[11], factorization [13], distillation [32] or,
and this will be the focus of this post,
pruning.

Neural network pruning is a method that


revolves around the intuitive idea of
removing superfluous parts of a network
that performs well but costs a lot of
resources. Indeed, even though large
neural networks have proven countless
times how well they could learn, it turns
out that not all of their parts are still useful
after the training process is over. The idea
is to eliminate these parts without
impacting the network’s performance.

Unfortunately, the dozens, if not


hundreds, of papers published each year
are revealing the hidden complexity of a
supposedly straight-forward idea. Indeed,
a quick overview of the literature yields
countless ways of identifying said useless
parts or removing them before, during or
after training; it even turns out that not all
kinds of pruning actually allow for
accelerating neural networks, which is
supposed to be the whole point.

The goal of this post is to provide a solid


foundation to tackle the intimidatingly
wild literature around neural network
pruning. We will review successively three
questions that seem to be at the core of the
whole domain: “What kind of part should I
prune?”, “How to tell which parts can be
pruned?” and “How to prune parts without
harming the network?”. To sum it up, we
will detail pruning structures, pruning
criteria and pruning methods.

1 — Pruning structures

1.1 — Unstructured pruning


When talking about the cost of neural
networks, the count of parameters is
surely one of the most widely used
metrics, along with FLOPS (floating-point
operations per second). It is indeed
intimidating to see networks displaying
astronomical amounts of weights (up to
billions for some), often correlated with
stellar performance. Therefore, it is quite
intuitive to aim at reducing directly this
count by removing parameters
themselves. Actually, pruning connections
is one of the most widespread paradigms
in the literature, enough to be considered
as the default framework when dealing
with pruning. The seminal work of Han et
al.[26] presented this kind of pruning and
served as a basis for numerous
contributions [18, 21, 25].

Directly pruning parameters has many


advantages. First, it is simple, since
replacing the value of their weight with
zero, within the parameter tensors, is
enough to prune a connection.
Widespread deep learning frameworks,
such as Pytorch, allow to easily access all
the parameters of a network, making it
extremely simple to implement. Still, the
greatest advantage of pruning connections
remains yet that they are the smallest,
most fundamental elements of networks
and, therefore, they are numerous enough
to prune them in large quantities without
impacting performance. Such a fine
granularity allows pruning very subtle
patterns, up to parameters within
convolution kernels, for example. As
pruning weights is not limited by any
constraint at all and is the finest way to
prune a network, such a paradigm is
called unstructured pruning.

However, this method presents a major,


fatal drawback: most frameworks and
hardware cannot accelerate sparse
matrices’ computation, meaning that no
matter how many zeros you fill the
parameter tensors with, it will not impact
the actual cost of the network. What does
impact it, however, is pruning in a way
that directly alters the very architecture of
the network, which any framework can
handle.

Difference between unstructured (left) and structured


(right) pruning: structured pruning removes both
convolution filters and rows of kernels instead of just
pruning connections. This leads to fewer feature maps
within intermediate representations. (image by author)

1.2 — Structured pruning


This is the reason why many works have
focused on pruning larger structures, such
as whole neurons [36] or, for its direct
equivalent within the more modern deep
convolutional networks, convolution
filters [40, 41, 66]. Filter pruning allows for
an exploitable and yet fine enough
granularity, as large networks tend to
include numerous convolution layers, each
counting up to hundreds or thousands of
filters. Not only does removing such
structures result in sparse layers that can
be directly instantiated as thinner ones,
but doing so also eliminates the feature
maps that are the outputs of such filters.

Therefore, not only are such networks


lighter to store, due to fewer parameters,
but also they require less computations
and generate lighter intermediate
representations, hence needing less
memory during runtime. Actually, it is
sometimes more beneficial to reduce
bandwidth rather than the parameter
count. Indeed, for tasks that involve large
images, such as semantic segmentation or
object detection, intermediate
representations may be prohibitively
memory-consuming, way more than the
network itself. For these reasons, filter
pruning is now seen as the default kind of
structured pruning.

Yet, when applying such a pruning, one


should pay attention to the following
aspects. Let’s consider how a convolution
layer is built: for Cin input channels and
Cout output ones, a convolution layer is
made of Cout filters, each counting Cin
kernels; each filter outputs one feature
map and within each filter, one kernel is
dedicated to each input channel.
Considering this architecture, and
acknowledging a regular convolutional
network basically stacks convolution
layers, when pruning whole filters, one
may observe that pruning a filter, and then
the feature map it outputs, actually results
in pruning the corresponding kernels in
the ensuing layer too. That means that,
when pruning filters, one may actually
prune twice the amount of parameters
thought to be removed in the first place.

Let’s consider too that, when a whole layer


happens to get pruned (which tends to
happen because of layer collapse [62], but
does not always break the network,
depending on the architecture), the
previous layer’s outputs are now totally
unconnected, hence pruned too: pruning
a whole layer may actually prune all its
previous layers whose outputs are not
somehow connected elsewhere (because
of residual connections [28] or whole
parallel paths [61]). Therefore, when
pruning filters, one should consider
computing the exact number of actually
pruned parameters. Indeed, pruning the
same amount of filters, depending on their
distribution within the architecture, may
not lead to the same actual amount of
pruned parameters, making any result
impossible to compare with.

Before changing topic, let’s just mention


that, albeit a minority, some works focus
on pruning convolution kernels, intra-
kernel structures [2,24, 46] or even specific
parameter-wise structures. However, such
structures need special implementations
to lead to any kind of speedup (as for
unstructured pruning). Another kind of
exploitable structure, though, is to turn
convolutions into “shift layers” by pruning
all but one parameter in each kernel,
which can then be summed up as a
combination of a shifting operation and a
1 × 1 convolution [24].

The danger of structured pruning: altering the input


and output dimensions of layers can lead to some
discrepancies. If on the left, both layers output the
same number of feature maps, that can be summed up
well afterward, their pruned counterparts on the right
produce intermediate representations of different
dimensions, that cannot be summed up without
processing them. (image by author)

2 — Pruning criteria
Once one has decided what kind of
structure to prune, the next question one
may ask could be: “Now, how do I figure
out which ones to keep and which ones to
prune?”. To answer that one needs a
proper pruning criteria, that will rank the
relative importance of the parameters,
filters or else.

2.1 — Weight magnitude criterion


One criterion that is quite intuitive and
surprisingly efficient is pruning weights
whose absolute value (or “magnitude”) is
the smallest. Indeed, under the constraint
of a weight-decay, those which do not
contribute significantly to the function are
expected to have their magnitude shrink
during training. Therefore, the
superfluous weights are expected to be
those of lesser magnitude [8].
Notwithstanding its simplicity, the
magnitude criterion is still widely used in
modern works [21, 26, 58], making it a
staple of the domain.

However, although this criterion seems


trivial to implement in the case of
unstructured pruning, one may wonder
how to adapt it to structured pruning. One
straightforward way is to order filters
depending on their norm (L 1 or L 2 for
example) [40, 70]. If this method is quite
straightforward one may desire to
encapsulate multiple sets of parameters
within one measure: for example, a
convolutional filter, its bias and its batch-
normalization parameters together, or
even corresponding filters within parallel
layers whose outputs are then fused and
whose channels we would like to reduce.

One way to do that, without having to


compute the combined norm of these
parameters, involves inserting a learnable
multiplicative parameter for each feature
map after each set of layers you want to
prune. This gate, when reduced to zero,
effectively prunes the whole set of
parameters responsible for this channel
and the magnitude of this gate accounts
for the importance of all of them. The
method hence consists in pruning the
gates of lesser magnitude [36, 41].

2.2 — Gradient magnitude pruning


Magnitude of the weight is not the only
popular criterion (or family of criteria)
that exists. Actually, the other main
criterion to have lasted up to now is the
magnitude of the gradient. Indeed, back in
the 80's some fundamental works [37, 53]
theorized, through a Taylor decomposition
of the impact of removing a parameter on
the loss, that some metrics, derived from
the back-propagated gradient, may
provide a good way to determine which
parameters could be pruned without
damaging the network.

More modern implementations of this


criterion [4, 50] actually accumulate
gradients over a minibatch of training data
and prune on the basis of the product
between this gradient and the
corresponding weight of each parameter.
This criterion can be applied to the
aforementioned gates too [49].

2.3 — Global or local pruning


One final aspect to take into consideration
is whether the chosen criterion is applied
globally to all parameters or filters of the
network, or if it is computed
independently for each layer. While global
pruning has proven many times to yield
better results, it can lead to layer collapse
[62]. A simple way to avoid this problem is
to resort to layer-wise local pruning,
namely pruning the same rate at each
layer, when the used method cannot
prevent layer collapse.

Difference between local pruning (left) and global


pruning (right): local pruning applies the same rate to
each layer while global applies it on the whole network
at once. (image by author)

3 — Pruning method
Now that we have got our pruning
structure and criterion, the only
parameter left is which method should we
use to prune a network. This is actually the
topic on which the literature can be the
most confusing, as each paper will bring
its own quirks and gimmicks, so much that
one may get lost between what is
methodically relevant and what is just a
specificity of a given paper.

This is why we will thematically overview


some of the most popular families of
method to prune neural networks, in an
order that highlights the evolution of the
use of sparsity during training.

3.1 — The classic framework: train, prune and


fine-tune
The first basic framework to know is the
train, prune and fine-tune method, which
obviously involves 1) training the network
2) pruning it by setting to 0 all parameters
targeted by the pruning structures and
criterion (these parameters cannot recover
afterwhile) and 3) training the network for
a few extra epochs, with the lowest
learning rate, to give it a chance to recover
from the loss in performance induced by
pruning. Usually, these last two steps can
be iterated, with each time a growing
pruning rate.

The method proposed by Han et al. [26]


applies this method, with 5 iterations
between pruning and fine-tuning, to
weight magnitude pruning. Iterating has
shown to improve performance, at the cost
of extra computation and training time.
This simple framework serves as a basis
for many works [26, 40, 41, 50, 66] and can
be seen as the default method over which
all the others have built.

3.2 — Extending the classic framework


While not straying too far, some methods
have brought significant modifications to
the aforementioned classic framework by
Han et al. [26]. Gale et al. [21] have pushed
the principle of iterations further by
removing an increasing amount of weights
progressively all along the training
process, which allows benefiting from the
advantages of iterations and to remove the
whole fine-tuning process. He et al. [29]
reduce prunable filters to 0, at each epoch,
while not preventing them from learning
and being updated afterward, in order to
let their weights grow back after pruning
while enforcing sparsity during training.

Finally, the method of Renda et al. [58]


involves fully retraining a network once it
is pruned. Unlike fine-tuning, which is
performed at the lowest learning-rate,
retraining follows the same learning-rate
schedule as training, hence its name:
“Learning-Rate Rewinding”. This
retraining has shown to yield better
performance than mere fine-tuning, at a
significantly higher cost.

3.3 — Pruning at initialization


In order to speed up training, avoid fine-
tuning and prevent any alteration of the
architecture during or after training,
multiple works have focused on pruning
before training. In the wake of SNIP [39],
many works have studied the use of the
work of Le Cun et al. [37] or of Mozer and
Smolensky [53] to prune at initialization
[12, 64], including intensive theoretical
studies [27, 38, 62]. However, Optimal
Brain Damage [37] relies on multiple
approximations including an “extremal”
approximation that “assumes that
parameter deletion will be performed
after training has converged” [37]; this fact
is rarely mentioned, even among works
that are based on it. Some works have
raised reservations about the ability of
such methods to generate masks whose
relevance outshines random ones of
similar distribution per layer [20].

Another family of methods that study the


relationship between pruning and
initialization gravitates around the
“Lottery Ticket Hypothesis” [18]. This
hypothesis states that “randomly-
initialized, dense neural network contains
a subnet-work that is initialized such that
— when trained in isolation — it can match
the test accuracy of the original network
after training for at most the same number
of iterations”. In practice, this literature
studies how well a pruning mask, defined
using an already converged network, can
be applied to the network back when it
was just initialized. Multiple works have
expanded, stabilized or studied this
hypothesis [14, 19, 45, 51, 69]. However,
once again multiple works tend to
question the validity of the hypothesis and
of the method used to study it [21, 42] and
some even tend to show that its benefits
rather came from the principle of fully
training with the definitive mask instead
of a hypothetical “Winning Ticket” [58].

Comparison between the classic “train, prune and fine-


tune” framework [26], the lottery ticket experiment [18]
and learning rate rewinding [58]. (image by author)

3.4 — Sparse training


The previous methods are linked by a
seemingly shared underlying theme:
training under sparsity constraints. This
principle is at the core of a family of
methods, called sparse training, which
consists in enforcing a constant rate of
sparsity during training while its
distribution varies and is progressively
adjusted. Introduced by Mocanu et al. [47],
it involves: 1) initializing the network with
a random mask that prunes a certain
proportion of the network 2) training this
pruned network during one epoch 3)
pruning a certain amount of weights of
lower magnitude and 4) regrowing the
same amount of random weights.

That way, the pruning mask, at first


random, is progressively adjusted to target
the least import weights while enforcing
sparsity all throughout training. The
sparsity level can be the same for each
layer [47] or global [52]. Other methods
have extended sparse training by using a
certain criterion to regrow weights instead
of choosing them randomly [15, 17].

402and grows2different weights


Sparse training cuts
periodically during training, which leads to an adjusted
:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy