[Fall 2024] Deep Learning 3
[Fall 2024] Deep Learning 3
Outline 2.
Learning
Transfer Learning
3. Unsupervised and
Self-Supervised Learning
What are we trying to learn?
What is a “feature”
● Consider the classical machine learning problem:
○ You have an input x and some function 𝚽(x) that returns any relevant features from x
■ For example, if x is a house and y is it’s selling price, then 𝚽(x) could be a vector with
information like the house age, the number of rooms, whether it has a basement or not, etc.
○ You pass the features into some model fθ(࣭𝚽(x)), parameterized by θ, to predict the label y
○ The machine learning problem is to “learn” θ from a training dataset of (x, y) pairs
House House Age Number of Number of Size (sq. Floors Basement? Garage? Backyard? Pool?
Bathrooms Bedrooms feet)
Output
Output
Output
Output
Deep learning is Representation learning
● Deep learning allows a model to learn “good” representations directly from data!
○ The main idea is to relinquish all control to the model and let it learn whatever it feels is important to
solve the task at hand
○ Features are synonymous with representations in ML
● The output of each layer in a neural network is a learned representation so deep
learning can be viewed as the process of learning stacked representations
○ We call these representations hierarchical, i.e., later representations depend on, and are more
abstract than earlier ones — depth refines representations
Deep learning is Representation learning
Transfer Learning
Training a network from scratch
● Training a good model from scratch takes
○ Time
○ Compute
○ Training data — the more, the better.
■ Models benefit significantly from A LOT of data — especially in computer vision, a few
thousand examples usually doesn’t cut it.
● Money for all of the above
Transfer learning motivation
● When trained from scratch, a model’s parameters are initialized randomly and
updated by some optimization algorithm like stochastic gradient descent or Adam
● So, if there are two different tasks to be solved, no matter how similar, this process
will be repeated separately both times
● However, does it really need to be? Consider the case when we humans learn
something new: do we ALWAYS start from the ground up?
Transfer learning motivation
● Let’s apply this idea to deep neural networks — consider training separate models
for cat/dog classification and face recognition
○ They will all need to learn a suitable set of feature extractors for their inputs. Since representations
are hierarchical:
■ Low-level (earlier) layers might learn feature extractors for very concrete details like edges,
corners, shapes, patterns, colors, etc.
■ High-level (deeper) layers might learn feature extractors for abstract concepts like mental
models of cats/dogs or human faces
○ Note: these low level features are general enough to be extracted from any kind of image for
any kind of task and representations/features only start to diverge significantly across tasks in
the later layers
○ Don’t worry about what exactly these feature extractors / neural network layers look like for now…
we will discuss that next time
Take some hypothetical model trained on
image inputs — what kind of information
and/or visual content does each layer look for?
Transfer learning motivation
● What about a neural network trained for
ImageNet classification?
○ The ImageNet dataset is a huge collection of
~1.3 million images divided across 1000 ResNet 18 ResNet 101
incredibly diverse classes
○ ResNet, AlexNet and DenseNet are just
special kinds of neural network architectures…
we will cover some of them in detail soon
enough but just think of them as some black
box NNs for now
○ Takeaway: different models trained for AlexNet DenseNet 121
image classification learn very similar lower The first layers of completely different models, trained
layers separately, are trying to extract the same kind of
information from an input image!
Transfer learning motivation
How can we transfer the “knowledge gained” by one network to another? What are we
really “sharing” between the two networks? Or rather, what can we possible share
between them?
x … … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
Frozen
Layers
x … … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers
Frozen
Layers
x … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers
● Optionally, attach a new custom network (often only a few linear layers that are initialized randomly) that
is not frozen to the frozen layers
Frozen Custom
Layers Layers
Aside: in computer vision
literature, if this custom network is
only a single linear layer, then this
process is also called linear
x … … probing!
y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers
● Optionally, attach a new custom network (often only a few linear layers that are initialized randomly) that
is not frozen to the frozen layers
● Train these new layers on unseen data from task 2
Frozen Custom
Layers Layers
Aside: in computer vision
literature, if this custom network is
only a single linear layer, then this
process is also called linear
x … … probing!
y
Fine-tuning layers
● Instead of discarding the non-frozen layers of the pretrained network, fine-tune them, i.e., train them
further (starting with the same pre-trained weights from task 1) on the new data for task 2.
Frozen Trainable
Layers Layers
x … … y
Fine-tuning layers
● Instead of discarding the non-frozen layers of the pretrained network, fine-tune them, i.e., train them
further (starting with the same pre-trained weights from task 1) on the new data for task 2.
● Can also simultaneously train any (optional) additional output layers (again, these tend to just be
randomly initialized linear layers) on top of the original pre-trained network layers as necessary!
x … … … y
Fine-tuning the whole network
In some cases, it might be favorable to fine-tune the entire pretrained network rather
than some subset of layers. This can equivalently be viewed as initializing a network’s
parameters to a pretrained network’s parameters, instead of the usual random
initialization. In other words, a good pretrained network gives you a “head-start” in the
training process.
Fine-tuning vs freezing
Choosing the number of frozen, fine-tunable and custom layers greatly depends on the
problem at hand. Nonetheless, here is what CS231N (Stanford’s DL course) suggests
for four common scenarios in which transfer learning is generally applicable:
Learning
Dataset 1 system task
Learning
1
Dataset 1 system task
1
Knowledge
Learning
Dataset 2 system task
2
Learning
Dataset 2 system task
2
Unsupervised Learning
Supervised vs unsupervised learning
● Supervised learning — the training process receives supervision from labels
○ Ex: classification, regression, object detection, etc.
● Unsupervised learning — the training process receives no supervision from labels
○ Ex: dimensionality reduction (PCA, t-SNE), clustering, generative modeling, etc.
Unsupervised representation learning
● Models trained on large labelled datasets learn pretty useful features in general,
but can we also learn them from completely unlabeled data?
● Imagine scraping a giant collection of images, text, audio samples, etc. from the
web without having to manually label them. Consider the implications of this:
○ ImageNet only has a million images but there are billions of images online
○ Curated text datasets (like the English Wikipedia) may have billions of tokens but there are trillions
of words on the internet
● Moreover, labeling data for good supervision can be extremely difficult, tedious,
expensive and time-consuming
○ Ex. medical data, legal data, etc.
Self-Supervised Learning
● Self-supervised learning — ML model first generates labels out of raw data and
then train in a supervised manner
● “The general technique of self-supervised learning is to predict any unobserved or
hidden part of the input from any observed or unhidden part of the input”
Some terminology
● Recall how we have been referring to a “task 1” and “task 2” during our discussion
of transfer learning so far. Turns out, they both have a special name:
○ Pretext task — the task on which a self-supervised pretrained network is first trained on
○ Downstream tasks — the task which then leverages the pretrained representations in some way.
Some examples in different fields of ML include:
■ Computer Vision – classification, object detection, segmentation, etc.
■ NLP – classification, translation, summarization, question-answering, etc.
■ RL – online model-free fine-tuning, etc.
Autoencoder (the most basic unsupervised model)
● Pretext task — learn a “compressed” representation of an input (does not have to
be an image, but it’s one in the example below for illustrative purposes)
● The bottleneck is a hidden layer with very few neurons — if the input can be
reconstructed from the output of the bottleneck layer, it must have captured
enough “important” information about the original input in a smaller vector
Rotnet
● The pretext task is to predict the
rotation angle of a rotated image
○ 4-way classification between
0, 90, 180 and 270 degrees
● Model learns about the
relationships between high level
features in an image (ex. location,
type and poses of objects), instead
of just low-level patterns
Rotnet
Jigsaw
● Pretext task – sample 9 patches from 3 by 3 square grid, shuffle them and then
predict their original order
○ Shuffling order sampled from a predefined set instead of all 9! permutations
● Model learns that images are made up of “parts” that are related to each other!
Word2Vec
● Ways to generate embeddings that map discrete words to vectors
● Multiple ways to learn word2vec embeddings, two of which include:
○ Continuous Bag of Words — predict a word from the surrounding (context) words
■ Ex: predict bit in “The dog bit the man” from dog and the.
○ Skip-Gram — predict the surrounding context from a word
■ Ex: predict dog and bit in “The dog bit the man” from bit
Masked Language Models
● Common word2vec approaches involve predicting words from only their
immediately surrounding context. What if we take into consideration the entire
sentence a word belongs to?
● Further, what if you predict multiple words within that sentence simultaneously?
● We say that these words have been masked and the pretext task here is to unmask
them using the remaining words as context
BERT (2018)
● Input tokens (words) are
randomly masked with 15%
probability — the model
must predict these
○ Gives word level
representations
● Pairs of sentences are
passed in together — the
model must predict the
sentence order
○ Gives sentence level
representations
Lecture Attendance
http://tinyurl.com/fa24-dl4cv
Contributors
● Aryan Jain