0% found this document useful (0 votes)
22 views54 pages

[Fall 2024] Deep Learning 3

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views54 pages

[Fall 2024] Deep Learning 3

Uploaded by

David Earnest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Deep Learning 3

By: ML@B Edu Team


Announcements
● Quiz 2 due next Monday
○ Covers content from Monday’s (Deep Learning 2) and today’s (Deep Learning 3) lectures
● Homework 1 due next Monday
● Office Hours will be held every Thursday 3-4 PM, at Cory 531
1. Deep Learning is Representation

Outline 2.
Learning
Transfer Learning
3. Unsupervised and
Self-Supervised Learning
What are we trying to learn?
What is a “feature”
● Consider the classical machine learning problem:
○ You have an input x and some function 𝚽(x) that returns any relevant features from x
■ For example, if x is a house and y is it’s selling price, then 𝚽(x) could be a vector with
information like the house age, the number of rooms, whether it has a basement or not, etc.
○ You pass the features into some model fθ(࣭𝚽(x)), parameterized by θ, to predict the label y
○ The machine learning problem is to “learn” θ from a training dataset of (x, y) pairs

House House Age Number of Number of Size (sq. Floors Basement? Garage? Backyard? Pool?
Bathrooms Bedrooms feet)

1 12 months 4 4 2500 2 Yes Yes No No

2 30 months 2 3 2000 1 No Yes Yes Yes


What is a “feature”
● Consider the classical machine learning problem:
○ You have an input x and some function 𝚽(x) that returns any relevant features from x
■ For example, if x is a house and y is it’s selling price, then 𝚽(x) could be a vector with
information like the house age, the number of rooms, whether it has a basement or not, etc.
○ You pass the features into some model fθ(࣭𝚽(x)), parameterized by θ, to predict the label y
○ The machine learning problem is to “learn” θ from a training dataset of (x, y) pairs

Input Features Output

Feature Extraction ML Algorithm


x → 𝚽(x) 𝚽(x) → y

An alternate view of deep learning
● In the early days, feature extractors were programmed by hand, i.e., ML
practitioners would manually select what features they would feed into the model
of their choice.
○ If the input is an image, this could include edge and corner information, presence of certain shapes,
patterns or colors, etc — these are all different representations of the same image

Output

Hand-engineered Algorithm with


features Learned Weights
An alternate view of deep learning
● However, this is a compromise! The model is learning its parameters from data yet
we are still hand-programming the feature extractors ourselves. Can we make the
entire process learned from end-to-end?
○ Yes! This is where deep learning comes in!
○ Learn the feature extractor as well — learn EVERYTHING in the entire pipeline!

Output

Hand-engineered Algorithm with


features Learned Weights
An alternate view of deep learning
● However, this is a compromise! The model is learning its parameters from data yet
we are still hand-programming the feature extractors ourselves. Can we make the
entire process learned from end-to-end?
○ Yes! This is where deep learning comes in!
○ Learn the feature extractor as well — learn EVERYTHING in the entire pipeline!

Output

Learned Feature Algorithm with


Extractor Learned Weights
An alternate view of deep learning
● However, this is a compromise! The model is learning its parameters from data yet
we are still hand-programming the feature extractors ourselves. Can we make the
entire process learned from end-to-end?
○ Yes! This is where deep learning comes in!
○ Learn the feature extractor as well — learn EVERYTHING in the entire pipeline!

Output
Deep learning is Representation learning
● Deep learning allows a model to learn “good” representations directly from data!
○ The main idea is to relinquish all control to the model and let it learn whatever it feels is important to
solve the task at hand
○ Features are synonymous with representations in ML
● The output of each layer in a neural network is a learned representation so deep
learning can be viewed as the process of learning stacked representations
○ We call these representations hierarchical, i.e., later representations depend on, and are more
abstract than earlier ones — depth refines representations
Deep learning is Representation learning
Transfer Learning
Training a network from scratch
● Training a good model from scratch takes
○ Time
○ Compute
○ Training data — the more, the better.
■ Models benefit significantly from A LOT of data — especially in computer vision, a few
thousand examples usually doesn’t cut it.
● Money for all of the above
Transfer learning motivation
● When trained from scratch, a model’s parameters are initialized randomly and
updated by some optimization algorithm like stochastic gradient descent or Adam
● So, if there are two different tasks to be solved, no matter how similar, this process
will be repeated separately both times
● However, does it really need to be? Consider the case when we humans learn
something new: do we ALWAYS start from the ground up?
Transfer learning motivation
● Let’s apply this idea to deep neural networks — consider training separate models
for cat/dog classification and face recognition
○ They will all need to learn a suitable set of feature extractors for their inputs. Since representations
are hierarchical:
■ Low-level (earlier) layers might learn feature extractors for very concrete details like edges,
corners, shapes, patterns, colors, etc.
■ High-level (deeper) layers might learn feature extractors for abstract concepts like mental
models of cats/dogs or human faces
○ Note: these low level features are general enough to be extracted from any kind of image for
any kind of task and representations/features only start to diverge significantly across tasks in
the later layers
○ Don’t worry about what exactly these feature extractors / neural network layers look like for now…
we will discuss that next time
Take some hypothetical model trained on
image inputs — what kind of information
and/or visual content does each layer look for?
Transfer learning motivation
● What about a neural network trained for
ImageNet classification?
○ The ImageNet dataset is a huge collection of
~1.3 million images divided across 1000 ResNet 18 ResNet 101
incredibly diverse classes
○ ResNet, AlexNet and DenseNet are just
special kinds of neural network architectures…
we will cover some of them in detail soon
enough but just think of them as some black
box NNs for now
○ Takeaway: different models trained for AlexNet DenseNet 121
image classification learn very similar lower The first layers of completely different models, trained
layers separately, are trying to extract the same kind of
information from an input image!
Transfer learning motivation
How can we transfer the “knowledge gained” by one network to another? What are we
really “sharing” between the two networks? Or rather, what can we possible share
between them?

Hint: neural networks are stacks of layers


Freezing layers
● Take the pretrained network from task 1

x … … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input

Frozen
Layers

x … … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers

Frozen
Layers

x … y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers
● Optionally, attach a new custom network (often only a few linear layers that are initialized randomly) that
is not frozen to the frozen layers

Frozen Custom
Layers Layers
Aside: in computer vision
literature, if this custom network is
only a single linear layer, then this
process is also called linear
x … … probing!
y
Freezing layers
● Take the pretrained network from task 1
● Freeze some of the initial layers (i.e. disable gradient updates to them) and treat them as a fixed feature
extractor — take the activations from these layers as some intermediate representation of your input
● Discard the remaining un-frozen (possibly none) later layers
● Optionally, attach a new custom network (often only a few linear layers that are initialized randomly) that
is not frozen to the frozen layers
● Train these new layers on unseen data from task 2
Frozen Custom
Layers Layers
Aside: in computer vision
literature, if this custom network is
only a single linear layer, then this
process is also called linear
x … … probing!
y
Fine-tuning layers
● Instead of discarding the non-frozen layers of the pretrained network, fine-tune them, i.e., train them
further (starting with the same pre-trained weights from task 1) on the new data for task 2.

Frozen Trainable
Layers Layers

x … … y
Fine-tuning layers
● Instead of discarding the non-frozen layers of the pretrained network, fine-tune them, i.e., train them
further (starting with the same pre-trained weights from task 1) on the new data for task 2.
● Can also simultaneously train any (optional) additional output layers (again, these tend to just be
randomly initialized linear layers) on top of the original pre-trained network layers as necessary!

Frozen Trainable Custom


Layers Layers Layers

x … … … y
Fine-tuning the whole network
In some cases, it might be favorable to fine-tune the entire pretrained network rather
than some subset of layers. This can equivalently be viewed as initializing a network’s
parameters to a pretrained network’s parameters, instead of the usual random
initialization. In other words, a good pretrained network gives you a “head-start” in the
training process.
Fine-tuning vs freezing
Choosing the number of frozen, fine-tunable and custom layers greatly depends on the
problem at hand. Nonetheless, here is what CS231N (Stanford’s DL course) suggests
for four common scenarios in which transfer learning is generally applicable:

Large Task 2 Dataset Small Task 2 Dataset

Should be ok to fine-tune the entire No need to fine-tune. Freeze most of


Task 2 dataset similar to
network. the initial layers and train a linear
task 1 dataset
classifier on top of them.

Should be ok to fine-tune the entire Don’t fine-tune. Freeze some of the


Task 2 dataset different
network. initial layers and train a custom
from task 1 dataset
network on top of them.
A small aside on some deep learning
terminology
Shots
● A shot is the number of new examples you show a pre-trained network during
transfer learning
● few shot: show the network a small number of examples
● k-shot: show the network exactly k examples (say, k labeled data points)
● 1-shot: show the network only a single example
● 0-shot: don’t show the network any new examples
○ Using frozen feature extractors is an example of 0-shot adaptation — one example is LLMs
Embeddings
● An embedding is the output of a map
from some (discrete or continuous)
space to a different (continuous)
space.
○ For example, this map can be from the
space of image pixels, text tokens or audio
waves to the space of d-dimensional
vectors.
○ This map yields image, text or audio
embeddings respectively.
● In other words, embeddings are just
another way of representing complex
non-numerical data using numbers!
Learned embeddings
● How does one obtain these embeddings in the first place?
○ Like with everything in deep learning, let’s learn them!
● Neural networks already learn their parameters via gradient-based optimization
— it is only natural to use one to generate data embeddings.
● Or alternatively, instead of training a whole new network from scratch, just take
the output of some intermediate or final layer of an existing neural network.
○ Recall that pretrained models with frozen layers can be treated as feature extractors.
○ These features can now function as pretrained learned embeddings!
Learned embeddings
Softmax + Cross Entropy
Learned embeddings
Latent spaces
● High dimensional data can have an
inherent low-dimensional structure that
may be more preferable to work with
● This low-dimensional structure can be
captured by a latent space of features
that aims to encode all meaningful
information required to represent the
original high-dimensional data.
● In particular, the low-dimensional latent
space is said to be embedded in in the
high-dimensional input space of data
Why are latent spaces important?
● Latent spaces contain the
compressed representations
(sometimes also called latents) of the
original inputs. Latent vectors
○ are embeddings which tend to be closer
together for items that are semantically
“similar”to each other in their original
high-dimensional space
○ are a very important concept in the theory
of generative modeling, and we will revisit
them when discussing models such as
VAEs, GANs, LDMs, etc.
Back to transfer learning
Why are pretrained networks useful?
● The most useful pretrained networks tend to be the ones trained on massive
datasets: lots of diverse data ⇒ better and generalizable representations
○ These are particularly useful when being leveraged for tasks in low-data regimes, where there isn’t
enough data for a model to develop these representations on its own
○ Someone has already done the hard work on training on large datasets for us… let’s just build off of
that to improve our own pipelines
● Popular pretrained networks are available off-the-shelf. For example,
○ Parameters for vision models trained on ImageNet can be found online
○ LLMs are pre-trained networks and some of them (like Meta’s LLaMa) are even open source!
○ Various word2vec or GLoVe embeddings are available through certain python libraries
Why are pretrained networks useful?
● Embeddings can be pre-computed and stored in memory instead of the original
high-dimensional data
○ This can free up a lot of space
■ A 256 x 256 x 3 image contains 196,608 pixels. Suppose a pixel occupies 1 byte of memory —
so this image effectively takes up 192 kB of space.
■ What if we can get a 2048 dimensional embedding of this image? Each element in the
embedding vector will be a 4 byte floating point number — this embedding only takes up
8192 bytes = 8kB of space.
■ This is an almost 96% reduction in the memory requirement for storing this dataset!
○ Lower-dimensional embeddings (which already capture most of the low-level features of the
original data) can also speed up training — think about the computation time of passing a
~200k-dimensional image through a large neural network vs passing a ~2k-dimensional embedding
into a small neural network
Transfer learning Recap
● Transfer learning in a nutshell: yoinking other people’s work, who have already
done the hard part of training big models on huge datasets, to aid your own model.
● While most of the examples today focused on CV, everything can generally be
applied to any kind of neural network. In fact, transfer learning is huge in NLP!

Learning
Dataset 1 system task
Learning
1
Dataset 1 system task
1

Knowledge
Learning
Dataset 2 system task
2
Learning
Dataset 2 system task
2
Unsupervised Learning
Supervised vs unsupervised learning
● Supervised learning — the training process receives supervision from labels
○ Ex: classification, regression, object detection, etc.
● Unsupervised learning — the training process receives no supervision from labels
○ Ex: dimensionality reduction (PCA, t-SNE), clustering, generative modeling, etc.
Unsupervised representation learning
● Models trained on large labelled datasets learn pretty useful features in general,
but can we also learn them from completely unlabeled data?
● Imagine scraping a giant collection of images, text, audio samples, etc. from the
web without having to manually label them. Consider the implications of this:
○ ImageNet only has a million images but there are billions of images online
○ Curated text datasets (like the English Wikipedia) may have billions of tokens but there are trillions
of words on the internet
● Moreover, labeling data for good supervision can be extremely difficult, tedious,
expensive and time-consuming
○ Ex. medical data, legal data, etc.
Self-Supervised Learning
● Self-supervised learning — ML model first generates labels out of raw data and
then train in a supervised manner
● “The general technique of self-supervised learning is to predict any unobserved or
hidden part of the input from any observed or unhidden part of the input”
Some terminology
● Recall how we have been referring to a “task 1” and “task 2” during our discussion
of transfer learning so far. Turns out, they both have a special name:
○ Pretext task — the task on which a self-supervised pretrained network is first trained on
○ Downstream tasks — the task which then leverages the pretrained representations in some way.
Some examples in different fields of ML include:
■ Computer Vision – classification, object detection, segmentation, etc.
■ NLP – classification, translation, summarization, question-answering, etc.
■ RL – online model-free fine-tuning, etc.
Autoencoder (the most basic unsupervised model)
● Pretext task — learn a “compressed” representation of an input (does not have to
be an image, but it’s one in the example below for illustrative purposes)
● The bottleneck is a hidden layer with very few neurons — if the input can be
reconstructed from the output of the bottleneck layer, it must have captured
enough “important” information about the original input in a smaller vector
Rotnet
● The pretext task is to predict the
rotation angle of a rotated image
○ 4-way classification between
0, 90, 180 and 270 degrees
● Model learns about the
relationships between high level
features in an image (ex. location,
type and poses of objects), instead
of just low-level patterns
Rotnet
Jigsaw
● Pretext task – sample 9 patches from 3 by 3 square grid, shuffle them and then
predict their original order
○ Shuffling order sampled from a predefined set instead of all 9! permutations
● Model learns that images are made up of “parts” that are related to each other!
Word2Vec
● Ways to generate embeddings that map discrete words to vectors
● Multiple ways to learn word2vec embeddings, two of which include:
○ Continuous Bag of Words — predict a word from the surrounding (context) words
■ Ex: predict bit in “The dog bit the man” from dog and the.
○ Skip-Gram — predict the surrounding context from a word
■ Ex: predict dog and bit in “The dog bit the man” from bit
Masked Language Models
● Common word2vec approaches involve predicting words from only their
immediately surrounding context. What if we take into consideration the entire
sentence a word belongs to?
● Further, what if you predict multiple words within that sentence simultaneously?
● We say that these words have been masked and the pretext task here is to unmask
them using the remaining words as context
BERT (2018)
● Input tokens (words) are
randomly masked with 15%
probability — the model
must predict these
○ Gives word level
representations
● Pairs of sentences are
passed in together — the
model must predict the
sentence order
○ Gives sentence level
representations
Lecture Attendance

http://tinyurl.com/fa24-dl4cv
Contributors
● Aryan Jain

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy