Week 3: Deeplearning - Ai
Week 3: Deeplearning - Ai
Overview
deeplearning.ai
Week 3
Question BERT
Answering
Transfer T5
learning
Question Answering
Context-based Closed book
Model Model
Not just the model
Data Data
Model Model
Classical training
Training Inference
Course Review
Model
Training
on “Downstream” Task
Course Review Model
Transfer Learning: Different Tasks
Inference
Pre-Training
Sentiment Watching the Model When’s my
Classification movie is like ... birthday?
Model
Training When is
Downstream task: Pi Day? Model
Question Answering March
14! Umm…
BERT: Bi-directional Context
Uni-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!
context
Bi-directional
Learning from deeplearning.ai is like watching the sunset with my best friend!
context context
T5: Single task vs. Multi task
Studying with Studying with
deeplearning.ai deeplearning.ai
was ... was ...
Model 1
Model
Model 2
T5: more data, better performance
C4
Colossal Clean Crawled
English wikipedia
Corpus
~13 GB
~800 GB
Transfer
Learning
in NLP
deeplearning.ai
Desirable Goals
Transfer Learning!
● Improve predictions
● Small datasets
Transfer learning options
1 2
Pre-train
Transfer Model
data
Train
Model Labeled
data Feature- Unlabeled
based
3
prediction Fine- Pre-training task
tuning
Language modeling
Masked words
Next sentence
Transfer 1
General purpose learning
I am because I am learning CBOW “Happy”
Word Embeddings
input
Translation
“Features”
Transfer 1
Feature-based vs. Fine-Tuning
Pre-Train Pre-Train
features
Fine-tune same model
on Downstream task
Model prediction
Model prediction
Train a new model
Fine-tune: adding a layer Transfer
1
Pre-Training
Movies ...
Course
reviews
2
Pre-train
Data and performance data
Data Model
Data Model
2
Pre-train
Labeled vs Unlabeled Data data
Model
Labeled data
3
Self-supervised task Pre-training task
Unlabeled
data
Create
Inputs targets
(features) (Labels)
3
Self-supervised tasks Pre-training task
Unlabeled Data
Learning from deeplearning.ai
is like watching the sunset Target
with my best friend.
Input friend
Update
Language modeling
Fine-tune a model for each downstream task
Pre Training
Model
Training on
Downstream task
Model Model Model
“on”
“the” “right”
“side”
Bi-directional LSTM
“right”
LSTM LSTM
Decoder
RNN
Decoder
Encoder
Uni-directional
Why not bi-directional?
Transformer
Attention
Attention
Attention
Decoder
Decoder Encoder
Encoder
The legislators believed that they were on the _____ side of history, so they changed the law.
Bi-directional
Transformer + Bi-directional Context
Sentence “A”
? Sentence “B”
Decoder Decoder
Decoder Encoder
Encoder Encoder
T5: Multi-task
Studying with
deeplearning.ai
was ...
How?
Model
T5: Text-to-Text
“5 stars”
“Classify: Learning from deeplearning.ai is like...”
Classify
Next Sentence
Prediction
Bidirectional Encoder
Representations from
Transformers (BERT)
deeplearning.ai
Outline
● Learn about the BERT architecture
...
...
...
...
BERT
● A multi layer bidirectional transformer
● Positional embeddings
● BERT_base:
12 layers (12 transformer blocks)
12 attentions heads
110 million parameters
BERT pre-training
E E E E E E E E E E E
Token [CLS] my dog is cute [SEP] he likes play ##ing [SEP]
Embeddings
E E E E E E E E E E E
Segment A A A A A A B B B B B
Embeddings
Position E E E E E E E E E E E
0 1 2 3 4 5 6 7 8 9 10
Embeddings
Visualizing the output
NSP Mask ML Mask ML
[CLS] Tok 1 ... Tok N [SEP] Tok 1 ... Tok M • [SEP]: a special
separator token
Masked sentence A Masked sentence B
V 2
Summary
● BERT objective
● Model inputs/outputs
Fine-tuning
BERT
deeplearning.ai
Fine-tuning BERT: Outline
MNLI
Pre-train BERT
BERT
Hypothesis Premise
Sentence A Sentence B
NER
SQuAD BERT
BERT
Sentence A Tags
Question Answer
Inputs
Summary
Sentence A Sentence B Sentence Entities
Hypothesis Premise
⋮
Transformer
T5
deeplearning.ai
Outline
Classification
Summarization
Question
Answering (Q&A)
Sentiment
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Transformer - T5 Model
Original text
Inputs
Targets
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Model Architecture
y1 y2 .
Language model Prefix LM
Decoder
X2 X3 y1 y2 . X2 X3 y1 y2 .
Encoder
X1 X2 X3 X4 X1 X2 X3 y1 y2 X1 X X 3 y1 y2
2
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Model Architecture
● Encoder/decoder Decoder
Encoder
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Summary
● Prefix LM attention
● Model architecture
● Pre-training T5 (MLM)
Multi-task
Training
Strategy
deeplearning.ai
Multi-task training strategy
“Translate English to German: That is
good.” “Das ist gut”
T5
well.”
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Input and Output Format
Machine translation:
• translate English to German: That is good.
● Predict entailment, contradiction , or neutral
• mnli premise: I hate pigeons hypothesis: My feelings
towards pigeons are filled with animosity. target: entailment
● Winograd schema
• The city councilmen refused the demonstrators a permit
because *they* feared violence
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Multi-task Training Strategy
Fine-tuning method GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
©Exploring the Limits of Transfer learning with a unified text to Text Transformer. Raffel et. al. 2020
Data Training Strategies
Examples-proportional mixing Equal mixing
Data 1 Data 1
Sample 1 Sample 1
Data 2
Data 2 Sample 2
Sample 2
Temperature-scaled mixing
Gradual unfreezing vs. Adapter layers
Q&A
GLUE
Benchmark
deeplearning.ai
General Language Understanding Evaluation
● A collection used to train, evaluate, analyze natural language
understanding systems
● Datasets with different genres, and of different sizes and
difficulties
● Leaderboard
Tasks Evaluated on
● Sentence grammatical or not?
● Sentiment
● Paraphrase
● Similarity
● Questions duplicates
● Answerable
● Contradiction
● Entailment
● Winograd (co-ref)
General Language Understanding Evaluation
● Drive research
● Model agnostic
[
Add & Norm
Feed LayerNorm,
Forward dense,
Add & Norm activation,
Multi-Head dropout_middle,
Attention
dense,
Positional dropout_final
Encoding ]
Input
Embedding
Inputs
Transformer encoder Encoder block:
[
Add & Norm
Feed Residual(
Forward LayerNorm,
Add & Norm attention,
Multi-Head dropout_,
Attention
),
Positional Residual(
Encoding feed_forward,
Input
),
Embedding
Inputs ]
Transformer encoder Feedforward: Encoder block:
[ [
Add & Norm
Feed LayerNorm, Residual(
Forward dense, LayerNorm,
Context: Since the end of the Second World War , France has become an ethnically diverse country . Today ,
approximately five percent of the French population is non - European and non - white . This does not
approach the number of non - white citizens in the United States ( roughly 28 – 37 % , depending on how Latinos are classified ;
see Demographics of the United States ) . Nevertheless , it amounts to at least three million people , and has forced the issues
of ethnic diversity onto the French policy agenda . France has developed an approach to dealing with ethnic problems that
stands in contrast to that of many advanced , industrialized countries . Unlike the United States , Britain , or even the
Netherlands , France maintains a " color - blind " model of public policy . This means that it targets virtually no policies directly
at racial or ethnic groups . Instead , it uses geographic or class criteria to address issues of social inequalities . It has , however ,
developed an extensive anti - racist policy repertoire since the early 1970s . Until recently , French policies focused primarily
on issues of hate speech — going much further than their American counterparts — and relatively less on issues of
discrimination in jobs , housing , and in provision of goods and services .
● Process data to get the required inputs “cola sentence: The course “not
is jumping well.”
and outputs: "question: Q context: C" as
input and "A" as target “stsb sentence1: The rhino
T5 acceptable”
Transformers library
Use it for
Context
Q/ Answers
A
Questions
Hugging Face: Fine-Tuning Transformers
Datasets:
Tokenizer
One Thousand
Model Checkpoints:
Trainer Evaluation metrics
More than 14 thousand
Tokenizer
Checkpoint: Set of learned
parameters for a model using a
training procedure for some task
Human readable output
Hugging
Face: Using
Transformers
deeplearning.ai
Using Transformers
Pipelines 1. Pre-processing your inputs
Context
Q/ Answers
A
Questions
Task
Tasks
Initialization
Pipelines Model
Checkpoint
Model Checkpoints:
Trainer Evaluation metrics
More than 14 thousand
Tokenizer
Datasets:
One Thousand Load them using just one function
Number of epochs
Warm-up steps
Weight decay
...