0% found this document useful (0 votes)

124 views54 pages

cs224n 2023 Lecture9 Pretraining

The document discusses pretraining of deep learning models for natural language processing. It covers subword modeling, motivating pretraining from word embeddings, three ways to pretrain models including encoders, encoder-decoders and decoders, and what pretraining may teach models.

Uploaded by

Cassin Thangam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views54 pages

cs224n 2023 Lecture9 Pretraining

Uploaded by

Cassin Thangam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Natural Language Processing

with Deep Learning

CS224N/Ling284

John Hewitt
Lecture 9: Pretraining
Adapted from slides by Anna Goldie, John Hewitt
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Decoders
2. Encoders
3. Encoder-Decoders
4. Interlude: what do we think pretraining is teaching?
5. Very large models and in-context learning

Reminders:
Assignment 5 is out on Thursday! It covers lecture 8 and lecture 9(Today)!
It has ~pedagogically relevant math~
2
Word structure and subword models
Let’s take a look at the assumptions we’ve made about a language’s vocabulary.

We assume a fixed vocab of tens of thousands of words, built from the training set.
All novel words seen at test time are mapped to a single UNK.

word vocab mapping embedding

Common hat → pizza (index)
words
learn → tasty (index)
Variations taaaaasty → UNK (index)
misspellings laern → UNK (index)
novel items Transformerify → UNK (index)

3
Word structure and subword models
Finite vocabulary assumptions make even less sense in many languages.
• Many languages exhibit complex morphology, or word structure.
• The effect is more word types, each occurring fewer times.

Example: Swahili verbs can have

hundreds of conjugations, each
encoding a wide variety of
information. (Tense, mood,
definiteness, negation, information
about the object, ++)

Here’s a small fraction of the

conjugations for ambia – to tell.

4 [Wiktionary]
The byte-pair encoding algorithm
Subword modeling in NLP encompasses a wide range of methods for reasoning about
structure below the word level. (Parts of words, characters, bytes.)
• The dominant modern paradigm is to learn a vocabulary of parts of words (subword tokens).
• At training and testing time, each word is split into a sequence of known subwords.

Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.

1. Start with a vocabulary containing only characters and an “end-of-word” symbol.
2. Using a corpus of text, find the most common adjacent characters “a,b”; add “ab” as a subword.
3. Replace instances of the character pair with the new subword; repeat until desired vocab size.

Originally used in NLP for machine translation; now a similar method (WordPiece) is used in pretrained
models.

5 [Sennrich et al., 2016, Wu et al., 2016]

Word structure and subword models
Common words end up being a part of the subword vocabulary, while rarer words are split
into (sometimes intuitive, sometimes not) components.

In the worst case, words are split into as many subwords as they have characters.

word vocab mapping embedding

Common hat → hat
words learn → learn
Variations taaaaasty → taa## aaa## sty
misspellings laern → la## ern##
novel items Transformerify → Transformer## ify

6
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

7
Motivating word meaning and context
Recall the adage we mentioned at the beginning of the course:

“You shall know a word by the company it keeps” (J. R. Firth 1957: 11)

This quote is a summary of distributional semantics, and motivated word2vec. But:

“… the complete meaning of a word is always contextual,

and no study of meaning apart from a complete context
can be taken seriously.” (J. R. Firth 1935)

Consider I record the record: the two instances of record mean different things.

8 [Thanks to Yoav Goldberg on Twitter for pointing out the 1935 Firth quote.]
Where we were: pretrained word embeddings
Circa 2017:
• Start with pretrained word embeddings (no ෝ
𝒚
context!)
• Learn how to incorporate context in an LSTM Not pretrained
or Transformer while training on the task.

Some issues to think about: pretrained

(word embeddings)
• The training data we have for our
… the movie was …
downstream task (like question answering)
must be sufficient to teach all contextual
[Recall, movie gets the same word embedding,
aspects of language. no matter what sentence it shows up in]
• Most of the parameters in our network are
randomly initialized!

9
Where we’re going: pretraining whole models
In modern NLP:
• All (or almost all) parameters in NLP ෝ
𝒚
networks are initialized via pretraining.
• Pretraining methods hide parts of the input
from the model, and train the model to Pretrained jointly
reconstruct those parts.

• This has been exceptionally effective at

building strong: … the movie was …

• representations of language
• parameter initializations for strong NLP [This model has learned how to represent
models. entire sentences through pretraining]
• Probability distributions over language that
we can sample from
10
What can we learn from reconstructing the input?

Stanford University is located in __________, California.

11
What can we learn from reconstructing the input?

I put ___ fork down on the table.

12
What can we learn from reconstructing the input?

The woman walked across the street,

checking for traffic over ___ shoulder.

13
What can we learn from reconstructing the input?

I went to the ocean to see the fish, turtles, seals, and _____.

14
What can we learn from reconstructing the input?

Overall, the value I got from the two hours watching

it was the sum total of the popcorn and the drink.
The movie was ___.

15
What can we learn from reconstructing the input?

Iroh went into the kitchen to make some tea.

Standing next to Iroh, Zuko pondered his destiny.
Zuko left the ______.

16
What can we learn from reconstructing the input?

I was thinking about the sequence that goes

1, 1, 2, 3, 5, 8, 13, 21, ____

17
Pretraining through language modeling [Dai and Le, 2015]
Recall the language modeling task:
• Model 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 ), the probability
distribution over words given their past goes to make tasty tea END
contexts.
• There’s lots of data for this! (In English.) Decoder
(Transformer, LSTM, ++ )
Pretraining through language modeling:
• Train a neural network to perform language
modeling on a large amount of text. Iroh goes to make tasty tea
• Save the network parameters.

18
The Pretraining / Finetuning Paradigm
Pretraining can improve NLP applications by serving as parameter initialization.

Step 1: Pretrain (on language modeling) Step 2: Finetune (on your task)
Lots of text; learn general things! Not many labels; adapt to the task!
goes to make tasty tea END ☺/

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

Iroh goes to make tasty tea … the movie was …

19
Stochastic gradient descent and pretrain/finetune
Why should pretraining and finetuning help, from a “training neural nets” perspective?

• Consider, provides parameters 𝜃෠ by approximating min ℒpretrain 𝜃 .

𝜃
• (The pretraining loss.)
෠
• Then, finetuning approximates min ℒfinetune 𝜃 , starting at 𝜃.
𝜃
• (The finetuning loss)
• The pretraining may matter because stochastic gradient descent sticks (relatively)
close to 𝜃෠ during finetuning.
• So, maybe the finetuning local minima near 𝜃෠ tend to generalize well!
• And/or, maybe the gradients of finetuning loss near 𝜃෠ propagate nicely!

20
Lecture Plan
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

21
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?

Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

Decoders
• Nice to generate from; can’t condition on future words

22
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?

Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

Decoders
• Nice to generate from; can’t condition on future words

23
Pretraining encoders: what pretraining objective to use?
So far, we’ve looked at language model pretraining. But encoders get bidirectional
context, so we can’t do language modeling!

Idea: replace some fraction of words in the

went store
input with a special [MASK] token; predict
these words. 𝐴, 𝑏
ℎ1 , … , ℎ 𝑇
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
𝑦𝑖 ∼ 𝐴𝑤𝑖 + 𝑏

Only add loss terms from words that are

“masked out.” If 𝑥෤ is the masked version of 𝑥, I [M] to the [M]
we’re learning 𝑝𝜃 (𝑥|𝑥).
෤ Called Masked LM.
[Devlin et al., 2018]

24
BERT: Bidirectional Encoder Representations from Transformers
Devlin et al., 2018 proposed the “Masked LM” objective and released the weights of a
pretrained Transformer, a model they labeled BERT.

Some more details about Masked LM for BERT:

• Predict a random 15% of (sub)word tokens. [Predict these!] went to store
• Replace input word with [MASK] 80% of the time
• Replace input word with a random token 10% of Transformer
the time Encoder
• Leave input word unchanged 10% of the time (but
still predict it!)
• Why? Doesn’t let the model get complacent and not
I pizza to the [M]
build strong representations of non-masked words.
(No masks are seen at fine-tuning time!)
[Replaced] [Not replaced] [Masked]

25 [Devlin et al., 2018]

BERT: Bidirectional Encoder Representations from Transformers

• The pretraining input to BERT was two separate contiguous chunks of text:

• BERT was trained to predict whether one chunk follows the other or is randomly
sampled.
• Later work has argued this “next sentence prediction” is not necessary.

26 [Devlin et al., 2018, Liu et al., 2019]

BERT: Bidirectional Encoder Representations from Transformers
Details about BERT
• Two models were released:
• BERT-base: 12 layers, 768-dim hidden states, 12 attention heads, 110 million params.
• BERT-large: 24 layers, 1024-dim hidden states, 16 attention heads, 340 million params.
• Trained on:
• BooksCorpus (800 million words)
• English Wikipedia (2,500 million words)
• Pretraining is expensive and impractical on a single GPU.
• BERT was pretrained with 64 TPU chips for a total of 4 days.
• (TPUs are special tensor operation acceleration hardware)
• Finetuning is practical and common on a single GPU
• “Pretrain once, finetune many times.”

27 [Devlin et al., 2018]

BERT: Bidirectional Encoder Representations from Transformers
BERT was massively popular and hugely versatile; finetuning BERT led to new state-of-
the-art results on a broad range of tasks.
• QQP: Quora Question Pairs (detect paraphrase • CoLA: corpus of linguistic acceptability (detect
questions) whether sentences are grammatical.)
• QNLI: natural language inference over question • STS-B: semantic textual similarity
answering data • MRPC: microsoft paraphrase corpus
• SST-2: sentiment analysis • RTE: a small natural language inference corpus

28 [Devlin et al., 2018]

Limitations of pretrained encoders
Those results looked great! Why not used pretrained encoders for everything?

If your task involves generating sequences, consider using a pretrained decoder; BERT and other
pretrained encoders don’t naturally lead to nice autoregressive (1-word-at-a-time) generation
methods.

make/brew/craft goes to make tasty tea END

Pretrained Encoder Pretrained Decoder

Iroh goes to [MASK] tasty tea Iroh goes to make tasty tea

29
Extensions of BERT
You’ll see a lot of BERT variants like RoBERTa, SpanBERT, +++
Some generally accepted improvements to the BERT pretraining formula:
• RoBERTa: mainly just train BERT for longer and remove next sentence prediction!
• SpanBERT: masking contiguous spans of words makes a harder, more useful pretraining task

It’s bly irr## esi## sti## bly

BERT SpanBERT

It’ [MASK] [MASK] [MASK] [MASK] good

[MASK] irr## esi## sti## [MASK] good

30 [Liu et al., 2019; Joshi et al., 2020]

Extensions of BERT
A takeaway from the RoBERTa paper: more compute, more data can improve pretraining
even when not changing the underlying Transformer encoder.

31 [Liu et al., 2019; Joshi et al., 2020]

Full Finetuning vs. Parameter-Efficient Finetuning
Finetuning every parameter in a pretrained model works well, but is memory-intensive.
But lightweight finetuning methods adapt pretrained models in a constrained way.
Leads to less overfitting and/or more efficient finetuning and inference.

Full Finetuning Lightweight Finetuning

Adapt all parameters Train a few existing or new parameters

☺/ ☺/

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

… the movie was … … the movie was …

32 [Liu et al., 2019; Joshi et al., 2020]
Parameter-Efficient Finetuning: Prefix-Tuning, Prompt tuning
Prefix-Tuning adds a prefix of parameters, and freezes all pretrained parameters.
The prefix is processed by the model just like real words would be.
Advantage: each element of a batch at inference could run a different tuned model.

☺/

(Transformer, LSTM, ++ )

… the movie was …

Learnable prefix
parameters
33 [Li and Liang, 2021; Lester et al., 2021]
Parameter-Efficient Finetuning: Low-Rank Adaptation
Low-Rank Adaptation Learns a low-rank “diff” between the pretrained and finetuned
weight matrices.
Easier to learn than prefix-tuning.

𝐵 ∈ ℝ𝑘×𝑑
☺/
𝑊 ∈ ℝ𝑑×𝑑
(Transformer, LSTM, ++ )
𝐴 ∈ ℝ𝑑×𝑘

𝑊 + 𝐴𝐵
… the movie was …

34 [Hu et al., 2021]

Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?

Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

Decoders
• Nice to generate from; can’t condition on future words

35
Pretraining encoder-decoders: what pretraining objective to use?
For encoder-decoders, we could do something like language modeling, but where a
prefix of every input is provided to the encoder and is not predicted.
𝑤𝑇+2 , … ,
ℎ1 , … , ℎ 𝑇 = Encoder 𝑤1 , … , 𝑤𝑇
ℎ 𝑇+1 , … , ℎ2 = 𝐷𝑒𝑐𝑜𝑑𝑒𝑟 𝑤1 , … , 𝑤𝑇 , ℎ1 , … , ℎ 𝑇
𝑦𝑖 ∼ 𝐴ℎ𝑖 + 𝑏, 𝑖 > 𝑇

The encoder portion benefits from

bidirectional context; the decoder portion is 𝑤𝑇+1 , … , 𝑤2𝑇
used to train the whole model through
language modeling.
𝑤1 , … , 𝑤𝑇
[Raffel et al., 2018]

36
Pretraining encoder-decoders: what pretraining objective to use?
What Raffel et al., 2018 found to work best was span corruption. Their model: T5.

Replace different-length spans from the input

with unique placeholders; decode out the
spans that were removed!

This is implemented in text

preprocessing: it’s still an objective
that looks like language modeling at
the decoder side.

37
Pretraining encoder-decoders: what pretraining objective to use?
Raffel et al., 2018 found encoder-decoders to work better than decoders for their tasks,
and span corruption (denoising) to work better than language modeling.
Pretraining encoder-decoders: what pretraining objective to use?

A fascinating property
of T5: it can be
finetuned to answer a
wide range of
questions, retrieving
knowledge from its
parameters.

NQ: Natural Questions

WQ: WebQuestions 220 million params
770 million params
TQA: Trivia QA 3 billion params
11 billion params
All “open-domain”
versions
[Raffel et al., 2018]
Pretraining for three types of architectures
The neural architecture influences the type of pretraining, and natural use cases.

• Gets bidirectional context – can condition on future!

Encoders
• How do we train them to build strong representations?

Encoder- • Good parts of decoders and encoders?

Decoders • What’s the best way to pretrain them?

• Language models! What we’ve seen so far.

Decoders
• Nice to generate from; can’t condition on future words
• All the biggest pretrained models are Decoders.
40
Pretraining decoders
When using language model pretrained decoders, we can ignore
that they were trained to model 𝑝 𝑤𝑡 𝑤1:𝑡−1 ). ☺/

We can finetune them by training a classifier Linear 𝐴, 𝑏

on the last word’s hidden state.
ℎ1 , … , ℎ 𝑇
ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑦 ∼ 𝐴ℎ 𝑇 + 𝑏
Where 𝐴 and 𝑏 are randomly initialized and
specified by the downstream task. 𝑤1 , … , 𝑤𝑇

Gradients backpropagate through the whole [Note how the linear layer hasn’t been
network. pretrained and must be learned from scratch.]

41
Pretraining decoders
It’s natural to pretrain decoders as language models and then
use them as generators, finetuning their 𝑝𝜃 𝑤𝑡 𝑤1:𝑡−1 )!

This is helpful in tasks where the output is a 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6

sequence with a vocabulary like that at 𝐴, 𝑏
pretraining time!
• Dialogue (context=dialogue history) ℎ1 , … , ℎ 𝑇
• Summarization (context=document)

ℎ1 , … , ℎ 𝑇 = Decoder 𝑤1 , … , 𝑤𝑇
𝑤𝑡 ∼ 𝐴ℎ𝑡−1 + 𝑏 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5

[Note how the linear layer has been pretrained.]

Where 𝐴, 𝑏 were pretrained in the language
model!
42
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
2018’s GPT was a big success in pretraining a decoder!
• Transformer decoder with 12 layers, 117M parameters.
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Byte-pair encoding with 40,000 merges
• Trained on BooksCorpus: over 7000 unique books.
• Contains long spans of contiguous text, for learning long-distance dependencies.
• The acronym “GPT” never showed up in the original paper; it could stand for
“Generative PreTraining” or “Generative Pretrained Transformer”

43 [Devlin et al., 2018]

Generative Pretrained Transformer (GPT) [Radford et al., 2018]
How do we format inputs to our decoder for finetuning tasks?

Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral

Premise: The man is in the doorway
entailment
Hypothesis: The person is near the door

Radford et al., 2018 evaluate on natural language inference.

Here’s roughly how the input was formatted, as a sequence of tokens for the decoder.

[START] The man is in the doorway [DELIM] The person is near the door [EXTRACT]

The linear classifier is applied to the representation of the [EXTRACT] token.

44
Generative Pretrained Transformer (GPT) [Radford et al., 2018]
GPT results on various natural language inference datasets.

45
Increasingly convincing generations (GPT2) [Radford et al., 2018]
We mentioned how pretrained decoders can be used in their capacities as language models.
GPT-2, a larger version (1.5B) of GPT trained on more data, was shown to produce relatively
convincing samples of natural language.
GPT-3, In-context learning, and very large models
So far, we’ve interacted with pretrained models in two ways:
• Sample from the distributions they define (maybe providing a prompt)
• Fine-tune them on a task we care about, and take their predictions.

Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

GPT-3 is the canonical example of this. The largest T5 model had 11 billion parameters.
GPT-3 has 175 billion parameters.

47
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

The in-context examples seem to specify the task to be performed, and the conditional
distribution mocks performing the task to a certain extent.
Input (prefix within a single Transformer decoder context):
“ thanks -> merci
hello -> bonjour
mint -> menthe
otter -> ”
Output (conditional generations):
loutre…”
48
GPT-3, In-context learning, and very large models
Very large language models seem to perform some kind of learning without gradient
steps simply from examples you provide within their contexts.

49
Scaling Efficiency: how do we best use our compute
GPT-3 was 175B parameters and trained on 300B tokens of text.
Roughly, the cost of training a large transformer scales as parameters*tokens
Did OpenAI strike the right parameter-token data to get the best model? No.

This 70B parameter model is better than the much larger other models!
50
The prefix as task specification and scratch pad: chain-of-thought

51
[Wei et al., 2023]
Outline
1. A brief note on subword modeling
2. Motivating model pretraining from word embeddings
3. Model pretraining three ways
1. Encoders
2. Encoder-Decoders
3. Decoders
4. What do we think pretraining is teaching?

52
What kinds of things does pretraining teach?
There’s increasing evidence that pretrained models learn a wide variety of things about
the statistical properties of language. Taking our examples from the start of class:
• Stanford University is located in __________, California. [Trivia]
• I put ___ fork down on the table. [syntax]
• The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
• I went to the ocean to see the fish, turtles, seals, and _____. [lexical semantics/topic]
• Overall, the value I got from the two hours watching it was the sum total of the popcorn
and the drink. The movie was ___. [sentiment]
• Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his
destiny. Zuko left the ______. [some reasoning – this is harder]
• I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic
arithmetic; they don’t learn the Fibonnaci sequence]
• Models also learn – and can exacerbate racism, sexism, all manner of bad biases.
• More on all this in the interpretability lecture!
53
Parting remarks
These models are still not well-understood.
“Small” models like BERT have become general tools in a wide range of settings.
More on this in later lectures!
Assignment 5 out Thursday! Tuesday’s and today’s lectures in its subject matter.

Mother Terisa by Kushantsingh
67% (3)
Mother Terisa by Kushantsingh
40 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Large Language Models (LLM)
100% (1)
Large Language Models (LLM)
139 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Ophiel Art and Practice of Creative Visualization
No ratings yet
Ophiel Art and Practice of Creative Visualization
77 pages
Flexibility in Learning
No ratings yet
Flexibility in Learning
9 pages
Global Trends On Inclusive Education July 2018
100% (1)
Global Trends On Inclusive Education July 2018
36 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Math Goals Level C D Sample Items
50% (2)
Math Goals Level C D Sample Items
11 pages
Igcse Art and Design Coursework Examples
100% (2)
Igcse Art and Design Coursework Examples
6 pages
ESOLCourse Spec N5
No ratings yet
ESOLCourse Spec N5
45 pages
LLM - Introduction 2024
No ratings yet
LLM - Introduction 2024
77 pages
Reflexive and Intensive Pronouns
100% (1)
Reflexive and Intensive Pronouns
5 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Thesis Writing Research Design
100% (3)
Thesis Writing Research Design
8 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Appointment
50% (2)
Appointment
2 pages
Political Secularism
No ratings yet
Political Secularism
2 pages
Scrum Guide Summary
100% (1)
Scrum Guide Summary
12 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
L4 Cse256 Fa24 We
No ratings yet
L4 Cse256 Fa24 We
68 pages
Pretrained Model
No ratings yet
Pretrained Model
50 pages
NLP Week9 Fine Tuning - and - IR
No ratings yet
NLP Week9 Fine Tuning - and - IR
64 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Brief Introduction To LLM
No ratings yet
Brief Introduction To LLM
69 pages
2058 s14 Ms 12 PDF
100% (1)
2058 s14 Ms 12 PDF
7 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
Students Perceptions of Mathematics Writing
No ratings yet
Students Perceptions of Mathematics Writing
21 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
Lec 19
No ratings yet
Lec 19
50 pages
Table of Table of Specification Mathematics 9 1St Summative Test, Quarter Thinking Skills / Item Placement
No ratings yet
Table of Table of Specification Mathematics 9 1St Summative Test, Quarter Thinking Skills / Item Placement
5 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
06 Lecture09 Pretraining
No ratings yet
06 Lecture09 Pretraining
61 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Thesis Proposal Defense
No ratings yet
Thesis Proposal Defense
7 pages
Neural Language Modelling - NLP
No ratings yet
Neural Language Modelling - NLP
30 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Large Language Models
No ratings yet
Large Language Models
32 pages
Bert 1 42
No ratings yet
Bert 1 42
42 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
Transactions of The Association For COmputational Linguistics PDF
No ratings yet
Transactions of The Association For COmputational Linguistics PDF
14 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
Tele/Fax: 0194-2433590, 2437647 (Srinagar) : 0191-2479371, 2470102 (Jammu)
No ratings yet
Tele/Fax: 0194-2433590, 2437647 (Srinagar) : 0191-2479371, 2470102 (Jammu)
44 pages
State of Multilingual and Multimodal NLP
No ratings yet
State of Multilingual and Multimodal NLP
27 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
Chad Skills Development For Youth Employability Project
No ratings yet
Chad Skills Development For Youth Employability Project
79 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Minorities Problems
No ratings yet
Minorities Problems
13 pages
Saint Louis University Baguio City Principal'S Recommendation Form
No ratings yet
Saint Louis University Baguio City Principal'S Recommendation Form
1 page
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Clip
No ratings yet
Clip
15 pages
Document From Bathini Sai Sujith
No ratings yet
Document From Bathini Sai Sujith
12 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Deep Network Notes
No ratings yet
Deep Network Notes
54 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
CBCRM Unit2 Study Material
No ratings yet
CBCRM Unit2 Study Material
18 pages
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
No ratings yet
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
16 pages
Workbook Answers
No ratings yet
Workbook Answers
16 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
End Term Science YR 2
No ratings yet
End Term Science YR 2
7 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
374
No ratings yet
374
7 pages
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
No ratings yet
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
7 pages
5 Pretained Word Embeddings Algorithms
No ratings yet
5 Pretained Word Embeddings Algorithms
21 pages
WS - Q3 - Science 4 - Lesson 1 - Week 1-3 - B
No ratings yet
WS - Q3 - Science 4 - Lesson 1 - Week 1-3 - B
4 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Harar Jugol
No ratings yet
Harar Jugol
1 page
Job Description: Head of Development, Royal Irish Academy of Music
No ratings yet
Job Description: Head of Development, Royal Irish Academy of Music
5 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
Haley Assignment 5
No ratings yet
Haley Assignment 5
11 pages
Principal Letter of Recommendation
No ratings yet
Principal Letter of Recommendation
1 page
C# for Beginners: A Step-by-Step Tutorial to Learning C# Programming from Scratch
From Everand
C# for Beginners: A Step-by-Step Tutorial to Learning C# Programming from Scratch
Lena Neill
No ratings yet
2000 Most Common Dutch Words in Context: Get Fluent & Increase Your Dutch Vocabulary with 2000 Dutch Phrases
From Everand
2000 Most Common Dutch Words in Context: Get Fluent & Increase Your Dutch Vocabulary with 2000 Dutch Phrases
Lingo Mastery
No ratings yet
Exercises in Speaking English
From Everand
Exercises in Speaking English
A. G. Schopf A.A. B.A.
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

cs224n 2023 Lecture9 Pretraining

Uploaded by

cs224n 2023 Lecture9 Pretraining

Uploaded by

Natural Language Processing

with Deep Learning

word vocab mapping embedding

Example: Swahili verbs can have

Here’s a small fraction of the

Byte-pair encoding is a simple, effective strategy for defining a subword vocabulary.

5 [Sennrich et al., 2016, Wu et al., 2016]

word vocab mapping embedding

This quote is a summary of distributional semantics, and motivated word2vec. But:

“… the complete meaning of a word is always contextual,

Some issues to think about: pretrained

• This has been exceptionally effective at

Stanford University is located in __________, California.

I put ___ fork down on the table.

The woman walked across the street,

Overall, the value I got from the two hours watching

Iroh went into the kitchen to make some tea.

I was thinking about the sequence that goes

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

Iroh goes to make tasty tea … the movie was …

• Consider, provides parameters 𝜃෠ by approximating min ℒpretrain 𝜃 .

• Gets bidirectional context – can condition on future!

Encoder- • Good parts of decoders and encoders?

• Language models! What we’ve seen so far.

• Gets bidirectional context – can condition on future!

Encoder- • Good parts of decoders and encoders?

• Language models! What we’ve seen so far.

Idea: replace some fraction of words in the

Only add loss terms from words that are

Some more details about Masked LM for BERT:

25 [Devlin et al., 2018]

26 [Devlin et al., 2018, Liu et al., 2019]

27 [Devlin et al., 2018]

28 [Devlin et al., 2018]

make/brew/craft goes to make tasty tea END

Pretrained Encoder Pretrained Decoder

It’s bly irr## esi## sti## bly

It’ [MASK] [MASK] [MASK] [MASK] good

30 [Liu et al., 2019; Joshi et al., 2020]

31 [Liu et al., 2019; Joshi et al., 2020]

Full Finetuning Lightweight Finetuning

(Transformer, LSTM, ++ ) (Transformer, LSTM, ++ )

… the movie was … … the movie was …

… the movie was …

34 [Hu et al., 2021]

• Gets bidirectional context – can condition on future!

Encoder- • Good parts of decoders and encoders?

• Language models! What we’ve seen so far.

The encoder portion benefits from

Replace different-length spans from the input

This is implemented in text

NQ: Natural Questions

• Gets bidirectional context – can condition on future!

Encoder- • Good parts of decoders and encoders?

• Language models! What we’ve seen so far.

We can finetune them by training a classifier Linear 𝐴, 𝑏

This is helpful in tasks where the output is a 𝑤2 𝑤3 𝑤4 𝑤5 𝑤6

[Note how the linear layer has been pretrained.]

43 [Devlin et al., 2018]

Natural Language Inference: Label pairs of sentences as entailing/contradictory/neutral

Radford et al., 2018 evaluate on natural language inference.

The linear classifier is applied to the representation of the [EXTRACT] token.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.