0% found this document useful (0 votes)
66 views42 pages

Bert 1 42

The document discusses pretrained transformers, focusing on the pretraining and fine-tuning paradigm, particularly in the context of BERT and its applications. It covers various aspects such as the use of masks for bidirectional context, byte-pair encoding for vocabulary management, and fine-tuning techniques for tasks like sequence classification and named entity recognition. The document emphasizes the importance of transfer learning and the ability of pretrained models to generalize from large datasets for downstream applications.

Uploaded by

Atharv Bajaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views42 pages

Bert 1 42

The document discusses pretrained transformers, focusing on the pretraining and fine-tuning paradigm, particularly in the context of BERT and its applications. It covers various aspects such as the use of masks for bidirectional context, byte-pair encoding for vocabulary management, and fine-tuning techniques for tasks like sequence classification and named entity recognition. The document emphasizes the importance of transfer learning and the ability of pretrained models to generalize from large datasets for downstream applications.

Uploaded by

Atharv Bajaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Pretrained Transformers

Pawan Goyal

CSE, IIT Kharagpur

March 15th, 2023

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 1 / 55


Pretraining through Language Modeling - General
Paradigm

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 2 / 55


The Pretraining / Finetuning paradigm

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 3 / 55


Stochastic Gradient Descent and Pretrain/Finetune

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 4 / 55


Beyond ELMo: Using Transformers for Pretraining

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 5 / 55


Pretraining for three types of architectures

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 6 / 55


Pretraining Encoders

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 7 / 55


Pretraining Encoders

What would be the objective function?


So far, we’ve looked at language model pretraining (ELMo).
But encoders get bidirectional context, so we can’t do language modeling!

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 7 / 55


Solution: Use Masks

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 8 / 55


BERT: Bidirectional Encoder Representations from
Transformers

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 9 / 55


BERT: another view

yi = softmax(WV hi ), WV ∈ R|V|×dh , hi ∈ Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 10 / 55


BERT: Next Sentence Prediction

y = softmax(WNSP C), WNSP ∈ R2×dh , C ∈ Rdh

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 11 / 55


BERT: Next Sentence Prediction

Why NSP?
Masking focuses on predicting words from surrounding contexts so as to
produce effective word-level representations.
Many applications require relationship between two sentences, e.g.,
▶ paraphrase detection (detecting if two sentences have similar meanings),
▶ entailment (detecting if the meanings of two sentences entail or contradict
each other)
▶ discourse coherence (deciding if two neighboring sentences form a
coherent discourse)

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 12 / 55


BERT: Bidirectional Encoder Representations from
Transformers

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 13 / 55


Using BERT to create contextualized word embeddings

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 14 / 55


Using BERT to create contextualized word embeddings

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 15 / 55


A quick detail about the vocabulary

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 16 / 55


Finite vocabulary assumption

Finite vocabulary assumptions make even less sense in many languages with
complex morphology

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 17 / 55


Finite vocabulary assumption

Finite vocabulary assumptions make even less sense in many languages with
complex morphology

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 17 / 55


The byte-pair encoding algorithm

Start with a vocabulary containing only characters and an “end of word”


symbol.
Using a corpus of text, find the most common adjacent characters “a,b”;
add “ab” as a subword
Replace instances of the character pair with the new subword ; repeat
until desired vocab

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 18 / 55


Byte Pair Encoding

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 19 / 55


Byte Pair Encoding

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 20 / 55


Byte Pair Encoding

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 21 / 55


Byte Pair Encoding

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 22 / 55


Byte Pair Encoding

When do we stop
When the desired vocabulary size is met

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 23 / 55


Wordpiece/Sentencepiece model

Google NMT (GNMT) uses a variant of this


▶ V1: wordpiece model
▶ V2: sentencepiece model
Rather than char n-gram count, uses a greedy approximation to
maximizing language model log likelihood to choose the pieces
Add n-gram that maximally reduces perplexity
Here, perplexity is computed based on unigram LM

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 24 / 55


Wordpiece/Sentencepiece model

Wordpiece model tokenizes inside words


Sentencepiece model works from raw text
Whitespace is retained as special token (_) and grouped normally
You can reverse things at end by joining pieces and recoding them to
spaces

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 25 / 55


Using BERT for different tasks

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 26 / 55


Transfer Learning through Fine-tuning

The power of pretrained language models lies in their ability to extract


generalizations from large amounts of text
To make practical use of these generalizations, we need to create
interfaces from these models to downstream applications through a
process called fine-tuning.
Fine-tuning facilitates the creation of applications on top of pretrained
models through the addition of a small set of application-specific
parameters.
The fine-tuning process consists of using labeled data from the
application to train these additional application-specific parameters.
Typically, this training will either freeze or make only minimal adjustments
to the pretrained language model parameters.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 27 / 55


Fine Tuning for Sequence Classification

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector in the final layer of the model for the [CLS] input serves
as the input to classifier head a classifier head.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 28 / 55


Fine Tuning for Sequence Classification

With RNNs, we used the hidden layer associated with the final input
element to stand for the entire sequence. In BERT, the [CLS] token plays
the role of sentence embedding.
This unique token is added to the vocabulary and is prepended to the
start of all input sequences, both during pretraining and encoding.
The output vector C ∈ Rdh in the final layer of the model for the [CLS]
input serves as the input to classifier head a classifier head.
The only new parameters introduced during fine-tuning are classification
layer weights WC ∈ RK×dh , where K is the number of labels.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 29 / 55


Fine Tuning for Sequence Classification

y = softmax(WC C)

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 30 / 55


Pair-wise Sequence Classification

Example: multiNLI
Pairs of sentences are given one of 3 labels: entails, contradicts and
neutral.
These labels describe a relationship between the meaning of the first
sentence (the premise) and the second sentence (the hypothesis).

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 31 / 55


Pair-wise Sequence Classification
As with NSP training, the two inputs are separated by a [SEP] token.
As with sequence classification, the output vector associated with the
prepended [CLS] token represents the model’s view of the input pair.
This vector C provides the input to a three-way classifier that can be
trained on the MultiNLI training corpus.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 32 / 55


Sequence Labeling

Here, the final output vector corresponding to each input token is passed
to a classifier that produces a softmax distribution over the possible set of
tags.
The set of weights to be learned for this additional layer is WK ∈ Rk×dh ,
where k is the number of possible tags for the task.
Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 33 / 55
POS Tagging

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 34 / 55


Named Entity Recognition and BIO Scheme

Supervised training data for tasks like named entity recognition (NER) is
typically in the form of BIO tags associated with text segmented at the word
level. For example:

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 35 / 55


BIO Scheme with subwords

After wordPiece Tokenization:

The sequence does not align with the original tags.

Solution: Training and Decoding

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 36 / 55


BIO Scheme with subwords

After wordPiece Tokenization:

The sequence does not align with the original tags.

Solution: Training and Decoding


Training: we can just assign the gold-standard tag associated with each
word to all of the subword tokens derived from it.
Decoding: the simplest approach is to use the argmax BIO tag
associated with the first subword token of a word.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 36 / 55


Fine-tuning for span-based applications

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 37 / 55


Fine-tuning for span-based applications

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 38 / 55


Fine-tuning for SQUAD

We represent the input question and passage as a single packed


sequence (with [SEP])
We only introduce a start vector S ∈ Rdh and an end vector E ∈ Rdh during
fine-tuning.
The probability of word i being the start of the answer span is computed
as a dot product between Ti and S followed by a softmax over all of the
words in the paragraph.
eS·Ti
Pi =
∑j eS·Tj
The analogous formula is used for the end of the answer span.
The score of a candidate span from position i to position j is defined as
S · Ti + E · Tj , and the maximum scoring span where j ≥ i is used as a
prediction.

Pawan Goyal (IIT Kharagpur) Pretrained Transformers March 15th, 2023 39 / 55

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy