0% found this document useful (0 votes)
13 views83 pages

BERT GPT CoT

Bert and gpt

Uploaded by

jprem637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views83 pages

BERT GPT CoT

Bert and gpt

Uploaded by

jprem637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Bidirectional Encoder Representations

from
Transformers
(BERT)
Model Architecture
➢ Language models only use left context, but language understanding is bidirectional.

➢ BERT’s model architecture is a multi-layer bidirectional Transformer encoder based


on the original implementation described in the “Attention is All You Need” paper.
Model Architecture
In their work,

1. They denoted the number of layers


(i.e., Transformer blocks) as L, the
hidden size as H, and the number of
self-attention heads as A.

2. They primarily reported results on


two model sizes:

• BERTBASE (L=12, H=768, A=12,


Total Parameters=110M)

• BERTLARGE (L=24, H=1024, A=16,


Total Parameters=340M).

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Model Architecture
➢ BERTBASE was chosen to have the same model size as OpenAI GPT for comparison
purposes.

➢ Critically, however, the BERT Transformer uses bidirectional self-attention, while


the GPT Transformer uses constrained self-attention where every token can only
attend to the context to its left.

https://arxiv.org/pdf/1810.04805
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
➢ The first token of every sequence is always a special classification token ([CLS]).

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
➢ The first token of every sequence is always a special classification token ([CLS]).
➢ The final hidden state corresponding to this token is used as the aggregate
sequence representation for classification tasks.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ Sentence pairs were packed together into a single sequence. They differentiated
the sentences in two ways.

➢ First, they separated them with a special token ([SEP]). Second, they added a
learned embedding to every token, indicating whether it belongs to sentence A or B.

https://arxiv.org/pdf/1810.04805
Input Representation
➢ They denoted input embedding as E, the final hidden vector of the special [CLS]
token as C ∈ RH, and the final hidden vector for the ith input token as Ti ∈ RH.

➢ For a given token, its input representation is constructed by summing the


corresponding token, segment, and position embeddings.

https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Masked Language Modeling (MLM) Objective/Task
➢ Next Sentence Prediction (NSP) Objective/Task

https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks. 1. Select some tokens
➢ Masked Language Modeling (MLM) Objective/Task (each token is selected
with a probability of
15%)

2. Replace these
selected tokens
• (with the special
token [MASK] - with
p=80%, with a
random token - with
p=10%, with the
original token (remain
unchanged) - with
p=10%)

3. predict original
tokens (compute loss).

https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Next Sentence Prediction (NSP) Objective/Task
➢ The Next Sentence Prediction (NSP) objective is a binary classification task.

➢ From the final-layer representation of the special token [CLS], the model predicts whether
the two sentences are consecutive in some text or not.

➢ Note that 50% of examples in training contain consecutive sentences extracted from training
texts (from the same document) and another 50% - a random pair of sentences (from
different documents).

➢ Input: [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
➢ Label: isNext

➢ Input: [CLS] the man went to [MASK] store [SEP] penguin [MASK] are flight ##less birds
[SEP]
➢ Label: notNext

➢ This task teaches the model to understand the relationships between sentences. As we'll see
later, this will enable to use BERT for complicated tasks requiring some kind of reasoning.
https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT

https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training Data
➢ For the pre-training corpus, they used BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).

➢ For Wikipedia, they extracted only the text passages and ignored lists, tables, and
headers.

➢ It is critical to use a document-level corpus rather than a shuffled sentence-level


corpus such as the Billion Word Benchmark (Chelba et al., 2013) to extract long
contiguous sequences.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ To generate each training input sequence, they sampled two spans of text from the
corpus, which they refered to as “sentences” even though they were typically much
longer than single sentences (but can be shorter also).

➢ The first sentence received the A embedding, and the second received the B
embedding. 50% of the time, B was the actual next sentence that followed A and
50% of the time it was a random sentence, which was done for the “next sentence
prediction” task. They were sampled such that the combined length is ≤ 512 tokens.

➢ The LM masking was applied after WordPiece tokenization with a uniform masking
rate of 15% and no special consideration was given to partial word pieces.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ They trained with a batch size of 256 sequences (256 sequences * 512 tokens =
128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over
the 3.3 billion word corpus.

➢ They used Adam with a learning rate of 1e-4, β1 = 0:9, β2 = 0:999, L2 weight decay
of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the
learning rate.

➢ They used a dropout probability of 0.1 on all layers. They used a gelu activation
(Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI
GPT. The training loss was the sum of the mean masked LM likelihood and the mean
next-sentence prediction likelihood.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16
TPU chips total).

➢ Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).
Each pretraining took 4 days to complete.

➢ Longer sequences are disproportionately expensive because attention is quadratic


to the sequence length. To speed up pretraining in their experiments, they pre-
trained the model with a sequence length of 128 for 90% of the steps. (Batch:
1024 x 128)

➢ Then, they trained the rest 10% of the steps on the sequence of 512 (Batch: 256 x
512)

https://arxiv.org/pdf/1810.04805
GLUE (General Language Understanding Evaluation) Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.

2. QQP (Quora Question Pairs)


➢ QQP is a dataset of question pairs from Quora, where the objective is to determine
whether two questions are semantically equivalent. This task helps assess the ability to
understand paraphrases and similarity between questions.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.

2. QQP (Quora Question Pairs)


➢ QQP is a dataset of question pairs from Quora, where the objective is to determine
whether two questions are semantically equivalent. This task helps assess the ability to
understand paraphrases and similarity between questions.

3. QNLI (Question Natural Language Inference)


➢ QNLI is derived from the Stanford Question Answering Dataset (SQuAD). The task
involves determining whether a given context sentence contains the answer to a
question. It frames question-answer pairs as an inference task to test understanding
of the relationship between questions and their corresponding contexts.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.

5. CoLA (Corpus of Linguistic Acceptability)


➢ CoLA is a dataset that consists of sentences labeled as either grammatically
acceptable or unacceptable. The task is to classify sentences based on their
grammaticality, serving as a benchmark for evaluating models on linguistic
acceptability.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.

5. CoLA (Corpus of Linguistic Acceptability)


➢ CoLA is a dataset that consists of sentences labeled as either grammatically
acceptable or unacceptable. The task is to classify sentences based on their
grammaticality, serving as a benchmark for evaluating models on linguistic
acceptability.

6. STS-B (Semantic Textual Similarity Benchmark)


➢ STS-B is a dataset designed to evaluate models on semantic textual
similarity. The task involves scoring pairs of sentences on a scale from 0 to 5
based on their semantic similarity, helping assess the ability to understand
nuanced meanings.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
7. MRPC (Microsoft Research Paraphrase Corpus)
➢ MRPC consists of pairs of sentences from news articles labeled as
paraphrases or non-paraphrases. The objective is to determine whether two
sentences convey the same meaning, making it a valuable resource for
evaluating paraphrase detection models.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
7. MRPC (Microsoft Research Paraphrase Corpus)
➢ MRPC consists of pairs of sentences from news articles labeled as
paraphrases or non-paraphrases. The objective is to determine whether two
sentences convey the same meaning, making it a valuable resource for
evaluating paraphrase detection models.

8. RTE (Recognizing Textual Entailment)


➢ RTE is a task focused on determining whether a hypothesis sentence can be
inferred from a premise sentence. The dataset consists of sentence pairs,
and the model needs to predict if the hypothesis logically follows from the
premise. This task is closely related to natural language inference.

https://arxiv.org/pdf/1810.04805
Fine-Tuning BERT on Different Tasks

https://arxiv.org/pdf/1810.04805
Fine-Tuning BERT on Different Tasks

https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ The only new parameters introduced during fine-tuning are classification layer
weights W ∈ RK×H, where K is the number of labels.

➢ They computed a standard classification loss with C (Recall: C is the hidden


representation of [CLS]) and W, i.e., log(softmax(CWT)).

https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ They used a batch size of 32 and fine-tuned for 3 epochs over the data for all
GLUE tasks.

➢ For each task, they selected the best fine-tuning learning rate (among 5e-5, 4e-5,
3e-5, and 2e-5) on the Dev set.

➢ Additionally, for BERTLARGE they found that finetuning was sometimes unstable on
small datasets, so they ran several random restarts and selected the best model on
the Dev set.

➢ With random restarts, they used the same pre-trained checkpoint but perform
different fine-tuning data shuffling and classifier layer initialization.

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1
➢ The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k
crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a
passage from Wikipedia containing the answer, the task is to predict the answer
text span in the passage.

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V1.1 vs V2.0

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V2.0
➢ They used a simple approach to extend the SQuAD v1.1 BERT model for this task.

➢ They treated questions that do not have an answer as having an answer span with
start and end at the [CLS] token.

➢ The probability space for the start and end answer span positions is extended to
include the position of the [CLS] token.

➢ For prediction, they compared the score of the no-answer span Snull =
to the score of the best non-null span

➢ They predicted a non-null answer when , where the threshold τ is


selected on the dev set to maximize F1.

➢ They did not use TriviaQA data for this model. They fine-tuned for 2 epochs with a
learning rate of 5e-5 and a batch size of 48.
https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ The Situations With Adversarial Generations (SWAG) dataset contains 113k
sentence-pair completion examples that evaluate grounded commonsense inference
(Zellers et al., 2018).

➢ Given a sentence, the task is to choose the most plausible continuation among four
choices.

➢ When fine-tuning on the SWAG dataset, they constructed four input sequences,
each containing the concatenation of the given sentence (sentence A) and a possible
continuation (sentence B).

➢ The only task-specific parameters introduced was a vector whose dot product with
the [CLS] token representation C denotes a score for each choice which was
normalized with a softmax layer.

https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ They fine-tuned the model for 3 epochs
with a learning rate of 2e-5 and a batch
size of 16. Results are presented in
Table 4.

➢ BERTLARGE outperformed the authors’


baseline ESIM+ELMo system by +27.1%
and OpenAI GPT by 8.3%

https://arxiv.org/pdf/1810.04805
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Trained BERT on larger batch size, for more epochs and/or on more data.

➢ Larger batch size helped improve convergence and stability.

➢ Showed that more epochs alone help, even on the same data

➢ More data also helps (RoBERTa was trained on a significantly larger dataset
(over 160GB of text, compared to BERT’s 16GB) sourced from diverse
corpora like BookCorpus, English Wikipedia, CC-News, OpenWebText, and
Stories.)

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Improved upon BERT by removing NSP and enhancing MLM.

➢ RoBERTa applied dynamic masking, where the masked tokens were chosen
randomly for each training epoch. This allowed the model to see different
masked tokens for the same input text across epochs, providing more diverse
training signals and preventing overfitting to specific masked positions.

➢ BERT did not use dynamic masking during its pretraining. Instead, BERT used
static masking, where the masked tokens were selected once for each
training example and remained fixed throughout the training process.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Static vs Dynamic Masking

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ The NSP loss was hypothesized to be an important factor in training the
original BERT model.

➢ Devlin et al. (2019) observed that removing NSP hurts performance, with
significant performance degradation on QNLI, MNLI, and SQuAD 1.1.

➢ However, some later work questioned the necessity of the NSP loss (Lample
and Conneau, 2019; Yang et al., 2019; Joshi et al., 2019).

➢ To better understand this discrepancy, researchers of RoBERTA compared


several alternative training formats:

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):

Table 2: Development set results for base models pretrained over BOOKCORPUS and WIKIPEDIA. All
models are trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and
accuracy for MNLI-m, SST-2 and RACE. Reported results are medians over five random initializations
(seeds). Results for BERTBASE and XLNetBASE are from Yang et al. (2019).
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Batch-Size

Table 3: Perplexity on held-out training data (ppl) and development set accuracy for base
models trained over BOOKCORPUS and WIKIPEDIA with varying batch sizes (bsz). We
tune the learning rate (lr) for each setting. Models make the same number of passes over
the data (epochs) and have the same computational cost.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Development-Set Results

Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and
pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows
above. RoBERTa matches the architecture and training objective of BERTLARGE . Results for BERTLARGE
and XLNetLARGE are from Devlin et al. (2019) and Yang et al. (2019), respectively.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ ALBERT (A Lite BERT):
➢ Factorized Embedding

➢ Cross-Layer Parameter Sharing


➢ Q, K, V, W1, W2, b1, b2, γ and β (Ablation study was also done)
➢ Sentence Order Prediction (SOP)
➢ Instead of Next Sentence Prediction (NSP), ALBERT uses SOP, where
the model predicts the correct order of two consecutive sentences, which
has been found to be more effective for certain tasks.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1909.11942
Post-BERT Pretraining Advancements
➢ ALBERT (A Lite BERT):
➢ Results

State-of-the-art results on the GLUE benchmark. For single-task single-model results,


we report ALBERT at 1M steps (comparable to RoBERTa) and at 1.5M steps.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1909.11942
Generative Pre-trained Transformer (GPT)
➢ Architecture

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Architecture

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Unsupervised pre-training

Training Data: BooksCorpus (over 7,000 unpublished books)


https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning
➢ All transformations include adding randomly initialized start and end tokens
(<s>, <e>).

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT 1 vs GPT 2 vs GPT 3

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ It was formally introduced in GPT-3.

➢ In-context learning refers to a model's ability to learn and adapt to new


tasks or contexts solely based on examples provided in the input prompt,
without any additional fine-tuning or parameter updates.

➢ This approach allows the model to perform specific tasks by understanding


and mimicking the patterns present in the input examples.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ How does ICL work?
➢ When you use in-context learning, you give the model a prompt that
includes examples of the task at hand.

➢ The model uses these examples to infer the task and generalize it to new,
similar instances within the same prompt. Essentially, the model "learns"
from the context provided directly in the input.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ Suppose you want a language model to perform English-to-French
translation using in-context learning. You can provide a prompt like this:
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ What is happening?
➢ You’ve provided the model with two examples of English sentences and
their French translations.

➢ The model infers from this context that its task is to translate English
sentences to French.

➢ Based on the examples, it will try to predict the French translation of


the new sentence: "Where is the nearest train station?“

➢ So, basically, the model uses the patterns in the examples to understand
the structure and provide an output based on the task inferred from the
context.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Zero-Shot Learning
➢ Definition: In zero-shot learning, the model performs a task without
having seen any specific examples for that task. It relies on general
language understanding and context to complete the task.

➢ Example: Suppose you want the model to summarize a sentence:

➢ Prompt:

➢ "Artificial Intelligence is the simulation of human intelligence in


machines that are programmed to think like humans and mimic their
actions."
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Zero-Shot Learning
➢ Definition: In zero-shot learning, the model performs a task without
having seen any specific examples for that task. It relies on general
language understanding and context to complete the task.
➢ Example: Suppose you want the model to summarize a sentence:
➢ Prompt:
➢ "Artificial Intelligence is the simulation of human intelligence in
machines that are programmed to think like humans and mimic their
actions.“
➢ Expected Output:
➢ AI is the simulation of human intelligence in machines.
➢ In this case, the model has not been given any explicit examples of how to
summarize; it’s expected to infer the task based solely on the prompt
instructions.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ One-Shot Learning
➢ Definition: In one-shot learning, the model is given one example of the
task before being asked to perform it. This example serves as a reference
for the model to understand and mimic the task.
➢ Example: Translating English to Spanish with one example:
➢ Prompt:
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ One-Shot Learning
➢ Expected Output
➢ Muchas gracias

➢ Here, the model has seen one example of an English-to-Spanish


translation and is expected to translate the second sentence based on
that.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Few-Shot Learning
➢ Definition
➢ In few-shot learning, the model is given a few examples of the task
before being asked to perform it.

➢ These multiple examples help the model better understand the


task’s pattern.

➢ Prompt
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Few-Shot Learning
➢ Expected Output
➢ Positive

➢ In this case, the model is given two labeled examples of sentiment


analysis. It uses these examples to understand the task and then
predicts the sentiment for the third sentence.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Summary of Differences
➢ Zero-Shot: No examples; the model relies solely on prompt instructions.

➢ One-Shot: One example provided; the model generalizes from a single


instance.

➢ Few-Shot: A few examples provided; the model learns patterns based on


multiple instances.

➢ Each of these learning types demonstrates different levels of context


dependency, with few-shot learning providing the most context, enabling the
model to understand more complex tasks better.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT-2 could perform zero-shot learning
➢ It introduced significant advancements that enabled it to perform zero-shot
learning effectively.

➢ Here’s what makes GPT-2 capable of handling zero-shot tasks:

➢ Massive Scale and Diverse Pretraining Data:


➢ Scale: GPT-2 is a much larger model than GPT-1, with up to 1.5 billion parameters in
its largest version. This increased capacity allows it to capture more complex
patterns and relationships within the data.

➢ Diverse Data: GPT-2 was trained on a vast and diverse dataset, containing billions of
tokens from a wide range of sources, covering topics from news to literature to
technical documents. This diversity provides it with broad contextual knowledge
across many domains.

➢ Extensive Contextual Knowledge: By seeing such a wide array of language and tasks
during pretraining, GPT-2 learns generalized representations of language patterns,
which enables it to interpret new tasks based on context alone.
Generative Pre-trained Transformer (GPT)
➢ Notes:
➢ Zero-Shot, One-Shot, and Few-Shot Learning have been long-standing concepts in machine
learning. They refer to a model’s ability to generalize with minimal examples, originally
applied in areas like image recognition and transfer learning. These paradigms existed
before they were demonstrated in large language models.

➢ In-Context Learning as a distinct term became relevant with language models that could
learn from prompts alone, without parameter updates. It refers specifically to a model’s
capacity to learn based on examples given within the same prompt.

➢ While zero-shot, one-shot, and few-shot learning are paradigms that describe the number of
examples provided, in-context learning refers to the process itself — using context within a
prompt to perform tasks. Thus, in-context learning is an umbrella term that includes zero-
shot, one-shot, and few-shot learning within large language models.

➢ In summary: While zero-shot, one-shot, and few-shot learning existed as independent


concepts before, in-context learning was introduced as a term to describe how large
language models like GPT-3 could adapt to tasks based on examples within a prompt. This
makes in-context learning the broader concept that formally ties them together in the realm
of language models.
Generative Pre-trained Transformer (GPT)

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
➢ Chain of Thought (CoT) reasoning is a prompting technique that encourages
language models to break down complex tasks into smaller, logical steps.
➢ This method helps the model articulate its reasoning process, leading to more
accurate and coherent answers.
➢ Here’s how it works, along with an example:
➢ Solve the following math problem:
➢ Chain of Thought Reasoning
➢ Step 1: Identify the Components of the Problem
➢ The problem gives me the cost of one shirt ($25) and the quantity of shirts I want to buy (3).
➢ Step 2: Formulate the Calculation
➢ To find the total cost, I need to multiply the cost of one shirt by the number of shirts:
➢ Total Cost=Cost per Shirt×Number of Shirts
➢ Step 3: Perform the Calculation
➢ Now, I can substitute the values into the formula:
➢ Total Cost=25×3
➢ Performing the multiplication gives:
➢ Total Cost=75
➢ Step 4: State the Final Answer
➢ Therefore, the total cost of buying 3 shirts is $75.
Chain of Thoughts (CoT)
➢ When using CoT prompting, the model would be encouraged to articulate these
steps explicitly (by means of writing/using a proper Prompt template).

➢ Here’s how the complete response might look when prompted to use chain-of-
thought reasoning:

➢ To find the total cost of buying 3 shirts, I first need to know the price of
one shirt, which is $25.

➢ Now, I multiply the price by the number of shirts I want to buy.

➢ So, the calculation is:Total Cost = Price per Shirt × Number of Shirts

➢ Total Cost = $25 × 3 = $75.

➢ Therefore, the total cost for 3 shirts is $75.


Chain of Thoughts (CoT)
➢ Benefits
➢ Clarity: By breaking down the problem, the model makes its reasoning
transparent.

➢ Accuracy: The structured approach helps prevent errors, especially in


complex calculations or logic tasks.

➢ Flexibility: This method can be applied to various domains, including math,


logic puzzles, and even ethical reasoning, by guiding the model through the
necessary steps to arrive at a conclusion.

➢ Summary
➢ Chain of thought reasoning encourages a systematic approach to problem-
solving by explicitly stating each logical step.

➢ This method significantly enhances the model’s ability to tackle complex


tasks effectively, providing clearer and more accurate outputs.
Chain of Thoughts (CoT)

https://arxiv.org/pdf/2201.11903
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
import os
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Set your OpenAI API key


os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY’

# Step 1: Initialize the language model


llm = OpenAI(model="gpt-3.5-turbo") # Use the appropriate model

# Step 2: Create a prompt template for CoT reasoning


prompt_template = PromptTemplate(
input_variables=["question"],
template="""
To solve the following problem, please explain your reasoning step-by-step:

Problem: {question}

Step-by-step reasoning:
"""
)
Chain of Thoughts (CoT)
# Step 3: Create the LLM chain
chain = LLMChain(llm=llm, prompt=prompt_template)

# Step 4: Run the chain with user input


user_question = input("Please enter a word problem: ")
response = chain.run(question=user_question)

# Step 5: Print the response


print("Model's Response:")
print(response)
Chain of Thoughts (CoT)
Let us say that the goal is to solve the arithmetic problem: "What is 24 plus 36?“
Chain of Thoughts:
Few Shot:
Example 1:
Example 1: Question: What is 12 plus 15?
Question: What is 12 plus 15? Step-by-step:
Answer: 27 1. Break down the numbers: 12 and 15.
2. Add them: 12 + 15 = 27.
Example 2: Answer: 27
Question: What is 45 plus 32?
Answer: 77 Example 2:
Question: What is 45 plus 32?
Example 3: Step-by-step:
Question: What is 24 plus 36? 1. Break down the numbers: 45 and 32.
Answer: 2. Add them: 45 + 32 = 77.
Answer: 77

Chain of Thoughts (CoT) prompting often uses few-shot examples to Example 3:


demonstrate the step-by-step reasoning process. In this way, CoT and Question: What is 24 plus 36?
few-shot learning can work hand in hand. CoT is often applied in a Step-by-step:
few-shot manner, but its defining characteristic is the detailed 1. Break down the numbers: 24 and 36.
reasoning chain, which might not always be present in typical few-shot 2. Add them: 24 + 36 =
learning examples. Answer:
What does pretraining teach?

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Disclaimer
➢ The content of this presentation is not original, and it has been
prepared from various sources for teaching purposes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy