0% found this document useful (0 votes)

13 views83 pages

BERT GPT CoT

Bert and gpt

Uploaded by

jprem637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views83 pages

BERT GPT CoT

Bert and gpt

Uploaded by

jprem637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Bidirectional Encoder Representations

from
Transformers
(BERT)
Model Architecture
➢ Language models only use left context, but language understanding is bidirectional.

➢ BERT’s model architecture is a multi-layer bidirectional Transformer encoder based

on the original implementation described in the “Attention is All You Need” paper.
Model Architecture
In their work,

1. They denoted the number of layers

(i.e., Transformer blocks) as L, the
hidden size as H, and the number of
self-attention heads as A.

2. They primarily reported results on

two model sizes:

• BERTBASE (L=12, H=768, A=12,

Total Parameters=110M)

• BERTLARGE (L=24, H=1024, A=16,

Total Parameters=340M).

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Model Architecture
➢ BERTBASE was chosen to have the same model size as OpenAI GPT for comparison
purposes.

➢ Critically, however, the BERT Transformer uses bidirectional self-attention, while

the GPT Transformer uses constrained self-attention where every token can only
attend to the context to its left.

https://arxiv.org/pdf/1810.04805
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
➢ The first token of every sequence is always a special classification token ([CLS]).
➢ The final hidden state corresponding to this token is used as the aggregate
sequence representation for classification tasks.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ Sentence pairs were packed together into a single sequence. They differentiated
the sentences in two ways.

➢ First, they separated them with a special token ([SEP]). Second, they added a
learned embedding to every token, indicating whether it belongs to sentence A or B.

https://arxiv.org/pdf/1810.04805
Input Representation
➢ They denoted input embedding as E, the final hidden vector of the special [CLS]
token as C ∈ RH, and the final hidden vector for the ith input token as Ti ∈ RH.

➢ For a given token, its input representation is constructed by summing the

corresponding token, segment, and position embeddings.

https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Masked Language Modeling (MLM) Objective/Task
➢ Next Sentence Prediction (NSP) Objective/Task

https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks. 1. Select some tokens
➢ Masked Language Modeling (MLM) Objective/Task (each token is selected
with a probability of
15%)

2. Replace these
selected tokens
• (with the special
token [MASK] - with
p=80%, with a
random token - with
p=10%, with the
original token (remain
unchanged) - with
p=10%)

3. predict original
tokens (compute loss).

https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Next Sentence Prediction (NSP) Objective/Task
➢ The Next Sentence Prediction (NSP) objective is a binary classification task.

➢ From the final-layer representation of the special token [CLS], the model predicts whether
the two sentences are consecutive in some text or not.

➢ Note that 50% of examples in training contain consecutive sentences extracted from training
texts (from the same document) and another 50% - a random pair of sentences (from
different documents).

➢ Input: [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
➢ Label: isNext

➢ Input: [CLS] the man went to [MASK] store [SEP] penguin [MASK] are flight ##less birds
[SEP]
➢ Label: notNext

➢ This task teaches the model to understand the relationships between sentences. As we'll see
later, this will enable to use BERT for complicated tasks requiring some kind of reasoning.
https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT

https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training Data
➢ For the pre-training corpus, they used BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).

➢ For Wikipedia, they extracted only the text passages and ignored lists, tables, and
headers.

➢ It is critical to use a document-level corpus rather than a shuffled sentence-level

corpus such as the Billion Word Benchmark (Chelba et al., 2013) to extract long
contiguous sequences.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ To generate each training input sequence, they sampled two spans of text from the
corpus, which they refered to as “sentences” even though they were typically much
longer than single sentences (but can be shorter also).

➢ The first sentence received the A embedding, and the second received the B
embedding. 50% of the time, B was the actual next sentence that followed A and
50% of the time it was a random sentence, which was done for the “next sentence
prediction” task. They were sampled such that the combined length is ≤ 512 tokens.

➢ The LM masking was applied after WordPiece tokenization with a uniform masking
rate of 15% and no special consideration was given to partial word pieces.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ They trained with a batch size of 256 sequences (256 sequences * 512 tokens =
128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over
the 3.3 billion word corpus.

➢ They used Adam with a learning rate of 1e-4, β1 = 0:9, β2 = 0:999, L2 weight decay
of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the
learning rate.

➢ They used a dropout probability of 0.1 on all layers. They used a gelu activation
(Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI
GPT. The training loss was the sum of the mean masked LM likelihood and the mean
next-sentence prediction likelihood.

https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16
TPU chips total).

➢ Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).
Each pretraining took 4 days to complete.

➢ Longer sequences are disproportionately expensive because attention is quadratic

to the sequence length. To speed up pretraining in their experiments, they pre-
trained the model with a sequence length of 128 for 90% of the steps. (Batch:
1024 x 128)

➢ Then, they trained the rest 10% of the steps on the sequence of 512 (Batch: 256 x
512)

https://arxiv.org/pdf/1810.04805
GLUE (General Language Understanding Evaluation) Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.

2. QQP (Quora Question Pairs)

➢ QQP is a dataset of question pairs from Quora, where the objective is to determine
whether two questions are semantically equivalent. This task helps assess the ability to
understand paraphrases and similarity between questions.

2. QQP (Quora Question Pairs)

3. QNLI (Question Natural Language Inference)

➢ QNLI is derived from the Stanford Question Answering Dataset (SQuAD). The task
involves determining whether a given context sentence contains the answer to a
question. It frames question-answer pairs as an inference task to test understanding
of the relationship between questions and their corresponding contexts.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.

5. CoLA (Corpus of Linguistic Acceptability)

➢ CoLA is a dataset that consists of sentences labeled as either grammatically
acceptable or unacceptable. The task is to classify sentences based on their
grammaticality, serving as a benchmark for evaluating models on linguistic
acceptability.

5. CoLA (Corpus of Linguistic Acceptability)

6. STS-B (Semantic Textual Similarity Benchmark)

➢ STS-B is a dataset designed to evaluate models on semantic textual
similarity. The task involves scoring pairs of sentences on a scale from 0 to 5
based on their semantic similarity, helping assess the ability to understand
nuanced meanings.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
7. MRPC (Microsoft Research Paraphrase Corpus)
➢ MRPC consists of pairs of sentences from news articles labeled as
paraphrases or non-paraphrases. The objective is to determine whether two
sentences convey the same meaning, making it a valuable resource for
evaluating paraphrase detection models.

https://arxiv.org/pdf/1810.04805
GLUE Tasks
7. MRPC (Microsoft Research Paraphrase Corpus)
➢ MRPC consists of pairs of sentences from news articles labeled as
paraphrases or non-paraphrases. The objective is to determine whether two
sentences convey the same meaning, making it a valuable resource for
evaluating paraphrase detection models.

8. RTE (Recognizing Textual Entailment)

➢ RTE is a task focused on determining whether a hypothesis sentence can be
inferred from a premise sentence. The dataset consists of sentence pairs,
and the model needs to predict if the hypothesis logically follows from the
premise. This task is closely related to natural language inference.

https://arxiv.org/pdf/1810.04805
Fine-Tuning BERT on Different Tasks

https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ The only new parameters introduced during fine-tuning are classification layer
weights W ∈ RK×H, where K is the number of labels.

➢ They computed a standard classification loss with C (Recall: C is the hidden

representation of [CLS]) and W, i.e., log(softmax(CWT)).

https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ They used a batch size of 32 and fine-tuned for 3 epochs over the data for all
GLUE tasks.

➢ For each task, they selected the best fine-tuning learning rate (among 5e-5, 4e-5,
3e-5, and 2e-5) on the Dev set.

➢ Additionally, for BERTLARGE they found that finetuning was sometimes unstable on
small datasets, so they ran several random restarts and selected the best model on
the Dev set.

➢ With random restarts, they used the same pre-trained checkpoint but perform
different fine-tuning data shuffling and classifier layer initialization.

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1
➢ The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k
crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a
passage from Wikipedia containing the answer, the task is to predict the answer
text span in the passage.

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V1.1 vs V2.0

https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V2.0
➢ They used a simple approach to extend the SQuAD v1.1 BERT model for this task.

➢ They treated questions that do not have an answer as having an answer span with
start and end at the [CLS] token.

➢ The probability space for the start and end answer span positions is extended to
include the position of the [CLS] token.

➢ For prediction, they compared the score of the no-answer span Snull =
to the score of the best non-null span

➢ They predicted a non-null answer when , where the threshold τ is

selected on the dev set to maximize F1.

➢ They did not use TriviaQA data for this model. They fine-tuned for 2 epochs with a
learning rate of 5e-5 and a batch size of 48.
https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ The Situations With Adversarial Generations (SWAG) dataset contains 113k
sentence-pair completion examples that evaluate grounded commonsense inference
(Zellers et al., 2018).

➢ Given a sentence, the task is to choose the most plausible continuation among four
choices.

➢ When fine-tuning on the SWAG dataset, they constructed four input sequences,
each containing the concatenation of the given sentence (sentence A) and a possible
continuation (sentence B).

➢ The only task-specific parameters introduced was a vector whose dot product with
the [CLS] token representation C denotes a score for each choice which was
normalized with a softmax layer.

https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ They fine-tuned the model for 3 epochs
with a learning rate of 2e-5 and a batch
size of 16. Results are presented in
Table 4.

➢ BERTLARGE outperformed the authors’

baseline ESIM+ELMo system by +27.1%
and OpenAI GPT by 8.3%

https://arxiv.org/pdf/1810.04805
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Trained BERT on larger batch size, for more epochs and/or on more data.

➢ Larger batch size helped improve convergence and stability.

➢ Showed that more epochs alone help, even on the same data

➢ More data also helps (RoBERTa was trained on a significantly larger dataset
(over 160GB of text, compared to BERT’s 16GB) sourced from diverse
corpora like BookCorpus, English Wikipedia, CC-News, OpenWebText, and
Stories.)

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Improved upon BERT by removing NSP and enhancing MLM.

➢ RoBERTa applied dynamic masking, where the masked tokens were chosen
randomly for each training epoch. This allowed the model to see different
masked tokens for the same input text across epochs, providing more diverse
training signals and preventing overfitting to specific masked positions.

➢ BERT did not use dynamic masking during its pretraining. Instead, BERT used
static masking, where the masked tokens were selected once for each
training example and remained fixed throughout the training process.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Static vs Dynamic Masking

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ The NSP loss was hypothesized to be an important factor in training the
original BERT model.

➢ Devlin et al. (2019) observed that removing NSP hurts performance, with
significant performance degradation on QNLI, MNLI, and SQuAD 1.1.

➢ However, some later work questioned the necessity of the NSP loss (Lample
and Conneau, 2019; Yang et al., 2019; Joshi et al., 2019).

➢ To better understand this discrepancy, researchers of RoBERTA compared

several alternative training formats:

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):

Table 2: Development set results for base models pretrained over BOOKCORPUS and WIKIPEDIA. All
models are trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and
accuracy for MNLI-m, SST-2 and RACE. Reported results are medians over five random initializations
(seeds). Results for BERTBASE and XLNetBASE are from Yang et al. (2019).
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Batch-Size

Table 3: Perplexity on held-out training data (ppl) and development set accuracy for base
models trained over BOOKCORPUS and WIKIPEDIA with varying batch sizes (bsz). We
tune the learning rate (lr) for each setting. Models make the same number of passes over
the data (epochs) and have the same computational cost.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Development-Set Results

Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and
pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows
above. RoBERTa matches the architecture and training objective of BERTLARGE . Results for BERTLARGE
and XLNetLARGE are from Devlin et al. (2019) and Yang et al. (2019), respectively.

https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ ALBERT (A Lite BERT):
➢ Factorized Embedding

➢ Cross-Layer Parameter Sharing

➢ Q, K, V, W1, W2, b1, b2, γ and β (Ablation study was also done)
➢ Sentence Order Prediction (SOP)
➢ Instead of Next Sentence Prediction (NSP), ALBERT uses SOP, where
the model predicts the correct order of two consecutive sentences, which
has been found to be more effective for certain tasks.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1909.11942
Post-BERT Pretraining Advancements
➢ ALBERT (A Lite BERT):
➢ Results

State-of-the-art results on the GLUE benchmark. For single-task single-model results,

we report ALBERT at 1M steps (comparable to RoBERTa) and at 1.5M steps.

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1909.11942
Generative Pre-trained Transformer (GPT)
➢ Architecture

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Architecture

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Unsupervised pre-training

Training Data: BooksCorpus (over 7,000 unpublished books)

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning
➢ All transformations include adding randomly initialized start and end tokens
(<s>, <e>).

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT 1 vs GPT 2 vs GPT 3

https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ It was formally introduced in GPT-3.

➢ In-context learning refers to a model's ability to learn and adapt to new

tasks or contexts solely based on examples provided in the input prompt,
without any additional fine-tuning or parameter updates.

➢ This approach allows the model to perform specific tasks by understanding

and mimicking the patterns present in the input examples.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ How does ICL work?
➢ When you use in-context learning, you give the model a prompt that
includes examples of the task at hand.

➢ The model uses these examples to infer the task and generalize it to new,
similar instances within the same prompt. Essentially, the model "learns"
from the context provided directly in the input.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ Suppose you want a language model to perform English-to-French
translation using in-context learning. You can provide a prompt like this:
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ What is happening?
➢ You’ve provided the model with two examples of English sentences and
their French translations.

➢ The model infers from this context that its task is to translate English
sentences to French.

➢ Based on the examples, it will try to predict the French translation of

the new sentence: "Where is the nearest train station?“

➢ So, basically, the model uses the patterns in the examples to understand
the structure and provide an output based on the task inferred from the
context.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Zero-Shot Learning
➢ Definition: In zero-shot learning, the model performs a task without
having seen any specific examples for that task. It relies on general
language understanding and context to complete the task.

➢ Example: Suppose you want the model to summarize a sentence:

➢ Prompt:

➢ "Artificial Intelligence is the simulation of human intelligence in

machines that are programmed to think like humans and mimic their
actions."
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Zero-Shot Learning
➢ Definition: In zero-shot learning, the model performs a task without
having seen any specific examples for that task. It relies on general
language understanding and context to complete the task.
➢ Example: Suppose you want the model to summarize a sentence:
➢ Prompt:
➢ "Artificial Intelligence is the simulation of human intelligence in
machines that are programmed to think like humans and mimic their
actions.“
➢ Expected Output:
➢ AI is the simulation of human intelligence in machines.
➢ In this case, the model has not been given any explicit examples of how to
summarize; it’s expected to infer the task based solely on the prompt
instructions.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ One-Shot Learning
➢ Definition: In one-shot learning, the model is given one example of the
task before being asked to perform it. This example serves as a reference
for the model to understand and mimic the task.
➢ Example: Translating English to Spanish with one example:
➢ Prompt:
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ One-Shot Learning
➢ Expected Output
➢ Muchas gracias

➢ Here, the model has seen one example of an English-to-Spanish

translation and is expected to translate the second sentence based on
that.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Few-Shot Learning
➢ Definition
➢ In few-shot learning, the model is given a few examples of the task
before being asked to perform it.

➢ These multiple examples help the model better understand the

task’s pattern.

➢ Prompt
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Few-Shot Learning
➢ Expected Output
➢ Positive

➢ In this case, the model is given two labeled examples of sentiment

analysis. It uses these examples to understand the task and then
predicts the sentiment for the third sentence.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Summary of Differences
➢ Zero-Shot: No examples; the model relies solely on prompt instructions.

➢ One-Shot: One example provided; the model generalizes from a single

instance.

➢ Few-Shot: A few examples provided; the model learns patterns based on

multiple instances.

➢ Each of these learning types demonstrates different levels of context

dependency, with few-shot learning providing the most context, enabling the
model to understand more complex tasks better.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT-2 could perform zero-shot learning
➢ It introduced significant advancements that enabled it to perform zero-shot
learning effectively.

➢ Here’s what makes GPT-2 capable of handling zero-shot tasks:

➢ Massive Scale and Diverse Pretraining Data:

➢ Scale: GPT-2 is a much larger model than GPT-1, with up to 1.5 billion parameters in
its largest version. This increased capacity allows it to capture more complex
patterns and relationships within the data.

➢ Diverse Data: GPT-2 was trained on a vast and diverse dataset, containing billions of
tokens from a wide range of sources, covering topics from news to literature to
technical documents. This diversity provides it with broad contextual knowledge
across many domains.

➢ Extensive Contextual Knowledge: By seeing such a wide array of language and tasks
during pretraining, GPT-2 learns generalized representations of language patterns,
which enables it to interpret new tasks based on context alone.
Generative Pre-trained Transformer (GPT)
➢ Notes:
➢ Zero-Shot, One-Shot, and Few-Shot Learning have been long-standing concepts in machine
learning. They refer to a model’s ability to generalize with minimal examples, originally
applied in areas like image recognition and transfer learning. These paradigms existed
before they were demonstrated in large language models.

➢ In-Context Learning as a distinct term became relevant with language models that could
learn from prompts alone, without parameter updates. It refers specifically to a model’s
capacity to learn based on examples given within the same prompt.

➢ While zero-shot, one-shot, and few-shot learning are paradigms that describe the number of
examples provided, in-context learning refers to the process itself — using context within a
prompt to perform tasks. Thus, in-context learning is an umbrella term that includes zero-
shot, one-shot, and few-shot learning within large language models.

➢ In summary: While zero-shot, one-shot, and few-shot learning existed as independent

concepts before, in-context learning was introduced as a term to describe how large
language models like GPT-3 could adapt to tasks based on examples within a prompt. This
makes in-context learning the broader concept that formally ties them together in the realm
of language models.
Generative Pre-trained Transformer (GPT)

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
➢ Chain of Thought (CoT) reasoning is a prompting technique that encourages
language models to break down complex tasks into smaller, logical steps.
➢ This method helps the model articulate its reasoning process, leading to more
accurate and coherent answers.
➢ Here’s how it works, along with an example:
➢ Solve the following math problem:
➢ Chain of Thought Reasoning
➢ Step 1: Identify the Components of the Problem
➢ The problem gives me the cost of one shirt ($25) and the quantity of shirts I want to buy (3).
➢ Step 2: Formulate the Calculation
➢ To find the total cost, I need to multiply the cost of one shirt by the number of shirts:
➢ Total Cost=Cost per Shirt×Number of Shirts
➢ Step 3: Perform the Calculation
➢ Now, I can substitute the values into the formula:
➢ Total Cost=25×3
➢ Performing the multiplication gives:
➢ Total Cost=75
➢ Step 4: State the Final Answer
➢ Therefore, the total cost of buying 3 shirts is $75.
Chain of Thoughts (CoT)
➢ When using CoT prompting, the model would be encouraged to articulate these
steps explicitly (by means of writing/using a proper Prompt template).

➢ Here’s how the complete response might look when prompted to use chain-of-
thought reasoning:

➢ To find the total cost of buying 3 shirts, I first need to know the price of
one shirt, which is $25.

➢ Now, I multiply the price by the number of shirts I want to buy.

➢ So, the calculation is:Total Cost = Price per Shirt × Number of Shirts

➢ Total Cost = $25 × 3 = $75.

➢ Therefore, the total cost for 3 shirts is $75.

Chain of Thoughts (CoT)
➢ Benefits
➢ Clarity: By breaking down the problem, the model makes its reasoning
transparent.

➢ Accuracy: The structured approach helps prevent errors, especially in

complex calculations or logic tasks.

➢ Flexibility: This method can be applied to various domains, including math,

logic puzzles, and even ethical reasoning, by guiding the model through the
necessary steps to arrive at a conclusion.

➢ Summary
➢ Chain of thought reasoning encourages a systematic approach to problem-
solving by explicitly stating each logical step.

➢ This method significantly enhances the model’s ability to tackle complex

tasks effectively, providing clearer and more accurate outputs.
Chain of Thoughts (CoT)

https://arxiv.org/pdf/2201.11903
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
import os
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Set your OpenAI API key

os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY’

# Step 1: Initialize the language model

llm = OpenAI(model="gpt-3.5-turbo") # Use the appropriate model

# Step 2: Create a prompt template for CoT reasoning

prompt_template = PromptTemplate(
input_variables=["question"],
template="""
To solve the following problem, please explain your reasoning step-by-step:

Problem: {question}

Step-by-step reasoning:
"""
)
Chain of Thoughts (CoT)
# Step 3: Create the LLM chain
chain = LLMChain(llm=llm, prompt=prompt_template)

# Step 4: Run the chain with user input

user_question = input("Please enter a word problem: ")
response = chain.run(question=user_question)

# Step 5: Print the response

print("Model's Response:")
print(response)
Chain of Thoughts (CoT)
Let us say that the goal is to solve the arithmetic problem: "What is 24 plus 36?“
Chain of Thoughts:
Few Shot:
Example 1:
Example 1: Question: What is 12 plus 15?
Question: What is 12 plus 15? Step-by-step:
Answer: 27 1. Break down the numbers: 12 and 15.
2. Add them: 12 + 15 = 27.
Example 2: Answer: 27
Question: What is 45 plus 32?
Answer: 77 Example 2:
Question: What is 45 plus 32?
Example 3: Step-by-step:
Question: What is 24 plus 36? 1. Break down the numbers: 45 and 32.
Answer: 2. Add them: 45 + 32 = 77.
Answer: 77

Chain of Thoughts (CoT) prompting often uses few-shot examples to Example 3:

demonstrate the step-by-step reasoning process. In this way, CoT and Question: What is 24 plus 36?
few-shot learning can work hand in hand. CoT is often applied in a Step-by-step:
few-shot manner, but its defining characteristic is the detailed 1. Break down the numbers: 24 and 36.
reasoning chain, which might not always be present in typical few-shot 2. Add them: 24 + 36 =
learning examples. Answer:
What does pretraining teach?

https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Disclaimer
➢ The content of this presentation is not original, and it has been
prepared from various sources for teaching purposes.

BERT Slides
No ratings yet
BERT Slides
41 pages
CS283 Lecture6 2024
No ratings yet
CS283 Lecture6 2024
115 pages
Trace Former
No ratings yet
Trace Former
6 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
11 Bert
No ratings yet
11 Bert
66 pages
Transformer Basics
No ratings yet
Transformer Basics
17 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
BERT
No ratings yet
BERT
98 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Chap 7.1 Sequence Analysis Using FFN
No ratings yet
Chap 7.1 Sequence Analysis Using FFN
47 pages
2024 Acl-Long 256
No ratings yet
2024 Acl-Long 256
11 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
cl13 GPT
No ratings yet
cl13 GPT
26 pages
cl13 gpt-2
No ratings yet
cl13 gpt-2
26 pages
Blind Report
No ratings yet
Blind Report
46 pages
How Does A GPT Tool Process Inputs
No ratings yet
How Does A GPT Tool Process Inputs
19 pages
Lecture 27
No ratings yet
Lecture 27
40 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
The Artificial Intelligence Handbook For Graphic Designers by Jeroen Erne
100% (2)
The Artificial Intelligence Handbook For Graphic Designers by Jeroen Erne
367 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Bert
No ratings yet
Bert
20 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Lec 02
No ratings yet
Lec 02
33 pages
Bert
No ratings yet
Bert
36 pages
6-Bert T5 GPT
No ratings yet
6-Bert T5 GPT
31 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Arjun Sihota Blaine Segment Paper
No ratings yet
Arjun Sihota Blaine Segment Paper
41 pages
Long-Term Forecasting With TiDE Time-Series Dense Encoder
No ratings yet
Long-Term Forecasting With TiDE Time-Series Dense Encoder
21 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
m13. Feuerriegel Et Al. (2024)
No ratings yet
m13. Feuerriegel Et Al. (2024)
28 pages
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
32-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
8 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
20 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Gemini v1 5 Report
No ratings yet
Gemini v1 5 Report
153 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Bert
No ratings yet
Bert
10 pages
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
No ratings yet
Problem Statement:: Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Neural
4 pages
AI in Entertainment
No ratings yet
AI in Entertainment
10 pages
What Is Google Gemini? - Built in
No ratings yet
What Is Google Gemini? - Built in
12 pages
Autoregressive Video Generation
No ratings yet
Autoregressive Video Generation
22 pages
Paper Review
No ratings yet
Paper Review
6 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
BERT
No ratings yet
BERT
4 pages
Project Final1
No ratings yet
Project Final1
39 pages
Revised Chapter I - III - SignMo
No ratings yet
Revised Chapter I - III - SignMo
70 pages
Bert 1
No ratings yet
Bert 1
4 pages
Stock Price Prediction With Denoising Autoencoder and Transformers 2023
No ratings yet
Stock Price Prediction With Denoising Autoencoder and Transformers 2023
8 pages
Temporal Fusion Transformer Slides
No ratings yet
Temporal Fusion Transformer Slides
19 pages
A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models
No ratings yet
A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models
37 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
The Diverse Landscape of Large Language Models Deepsense Ai
No ratings yet
The Diverse Landscape of Large Language Models Deepsense Ai
16 pages
Aml Ccs Membership Inf Vision
No ratings yet
Aml Ccs Membership Inf Vision
15 pages
INT426 MCQ's Unit - 4,5,6 GeeksforCampus
No ratings yet
INT426 MCQ's Unit - 4,5,6 GeeksforCampus
17 pages
MotionBERT - A Unified Perspective On Learning Human Motion Representations
No ratings yet
MotionBERT - A Unified Perspective On Learning Human Motion Representations
18 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
24 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
答案解析
No ratings yet
答案解析
15 pages
2024 - Efficiency Optimization of Large-Scale Language
No ratings yet
2024 - Efficiency Optimization of Large-Scale Language
8 pages
2024 - Streaming Dense Video Captioning - Zhou Et Al
No ratings yet
2024 - Streaming Dense Video Captioning - Zhou Et Al
11 pages
Iclr2022 Should We Replace Cnns With TR
No ratings yet
Iclr2022 Should We Replace Cnns With TR
15 pages
Mask BERT
No ratings yet
Mask BERT
9 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
10 pages
L - R - V C CPU: OW Latency EAL Time Oice Onversion On
No ratings yet
L - R - V C CPU: OW Latency EAL Time Oice Onversion On
8 pages
The Challenges of Using Neural Machine Translation For Literature
No ratings yet
The Challenges of Using Neural Machine Translation For Literature
10 pages
N Gram, RNN Tranformer
No ratings yet
N Gram, RNN Tranformer
2 pages
A Day in Code- Python: Learn to Code in Python through an Illustrated Story (for Kids and Beginners)
From Everand
A Day in Code- Python: Learn to Code in Python through an Illustrated Story (for Kids and Beginners)
Shari Eskenas
5/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BERT GPT CoT

Uploaded by

BERT GPT CoT

Uploaded by

Bidirectional Encoder Representations

➢ BERT’s model architecture is a multi-layer bidirectional Transformer encoder based

1. They denoted the number of layers

2. They primarily reported results on

• BERTBASE (L=12, H=768, A=12,

• BERTLARGE (L=24, H=1024, A=16,

➢ Critically, however, the BERT Transformer uses bidirectional self-attention, while

➢ For a given token, its input representation is constructed by summing the

➢ It is critical to use a document-level corpus rather than a shuffled sentence-level

➢ Longer sequences are disproportionately expensive because attention is quadratic

2. QQP (Quora Question Pairs)

2. QQP (Quora Question Pairs)

3. QNLI (Question Natural Language Inference)

5. CoLA (Corpus of Linguistic Acceptability)

5. CoLA (Corpus of Linguistic Acceptability)

6. STS-B (Semantic Textual Similarity Benchmark)

8. RTE (Recognizing Textual Entailment)

➢ They computed a standard classification loss with C (Recall: C is the hidden

➢ They predicted a non-null answer when , where the threshold τ is

➢ BERTLARGE outperformed the authors’

➢ Larger batch size helped improve convergence and stability.

➢ To better understand this discrepancy, researchers of RoBERTA compared

➢ Cross-Layer Parameter Sharing

State-of-the-art results on the GLUE benchmark. For single-task single-model results,

Training Data: BooksCorpus (over 7,000 unpublished books)

➢ In-context learning refers to a model's ability to learn and adapt to new

➢ This approach allows the model to perform specific tasks by understanding

➢ Based on the examples, it will try to predict the French translation of

➢ Example: Suppose you want the model to summarize a sentence:

➢ "Artificial Intelligence is the simulation of human intelligence in

➢ Here, the model has seen one example of an English-to-Spanish

➢ These multiple examples help the model better understand the

➢ In this case, the model is given two labeled examples of sentiment

➢ One-Shot: One example provided; the model generalizes from a single

➢ Few-Shot: A few examples provided; the model learns patterns based on

➢ Each of these learning types demonstrates different levels of context

➢ Here’s what makes GPT-2 capable of handling zero-shot tasks:

➢ Massive Scale and Diverse Pretraining Data:

➢ In summary: While zero-shot, one-shot, and few-shot learning existed as independent

➢ Now, I multiply the price by the number of shirts I want to buy.

➢ Total Cost = $25 × 3 = $75.

➢ Therefore, the total cost for 3 shirts is $75.

➢ Accuracy: The structured approach helps prevent errors, especially in

➢ Flexibility: This method can be applied to various domains, including math,

➢ This method significantly enhances the model’s ability to tackle complex

# Set your OpenAI API key

# Step 1: Initialize the language model

# Step 2: Create a prompt template for CoT reasoning

# Step 4: Run the chain with user input

# Step 5: Print the response

Chain of Thoughts (CoT) prompting often uses few-shot examples to Example 3:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.