BERT GPT CoT
BERT GPT CoT
from
Transformers
(BERT)
Model Architecture
➢ Language models only use left context, but language understanding is bidirectional.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Model Architecture
➢ BERTBASE was chosen to have the same model size as OpenAI GPT for comparison
purposes.
https://arxiv.org/pdf/1810.04805
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
➢ The first token of every sequence is always a special classification token ([CLS]).
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ To make BERT handle various downstream tasks, the input representation was such
that it could unambiguously represent both a single sentence and a pair of sentences
(e.g., (Question, Answer)) in one token sequence.
➢ A “sentence” can be an arbitrary span of contiguous text, rather than an actual
linguistic sentence.
➢ A “sequence” refers to the input token sequence to BERT, which may be a single
sentence or two sentences packed together.
➢ They used WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.
➢ The first token of every sequence is always a special classification token ([CLS]).
➢ The final hidden state corresponding to this token is used as the aggregate
sequence representation for classification tasks.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1810.04805
Wu t al., Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144.
Input Representation
➢ Sentence pairs were packed together into a single sequence. They differentiated
the sentences in two ways.
➢ First, they separated them with a special token ([SEP]). Second, they added a
learned embedding to every token, indicating whether it belongs to sentence A or B.
https://arxiv.org/pdf/1810.04805
Input Representation
➢ They denoted input embedding as E, the final hidden vector of the special [CLS]
token as C ∈ RH, and the final hidden vector for the ith input token as Ti ∈ RH.
https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Masked Language Modeling (MLM) Objective/Task
➢ Next Sentence Prediction (NSP) Objective/Task
https://arxiv.org/pdf/1810.04805
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks. 1. Select some tokens
➢ Masked Language Modeling (MLM) Objective/Task (each token is selected
with a probability of
15%)
2. Replace these
selected tokens
• (with the special
token [MASK] - with
p=80%, with a
random token - with
p=10%, with the
original token (remain
unchanged) - with
p=10%)
3. predict original
tokens (compute loss).
https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT
➢ BERT was pre-trained using two unsupervised tasks.
➢ Next Sentence Prediction (NSP) Objective/Task
➢ The Next Sentence Prediction (NSP) objective is a binary classification task.
➢ From the final-layer representation of the special token [CLS], the model predicts whether
the two sentences are consecutive in some text or not.
➢ Note that 50% of examples in training contain consecutive sentences extracted from training
texts (from the same document) and another 50% - a random pair of sentences (from
different documents).
➢ Input: [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
➢ Label: isNext
➢ Input: [CLS] the man went to [MASK] store [SEP] penguin [MASK] are flight ##less birds
[SEP]
➢ Label: notNext
➢ This task teaches the model to understand the relationships between sentences. As we'll see
later, this will enable to use BERT for complicated tasks requiring some kind of reasoning.
https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training BERT
https://arxiv.org/pdf/1810.04805
https://lena-voita.github.io/nlp_course/transfer_learning.html#bert
Pre-Training Data
➢ For the pre-training corpus, they used BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).
➢ For Wikipedia, they extracted only the text passages and ignored lists, tables, and
headers.
https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ To generate each training input sequence, they sampled two spans of text from the
corpus, which they refered to as “sentences” even though they were typically much
longer than single sentences (but can be shorter also).
➢ The first sentence received the A embedding, and the second received the B
embedding. 50% of the time, B was the actual next sentence that followed A and
50% of the time it was a random sentence, which was done for the “next sentence
prediction” task. They were sampled such that the combined length is ≤ 512 tokens.
➢ The LM masking was applied after WordPiece tokenization with a uniform masking
rate of 15% and no special consideration was given to partial word pieces.
https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ They trained with a batch size of 256 sequences (256 sequences * 512 tokens =
128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over
the 3.3 billion word corpus.
➢ They used Adam with a learning rate of 1e-4, β1 = 0:9, β2 = 0:999, L2 weight decay
of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the
learning rate.
➢ They used a dropout probability of 0.1 on all layers. They used a gelu activation
(Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI
GPT. The training loss was the sum of the mean masked LM likelihood and the mean
next-sentence prediction likelihood.
https://arxiv.org/pdf/1810.04805
Pre-Training Procedure
➢ Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16
TPU chips total).
➢ Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total).
Each pretraining took 4 days to complete.
➢ Then, they trained the rest 10% of the steps on the sequence of 512 (Batch: 256 x
512)
https://arxiv.org/pdf/1810.04805
GLUE (General Language Understanding Evaluation) Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
1. MNLI (Multi-Genre Natural Language Inference)
➢ MNLI is a benchmark dataset for evaluating natural language inference models. It
consists of pairs of sentences, where the task is to determine if the second sentence
is an entailment, contradiction, or neutral with respect to the first sentence, covering
multiple genres of text.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
4. SST-2 (Stanford Sentiment Treebank)
➢ SST-2 is a sentiment analysis dataset where the task is to classify sentences
from movie reviews as either positive or negative. It is commonly used to
evaluate models on sentiment understanding and classification capabilities.
https://arxiv.org/pdf/1810.04805
GLUE Tasks
7. MRPC (Microsoft Research Paraphrase Corpus)
➢ MRPC consists of pairs of sentences from news articles labeled as
paraphrases or non-paraphrases. The objective is to determine whether two
sentences convey the same meaning, making it a valuable resource for
evaluating paraphrase detection models.
https://arxiv.org/pdf/1810.04805
Fine-Tuning BERT on Different Tasks
https://arxiv.org/pdf/1810.04805
Fine-Tuning BERT on Different Tasks
https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ The only new parameters introduced during fine-tuning are classification layer
weights W ∈ RK×H, where K is the number of labels.
https://arxiv.org/pdf/1810.04805
GLUE Test Results
➢ General Language Understanding Evaluation (GLUE) Results
➢ They used a batch size of 32 and fine-tuned for 3 epochs over the data for all
GLUE tasks.
➢ For each task, they selected the best fine-tuning learning rate (among 5e-5, 4e-5,
3e-5, and 2e-5) on the Dev set.
➢ Additionally, for BERTLARGE they found that finetuning was sometimes unstable on
small datasets, so they ran several random restarts and selected the best model on
the Dev set.
➢ With random restarts, they used the same pre-trained checkpoint but perform
different fine-tuning data shuffling and classifier layer initialization.
https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1
➢ The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of 100k
crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a
passage from Wikipedia containing the answer, the task is to predict the answer
text span in the passage.
https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1
https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD v1.1
https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V1.1 vs V2.0
https://arxiv.org/pdf/1810.04805
SQuAD
➢ SQuAD V2.0
➢ They used a simple approach to extend the SQuAD v1.1 BERT model for this task.
➢ They treated questions that do not have an answer as having an answer span with
start and end at the [CLS] token.
➢ The probability space for the start and end answer span positions is extended to
include the position of the [CLS] token.
➢ For prediction, they compared the score of the no-answer span Snull =
to the score of the best non-null span
➢ They did not use TriviaQA data for this model. They fine-tuned for 2 epochs with a
learning rate of 5e-5 and a batch size of 48.
https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ The Situations With Adversarial Generations (SWAG) dataset contains 113k
sentence-pair completion examples that evaluate grounded commonsense inference
(Zellers et al., 2018).
➢ Given a sentence, the task is to choose the most plausible continuation among four
choices.
➢ When fine-tuning on the SWAG dataset, they constructed four input sequences,
each containing the concatenation of the given sentence (sentence A) and a possible
continuation (sentence B).
➢ The only task-specific parameters introduced was a vector whose dot product with
the [CLS] token representation C denotes a score for each choice which was
normalized with a softmax layer.
https://arxiv.org/pdf/1810.04805
SWAG
➢ SWAG
➢ They fine-tuned the model for 3 epochs
with a learning rate of 2e-5 and a batch
size of 16. Results are presented in
Table 4.
https://arxiv.org/pdf/1810.04805
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Trained BERT on larger batch size, for more epochs and/or on more data.
➢ Showed that more epochs alone help, even on the same data
➢ More data also helps (RoBERTa was trained on a significantly larger dataset
(over 160GB of text, compared to BERT’s 16GB) sourced from diverse
corpora like BookCorpus, English Wikipedia, CC-News, OpenWebText, and
Stories.)
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Improved upon BERT by removing NSP and enhancing MLM.
➢ RoBERTa applied dynamic masking, where the masked tokens were chosen
randomly for each training epoch. This allowed the model to see different
masked tokens for the same input text across epochs, providing more diverse
training signals and preventing overfitting to specific masked positions.
➢ BERT did not use dynamic masking during its pretraining. Instead, BERT used
static masking, where the masked tokens were selected once for each
training example and remained fixed throughout the training process.
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Static vs Dynamic Masking
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ The NSP loss was hypothesized to be an important factor in training the
original BERT model.
➢ Devlin et al. (2019) observed that removing NSP hurts performance, with
significant performance degradation on QNLI, MNLI, and SQuAD 1.1.
➢ However, some later work questioned the necessity of the NSP loss (Lample
and Conneau, 2019; Yang et al., 2019; Joshi et al., 2019).
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
Table 2: Development set results for base models pretrained over BOOKCORPUS and WIKIPEDIA. All
models are trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and
accuracy for MNLI-m, SST-2 and RACE. Reported results are medians over five random initializations
(seeds). Results for BERTBASE and XLNetBASE are from Yang et al. (2019).
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Batch-Size
Table 3: Perplexity on held-out training data (ppl) and development set accuracy for base
models trained over BOOKCORPUS and WIKIPEDIA with varying batch sizes (bsz). We
tune the learning rate (lr) for each setting. Models make the same number of passes over
the data (epochs) and have the same computational cost.
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ RoBERTA (A Robustly Optimized BERT Pretraining Approach):
➢ Development-Set Results
Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and
pretrain for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows
above. RoBERTa matches the architecture and training objective of BERTLARGE . Results for BERTLARGE
and XLNetLARGE are from Devlin et al. (2019) and Yang et al. (2019), respectively.
https://arxiv.org/pdf/1907.11692
Post-BERT Pretraining Advancements
➢ ALBERT (A Lite BERT):
➢ Factorized Embedding
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/slides/Jacob_Devlin_BERT.pdf
https://arxiv.org/pdf/1909.11942
Generative Pre-trained Transformer (GPT)
➢ Architecture
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Architecture
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Unsupervised pre-training
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning
➢ All transformations include adding randomly initialized start and end tokens
(<s>, <e>).
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Supervised Fine-tuning
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ Results
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT 1 vs GPT 2 vs GPT 3
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ It was formally introduced in GPT-3.
➢ The model uses these examples to infer the task and generalize it to new,
similar instances within the same prompt. Essentially, the model "learns"
from the context provided directly in the input.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ Suppose you want a language model to perform English-to-French
translation using in-context learning. You can provide a prompt like this:
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Example
➢ What is happening?
➢ You’ve provided the model with two examples of English sentences and
their French translations.
➢ The model infers from this context that its task is to translate English
sentences to French.
➢ So, basically, the model uses the patterns in the examples to understand
the structure and provide an output based on the task inferred from the
context.
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Zero-Shot Learning
➢ Definition: In zero-shot learning, the model performs a task without
having seen any specific examples for that task. It relies on general
language understanding and context to complete the task.
➢ Prompt:
➢ Prompt
Generative Pre-trained Transformer (GPT)
➢ In-context Learning (ICL)
➢ Few-Shot Learning
➢ Expected Output
➢ Positive
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Generative Pre-trained Transformer (GPT)
➢ GPT-2 could perform zero-shot learning
➢ It introduced significant advancements that enabled it to perform zero-shot
learning effectively.
➢ Diverse Data: GPT-2 was trained on a vast and diverse dataset, containing billions of
tokens from a wide range of sources, covering topics from news to literature to
technical documents. This diversity provides it with broad contextual knowledge
across many domains.
➢ Extensive Contextual Knowledge: By seeing such a wide array of language and tasks
during pretraining, GPT-2 learns generalized representations of language patterns,
which enables it to interpret new tasks based on context alone.
Generative Pre-trained Transformer (GPT)
➢ Notes:
➢ Zero-Shot, One-Shot, and Few-Shot Learning have been long-standing concepts in machine
learning. They refer to a model’s ability to generalize with minimal examples, originally
applied in areas like image recognition and transfer learning. These paradigms existed
before they were demonstrated in large language models.
➢ In-Context Learning as a distinct term became relevant with language models that could
learn from prompts alone, without parameter updates. It refers specifically to a model’s
capacity to learn based on examples given within the same prompt.
➢ While zero-shot, one-shot, and few-shot learning are paradigms that describe the number of
examples provided, in-context learning refers to the process itself — using context within a
prompt to perform tasks. Thus, in-context learning is an umbrella term that includes zero-
shot, one-shot, and few-shot learning within large language models.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
➢ Chain of Thought (CoT) reasoning is a prompting technique that encourages
language models to break down complex tasks into smaller, logical steps.
➢ This method helps the model articulate its reasoning process, leading to more
accurate and coherent answers.
➢ Here’s how it works, along with an example:
➢ Solve the following math problem:
➢ Chain of Thought Reasoning
➢ Step 1: Identify the Components of the Problem
➢ The problem gives me the cost of one shirt ($25) and the quantity of shirts I want to buy (3).
➢ Step 2: Formulate the Calculation
➢ To find the total cost, I need to multiply the cost of one shirt by the number of shirts:
➢ Total Cost=Cost per Shirt×Number of Shirts
➢ Step 3: Perform the Calculation
➢ Now, I can substitute the values into the formula:
➢ Total Cost=25×3
➢ Performing the multiplication gives:
➢ Total Cost=75
➢ Step 4: State the Final Answer
➢ Therefore, the total cost of buying 3 shirts is $75.
Chain of Thoughts (CoT)
➢ When using CoT prompting, the model would be encouraged to articulate these
steps explicitly (by means of writing/using a proper Prompt template).
➢ Here’s how the complete response might look when prompted to use chain-of-
thought reasoning:
➢ To find the total cost of buying 3 shirts, I first need to know the price of
one shirt, which is $25.
➢ So, the calculation is:Total Cost = Price per Shirt × Number of Shirts
➢ Summary
➢ Chain of thought reasoning encourages a systematic approach to problem-
solving by explicitly stating each logical step.
https://arxiv.org/pdf/2201.11903
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Chain of Thoughts (CoT)
import os
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
Problem: {question}
Step-by-step reasoning:
"""
)
Chain of Thoughts (CoT)
# Step 3: Create the LLM chain
chain = LLMChain(llm=llm, prompt=prompt_template)
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/slides/cs224n-2023-lecture9-pretraining.pdf
Disclaimer
➢ The content of this presentation is not original, and it has been
prepared from various sources for teaching purposes.