0% found this document useful (0 votes)

7 views66 pages

RADL LHPhuong

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views66 pages

RADL LHPhuong

hust

Uploaded by

Phuc Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

Introduction to Large Language Models for

Natural Language Processing

Lê Hồng Phương
Data Science Laboratory
VIASM & Vietnam National University, Hanoi
<phuonglh@vnu.edu.vn>

June 30, 2023

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 1 / 66
Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 2 / 66
Introduction

Language Models

A language model (LM) is a probability distribution over sequences of

words. Given any sequence of words of length n, a LM assigns a
probability P(w1 , w2 , . . . , wn ) to the sequence.
An n-gram LM models sequences of words as a Markov process where
we want to predict the next word given its history:

P(wn |w1 , . . . , wn−1 )

Text generation from the probabilistic LM:

apple

Mary swallowed a green _ frog

mountain

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 3 / 66
Introduction

Neural Language Models

Neural language models (or continuous space language models) use

continuous representations or embeddings of words to make their
predictions. These models make use of neural networks.
Each word or token has an associated embedding vector xi ∈ Rd :

Mary swallowed a green apple

x1 x2 x3 x4 x5

Static word embeddings: Skip-gram and CBOW (2013), GloVe (2014)

Dynamic word embeddings: ELMo (2017), BERT (2018)

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 4 / 66
Introduction

Large Language Models

What are LLMs?

ML algorithms that can
recognize, predict, and
generate human languages.
Pretrained on petabyte scale
text datasets resulting in large
models with 10s to 100s of
billions of parameters.
LLMs are normally pretrained
followed by tuning on a specific
task.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 5 / 66
Introduction

Large Language Models: Evolutionary Tree

A nice recent survey presenting the evolution of LLMs1

A simplified view:

Encoder-Decoder
Encoder-Only LLMs Decoder-Only
LLMs LLMs

Transformer

Attention

Matrix Ops

1
Yang et al. (2023) Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 6 / 66
Introduction

Large Language Models: Evolutionary Tree

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 7 / 66
Introduction

A Timeline of LLMs

2
2
Zhao et al. (2023) A survey of large language models. https://arxiv.org/abs/2303.18223
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 8 / 66
Introduction

Pretraining – Finetuning Method: 2018–2021

Pretrained
Finetune on Inference on
Language Model
task A task A
(BERT, T5)

Pretraining – Finetuning:
Typically requires many task-specific examples
One specialized model for each task

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 9 / 66
Introduction

Prompting Method: 2021–2023

Pretrained
Inference on
Language Model
task A
(GPT-3)

Prompting:
Prompts are used to interact with LLMs to accomplish a task.
A prompt is a user-provided input. Prompts can include instructions,
questions, or any other type of input, depending on the intended use
of the model.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 10 / 66
Introduction

Instruction-Tuning Method: 2022–2023

Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .

Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 11 / 66
Transformers Attention Mechanism

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 12 / 66
Transformers Attention Mechanism

The Attention Mechanism

Currently, the dominant models for nearly all NLP tasks are based on the
Transformer architecture. Given any new task in NLP, we usually:
Take a large Transformer-based pretrained model (BERT, GPT-x,
T5-x, etc)
Fine-tune/Prompt engineer the model on the available data for the
downstream task
The core idea behind the Transformer model is the attention
mechanism, an innovation which was introduced in 2014.3

3
Bahdanau et al. (2014). Neural machine translation by jointly learning to align and translate.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 13 / 66
Transformers Attention Mechanism

The Attention Mechanism

In machine translation, attention models often assigned high

attention weights to cross-lingual synonyms when generating the
corresponding words in the target language.
“My feet hurt” → “j’ai mal au pieds”:
The model might assign high attention weights to the representation of
“feet” when generating “pieds”.4

4
Reference: “Dive into DL” – https://d2l.ai/
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 14 / 66
Transformers Attention Mechanism

The Attention Mechanism

In 2017, the Transformer architecture was proposed which relies on

cleverly arranged attention mechanisms to capture all relationships among
input and output tokens.5
Denote by D = {(k1 , v1 ), (k2 , v2 ), . . . , (km , vm )} a database of m
tuples (key, value).
Denote by q a query.
The attention over D is defined as
m
X
Attention(q, D) := α(q, ki ) vi ,
i =1

where α(q, ki ) ∈ R, ∀i = 1, . . . , m are scalar attention weights.

5
Vaswani et al. (2017). Attention is all you need. NIPS.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 15 / 66
Transformers Attention Mechanism

Attention Pooling

k1 α(q, k1 ) v1

k2 α(q, k2 ) v2
attention pooling
ki α(q, ki ) vi

km α(q, km ) vm

Some special cases:

P
1 α(q, k ) ≥ 0,
i i α(q, ki ) = 1: the weights form a convex combination.
2 α(q, kj ) = 1, all other weights are 0: sparse attention.
1
3 α(q, ki ) = m: average pooling.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 16 / 66
Transformers Attention Mechanism

Attention Pooling By Similarity

Some common similarity kernels α(q, k):

Gaussian: α(q, k) = exp − 2σ1 2 k q − k k2
Boxcar: α(q, k) = 1 if k q − k k ≤ 1
Epanechikov: α(q, k) = max{0, 1 − k q − k k}

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 17 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

From the Gaussian kernel:

1
α(q, ki ) = − k q − ki k2
2
1 1
= q ki − k qi k2 − k k k2
⊤
2 2
The scaled dot product attention that is used in the Transformers:

q⊤ ki
a(q, ki ) = √ , q, k ∈ Rd
d
exp(a(q, ki ))
α(q, ki ) = Pm
j=1 exp(a(q, kj ))

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 18 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

We often compute attention scores in mini-batches for efficiency. For n

queries and m key-value pairs, q, k ∈ Rd , and values v ∈ Rv , we use the
batch matrix multiplication:

QK⊤
softmax √ V ∈ Rn×v ,
d

for Q ∈ Rn×d , K ∈ Rm×d , and V ∈ Rm×v .

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 19 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

When queries and keys are vectors of different dimensionalities, we can

either
use a matrix to address the mismatch via q⊤ M k, or
use additive attention, for q ∈ Rq , k ∈ Rk :

a(q, k) = wv⊤ tanh (Wq q +Wk k) ∈ R,

where wv ∈ Rh , Wq ∈ Rh×q , Wk ∈ Rh×k are learnable parameters.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 20 / 66
Transformers Attention Mechanism

Multi-Head Attention

We want our model to combine knowledge from different behaviors of

the same attention mechanism, such as capturing dependencies of
various ranges within a sequence.
It is beneficial to allow our attention mechanism to jointly use different
representation subspaces of queries, keys, and values.
We concatenate multiple attention pooling outputs: multi-head
attention.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 21 / 66
Transformers Attention Mechanism

Self-Attention

In sequence processing, each token has its own query, keys, and
values. We can compute, for each token, a representation by building
the appropriate weighted sum over the other tokens.
Each token is attending to each other token ⇒ self-attention models.

s1 s2 s3 s4 s1 s2 s3 s4

x1 x2 x3 x4 x1 x2 x3 x4

Recurrent Neural Network Self-Attention

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 22 / 66
Transformers Attention Mechanism

Self-Attention

Computation complexity on a sequence of n tokens:

RNN: multiplication of a weight matrix of size d × d and a
d-dimensional hidden state ⇒ O(nd 2 ); O(n) operations cannot be
parallelized.
Self-attention: a n × d matrix is multiplied by a d × n matrix, then
the resulting matrix is multiplied by a n × d matrix ⇒ O(n2 d);
Computation can be parallelized with O(1) sequential operation.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 23 / 66
Transformers Transformer Architecture

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 24 / 66
Transformers Transformer Architecture

The Transformer Architecture

The Transformer is composed of

an encoder and a decoder.
The encoder is a stack of
multiple identical layers, where
each layer has two sublayers
The first is a multi-head
self-attention pooling.
The second is a positionwise
feed-forward network.
There is a residual connection
at both sublayers, inspired by
the ResNet.
The encoder outputs a
d-dimensional vector
representation for each position
of the input sequence.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 25 / 66
Transformers Transformer Architecture

The Transformer Architecture

The decoder is also a stack of multiple identical

layers with residual connections and layer
normalizations.
Between the two sublayers, the decoder has a third
sublayer, called the encoder-decoder attention.
In the encoder-decoder attention, queries are from
the outputs of the previous decoder layer, and the
keys and values are from the encoder outputs.
In the decoder self-attention, queries, keys, and
values are all from the outputs of the previous
decoder layer.
Each position in the decoder is allowed to only
attend to all positions up to that position.
The masked attention preserves the auto-regressive
property, ensuring that the prediction only depends
on those output tokens that have been generated.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 26 / 66
Transformers Transformer Architecture

Encoder Self-Attention Weights

Two layers of multi-head attention weights are presented row by row.6

6
https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 27 / 66
Transformers Transformer Architecture

Decoder Self-Attention Weights

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 28 / 66
Transformers Transformer Architecture

Encoder-Decoder Self-Attention Weights

i’m home . ⇒ [je, suis, chez , moi, .]

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 29 / 66
LLMs

Large-Scale Pretraining with Transformers: Three Modes

Transformers have been extensively pretrained with a wealth of text to

learn good representations.
Originally proposed for MT, the Transformer architecture consists of
an encoder for representing input sequences and a decoder for
generating target sequences.
Transformers can be used in three different modes:
1 encoder-only
2 encoder-decoder
3 decoder-only

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 30 / 66
LLMs Encoder-Only

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 31 / 66
LLMs Encoder-Only

Encoder-Only Transformer

A Transformer encoder consists of self-attention layers, where all input

tokens attend to each other.
A sequence of input tokens is converted into the same number of
representations by the encoder.
These representations can then be further projected into output (e.g.,
classification).
This design was inspired by an earlier encoder-only Transformer
pretrained on text: BERT.7

7
Devlin et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 32 / 66
LLMs Encoder-Only

Encoder-Only Transformer: BERT Pretraining

BERT is pretrained on text sequences using masked language

modeling : input text with randomly masked tokens is fed into a
Transformer encoder to predict the masked tokens.
There is no constraint in the attention pattern of Transformer encoders.
Prediction of “love” depends on input tokens before and after it.
Large-scale text data can be used for pretraining BERT.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 33 / 66
LLMs Encoder-Only

Encoder-Only Transformer: BERT Fine-Tuning

The pretrained BERT can be fine-tuned to downstream encoding

tasks involving single text or text pairs.
During fine-tuning, additional layers can be added to BERT with
randomized parameters: these parameters and those pretrained BERT
parameters will be updated to fit training data of downstream tasks.
The general language representations learned by the 340M-parameter
BERT from 250B training tokens advanced the SOTA for many NLP
tasks.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 34 / 66
LLMs Encoder-Only

Derivatives of BERT
Other derivatives of BERT improved model architectures or pretraining
objectives:
1 RoBERTa (2019)

change key hyperparameters

remove the next-sentence pretraining objective of BERT
train with much larger mini-batches and learning rates
2 DeBERTa (2021)
use disentangled attention mechanism: each word is represented using
two vectors that encode its content and position, the attention weights
among words are computed using disentangled matrices on their
contents and relative positions
use enhanced mask decoder to replace the output softmax layer to
predict the masked tokens
3 Others
ALBERT (2019): enforces parameter sharing
SpanBERT (2020): predicts spans of texts
ELECTRA (2020): replaced token detection
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 35 / 66
LLMs Encoder-Only

Multilingual BERT

1 mBERT (2019)
a multilingual version of BERT trained on 104 languages from the
Wikipedia corpus.
follows the BERT recipe with the same training architecture and
objective
110M to 340M parameters
2 XLM-R (2020)
a multilingual language model, trained on 2.5TB of filtered
CommonCrawl data of 100 different languages
performs particularly well on low-resource languages
outperforms mBERT on a variety of cross-lingual benchmarks.
3.5B to 10.7B parameters
3 BERT pretrained models for Vietnamese:
ViBERT (10/2020), FPT
PhoBERT (11/2020), VinAI

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 36 / 66
LLMs Encoder-Decoder

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 37 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer

The decoder autoregressively predicts the target sequence of arbitrary

length, token by token, conditional on both encoder output and
decoder output:
the encoder-decoder cross-attention allows target tokens to attend to
all input tokens
the masked multi-head attention of decoder allow any target token can
attend to past and present tokens in the target sequence
Two well-known encoder-decoder Transformers, both attempt to
reconstruct original text in their pretraining objectives.
1 BART (2019): emphasizes noising input (masking, deletion,
permutation, rotation)
2 T5/mT5 (2020): emphasizes multitask unification

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 38 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

Every task is cast as feeding the model text as input and training it to
generate some target text.8
This allows for the use of the same model, loss function,
hyperparameters across different tasks.
8
Raffel et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 39 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

T5 can be fine-tuned for novel tasks.9

11B-parameter T5 achieved SOTA on the GLUE, SuperGLUE,
SQuAD, and CNN/Daily Mail benchmarks.
9
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 40 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

Pretraining Fine-tuning

Switch Transformer (2022) is based on T5.10

improve with reduced communication and computational costs
advance the of LLMs by pre-training up to trillion parameter models,
achieved a 4x speedup over the T5-XXL model
10
Fedus et al. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. JMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 41 / 66
LLMs Decoder-Only

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 42 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-2

Remove the entire encoder and the encoder-decoder cross-attention

sublayer from the original encoder-decoder architecture.
GPT and GPT-2 chooses a Transformer decoder as its backbone.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 43 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-2

GPT (2018)11 : 100M parameters, need to be fine-tuned for

downstream tasks.
GPT-2 (11/2019)12 : 1.5B parameters, performed well on multiple
other tasks without updating the parameters or architecture.13

11
Radford et al. (2018). Improving language understanding by generative pre-training. OpenAI.
12
Radford et al. (2019). Language models are unsupervised multitask learners. OpenAI Blog.
13
https://openai.com/research/better-language-models
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 44 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-3

A pretrained LM may generate the task output as a sequence without

parameter update, conditional on an input sequence with
1 the task description
2 task-specific input-output examples
3 a prompt (task input)
This learning paradigm is called in-context learning.14
GPT-3 (2020), 175B parameters:
uses the same Transformer decoder architecture in GPT-2 except that
attention patterns are sparser at alternating layers
pretrained with 300B tokens (40TB of text data)
performs better with larger model size, where few-shot performance
increases most rapidly

14
Brown et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 45 / 66
LLMs Decoder-Only

GPT-3 Zero-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: i’m home →

| {z } | {z }
task description prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 46 / 66
LLMs Decoder-Only

GPT-3 One-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: go → va | i’m home

| {z } | {z } | {z }
task description 1 exemplar prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 47 / 66
LLMs Decoder-Only

GPT-3 Few-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: go → va ; i lost → j’ai perdu | i’m home

| {z } | {z } | {z }
task description 2 exemplars prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 48 / 66
LLMs Decoder-Only

Recall: Instruction-Tuning Method

Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .

Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks
Zero-shot learning

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 49 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning

Leverage the intuition that NLP tasks can be described via natural
language instructions, such as
Is the sentiment of this movie review positive or negative?
Translate “how are you” into Chinese.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 50 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning

Take a pretrained decoder-only model (LaMDA-PT, 137B parameters)

and perform instruction tuning–finetuning 60+ NLP datasets.15

15
Wei et al. (2022) Finedtuned language models are zero-shot learners. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 51 / 66
LLMs Decoder-Only

T0 Instruction-Tuning

T0 is an encoder-decoder model.16
16
Sanh et al. (2022) Multitask prompted training enables zero-shot task generalization. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 52 / 66
LLMs Decoder-Only

T0 Instruction-Tuning
T0 (2022) 11B parameters, trained on multitask mixture of NLP datasets
Each dataset is associated with multiple prompt templates to format
examplars. T0 is as good as FLAN despite being 10x smaller.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 53 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts

FLAN-PaLM 540B instruction-finetuned on 1.8K tasks outperforms
PaLM 540B by a large margin (+9.4% on average).
FLAN-PaLM 540B achieves SOTA performance on several
benchmarks, such as 75.2% on five-shot MMLU.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 54 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 55 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts

17
17
Chung et al. (2022) Scaling instruction-finetuned language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 56 / 66
LLMs Decoder-Only

Self-Instruct
Self-Instruct is a framework for improving the instruction-following
capabilities of pretrained LLMs by bootsrapping off their own
generations.18

18
Wang et al. (2023) Aligning language models with self-generated instructions. ACL
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 57 / 66
LLMs Decoder-Only

Self-Instruct

SuperNI is a benchmark
consisting of 119 tasks with 100
instances in each task.
Self-Instruct data is freely
available (52K instructions).
By finetuning GPT-3 on this
data leads to a 33% absolute
improvement over the original
GPT-3.

SuperNI19 , Self-Instruct20
19
Wang et al. (2022) Super-NaturalInstruction: Generalization via declarative instructions on 1600+ tasks. EMNLP
20
https://github.com/yizhongw/self-instruct
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 58 / 66
LLMs Emergent Abilities

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 59 / 66
LLMs Emergent Abilities

Emergent Ability

An ability is emergent if it is not present in smaller models but is

present in larger models.21

Model sizes:
GPT-3: 2 · 1022 training FLOPs (13B parameters)
LaMDA: 1023 training FLOPs (68B parameters)
Gopher: 5 · 1023 training FLOPs (280B parameters)
PaLM: 2.5 · 1024 FLOPs (540B parameters)

21
Wei et al. (2022) Emergent abilities of large language models. TMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 60 / 66
LLMs Emergent Abilities

Emergent Ability: Few-shot Prompting Setting

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 61 / 66
LLMs Emergent Abilities

Emergent Ability: Specialized Prompting or Finetuning

Multi-step reasoning: chain-of-thought prompting, guiding LLMs to produce a

sequence of intermediate steps before giving the final answer.
Program execution: computational tasks involving multiple steps, such as adding
large numbers or executing computer programs.
Model calibration: measures whether models can predict which questions they will
be able to answer correctly.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 62 / 66
LLMs Emergent Abilities

Emergent Abilities

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 63 / 66
LLMs Emergent Abilities

Emergent Abilities
A dataset of 4,550 questions and solutions from problem sets,
midterm exams, and final exams across all MIT EECS courses.
GPT-3.5 successfully solves a third of the entire MIT curriculum.
GPT-4, with prompt engineering, achieves a perfect solve rate on a test
set excluding questions based on images.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 64 / 66
LLMs Emergent Abilities

Cost

OPT and BLOOM (175B parameters) training required 34 days on

992 A100 80GB.
LLaMA of Facebook (65B parameters) used 2,048 A100-80GB for a
period of approximately 5 months, cost around 2,638 MWh.22
https://github.com/facebookresearch/llama
PaLM’s (Google’s 540B LLM) training costs around $9M to $17M.
https://blog.heim.xyz/palm-training-cost/

22
Touvron et al. (2023) LLaMA: Open and efficient foundation language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 65 / 66
Conclusion

Conclusion

LLMs have changed the NLP field completely in the last 5 years.
Single task ⇒ multitask
Monolingual processing ⇒ multilingual processing
Small models ⇒ very large models
Small corpora ⇒ collosal corpora
Core technologies of LLMs:
Attention mechanism → the Transformers and their variants
Distributed and parallel processing using GPU/TPU
Large-scale optimization algorithms
Current active research directions:
Model scaling; improved model architectures and training
Data scaling and selection;
Better techniques for understanding of prompting
Understanding emergence

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 66 / 66

Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen - Li
272 pages
Whitepaper - Foundational Large Language Models & Text Generation
100% (2)
Whitepaper - Foundational Large Language Models & Text Generation
75 pages
Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
100% (3)
Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
275 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Chellas ModalLogic
100% (8)
Chellas ModalLogic
305 pages
Definition, Nature, Significance, Types, and Stages of Educational Planning
80% (10)
Definition, Nature, Significance, Types, and Stages of Educational Planning
4 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (5)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
PIano Performance Rubric
50% (2)
PIano Performance Rubric
1 page
1VuHongDuyen - Portfolio 5
No ratings yet
1VuHongDuyen - Portfolio 5
7 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
No ratings yet
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
51 pages
Leveraging Language Models With RAG
No ratings yet
Leveraging Language Models With RAG
57 pages
SESSION 1 LLMs
No ratings yet
SESSION 1 LLMs
40 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
Unlocking The Potential A Comprehensive Exploratio
No ratings yet
Unlocking The Potential A Comprehensive Exploratio
6 pages
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
No ratings yet
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
25 pages
NLP Roadmap
No ratings yet
NLP Roadmap
10 pages
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
No ratings yet
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
325 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
Large Language Models (LLM)
100% (1)
Large Language Models (LLM)
139 pages
2023 LLMBC LLM Foundations
No ratings yet
2023 LLMBC LLM Foundations
92 pages
GenAI Preparation
No ratings yet
GenAI Preparation
15 pages
Transformer
No ratings yet
Transformer
5 pages
Transformers in Machine Learning - GeeksforGeeks
No ratings yet
Transformers in Machine Learning - GeeksforGeeks
9 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Week4 LLMs EN
No ratings yet
Week4 LLMs EN
48 pages
DZ-getting-started-large Language Models LLMs-2024
No ratings yet
DZ-getting-started-large Language Models LLMs-2024
7 pages
D 02 Large Language Models
100% (1)
D 02 Large Language Models
58 pages
Creación de Aplicaciones LLM Modelos de Lenguaje
No ratings yet
Creación de Aplicaciones LLM Modelos de Lenguaje
5 pages
Whitepaper - Foundational Large Language Models & Text Generation - v2
100% (1)
Whitepaper - Foundational Large Language Models & Text Generation - v2
86 pages
Transformers
No ratings yet
Transformers
27 pages
Transformer
No ratings yet
Transformer
59 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Transformers
No ratings yet
Transformers
15 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
JioDiscover-What Is The Neural Networ
No ratings yet
JioDiscover-What Is The Neural Networ
5 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Chapter 1
No ratings yet
Chapter 1
29 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
GenAI For Developers
No ratings yet
GenAI For Developers
205 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
Building Transformer Models With Attention - Web - Page
No ratings yet
Building Transformer Models With Attention - Web - Page
19 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
No ratings yet
Transformer Attention 91cb05dd 182d 4c7d 8c8e f1698567b8d6
39 pages
Introduction To LLMs 1730172304
No ratings yet
Introduction To LLMs 1730172304
29 pages
4-HC24.PrimisAI - Hans Bouwmeester.v4
No ratings yet
4-HC24.PrimisAI - Hans Bouwmeester.v4
29 pages
L12 Generative Models en
No ratings yet
L12 Generative Models en
65 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
RADL TTho
No ratings yet
RADL TTho
64 pages
Part 1 Linux Chapter 4 Process Management
No ratings yet
Part 1 Linux Chapter 4 Process Management
10 pages
Episode 6 Classroom Management
No ratings yet
Episode 6 Classroom Management
7 pages
Psychodynamics and The Arts PDF
100% (1)
Psychodynamics and The Arts PDF
28 pages
ENGLISH-8 TOS 3rd Quarter Examination
No ratings yet
ENGLISH-8 TOS 3rd Quarter Examination
2 pages
English 3: 1. The Industrious Student Studies Hard To Get Good Grades
No ratings yet
English 3: 1. The Industrious Student Studies Hard To Get Good Grades
3 pages
Mindfulness: Identifying Core Beliefs
No ratings yet
Mindfulness: Identifying Core Beliefs
3 pages
Determiners - Work Sample
No ratings yet
Determiners - Work Sample
5 pages
Gruffalo
100% (1)
Gruffalo
3 pages
Theme It Up! Planning Fun & Effective Primary ESL Lessons.p
No ratings yet
Theme It Up! Planning Fun & Effective Primary ESL Lessons.p
4 pages
PROF ED 6 Chapter 3 Lesson 4
No ratings yet
PROF ED 6 Chapter 3 Lesson 4
3 pages
Writing Part 2
No ratings yet
Writing Part 2
12 pages
Unit2 161017112537 PDF
No ratings yet
Unit2 161017112537 PDF
194 pages
Bte2601 Assignment Number 2
No ratings yet
Bte2601 Assignment Number 2
3 pages
Effects of WorkFamily Conflict On Intention To Quit
100% (1)
Effects of WorkFamily Conflict On Intention To Quit
28 pages
2nd Summative Test
No ratings yet
2nd Summative Test
4 pages
Group 3 Questionnaires
No ratings yet
Group 3 Questionnaires
6 pages
Implementation of Leadership Values in Pancasila Paradigm As Character Building Values
No ratings yet
Implementation of Leadership Values in Pancasila Paradigm As Character Building Values
8 pages
Psychology of Color PDF
No ratings yet
Psychology of Color PDF
4 pages
Gregor Sebba, Richard A. Watson The Dream of Descartes 1987
No ratings yet
Gregor Sebba, Richard A. Watson The Dream of Descartes 1987
88 pages
Effective Strategies in Teaching Literature To Non-Language Major Students
100% (3)
Effective Strategies in Teaching Literature To Non-Language Major Students
69 pages
Project Repoprt Final-Speech Emotion Recognition
No ratings yet
Project Repoprt Final-Speech Emotion Recognition
25 pages
New Challenges To Philosophy of Science
No ratings yet
New Challenges To Philosophy of Science
498 pages
The Cultural Environment of International Business
No ratings yet
The Cultural Environment of International Business
45 pages
Materi Kelas 8 14 Simple Past Tense Dan Past Continuous Tense
100% (1)
Materi Kelas 8 14 Simple Past Tense Dan Past Continuous Tense
14 pages
AFL3703 Exam 2022
No ratings yet
AFL3703 Exam 2022
14 pages
2.news Presentation Format
No ratings yet
2.news Presentation Format
11 pages
Reading The Poetry of Ee Cummings: Introduction/Anticipatory Set (15 Minutes)
No ratings yet
Reading The Poetry of Ee Cummings: Introduction/Anticipatory Set (15 Minutes)
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.