0% found this document useful (0 votes)
7 views66 pages

RADL LHPhuong

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views66 pages

RADL LHPhuong

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Introduction to Large Language Models for

Natural Language Processing

Lê Hồng Phương
Data Science Laboratory
VIASM & Vietnam National University, Hanoi
<phuonglh@vnu.edu.vn>

June 30, 2023

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 1 / 66
Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 2 / 66
Introduction

Language Models

A language model (LM) is a probability distribution over sequences of


words. Given any sequence of words of length n, a LM assigns a
probability P(w1 , w2 , . . . , wn ) to the sequence.
An n-gram LM models sequences of words as a Markov process where
we want to predict the next word given its history:

P(wn |w1 , . . . , wn−1 )

Text generation from the probabilistic LM:


apple

Mary swallowed a green _ frog

mountain

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 3 / 66
Introduction

Neural Language Models

Neural language models (or continuous space language models) use


continuous representations or embeddings of words to make their
predictions. These models make use of neural networks.
Each word or token has an associated embedding vector xi ∈ Rd :

Mary swallowed a green apple

x1 x2 x3 x4 x5

Static word embeddings: Skip-gram and CBOW (2013), GloVe (2014)


Dynamic word embeddings: ELMo (2017), BERT (2018)

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 4 / 66
Introduction

Large Language Models

What are LLMs?


ML algorithms that can
recognize, predict, and
generate human languages.
Pretrained on petabyte scale
text datasets resulting in large
models with 10s to 100s of
billions of parameters.
LLMs are normally pretrained
followed by tuning on a specific
task.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 5 / 66
Introduction

Large Language Models: Evolutionary Tree

A nice recent survey presenting the evolution of LLMs1


A simplified view:

Encoder-Decoder
Encoder-Only LLMs Decoder-Only
LLMs LLMs

Transformer

Attention

Matrix Ops

1
Yang et al. (2023) Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 6 / 66
Introduction

Large Language Models: Evolutionary Tree

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 7 / 66
Introduction

A Timeline of LLMs

2
2
Zhao et al. (2023) A survey of large language models. https://arxiv.org/abs/2303.18223
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 8 / 66
Introduction

Pretraining – Finetuning Method: 2018–2021

Pretrained
Finetune on Inference on
Language Model
task A task A
(BERT, T5)

Pretraining – Finetuning:
Typically requires many task-specific examples
One specialized model for each task

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 9 / 66
Introduction

Prompting Method: 2021–2023

Pretrained
Inference on
Language Model
task A
(GPT-3)

Prompting:
Prompts are used to interact with LLMs to accomplish a task.
A prompt is a user-provided input. Prompts can include instructions,
questions, or any other type of input, depending on the intended use
of the model.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 10 / 66
Introduction

Instruction-Tuning Method: 2022–2023

Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .

Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 11 / 66
Transformers Attention Mechanism

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 12 / 66
Transformers Attention Mechanism

The Attention Mechanism

Currently, the dominant models for nearly all NLP tasks are based on the
Transformer architecture. Given any new task in NLP, we usually:
Take a large Transformer-based pretrained model (BERT, GPT-x,
T5-x, etc)
Fine-tune/Prompt engineer the model on the available data for the
downstream task
The core idea behind the Transformer model is the attention
mechanism, an innovation which was introduced in 2014.3

3
Bahdanau et al. (2014). Neural machine translation by jointly learning to align and translate.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 13 / 66
Transformers Attention Mechanism

The Attention Mechanism

In machine translation, attention models often assigned high


attention weights to cross-lingual synonyms when generating the
corresponding words in the target language.
“My feet hurt” → “j’ai mal au pieds”:
The model might assign high attention weights to the representation of
“feet” when generating “pieds”.4

4
Reference: “Dive into DL” – https://d2l.ai/
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 14 / 66
Transformers Attention Mechanism

The Attention Mechanism

In 2017, the Transformer architecture was proposed which relies on


cleverly arranged attention mechanisms to capture all relationships among
input and output tokens.5
Denote by D = {(k1 , v1 ), (k2 , v2 ), . . . , (km , vm )} a database of m
tuples (key, value).
Denote by q a query.
The attention over D is defined as
m
X
Attention(q, D) := α(q, ki ) vi ,
i =1

where α(q, ki ) ∈ R, ∀i = 1, . . . , m are scalar attention weights.

5
Vaswani et al. (2017). Attention is all you need. NIPS.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 15 / 66
Transformers Attention Mechanism

Attention Pooling

k1 α(q, k1 ) v1

k2 α(q, k2 ) v2
attention pooling
ki α(q, ki ) vi

km α(q, km ) vm

Some special cases:


P
1 α(q, k ) ≥ 0,
i i α(q, ki ) = 1: the weights form a convex combination.
2 α(q, kj ) = 1, all other weights are 0: sparse attention.
1
3 α(q, ki ) = m: average pooling.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 16 / 66
Transformers Attention Mechanism

Attention Pooling By Similarity

Some common similarity kernels α(q, k):



Gaussian: α(q, k) = exp − 2σ1 2 k q − k k2
Boxcar: α(q, k) = 1 if k q − k k ≤ 1
Epanechikov: α(q, k) = max{0, 1 − k q − k k}

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 17 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

From the Gaussian kernel:


1
α(q, ki ) = − k q − ki k2
2
1 1
= q ki − k qi k2 − k k k2

2 2
The scaled dot product attention that is used in the Transformers:

q⊤ ki
a(q, ki ) = √ , q, k ∈ Rd
d
exp(a(q, ki ))
α(q, ki ) = Pm
j=1 exp(a(q, kj ))

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 18 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

We often compute attention scores in mini-batches for efficiency. For n


queries and m key-value pairs, q, k ∈ Rd , and values v ∈ Rv , we use the
batch matrix multiplication:
 
QK⊤
softmax √ V ∈ Rn×v ,
d

for Q ∈ Rn×d , K ∈ Rm×d , and V ∈ Rm×v .

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 19 / 66
Transformers Attention Mechanism

Scaled Dot Product Attention

When queries and keys are vectors of different dimensionalities, we can


either
use a matrix to address the mismatch via q⊤ M k, or
use additive attention, for q ∈ Rq , k ∈ Rk :

a(q, k) = wv⊤ tanh (Wq q +Wk k) ∈ R,

where wv ∈ Rh , Wq ∈ Rh×q , Wk ∈ Rh×k are learnable parameters.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 20 / 66
Transformers Attention Mechanism

Multi-Head Attention

We want our model to combine knowledge from different behaviors of


the same attention mechanism, such as capturing dependencies of
various ranges within a sequence.
It is beneficial to allow our attention mechanism to jointly use different
representation subspaces of queries, keys, and values.
We concatenate multiple attention pooling outputs: multi-head
attention.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 21 / 66
Transformers Attention Mechanism

Self-Attention

In sequence processing, each token has its own query, keys, and
values. We can compute, for each token, a representation by building
the appropriate weighted sum over the other tokens.
Each token is attending to each other token ⇒ self-attention models.

s1 s2 s3 s4 s1 s2 s3 s4

x1 x2 x3 x4 x1 x2 x3 x4

Recurrent Neural Network Self-Attention

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 22 / 66
Transformers Attention Mechanism

Self-Attention

Computation complexity on a sequence of n tokens:


RNN: multiplication of a weight matrix of size d × d and a
d-dimensional hidden state ⇒ O(nd 2 ); O(n) operations cannot be
parallelized.
Self-attention: a n × d matrix is multiplied by a d × n matrix, then
the resulting matrix is multiplied by a n × d matrix ⇒ O(n2 d);
Computation can be parallelized with O(1) sequential operation.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 23 / 66
Transformers Transformer Architecture

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 24 / 66
Transformers Transformer Architecture

The Transformer Architecture

The Transformer is composed of


an encoder and a decoder.
The encoder is a stack of
multiple identical layers, where
each layer has two sublayers
The first is a multi-head
self-attention pooling.
The second is a positionwise
feed-forward network.
There is a residual connection
at both sublayers, inspired by
the ResNet.
The encoder outputs a
d-dimensional vector
representation for each position
of the input sequence.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 25 / 66
Transformers Transformer Architecture

The Transformer Architecture

The decoder is also a stack of multiple identical


layers with residual connections and layer
normalizations.
Between the two sublayers, the decoder has a third
sublayer, called the encoder-decoder attention.
In the encoder-decoder attention, queries are from
the outputs of the previous decoder layer, and the
keys and values are from the encoder outputs.
In the decoder self-attention, queries, keys, and
values are all from the outputs of the previous
decoder layer.
Each position in the decoder is allowed to only
attend to all positions up to that position.
The masked attention preserves the auto-regressive
property, ensuring that the prediction only depends
on those output tokens that have been generated.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 26 / 66
Transformers Transformer Architecture

Encoder Self-Attention Weights

Two layers of multi-head attention weights are presented row by row.6


6
https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 27 / 66
Transformers Transformer Architecture

Decoder Self-Attention Weights

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 28 / 66
Transformers Transformer Architecture

Encoder-Decoder Self-Attention Weights

i’m home . ⇒ [je, suis, chez , moi, .]

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 29 / 66
LLMs

Large-Scale Pretraining with Transformers: Three Modes

Transformers have been extensively pretrained with a wealth of text to


learn good representations.
Originally proposed for MT, the Transformer architecture consists of
an encoder for representing input sequences and a decoder for
generating target sequences.
Transformers can be used in three different modes:
1 encoder-only
2 encoder-decoder
3 decoder-only

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 30 / 66
LLMs Encoder-Only

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 31 / 66
LLMs Encoder-Only

Encoder-Only Transformer

A Transformer encoder consists of self-attention layers, where all input


tokens attend to each other.
A sequence of input tokens is converted into the same number of
representations by the encoder.
These representations can then be further projected into output (e.g.,
classification).
This design was inspired by an earlier encoder-only Transformer
pretrained on text: BERT.7

7
Devlin et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 32 / 66
LLMs Encoder-Only

Encoder-Only Transformer: BERT Pretraining

BERT is pretrained on text sequences using masked language


modeling : input text with randomly masked tokens is fed into a
Transformer encoder to predict the masked tokens.
There is no constraint in the attention pattern of Transformer encoders.
Prediction of “love” depends on input tokens before and after it.
Large-scale text data can be used for pretraining BERT.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 33 / 66
LLMs Encoder-Only

Encoder-Only Transformer: BERT Fine-Tuning

The pretrained BERT can be fine-tuned to downstream encoding


tasks involving single text or text pairs.
During fine-tuning, additional layers can be added to BERT with
randomized parameters: these parameters and those pretrained BERT
parameters will be updated to fit training data of downstream tasks.
The general language representations learned by the 340M-parameter
BERT from 250B training tokens advanced the SOTA for many NLP
tasks.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 34 / 66
LLMs Encoder-Only

Derivatives of BERT
Other derivatives of BERT improved model architectures or pretraining
objectives:
1 RoBERTa (2019)

change key hyperparameters


remove the next-sentence pretraining objective of BERT
train with much larger mini-batches and learning rates
2 DeBERTa (2021)
use disentangled attention mechanism: each word is represented using
two vectors that encode its content and position, the attention weights
among words are computed using disentangled matrices on their
contents and relative positions
use enhanced mask decoder to replace the output softmax layer to
predict the masked tokens
3 Others
ALBERT (2019): enforces parameter sharing
SpanBERT (2020): predicts spans of texts
ELECTRA (2020): replaced token detection
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 35 / 66
LLMs Encoder-Only

Multilingual BERT

1 mBERT (2019)
a multilingual version of BERT trained on 104 languages from the
Wikipedia corpus.
follows the BERT recipe with the same training architecture and
objective
110M to 340M parameters
2 XLM-R (2020)
a multilingual language model, trained on 2.5TB of filtered
CommonCrawl data of 100 different languages
performs particularly well on low-resource languages
outperforms mBERT on a variety of cross-lingual benchmarks.
3.5B to 10.7B parameters
3 BERT pretrained models for Vietnamese:
ViBERT (10/2020), FPT
PhoBERT (11/2020), VinAI

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 36 / 66
LLMs Encoder-Decoder

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 37 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer

The decoder autoregressively predicts the target sequence of arbitrary


length, token by token, conditional on both encoder output and
decoder output:
the encoder-decoder cross-attention allows target tokens to attend to
all input tokens
the masked multi-head attention of decoder allow any target token can
attend to past and present tokens in the target sequence
Two well-known encoder-decoder Transformers, both attempt to
reconstruct original text in their pretraining objectives.
1 BART (2019): emphasizes noising input (masking, deletion,
permutation, rotation)
2 T5/mT5 (2020): emphasizes multitask unification

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 38 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

Every task is cast as feeding the model text as input and training it to
generate some target text.8
This allows for the use of the same model, loss function,
hyperparameters across different tasks.
8
Raffel et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 39 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

T5 can be fine-tuned for novel tasks.9


11B-parameter T5 achieved SOTA on the GLUE, SuperGLUE,
SQuAD, and CNN/Daily Mail benchmarks.
9
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 40 / 66
LLMs Encoder-Decoder

Encoder-Decoder Transformer: T5

Pretraining Fine-tuning

Switch Transformer (2022) is based on T5.10


improve with reduced communication and computational costs
advance the of LLMs by pre-training up to trillion parameter models,
achieved a 4x speedup over the T5-XXL model
10
Fedus et al. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. JMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 41 / 66
LLMs Decoder-Only

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 42 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-2

Remove the entire encoder and the encoder-decoder cross-attention


sublayer from the original encoder-decoder architecture.
GPT and GPT-2 chooses a Transformer decoder as its backbone.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 43 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-2

GPT (2018)11 : 100M parameters, need to be fine-tuned for


downstream tasks.
GPT-2 (11/2019)12 : 1.5B parameters, performed well on multiple
other tasks without updating the parameters or architecture.13

11
Radford et al. (2018). Improving language understanding by generative pre-training. OpenAI.
12
Radford et al. (2019). Language models are unsupervised multitask learners. OpenAI Blog.
13
https://openai.com/research/better-language-models
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 44 / 66
LLMs Decoder-Only

Decoder-Only Transformer: GPT-3

A pretrained LM may generate the task output as a sequence without


parameter update, conditional on an input sequence with
1 the task description
2 task-specific input-output examples
3 a prompt (task input)
This learning paradigm is called in-context learning.14
GPT-3 (2020), 175B parameters:
uses the same Transformer decoder architecture in GPT-2 except that
attention patterns are sparser at alternating layers
pretrained with 300B tokens (40TB of text data)
performs better with larger model size, where few-shot performance
increases most rapidly

14
Brown et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 45 / 66
LLMs Decoder-Only

GPT-3 Zero-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: i’m home →


| {z } | {z }
task description prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 46 / 66
LLMs Decoder-Only

GPT-3 One-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: go → va | i’m home


| {z } | {z } | {z }
task description 1 exemplar prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 47 / 66
LLMs Decoder-Only

GPT-3 Few-shot

je suis malade

Transformer Decoder
(no parameter update)

Translate English to French: go → va ; i lost → j’ai perdu | i’m home


| {z } | {z } | {z }
task description 2 exemplars prompt

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 48 / 66
LLMs Decoder-Only

Recall: Instruction-Tuning Method

Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .

Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks
Zero-shot learning

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 49 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning

Leverage the intuition that NLP tasks can be described via natural
language instructions, such as
Is the sentiment of this movie review positive or negative?
Translate “how are you” into Chinese.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 50 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning

Take a pretrained decoder-only model (LaMDA-PT, 137B parameters)


and perform instruction tuning–finetuning 60+ NLP datasets.15

15
Wei et al. (2022) Finedtuned language models are zero-shot learners. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 51 / 66
LLMs Decoder-Only

T0 Instruction-Tuning

T0 is an encoder-decoder model.16
16
Sanh et al. (2022) Multitask prompted training enables zero-shot task generalization. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 52 / 66
LLMs Decoder-Only

T0 Instruction-Tuning
T0 (2022) 11B parameters, trained on multitask mixture of NLP datasets
Each dataset is associated with multiple prompt templates to format
examplars. T0 is as good as FLAN despite being 10x smaller.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 53 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts


FLAN-PaLM 540B instruction-finetuned on 1.8K tasks outperforms
PaLM 540B by a large margin (+9.4% on average).
FLAN-PaLM 540B achieves SOTA performance on several
benchmarks, such as 75.2% on five-shot MMLU.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 54 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 55 / 66
LLMs Decoder-Only

FLAN Instruction-Tuning – Chain of Thoughts

17
17
Chung et al. (2022) Scaling instruction-finetuned language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 56 / 66
LLMs Decoder-Only

Self-Instruct
Self-Instruct is a framework for improving the instruction-following
capabilities of pretrained LLMs by bootsrapping off their own
generations.18

18
Wang et al. (2023) Aligning language models with self-generated instructions. ACL
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 57 / 66
LLMs Decoder-Only

Self-Instruct

SuperNI is a benchmark
consisting of 119 tasks with 100
instances in each task.
Self-Instruct data is freely
available (52K instructions).
By finetuning GPT-3 on this
data leads to a 33% absolute
improvement over the original
GPT-3.

SuperNI19 , Self-Instruct20
19
Wang et al. (2022) Super-NaturalInstruction: Generalization via declarative instructions on 1600+ tasks. EMNLP
20
https://github.com/yizhongw/self-instruct
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 58 / 66
LLMs Emergent Abilities

Content

1 Introduction

2 Transformers
Attention Mechanism
Transformer Architecture

3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities

4 Conclusion

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 59 / 66
LLMs Emergent Abilities

Emergent Ability

An ability is emergent if it is not present in smaller models but is


present in larger models.21

Model sizes:
GPT-3: 2 · 1022 training FLOPs (13B parameters)
LaMDA: 1023 training FLOPs (68B parameters)
Gopher: 5 · 1023 training FLOPs (280B parameters)
PaLM: 2.5 · 1024 FLOPs (540B parameters)

21
Wei et al. (2022) Emergent abilities of large language models. TMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 60 / 66
LLMs Emergent Abilities

Emergent Ability: Few-shot Prompting Setting

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 61 / 66
LLMs Emergent Abilities

Emergent Ability: Specialized Prompting or Finetuning

Multi-step reasoning: chain-of-thought prompting, guiding LLMs to produce a


sequence of intermediate steps before giving the final answer.
Program execution: computational tasks involving multiple steps, such as adding
large numbers or executing computer programs.
Model calibration: measures whether models can predict which questions they will
be able to answer correctly.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 62 / 66
LLMs Emergent Abilities

Emergent Abilities

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 63 / 66
LLMs Emergent Abilities

Emergent Abilities
A dataset of 4,550 questions and solutions from problem sets,
midterm exams, and final exams across all MIT EECS courses.
GPT-3.5 successfully solves a third of the entire MIT curriculum.
GPT-4, with prompt engineering, achieves a perfect solve rate on a test
set excluding questions based on images.

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 64 / 66
LLMs Emergent Abilities

Cost

OPT and BLOOM (175B parameters) training required 34 days on


992 A100 80GB.
LLaMA of Facebook (65B parameters) used 2,048 A100-80GB for a
period of approximately 5 months, cost around 2,638 MWh.22
https://github.com/facebookresearch/llama
PaLM’s (Google’s 540B LLM) training costs around $9M to $17M.
https://blog.heim.xyz/palm-training-cost/

22
Touvron et al. (2023) LLaMA: Open and efficient foundation language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 65 / 66
Conclusion

Conclusion

LLMs have changed the NLP field completely in the last 5 years.
Single task ⇒ multitask
Monolingual processing ⇒ multilingual processing
Small models ⇒ very large models
Small corpora ⇒ collosal corpora
Core technologies of LLMs:
Attention mechanism → the Transformers and their variants
Distributed and parallel processing using GPU/TPU
Large-scale optimization algorithms
Current active research directions:
Model scaling; improved model architectures and training
Data scaling and selection;
Better techniques for understanding of prompting
Understanding emergence

Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 66 / 66

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy