RADL LHPhuong
RADL LHPhuong
Lê Hồng Phương
Data Science Laboratory
VIASM & Vietnam National University, Hanoi
<phuonglh@vnu.edu.vn>
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 1 / 66
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 2 / 66
Introduction
Language Models
mountain
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 3 / 66
Introduction
x1 x2 x3 x4 x5
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 4 / 66
Introduction
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 5 / 66
Introduction
Encoder-Decoder
Encoder-Only LLMs Decoder-Only
LLMs LLMs
Transformer
Attention
Matrix Ops
1
Yang et al. (2023) Harnessing the power of LLMs in practice: a survey on ChatGPT and beyond.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 6 / 66
Introduction
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 7 / 66
Introduction
A Timeline of LLMs
2
2
Zhao et al. (2023) A survey of large language models. https://arxiv.org/abs/2303.18223
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 8 / 66
Introduction
Pretrained
Finetune on Inference on
Language Model
task A task A
(BERT, T5)
Pretraining – Finetuning:
Typically requires many task-specific examples
One specialized model for each task
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 9 / 66
Introduction
Pretrained
Inference on
Language Model
task A
(GPT-3)
Prompting:
Prompts are used to interact with LLMs to accomplish a task.
A prompt is a user-provided input. Prompts can include instructions,
questions, or any other type of input, depending on the intended use
of the model.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 10 / 66
Introduction
Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .
Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 11 / 66
Transformers Attention Mechanism
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 12 / 66
Transformers Attention Mechanism
Currently, the dominant models for nearly all NLP tasks are based on the
Transformer architecture. Given any new task in NLP, we usually:
Take a large Transformer-based pretrained model (BERT, GPT-x,
T5-x, etc)
Fine-tune/Prompt engineer the model on the available data for the
downstream task
The core idea behind the Transformer model is the attention
mechanism, an innovation which was introduced in 2014.3
3
Bahdanau et al. (2014). Neural machine translation by jointly learning to align and translate.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 13 / 66
Transformers Attention Mechanism
4
Reference: “Dive into DL” – https://d2l.ai/
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 14 / 66
Transformers Attention Mechanism
5
Vaswani et al. (2017). Attention is all you need. NIPS.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 15 / 66
Transformers Attention Mechanism
Attention Pooling
k1 α(q, k1 ) v1
k2 α(q, k2 ) v2
attention pooling
ki α(q, ki ) vi
km α(q, km ) vm
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 17 / 66
Transformers Attention Mechanism
q⊤ ki
a(q, ki ) = √ , q, k ∈ Rd
d
exp(a(q, ki ))
α(q, ki ) = Pm
j=1 exp(a(q, kj ))
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 18 / 66
Transformers Attention Mechanism
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 19 / 66
Transformers Attention Mechanism
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 20 / 66
Transformers Attention Mechanism
Multi-Head Attention
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 21 / 66
Transformers Attention Mechanism
Self-Attention
In sequence processing, each token has its own query, keys, and
values. We can compute, for each token, a representation by building
the appropriate weighted sum over the other tokens.
Each token is attending to each other token ⇒ self-attention models.
s1 s2 s3 s4 s1 s2 s3 s4
x1 x2 x3 x4 x1 x2 x3 x4
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 22 / 66
Transformers Attention Mechanism
Self-Attention
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 23 / 66
Transformers Transformer Architecture
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 24 / 66
Transformers Transformer Architecture
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 28 / 66
Transformers Transformer Architecture
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 29 / 66
LLMs
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 30 / 66
LLMs Encoder-Only
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 31 / 66
LLMs Encoder-Only
Encoder-Only Transformer
7
Devlin et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 32 / 66
LLMs Encoder-Only
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 33 / 66
LLMs Encoder-Only
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 34 / 66
LLMs Encoder-Only
Derivatives of BERT
Other derivatives of BERT improved model architectures or pretraining
objectives:
1 RoBERTa (2019)
Multilingual BERT
1 mBERT (2019)
a multilingual version of BERT trained on 104 languages from the
Wikipedia corpus.
follows the BERT recipe with the same training architecture and
objective
110M to 340M parameters
2 XLM-R (2020)
a multilingual language model, trained on 2.5TB of filtered
CommonCrawl data of 100 different languages
performs particularly well on low-resource languages
outperforms mBERT on a variety of cross-lingual benchmarks.
3.5B to 10.7B parameters
3 BERT pretrained models for Vietnamese:
ViBERT (10/2020), FPT
PhoBERT (11/2020), VinAI
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 36 / 66
LLMs Encoder-Decoder
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 37 / 66
LLMs Encoder-Decoder
Encoder-Decoder Transformer
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 38 / 66
LLMs Encoder-Decoder
Encoder-Decoder Transformer: T5
Every task is cast as feeding the model text as input and training it to
generate some target text.8
This allows for the use of the same model, loss function,
hyperparameters across different tasks.
8
Raffel et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 39 / 66
LLMs Encoder-Decoder
Encoder-Decoder Transformer: T5
Encoder-Decoder Transformer: T5
Pretraining Fine-tuning
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 42 / 66
LLMs Decoder-Only
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 43 / 66
LLMs Decoder-Only
11
Radford et al. (2018). Improving language understanding by generative pre-training. OpenAI.
12
Radford et al. (2019). Language models are unsupervised multitask learners. OpenAI Blog.
13
https://openai.com/research/better-language-models
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 44 / 66
LLMs Decoder-Only
14
Brown et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 45 / 66
LLMs Decoder-Only
GPT-3 Zero-shot
je suis malade
Transformer Decoder
(no parameter update)
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 46 / 66
LLMs Decoder-Only
GPT-3 One-shot
je suis malade
Transformer Decoder
(no parameter update)
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 47 / 66
LLMs Decoder-Only
GPT-3 Few-shot
je suis malade
Transformer Decoder
(no parameter update)
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 48 / 66
LLMs Decoder-Only
Pretrained Instruction-tune
Inference on
Language Model on many tasks
task A
(FLAN, T0) B, C, D,. . .
Instruction-tuning:
Model learns to perform many tasks via natural language instructions
Inference on unseen tasks
Zero-shot learning
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 49 / 66
LLMs Decoder-Only
FLAN Instruction-Tuning
Leverage the intuition that NLP tasks can be described via natural
language instructions, such as
Is the sentiment of this movie review positive or negative?
Translate “how are you” into Chinese.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 50 / 66
LLMs Decoder-Only
FLAN Instruction-Tuning
15
Wei et al. (2022) Finedtuned language models are zero-shot learners. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 51 / 66
LLMs Decoder-Only
T0 Instruction-Tuning
T0 is an encoder-decoder model.16
16
Sanh et al. (2022) Multitask prompted training enables zero-shot task generalization. ICLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 52 / 66
LLMs Decoder-Only
T0 Instruction-Tuning
T0 (2022) 11B parameters, trained on multitask mixture of NLP datasets
Each dataset is associated with multiple prompt templates to format
examplars. T0 is as good as FLAN despite being 10x smaller.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 53 / 66
LLMs Decoder-Only
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 54 / 66
LLMs Decoder-Only
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 55 / 66
LLMs Decoder-Only
17
17
Chung et al. (2022) Scaling instruction-finetuned language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 56 / 66
LLMs Decoder-Only
Self-Instruct
Self-Instruct is a framework for improving the instruction-following
capabilities of pretrained LLMs by bootsrapping off their own
generations.18
18
Wang et al. (2023) Aligning language models with self-generated instructions. ACL
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 57 / 66
LLMs Decoder-Only
Self-Instruct
SuperNI is a benchmark
consisting of 119 tasks with 100
instances in each task.
Self-Instruct data is freely
available (52K instructions).
By finetuning GPT-3 on this
data leads to a 33% absolute
improvement over the original
GPT-3.
SuperNI19 , Self-Instruct20
19
Wang et al. (2022) Super-NaturalInstruction: Generalization via declarative instructions on 1600+ tasks. EMNLP
20
https://github.com/yizhongw/self-instruct
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 58 / 66
LLMs Emergent Abilities
Content
1 Introduction
2 Transformers
Attention Mechanism
Transformer Architecture
3 LLMs
Encoder-Only
Encoder-Decoder
Decoder-Only
Emergent Abilities
4 Conclusion
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 59 / 66
LLMs Emergent Abilities
Emergent Ability
Model sizes:
GPT-3: 2 · 1022 training FLOPs (13B parameters)
LaMDA: 1023 training FLOPs (68B parameters)
Gopher: 5 · 1023 training FLOPs (280B parameters)
PaLM: 2.5 · 1024 FLOPs (540B parameters)
21
Wei et al. (2022) Emergent abilities of large language models. TMLR.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 60 / 66
LLMs Emergent Abilities
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 61 / 66
LLMs Emergent Abilities
Emergent Abilities
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 63 / 66
LLMs Emergent Abilities
Emergent Abilities
A dataset of 4,550 questions and solutions from problem sets,
midterm exams, and final exams across all MIT EECS courses.
GPT-3.5 successfully solves a third of the entire MIT curriculum.
GPT-4, with prompt engineering, achieves a perfect solve rate on a test
set excluding questions based on images.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 64 / 66
LLMs Emergent Abilities
Cost
22
Touvron et al. (2023) LLaMA: Open and efficient foundation language models.
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 65 / 66
Conclusion
Conclusion
LLMs have changed the NLP field completely in the last 5 years.
Single task ⇒ multitask
Monolingual processing ⇒ multilingual processing
Small models ⇒ very large models
Small corpora ⇒ collosal corpora
Core technologies of LLMs:
Attention mechanism → the Transformers and their variants
Distributed and parallel processing using GPU/TPU
Large-scale optimization algorithms
Current active research directions:
Model scaling; improved model architectures and training
Data scaling and selection;
Better techniques for understanding of prompting
Understanding emergence
Lê Hồng Phương, Data Science Laboratory Introduction to LLMs for NLP June 30, 2023 66 / 66