Transformer Basics
Transformer Basics
▪ A large language model is a type of machine learning model that is trained on a large corpus of text
data to generate outputs for various natural language processing (NLP) tasks, such as text generation,
question answering, and machine translation.
▪ Large language models are typically based on deep learning neural networks such as the Transformer
architecture and are trained on massive amounts of text data, often involving billions of words. Larger
models, such as Google’s BERT model, are trained with a large dataset from various data sources which
allows them to generate output for many tasks.
Text Output
Language
Text Input
Model
Numeric Representation of
text useful for other systems
USE CASES OF LLMs
Large language models can be applied to a variety of use cases and industries,
including healthcare, retail, tech, and more. The following are use cases that exist in
all industries:
Text generation
Sentiment analysis
Chatbots
Question Answering
Code generation
TTransformer Componets
TRANSFORMER MODELS
Largely replaced RNN models with the publication of Attention is All You Need by Google in 2017
CATEGORIES OF TRANSFORMER MODELS
Output probabilities
Outputs
Inputs
(shifted right)
BERT
▪ Originally published by OpenAI in 2018, followed by GPT-2 in 2019, and GPT-3 in 2020.
▪ Architecture is also a single stack like BERT, but is a traditional left-to-right language model
▪ Can be used for generating larger blocks of text (e.g. chat bots), but can also be used for question answering
▪ Has been the model that we have focused the most on with Megatron
▪ Faster to train than BERT since each iteration gets signal from every token in the sequence
WHEN LARGE LANGUAGE MODELS MAKE SENSE?
Traditional
Large Language Models ▪ Zero-Shot (or Few Shot Learning)
NLP
Approach ▪ Painful & Impractical to get a large corpus of
labelled data
Requires
labelled data Yes No
▪ Models can learn new tasks
▪ If you want models with “common sense”
Parameters 100s of millions Billions to trillions and can generalize well to new tasks
Desired Specific (one model General (model can do ▪ A single model can serve all use-cases
model per task) many tasks) ▪ At-scale you avoid costs and complexity of
capability many models, saving cost in data curation,
training, and managing deployment
Training Retrain frequently with Never retrain, or
Frequenc task-specific training retrain minimally
y data
DISTRIBUTED TRAINING
Data, Pipeline and Tensor Parallelism
CHALLENGES
Compute-, cost-, and Significant capital investment and large-scale compute infrastructure are
time- intensive necessary to maintain and develop LLMs.
workload:
Due to their scale, training and deploying large language models are very
Technical expertise:
difficult.
THANK YOU!