0% found this document useful (0 votes)
3 views17 pages

Transformer Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views17 pages

Transformer Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Transformers & LLM Basics

LARGE LANGUAGE MODELS(LLMs)

▪ A large language model is a type of machine learning model that is trained on a large corpus of text
data to generate outputs for various natural language processing (NLP) tasks, such as text generation,
question answering, and machine translation.

▪ Large language models are typically based on deep learning neural networks such as the Transformer
architecture and are trained on massive amounts of text data, often involving billions of words. Larger
models, such as Google’s BERT model, are trained with a large dataset from various data sources which
allows them to generate output for many tasks.

Text Output

Language
Text Input
Model

Numeric Representation of
text useful for other systems
USE CASES OF LLMs
Large language models can be applied to a variety of use cases and industries,
including healthcare, retail, tech, and more. The following are use cases that exist in
all industries:

Text generation

Sentiment analysis

Chatbots

Textual Entailment Recognition

Question Answering

Code generation
TTransformer Componets
TRANSFORMER MODELS
Largely replaced RNN models with the publication of Attention is All You Need by Google in 2017
CATEGORIES OF TRANSFORMER MODELS

Encoders Decoders Encoders-Decoders


For understanding Language For Generative Models Sequence to Sequence
Suited for task requiring an understanding Suited for tasks involving Suited for tasks around generating new sentences
of the full sentence, such as sentence Text Generation depending on a given input, such as
summarization, translation, or generative question
classification, named entity recognition, and
answering.
extractive question answering.

Output probabilities

Models: Models: Models:


▪ BERT ▪ GPT-3 ▪ T5
Encoder Decoder
▪ ALBERT ▪ GPT-2 ▪ Multilingual –mT5
▪ DistilBERT

Outputs
Inputs
(shifted right)
BERT

▪ BERT: Pre-training of Deep Bidirectional Transformers


for Language Understanding (from Google in 2018)

▪ Encoder-only architecture that performs two main tasks


▪ Predicts several blanks in input given entire
context around the blank

▪ When given sentences A and B, it determines if


B actually follows A

▪ Used for question answering, classification etc.


▪ Takes a long time to train since each iteration only gets
signal from a handful of tokens in each sequence
GPT
Generative Pre-Training

▪ Originally published by OpenAI in 2018, followed by GPT-2 in 2019, and GPT-3 in 2020.
▪ Architecture is also a single stack like BERT, but is a traditional left-to-right language model

▪ Can be used for generating larger blocks of text (e.g. chat bots), but can also be used for question answering
▪ Has been the model that we have focused the most on with Megatron
▪ Faster to train than BERT since each iteration gets signal from every token in the sequence
WHEN LARGE LANGUAGE MODELS MAKE SENSE?

Traditional
Large Language Models ▪ Zero-Shot (or Few Shot Learning)
NLP
Approach ▪ Painful & Impractical to get a large corpus of
labelled data
Requires
labelled data Yes No
▪ Models can learn new tasks
▪ If you want models with “common sense”
Parameters 100s of millions Billions to trillions and can generalize well to new tasks

Desired Specific (one model General (model can do ▪ A single model can serve all use-cases
model per task) many tasks) ▪ At-scale you avoid costs and complexity of
capability many models, saving cost in data curation,
training, and managing deployment
Training Retrain frequently with Never retrain, or
Frequenc task-specific training retrain minimally
y data
DISTRIBUTED TRAINING
Data, Pipeline and Tensor Parallelism
CHALLENGES

Compute-, cost-, and Significant capital investment and large-scale compute infrastructure are
time- intensive necessary to maintain and develop LLMs.
workload:

As mentioned, training a large model requires a significant amount of


Scale of data required: data. Many companies struggle to get access to large enough data.

Due to their scale, training and deploying large language models are very
Technical expertise:
difficult.
THANK YOU!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy