0% found this document useful (0 votes)
14 views25 pages

DAB311 DL Week 11 RNN

The document discusses the need for sequential modeling in machine learning, highlighting the limitations of Fully Connected Networks (FCN) and Convolutional Neural Networks (CNN) in handling fixed input dimensions and lack of memory. It emphasizes the importance of sequential models for time series data, such as video and autonomous vehicle data, which exhibit periodic cycles, trends, and sudden changes. The motivation for using sequential models is to effectively capture and analyze temporal patterns in data.

Uploaded by

sidharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views25 pages

DAB311 DL Week 11 RNN

The document discusses the need for sequential modeling in machine learning, highlighting the limitations of Fully Connected Networks (FCN) and Convolutional Neural Networks (CNN) in handling fixed input dimensions and lack of memory. It emphasizes the importance of sequential models for time series data, such as video and autonomous vehicle data, which exhibit periodic cycles, trends, and sudden changes. The motivation for using sequential models is to effectively capture and analyze temporal patterns in data.

Uploaded by

sidharth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Sequential Modelling

Week 11
Need for sequential modelling
• Fully Connected Network (FCN)
• Fixed input dimension
• e.g : input [x1 x2 x3 x4 ……… xn ]
• If input size is <=n → set zeros
• If input size is > n → ignore the input data

• Convolutional Neural Network (CNN)


• Carry spatial information
• Good for image data

FCN and CNN: FCN and CNN:


▪ Output for a given snapshot ▪ Fixed input dimension
▪ Next set of input is treated a new snapshot ▪ Does not carry memory
Need for sequential modelling (cont’d)
• Motivation for sequential Models:
➢ Time series data E.g.: for time series data
• Video
o Periodic cycles
• Autonomous vehicle: Object
o Trends state
o Regularity • Electric circuit
o Sudden spikes/drops • Temperature variation
• Stock price

➢ Natural Language:
o Email auto complete
o Translation (e.g.: English to French)
o Sentiment analysis
Need for sequential modelling (cont’d)
• Today is the coolest temperature in Windsor
NLP:
1 2 3 4 5 6 7
▪ Varying input size
• The historical average temperature in November is 12 degree Celsius
1 8 9 5 6 10 2 11 13 14

Tokenization is the process of breaking down text into smaller, manageable pieces called "tokens.“

Word tokenization - ["I", "love", "NLP"].


Character tokenization – ["N", "L", "P"]

A token ID is a numerical identifier assigned to each token during the tokenization process
Need for sequential modelling (cont’d)
salary Loan Rejected

Credit score
Loan granted
Sequence does not matter
Problems:
Experience
• Varying input size
Needs Verifcation • Too much computation
Age • No parameter sharing

I Neutral

like Positive
Sequence matters
this
Negative
dish
Recurrent Neural Network (RNN)

Neural Network unwrap

Simple RNN : one hidden layer

Deep RNN : many hidden layer


Issues with RNN
• Vanishing gradient
• Exploding gradient

Long Short-Term Memory (LSTM)


• LSTMs introduce special units called memory cells to store information across time steps in a sequence. These
cells can maintain their state (memory) over a longer period of time than traditional RNN units.

• The memory cells are controlled by three gates: input gate, forget gate, and output gate. These gates allow
LSTMs to decide which information to keep, which to discard, and which new information to add.

Gated Recurrent Unit (GRU)


• GRUs are similar to Long Short-Term Memory (LSTMs) but have a simpler structure and fewer parameters,
making them computationally more efficient.
Large Language Models (LLM)
• Language Models:
➢ Basic NLP tasks (answering questions, translation, sentiment analysis)

• LLM is a form of Generative Artificial Intelligence (GenAI – Able to generate new content)

• LLM is a Neural Network designed to


➢ Understand
➢ Generate
➢ Respond
to human like texts
• Deep NN trained on massive (large) amount of data

Why do we call Large Language Models?


• Training on massive amount of data
• Billions of parameters
Large Language Models (cont’d.)
LLM vs Earlier NLP (or simple LM) Models
• NLP/LM:
➢ very specific tasks (e.g., translation, sentiment analysis)
➢ Not able to write an email from given instructions

• LLM:
➢ Can do wide range of NLP tasks
➢ Able to write email for a given set of instructions and more

• Why LLM is so good compared to earlier NLP/LM?

TRANSFORMER ARCHITECTURE
➢ Not all LLMs are transformers
➢ Not all transformers are LLMs
Large Language Models (cont’d.)

• Generative Artificial Intelligence (GenAI): Generate new contents

• LLM typically deals with text, but do they have to be limited to text only?
NO

• GPT 4 is a multimodal model that can process text and images, however referred as LLM due to its primary
fucus and fundamental design being around text-based tasks
• Waymo's multimodal end-to-end model refers to their integrated approach for autonomous driving, where
multiple types of data inputs (camera, radar and lidar) are processed together to make driving decisions.
Use Cases of LLM
o Machine translation: LLMs can be used to translate text from one language to another.

o Content generation: LLMs can generate new text, such as fiction, articles, and even computer
code.

o Sentiment analysis: LLMs can be used to analyze the sentiment of a piece of text, such as
determining whether it is positive, negative, or neutral.

o Text summarization: LLMs can be used to summarize a long piece of text, such as an article or a
document.

o Chatbots and virtual assistants: LLMs can be used to power chatbots and virtual assistants,
such as OpenAI's ChatGPT or Google's Gemini (formerly called Bard).

o Knowledge retrieval: LLMs can be used to retrieve knowledge from vast volumes of text in
specialized areas such as medicine or law.
Stages of Building LLMs Huge computational cost (e.g.: GPT3 training
cost is approximately 4.6 million dollars)
▪ Stage 1: Implementing the LLM
architecture and data preparation
process. This stage involves preparing
and sampling the text data and
understanding the basic mechanisms
behind LLMs.

▪ Stage 2: Pretraining an LLM to create a


foundation model. This stage involves
pretraining the LLM on unlabeled data.
Typically training on a large diverse data
set. (Also known as general data set)

▪ Stage 3: Fine-tuning the foundation


model to become a personal assistant Why is fine tuning important?
or text classifier. This stage involves fine • Train your specific data set
tuning the pretrained LLM on labeled data, • Customize for your application of
which can be either an instruction organization (e.g. health care, airline, law
dataset or a dataset with class labels. firm, educational institute etc.,)
Simplified Transformer Architecture
• An encoder that processes the input text
and produces an embedding representation
(a numerical representation that captures
many different factors in different
dimensions) of the text

• Encodes input text into vectors

• Decoder can use to generate the translated


text one word at a time.

• Generate output text from encoded


vectors

Self-attention mechanism:
• Key part of transformers that allows to weigh importance of different words/tokens relative to
each other.
• Enables model to capture long range dependencies
Transformer Architecture

Attention Is All You Need

https://arxiv.org/pdf/1706.03762
BERT Vs GPT Architecture

• Bidirectional encode representations from


transformers (BERT): the encoder segment
exemplifies BERT-like LLMs, which focus on
masked word prediction and are primarily
used for tasks like text classification

• Predict hidden words in a given


sentence

• Generate pre-trained transformers (GPT):


the decoder segment showcases GPT-like
LLMs, designed for generative tasks and
producing coherent text sequences

• Generate new words


GPT Architecture
• The GPT architecture employs only the
decoder portion of the original transformer.

• It is designed for unidirectional, left-to-right


processing, making it well suited for text
generation and next-word prediction tasks.

• Generate text in an iterative fashion, one


word at a time.
GPT Architecture (cont’d.)
Working with text
Embedding
Vector Embedding

Words corresponding to similar concepts


often appear close to each other in the
embedding space. For instance, different
types of birds appear closer to each
other in the embedding space than in
countries and cities.
Tokenizing Texts

Here, we split an input text into


individual tokens, which are either
words or special characters, such as
punctuation characters.
Converting Tokens into Token IDs

We build a vocabulary by tokenizing the


entire text in a training dataset into
individual tokens. These individual
tokens are then sorted alphabetically,
and duplicate tokens are removed. The
unique tokens are then aggregated into
a vocabulary that defines a mapping
from each unique token to a unique
integer value. The depicted vocabulary is
purposefully small and contains no
punctuation or special characters for
simplicity.
Converting Tokens into Token IDs (cont’d.)
Starting with a new text sample, we tokenize the
text and use the vocabulary to convert the text
tokens into token IDs. The vocabulary is built from
the entire training set and can be applied to the
training set itself and any new text samples. The
depicted vocabulary contains no punctuation or
special characters for simplicity.
Adding special context tokens

We add special tokens to a vocabulary to deal with


certain contexts. For instance, we add
an <|unk|> token to represent new and unknown
words that were not part of the training data and
thus not part of the existing vocabulary.
Furthermore, we add an <|endoftext|> token that
we can use to separate two unrelated text sources.
Byte Pair Encoding (BPE)
The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

BPE tokenizers break down unknown words into


subwords and individual characters. This way, a BPE
tokenizer can parse any word and doesn’t need to
replace unknown words with special tokens, such
as <|unk|>.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy