0% found this document useful (0 votes)

8 views

Unit 4 NLP

The document discusses Recurrent Neural Networks (RNNs) and their variations, including Long Short-Term Memory (LSTM) networks and attention mechanisms. RNNs are designed to handle sequential data by retaining information from previous inputs, making them suitable for tasks like natural language processing. The document also highlights the importance of attention mechanisms in improving model performance by allowing selective focus on relevant parts of the input data.

Uploaded by

Isha Govind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Unit 4 NLP

Uploaded by

Isha Govind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Attention mechanism, Transformer Based

Models, Self-attention, multi-headed attention, BERT, RoBERTa, Fine Tuning for downstream tasks, Text
classification and Text generation

RECURRENT NEURAL NETWORKS

Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In neural network the
information flows in one direction from input to output. However in RNN information is fed back into the system after
each step. Think of it like reading a sentence, when you’re trying to predict the next word you don’t just look at the
current word but also need to remember the words that came before to make accurate guess.

RNNs allow the network to “remember” past information by feeding the output from one step into next step. This
helps the network understand the context of what has already happened and make better predictions based on
that. For example when predicting the next word in a sentence the RNN uses the previous words to help decide what
word is most likely to come next.

How RNN Differs from Feedforward Neural Networks?

Feedforward Neural Networks (FNNs) process data in one direction from input to output without retaining information
from previous inputs. This makes them suitable for tasks with independent inputs like image classification. However,
FNNs struggle with sequential data since they lack memory.

Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information from previous steps
to be fed back into the network. This feedback enables RNNs to remember prior inputs making them ideal for tasks
where context is important.

Recurrent Vs Feedforward networks

Recurrent neural networks (RNNs) set themselves apart from other neural networks with their unique capabilities:

• Internal Memory: This is the key feature of RNNs. It allows them to remember past inputs and use that context
when processing new information.

• Sequential Data Processing: Because of their memory, RNNs are exceptional at handling sequential data where
the order of elements matters. This makes them ideal for speech recognition, machine translation, natural
language processing (NLP) and text generation.

• Contextual Understanding: RNNs can analyze the current input in relation to what they’ve “seen” before. This
contextual understanding is crucial for tasks where meaning depends on prior information.

• Dynamic Processing: RNNs can continuously update their internal memory as they process new data, allowing
them to adapt to changing patterns within a sequence.

Key Components of RNNs

1. Recurrent Neurons

The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that maintains
information about previous inputs in a sequence. Recurrent units can “remember” information from prior steps by
feeding back their hidden state, allowing them to capture dependencies across time.
2. RNN Unfolding

RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding each
step of the sequence is represented as a separate layer in a series illustrating how information flows across each time
step.

This unrolling enables backpropagation through time (BPTT) a learning process where errors are propagated across
time steps to adjust the network’s weights enhancing the RNN’s ability to learn dependencies within sequential data.

Recurrent Neural Network Architecture

RNNs share similarities in input and output structures with other deep learning architectures but differ significantly in
how information flows from input to output. Unlike traditional deep neural networks, where each dense layer has
distinct weight matrices, RNNs use shared weights across time steps, allowing them to remember information over
sequences.

In RNNs, the hidden state HiHi is calculated for every input XiXi to retain sequential dependencies. The computations
follow these core formulas:

1. Hidden State Calculation:

h=σ(U⋅X+W⋅ht−1+B)

Here, h represents the current hidden state, U and W are weight matrices, and B is the bias.

2. Output Calculation:

Y=O(V⋅h+C)

The output Y is calculated by applying OO, an activation function, to the weighted hidden state,
where V and C represent weights and bias.

Working of RNN

At each time step RNNs process units with a fixed activation function. These units have an internal hidden state that
acts as memory that retains information from previous time steps. This memory allows the network to store past
knowledge and adapt based on new inputs.

Updating the Hidden State in RNNs

The current hidden state ht depends on the previous state ht−1 and the current input xt, and is calculated using the
following relations:

1. State Update:

ht=f(ht−1,xt)

where: ht is the current state, ht−1 is the previous state and xt is the input at the current time step

2. Activation Function Application:

ht=tanh(Whh⋅ht−1+Wxh⋅xt)

Here, Whh is the weight matrix for the recurrent neuron, and Wxh is the weight matrix for the input neuron.

3. Output Calculation:

yt=Why⋅ht

where yt is the output and Why is the weight at the output layer.

These parameters are updated using backpropagation. However, since RNN works on sequential data here we use an
updated backpropagation which is known as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs

Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to update the network’s
parameters. The loss function L(θ) depends on the final hidden state h3 and each hidden state relies on preceding ones
forming a sequential dependency chain:

h3 depends on h2, h2 depends on h1, h1 depends on h0

Backpropagation Through Time (BPTT) In RNN

In BPTT, gradients are backpropagated through each time step. This is essential for updating network parameters based
on temporal dependencies.

Issues in RNN

In recurrent neural networks (RNNs), the exploding gradient and vanishing gradient problems occur due to the
repeated multiplication of gradients through many time steps during backpropagation. These issues affect the learning
process and model performance.
1. Vanishing Gradient Problem
• When backpropagating through time (BPTT), if the gradient values are less than 1, they keep getting smaller
as they propagate backward.
• This leads to very small weight updates, making the network slow to learn or even stop learning altogether.
• Cause: Activation functions like sigmoid or tanh squash values between (0,1) or (-1,1), making gradients
shrink.
2. Exploding Gradient Problem
• If the gradient values are greater than 1, they increase exponentially as they propagate backward.
• This leads to very large weight updates, causing the network to become unstable and diverge.
• Cause: Large weight values or repeated multiplication of large gradients.

Types Of Recurrent Neural Networks

There are four types of RNNs based on the number of inputs and outputs in the network:

1. One-to-One RNN

This is the simplest type of neural network architecture where there is a single input and a single output. It is used for
straightforward classification tasks such as binary classification where no sequential data is involved.

Example: Image Classification

Input: A single image Output: A single label (e.g., "Dog" or "Cat")

2. One-to-Many RNN

In a One-to-Many RNN the network processes a single input to produce multiple outputs over time. This is useful in
tasks where one input triggers a sequence of predictions (outputs). For example in image captioning a single image can
be used as input to generate a sequence of words as a caption.
Example: Music Generation
Input: A single note Output: A sequence of notes

3. Many-to-One RNN

The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is useful when the
overall context of the input sequence is needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive, negative or neutral.

Example: Sentiment Analysis

Input: A sequence of words (e.g., "This movie was amazing")
Output: A single label (Positive/Negative sentiment)

4. Many-to-Many RNN

The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of outputs. In language
translation task a sequence of words in one language is given as input, and a corresponding sequence in another language
is generated as output.

Example: Machine Translation (Google Translate)

Input: A sequence of words in English
Output: A sequence of words in French

LONG SHORT TERM MEMORY

To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were
developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM
recurrent unit tries to “remember” all the past knowledge that the network is seen so far and to “forget” irrelevant data.
This is done by introducing different activation function layers called “gates” for different purposes. Each LSTM
recurrent unit also maintains a vector called the Internal Cell State which conceptually describes the information that
was chosen to be retained by the previous LSTM recurrent unit.

LSTM Architecture

LSTM architectures involves the memory cell which is controlled by three gates: the input gate, the forget gate and
the output gate. These gates decide what information to add to, remove from and output from the memory cell.
• Input gate: Controls what information is added to the memory cell.

• Forget gate: Determines what information is removed from the memory cell.

• Output gate: Controls what information is output from the memory cell.

This allows LSTM networks to selectively retain or discard information as it flows through the network which allows
them to learn long-term dependencies. The network has a hidden state which is like its short-term memory. This memory
is updated using the current input, the previous hidden state and the current state of the memory cell.

Working of LSTM

LSTM architecture has a chain structure that contains four neural networks and different memory blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates. There are three gates –

Forget Gate

The information that is no longer useful in the cell state is removed with the forget gate. Two inputs xt (input at the particular
time) and ht-1 (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias.
The resultant is passed through an activation function which gives a binary output. If for a particular cell state the output is 0,
the piece of information is forgotten and for output 1, the information is retained for future use.

The equation for the forget gate is:

ft=σ(Wf⋅ [ht−1,xt] + bf
where:

• Wf represents the weight matrix associated with the forget gate.

• [ht-1, xt] denotes the concatenation of the current input and the previous hidden state.

• bf is the bias with the forget gate.

• σ is the sigmoid activation function.

Input gate

The addition of useful information to the cell state is done by the input gate. First, the information is regulated using
the sigmoid function and filter the values to be remembered similar to the forget gate using inputs ht-1 and xt. Then, a
vector is created using tanh function that gives an output from -1 to +1, which contains all the possible values from ht-
1 and xt. At last, the values of the vector and the regulated values are multiplied to obtain the useful information. The
equation for the input gate is:

it=σ(Wi⋅[ht−1,xt]+bi)

#Ct=tanh(Wc⋅[ht−1,xt]+bc)

We multiply the previous state by ft, disregarding the information we had previously chosen to ignore. Next, we include
it∗Ct. This represents the updated candidate values, adjusted for the amount that we chose to update each state value.

Ct=ft⊙Ct−1+it⊙#Ct

where

• ⊙ denotes element-wise multiplication

• tanh is tanh activation function

Output gate

The task of extracting useful information from the current cell state to be presented as output is done by the output gate.
First, a vector is generated by applying tanh function on the cell. Then, the information is regulated using the sigmoid
function and filter by the values to be remembered using inputs ht−1and xt. At last, the values of the vector and the
regulated values are multiplied to be sent as an output and input to the next cell. The equation for the output gate is:
ot=σ(Wo⋅[ht−1,xt]+bo)

Attention Mechanism

Imagine you are trying to understand a complex image or translate a sentence from one language to another. Your brain
instinctively focuses on specific parts of the image or particular words in the sentence that are most relevant to your
task. This selective focus is what we refer to as attention, and it’s a fundamental aspect of human cognition. Attention
mechanisms in deep learning aim to mimic this selective focus process in artificial neural networks.

At its core, an attention mechanism allows a model to focus on different parts of the input data with varying degrees
of importance. It assigns weights to each element in the input sequence, highlighting the elements that are most
relevant to the task at hand. This not only enhances the model’s understanding of the data but also improves its
performance in tasks like language translation, image captioning, and more.

Key Takeaway: Attention mechanisms enable neural networks to mimic human-like selective focus, improving their
ability to process and understand complex data.

Why are Attention Mechanisms Important?

Attention mechanisms have become indispensable in various deep-learning applications due to their ability to address
some critical challenges:

1. Long Sequences: Traditional neural networks struggle with processing long sequences, such as translating a
paragraph from one language to another. Attention mechanisms allow models to focus on the relevant parts of
the input, making them more effective at handling lengthy data.

2. Contextual Understanding: In tasks like language translation, understanding the context of a word is crucial
for accurate translation. Attention mechanisms enable models to consider the context by assigning different
attention weights to each word in the input sequence.

3. Improved Performance: Models equipped with attention mechanisms often outperform their non-attention
counterparts. They achieve state-of-the-art results in tasks like machine translation, image classification, and
speech recognition.

Example: Attention in Machine Translation

Let's say we want to translate the English sentence: "I love deep learning"

into French: "J'adore l'apprentissage profond"

A traditional sequence-to-sequence (Seq2Seq) model with an LSTM or GRU processes each word sequentially and
generates an output word-by-word. However, this approach struggles with long sentences because the entire meaning
is compressed into a fixed-length context vector.

How Attention Helps

Instead of encoding everything into a single vector, attention assigns different importance (weights) to different
input words while generating each output word.

For example, when translating "love" to "adore", the model should focus more on "love" rather than the other words.

Attention mechanism computes a score for each input word at every decoding step and dynamically decides which
parts of the input sequence are most relevant.

Step-by-Step Explanation of Attention

Encode the Input Sentence

Each word in "I love deep learning" is converted into an embedding and processed by an encoder (e.g., LSTM,
GRU, Transformer), generating hidden states.

Compute Attention Scores

When generating a word in the output (e.g., "adore"), the decoder computes attention scores for all input
words.These scores determine how much each input word contributes to the output at that step.

Apply Softmax to Get Weights

The attention scores are normalized using a softmax function, assigning higher values to more relevant words.

Generate Output Word Using Weighted Sum

The weighted sum of input words' hidden states is used to generate the output word.
Mathematical Formulation

The attention mechanism is typically implemented using the following equation:

Self-Attention

Self-Attention (also called Scaled Dot-Product Attention) is a mechanism where a model learns to weigh different
parts of the same input sequence to determine which are most important for making predictions. This is a key
component of Transformers (e.g., BERT, GPT).

Consider the sentence:

➡ "The cat sat on the mat because it was soft."

Here, the word "it" refers to "the mat", not "the cat". Traditional models (like RNNs and LSTMs) struggle to capture
such long-range dependencies. Self-attention helps by dynamically focusing on important words.

How Self-Attention Works (Step-by-Step)

Step 1: Input Representation

Each word in the sentence is first converted into a word embedding (vector representation). Suppose we have:

Word Embedding (Simplified)

The [1.2, 0.5, 0.8]

cat [0.9, 1.5, 0.4]

sat [1.0, 0.7, 1.1]

These embeddings are stacked into a matrix X.

Step 2: Compute Queries, Keys, and Values

Self-attention operates using three matrices:

Query (Q): What this word wants to know.

Key (K): What this word contains.

Value (V): The actual word representation.

Step 3: Compute Attention Scores

Attention scores determine how much focus one word should give to another. This is computed as:

Score = QKT

If we compute the dot product for the sentence, we get an attention matrix:

Word The cat sat

The 1.5 0.8 1.2

cat 0.8 2.1 1.3

sat 1.2 1.3 1.8

Higher values mean stronger attention.

Step 4: Apply Softmax to Normalize

The softmax function converts scores into probabilities:

Word The cat sat

The 0.45 0.25 0.30

cat 0.20 0.60 0.20

sat 0.25 0.30 0.45

This means "sat" focuses more on itself (0.45) and a little on "cat" (0.30).

Step 5: Compute Weighted Sum of Values

Each word’s new representation is the weighted sum of the Value (V) matrix:

Output=Attention Scores×V

This results in new context-aware word embeddings!

Multi-Head Attention

Multi-Head Attention (MHA) is an enhanced version of self-attention used in Transformer models (e.g., BERT,
GPT, ViTs). It allows the model to attend to different parts of the input simultaneously, improving its ability to
capture complex relationships.

In self-attention, a single attention mechanism focuses on one type of relationship at a time. However, language and
images have multiple aspects to capture.

Example Sentence: ➡ "She saw the jaguar in the zoo and was amazed by its speed."

Here, "jaguar" could mean:

An animal (context: "zoo")

A car brand (not relevant here)

Its speed (important in this context)

If we use only one attention mechanism, the model might focus only on "zoo" and ignore "speed." Multi-head
attention solves this by using multiple attention heads to learn different aspects of meaning!

2. How Multi-Head Attention Works (Step-by-Step)

Transformer Based Models

Transformers are a type of deep learning architecture that have revolutionized the field of natural language processing
(NLP) in recent years. They are widely used for tasks such as language translation, text classification, sentiment
analysis, and more.

History

The transformer architecture was first introduced in a 2017 paper by Google researchers Vaswani et al. called “Attention
Is All You Need”. This paper proposed a novel approach to NLP tasks that relied solely on the self-attention mechanism,
a type of attention mechanism that allows the model to weigh the importance of different words in a sentence when
encoding it into a fixed-size vector representation. This was a departure from previous NLP models that relied on
recurrent neural networks (RNNs) or convolutional neural networks (CNNs) to process sequences of words.

The transformer architecture was revolutionary in that it allowed for much faster training times and better parallelization
on GPUs, since the self-attention mechanism could be computed in parallel for all words in a sequence. This made it
possible to train much larger models on much larger datasets, leading to significant improvements in NLP performance.

Transformer

The transformer architecture is composed of an encoder and a decoder, each of which is made up of multiple layers of
self-attention and feedforward neural networks. The self-attention mechanism is the heart of the transformer, allowing
the model to weigh the importance of different words in a sentence based on their affinity with each other. This is similar
to how a human might read a sentence, focusing on the most relevant parts of the text rather than reading it linearly
from beginning to end.

In addition to self-attention, the transformer also introduces positional bias, which allows the model to keep track of the
relative positions of words in a sentence. This is important because the order of words in a sentence can significantly
impact its meaning.

Transformer: Encoder-Decoder

The transformer encoder-decoder architecture is used for tasks like language translation, where the model must take in
a sentence in one language and output a sentence in another language. The encoder takes in the input sentence and
produces a fixed-size vector representation of it, which is then fed into the decoder to generate the output sentence. The
decoder uses both self-attention and cross-attention, where the attention mechanism is applied to the output of the
encoder and the input of the decoder.
One of the most popular transformer encoder-decoder models is the T5 (Text-to-Text Transfer Transformer), which was
introduced by Google in 2019. The T5 can be fine-tuned for a wide range of NLP tasks, including language translation,
question answering, summarization, and more.

Real-world examples of the transformer encoder-decoder architecture include Google Translate, which uses the T5
model to translate text between languages, and Facebook’s M2M-100, a massive multilingual machine translation
model that can translate between 100 different languages.

Transformer: Encoder

The transformer encoder architecture is used for tasks like text classification, where the model must classify a piece of
text into one of several predefined categories, such as sentiment analysis, topic classification, or spam detection. The
encoder takes in a sequence of tokens and produces a fixed-size vector representation of the entire sequence, which can
then be used for classification.

One of the most popular transformer encoder models is BERT (Bidirectional Encoder Representations from
Transformers), which was introduced by Google in 2018. BERT is pre-trained on large amounts of text data and can be
fine-tuned for a wide range of NLP tasks.

Unlike the encoder-decoder architecture, the transformer encoder is only concerned with the input sequence and does
not generate any output sequence. It applies self-attention mechanism to the input tokens, allowing it to focus on the
most relevant parts of the input for the given task.

Real-world examples of the transformer encoder architecture include sentiment analysis, where the model must classify
a given review as positive or negative, and email spam detection, where the model must classify a given email as spam
or not spam.

Transformer: Decoder

The transformer decoder architecture is used for tasks like language generation, where the model must generate a
sequence of words based on an input prompt or context. The decoder takes in a fixed-size vector representation of the
context and uses it to generate a sequence of words one at a time, with each word being conditioned on the previously
generated words.

One of the most popular transformer decoder models is the GPT-3 (Generative Pre-trained Transformer 3), which was
introduced by OpenAI in 2020. The GPT-3 is a massive language model that can generate human-like text in a wide
range of styles and genres.

The transformer decoder architecture introduces a technique called triangle masking for attention, which ensures that
the attention mechanism only looks at tokens to the left of the current token being generated. This prevents the model
from “cheating” by looking at tokens that it hasn’t generated yet.

Real-world examples of the transformer decoder architecture include text generation, where the model must generate a
story or article based on a given prompt or topic, and chatbots, where the model must generate responses to user inputs
in a natural and engaging way.
Drawbacks

The drawbacks of the transformer architecture are:

1. High computational cost due to the attention mechanism, which increases quadratically with sequence length.

2. Difficulty in interpretation and debugging due to the attention mechanism operating over the entire input
sequence.

3. Prone to overfitting when fine-tuned on small amounts of task-specific data.

Despite these downsides, the transformer architecture remains a powerful and widely-used tool in NLP, and research is
ongoing to mitigate its computational requirements and improve its interpretability and robustness.

BERT

BERT is a tool/model which understand language beter than any other model in the history. It’s freely available and it
is increadibly versatily as it can solve a large number of problems related to lanugage tasks. You have used bert without
even knowing!

If you have used google serch, you have already used BERT

Architecture: Transformer model — a foundational concept for BERT

BERT is based on the Transformer model architecture. Examining the model as if it were a single black box, a machine
translation application would take a sentence in one language and translate it into a different language.

Transformer performing the task of machine translation [Source]

• Basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the
task.

The basic structure of a transformer [Source]

• Since BERT’s goal is to generate a language representation model, it only needs the encoder part. hence, BERT
is basically a trained Transformer Encoder stack
A basic structure of encoder block [Source]

Training of BERT

During pretraining, BERT uses two objectives: masked language modeling and next sentence prediction.

• Masked Language Modeling(MLM) basically masks 80% of the 15% of the randomly selected input tokens
and uses the other tokens to attempt to predict the mask (missing word).

• Next Sentence Prediction(NSP)is a binary classification loss for predicting whether two segments follow
each other or are from a different document to create a semantic meaning.

BERT is significantly undertrained and the following areas stand the scope of modifications.

1. Masking in BERT training:

The masking is done only once during data preprocessing, resulting in a single static mask. Hence, the same input masks
were fed to the model on every single epoch.

2. Next Sentence Prediction:

• The original input format used in BERT is SEGMENT-PAIR+NSP LOSS.

• In this, each input has a pair of segments, which can each contain multiple natural sentences, but the total
combined length must be less than 512 tokens.

• It is noticed that individual sentences hurt performance on downstream tasks, which according to the
hypothesis is because the model was not able to learn long-range dependencies, hence the authors could
experiment with removing/adding NSP loss to see the effect in the model’s performance.

3. Text Encoding:

• The original BERT implementation uses a character-level BPE vocabulary of size 30K.

• BERT uses the WordPiece method a language-modeling-based variant of Byte Pair Encoding.

4. Training batch size:

Originally BERT is trained for 1M steps with a batch size of 256 sequences, which shows room for improvement in
perplexity on the Masked Language Modelling objective.

Altering the training Procedure:

1. Replacing Static masking with Dynamic Masking :

To avoid masking the same word multiple times, Facebook used Dynamic masking; the training data was repeated 10
times and every next time, the masked word would be different, meaning the sentence would be the same but the words
masked would be different.

2. Removing NSP :

TEST 1: Feeding the following alternate training formats.

2.1. Retain NSP Loss:

• SENTENCE-PAIR+NSP: Each input contains a pair of natural sentences, sampled from a contiguous portion
of one document or separate documents. The NSP loss is retained.

2.2. Remove NSP loss:

• FULL-SENTENCES: Each input is packed with full sentences sampled contiguously from single or cross
documents, such that the total length is at most 512 tokens. We remove the NSP loss.

• DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross
document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so we
dynamically increase the batch size in these cases to achieve a similar number of total tokens as
FULLSENTENCES. We remove the NSP loss.

Results —

• Removing the NSP loss matches or slightly improves downstream task performance, in contrast to the original
BERT with NSP loss.

• The sequences from a single document (DOC-SENTENCES) perform slightly better than packing sequences
from multiple documents (FULL-SENTENCES) as shown in Table 1.

Table 1: Comparison of performance of models with and without NSP loss (image is taken from the paper)

3. Training with large mini-batch:

It is noticed that training a model with large mini-batches improves the perplexity of MLM objective and end-accuracy.

1M steps, batch size of 256 has equivalent computational cost to 31K steps, batch size of 8K.

Large batches are also easier to parallelize via distributed parallel training.

4. Byte-Pair Encoding:

• Here Byte-Pair Encoding is used over raw bytes instead of Unicode characters.

• The BPE subword vocabulary is reduced to 50K (still bigger than BERT’s vocab size) units.
A quick example of Byte Pair Encoding (BPE) [Source]

Despite degrading the performance of end-task performance in some cases, this method was used for encoding as it is
a universal encoding scheme that doesn’t need any preprocessing and tokenization rules.

5. Increasing Training Data:

It was observed that training BERT on larger datasets, greatly improves its performance. Hence the training data was
increased to 160GB of uncompressed text.

RoBERTa

RoBERTa (short for “Robustly Optimized BERT Approach”) is an advanced version of the BERT (Bidirectional
Encoder Representations from Transformers) model, created by researchers at Facebook AI. Similar to BERT,
RoBERTa is a transformer-based language model that employs self-attention to analyze input sequences and produce
contextualized word representations within a sentence.

RoBERTa vs BERT

A key difference between RoBERTa and BERT is that RoBERTa was trained on a significantly larger dataset and with
a more effective training procedure. Specifically, RoBERTa was trained on 160GB of text, over 10 times the size of the
dataset used for BERT. Additionally, RoBERTa employs a dynamic masking technique during training, which enhances
the model’s ability to learn more robust and generalizable word representations.

RoBERTa has demonstrated superior performance compared to BERT and other leading models on various natural
language processing tasks, such as language translation, text classification, and question answering. It has also served
as a foundational model for numerous successful NLP models and has gained popularity for both research and industrial
applications.

In summary, RoBERTa is a powerful and effective language model that has made significant contributions to NLP,
advancing progress across a wide range of applications.

RoBERTa Model Architecture

The RoBERTa model shares the same architecture as the BERT model. It is a reimplementation of BERT with
modifications to key hyperparameters and minor adjustments to embeddings.

The general pre-training and fine-tuning procedures for BERT are illustrated in Figure 1 below. In BERT, the same
architecture is used for both pre-training and fine-tuning, except for the output layers. The pre-trained model parameters
are used to initialize models for various downstream tasks. During fine-tuning, all parameters are adjusted.
Architecture of BERT model

In contrast, RoBERTa does not use the next-sentence pretraining objective. Instead, it is trained with much larger mini-
batches and higher learning rates. RoBERTa employs a different pretraining scheme and replaces the character-level
BPE vocabulary with a byte-level BPE tokenizer (similar to GPT-2). Additionally, RoBERTa does not require the
definition of which token belongs to which segment, as it lacks token_type_ids. Segments can be easily divided using
the separation token tokenizer.sep_token (or ).

Furthermore, unlike the 16GB dataset originally used to train BERT, RoBERTa is trained on a massive dataset exceeding
160GB of uncompressed text. This dataset includes the 16GB of English Wikipedia and Books Corpus used in BERT,
along with additional data from the WebText corpus (38 GB), the CommonCrawl News dataset (63 million articles, 76
GB), and Stories from Common Crawl (31 GB). RoBERTa was pre-trained using this extensive dataset and 1024 V100
Tesla GPUs running for a day.

Advantages of RoBERTa Model

RoBERTa has a similar architecture to BERT, but to enhance performance, the authors made several simple design
changes to the architecture and training procedure. These changes include:

1. Removing the Next Sentence Prediction (NSP) Objective: In BERT, the model is trained to predict whether
two segments of a document are from the same or different documents using an auxiliary NSP loss. The authors
experimented with versions of the model with and without the NSP loss and found that removing the NSP loss
either matched or slightly improved performance on downstream tasks.

2. Training with Larger Batch Sizes and Longer Sequences: BERT was originally trained for 1 million steps with
a batch size of 256 sequences. RoBERTa was trained with 125 steps of 2,000 sequences and 31,000 steps with
8,000 sequences per batch. Larger batches improve perplexity on the masked language modeling objective and
end-task accuracy. They are also easier to parallelize using distributed parallel training.

3. Dynamically Changing the Masking Pattern: In BERT, masking is done once during data preprocessing,
resulting in a single static mask. To avoid this, training data is duplicated and masked 10 times with different
strategies over 40 epochs, resulting in 4 epochs with the same mask. This strategy was compared with dynamic
masking, where different masks are generated each time data is passed into the model.

Navigating to Limitations of RoBERTa

While RoBERTa is a powerful model, it’s not without its limitations. Here are some:

1. Computational Resources: Training and fine-tuning RoBERTa requires significant computational resources,
including powerful GPUs and large amounts of memory. This can make it challenging for individuals or
organizations with limited resources to utilize RoBERTa effectively.

2. Domain Specificity: Pre-trained language models like RoBERTa may not perform optimally on domain-
specific tasks or datasets without further fine-tuning. They may require additional training on domain-specific
data to achieve the desired level of performance.

3. Data Efficiency: RoBERTa and similar models require large amounts of data for pre-training, which might not
be available for all languages or domains. This reliance on extensive data can limit their applicability in settings
where data is scarce or expensive to acquire.
4. Interpretability: The black-box nature of RoBERTa can make it difficult to interpret how the model arrives at
its predictions. Understanding the inner workings of the model and diagnosing errors or biases can be
challenging, especially in complex applications or sensitive domains.

5. Fine-tuning Challenges: While fine-tuning RoBERTa for specific tasks can improve performance, it requires
expertise and experimentation to select the right hyperparameters, data augmentation techniques, and training
strategies. This process can be time-consuming and resource-intensive.

6. Bias and Fairness: Pre-trained language models like RoBERTa can inherit biases present in the training data,
leading to biased or unfair predictions. Addressing bias and ensuring fairness in AI models remains a
significant challenge, requiring careful data curation and model design considerations.

7. Out-of-Distribution Generalization: RoBERTa may struggle to generalize to out-of-distribution data or handle

scenarios significantly different from its training data. This limitation can impact the robustness and reliability
of RoBERTa in real-world applications where data distribution shifts are common.

Text Classification

Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural
language inference, and assessing grammatical correctness.

Input - I love Hugging Face!

Text Classification Model

Output

POSITIVE

0.900

NEUTRAL

0.100

NEGATIVE

0.000

Use Cases

Sentiment Analysis on Customer Reviews

You can track the sentiments of your customers from the product reviews using sentiment analysis models. This can
help understand churn and retention by grouping reviews by sentiment, to later analyze the text and make strategic
decisions based on this knowledge.

Task Variants

Natural Language Inference (NLI)

In NLI the model determines the relationship between two given texts. Concretely, the model takes a premise and a
hypothesis and returns a class that can either be:

• entailment, which means the hypothesis is true.

• contraction, which means the hypothesis is false.

• neutral, which means there's no relation between the hypothesis and the premise.

The benchmark dataset for this task is GLUE (General Language Understanding Evaluation). NLI models have different
variants, such as Multi-Genre NLI, Question NLI and Winograd NLI.

Multi-Genre NLI (MNLI)

MNLI is used for general NLI. Here are som examples:

Example 1:
Premise: A man inspects the uniform of a figure in some East Asian country.

Hypothesis: The man is sleeping.

Label: Contradiction

Example 2:

Premise: Soccer game with multiple males playing.

Hypothesis: Some men are playing a sport.

Label: Entailment

Question Natural Language Inference (QNLI)

QNLI is the task of determining if the answer to a certain question can be found in a given document. If the answer can
be found the label is “entailment”. If the answer cannot be found the label is “not entailment".

Question: What percentage of marine life died during the extinction?

Sentence: It is also known as the “Great Dying” because it is considered the largest mass extinction in the Earth’s
history.

Label: not entailment

Question: Who was the London Weekend Television’s Managing Director?

Sentence: The managing director of London Weekend Television (LWT), Greg Dyke, met with the representatives of
the "big five" football clubs in England in 1990.

Label: entailment

Sentiment Analysis

In Sentiment Analysis, the classes can be polarities like positive, negative, neutral, or sentiments such as happiness or
anger.

Quora Question Pairs

Quora Question Pairs models assess whether two provided questions are paraphrases of each other. The model takes
two questions and returns a binary value, with 0 being mapped to “not paraphrase” and 1 to “paraphrase". The
benchmark dataset is Quora Question Pairs inside the GLUE benchmark. The dataset consists of question pairs and
their labels.

Question1: “How can I increase the speed of my internet connection while using a VPN?”

Question2: How can Internet speed be increased by hacking through DNS?

Label: Not paraphrase

Question1: “What can make Physics easy to learn?”

Question2: “How can you make physics easy to learn?”

Label: Paraphrase

Grammatical Correctness

Linguistic Acceptability is the task of assessing the grammatical acceptability of a sentence. The classes in this task are
“acceptable” and “unacceptable”. The benchmark dataset used for this task is Corpus of Linguistic Acceptability
(CoLA). The dataset consists of texts and their labels.

Example: Books were sent to each other by the students.

Label: Unacceptable

Example: She voted for herself.

Label: Acceptable.
Text Generation

Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete
text or paraphrase.

Input

Once upon a time,

Text Generation Model

Output

Once upon a time, we knew that our ancestors were on the verge of extinction. The great explorers and poets of the Old
World, from Alexander the Great to Chaucer, are dead and gone. A good many of our ancient explorers and poets have

This task covers guides on both text-generation and text-to-text generation models. Popular large language models that
are used for chats or following instructions are also covered in this task. You can find the list of selected open-source
large language models here, ranked by their performance scores.

Use Cases

Code Generation

A Text Generation model, also known as a causal language model, can be trained on code from scratch to help the
programmers in their repetitive coding tasks. One of the most popular open-source models for code generation is
StarCoder, which can generate code in 80+ languages. You can try it here.

Stories Generation

A story generation model can receive an input like "Once upon a time" and proceed to create a story-like text based on
those first words. You can try this application which contains a model trained on story generation, by MosaicML.

If your generative model training data is different than your use case, you can train a causal language model from
scratch. Learn how to do it in the free transformers course!

Task Variants

Completion Generation Models

A popular variant of Text Generation models predicts the next word given a bunch of words. Word by word a longer
text is formed that results in for example:

• Given an incomplete sentence, complete it.

• Continue a story given the first sentences.

• Provided a code description, generate the code.

The most popular models for this task are GPT-based models, Mistral or Llama series. These models are trained on data
that has no labels, so you just need plain text to train your own model. You can train text generation models to generate
a wide variety of documents, from code to stories.

Text-to-Text Generation Models

These models are trained to learn the mapping between a pair of texts (e.g. translation from one language to another).
The most popular variants of these models are NLLB, FLAN-T5, and BART. Text-to-Text models are trained with
multi-tasking capabilities, they can accomplish a wide range of tasks, including summarization, translation, and text
classification.

Language Model Variants

When it comes to text generation, the underlying language model can come in several types:

Base models: refers to plain language models like Mistral 7B and Meta Llama-3-70b. These models are good for fine-
tuning and few-shot prompting.

Instruction-trained models: these models are trained in a multi-task manner to follow a broad range of instructions
like "Write me a recipe for chocolate cake". Models like Qwen 2 7B, Yi 1.5 34B Chat, and Meta Llama 70B Instruct
are examples of instruction-trained models. In general, instruction-trained models will produce better responses to
instructions than base models.
Human feedback models: these models extend base and instruction-trained models by incorporating human feedback
that rates the quality of the generated text according to criteria like helpfulness, honesty, and harmlessness. The human
feedback is then combined with an optimization technique like reinforcement learning to align the original model to be
closer with human preferences. The overall methodology is often called Reinforcement Learning from Human
Feedback, or RLHF for short. Zephyr ORPO 141B A35B is an open-source model aligned through human feedback.

Text Generation from Image and Text

There are language models that can input both text and image and output text, called vision language models. IDEFICS
2 and MiniCPM Llama3 V are good examples. They accept the same generation parameters as other language models.
However, since they also take images as input, you have to use them with the image-to-text pipeline. You can find more
information about this in the image-to-text task page.

Text-to-Text generation models have a separate pipeline called text2text-generation. This pipeline takes an input
containing the sentence including the task and returns the output of the accomplished task.

from transformers import pipeline

text2text_generator = pipeline("text2text-generation")

text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

[{'generated_text': 'the answer to life, the universe and everything'}]

text2text_generator("translate from English to French: I'm very happy")

[{'generated_text': 'Je suis très heureux'}]

You can use huggingface.js to infer text classification models on Hugging Face Hub.

import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN);await inference.conversational({

model: "distilbert-base-uncased-finetuned-sst-2-english",

inputs: "I love this movie!",

});

Outcomes Preintermediate SB
100% (14)
Outcomes Preintermediate SB
213 pages
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
English Vowel Sounds (Description & Chart)
100% (1)
English Vowel Sounds (Description & Chart)
14 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
18 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
9 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
What is a Recurrent Neural Network
No ratings yet
What is a Recurrent Neural Network
36 pages
DL Notes
No ratings yet
DL Notes
35 pages
Unit V Recurrent Neural Networks
No ratings yet
Unit V Recurrent Neural Networks
35 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
11 pages
DS303_RNN_LSTM
No ratings yet
DS303_RNN_LSTM
16 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
RNN.docx
No ratings yet
RNN.docx
10 pages
RNN
No ratings yet
RNN
32 pages
Sequence Modeling Recurrent Neural Networks
No ratings yet
Sequence Modeling Recurrent Neural Networks
18 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
99 pages
ML Lec 21 RNN
No ratings yet
ML Lec 21 RNN
72 pages
RNN.docx
No ratings yet
RNN.docx
8 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
8 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
1 Recurrent Neural Networks (1)
No ratings yet
1 Recurrent Neural Networks (1)
34 pages
Soft Computing 1
No ratings yet
Soft Computing 1
15 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
42 pages
Recurrent Neural Networks(RNNs)
No ratings yet
Recurrent Neural Networks(RNNs)
45 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
6 pages
DL Unit4
No ratings yet
DL Unit4
20 pages
module5
No ratings yet
module5
21 pages
REPORT
No ratings yet
REPORT
24 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
Unit-2 Part-2
No ratings yet
Unit-2 Part-2
42 pages
unit 4_merged
No ratings yet
unit 4_merged
13 pages
DL_4_notes
No ratings yet
DL_4_notes
34 pages
Nria20-Dl - Unit-4 Notes-Final
No ratings yet
Nria20-Dl - Unit-4 Notes-Final
21 pages
Deep & Reinforcement - Unit 4
No ratings yet
Deep & Reinforcement - Unit 4
17 pages
Recurrent Neural Network Jeeva
No ratings yet
Recurrent Neural Network Jeeva
10 pages
Recurrent Neural Network: Dr. Sukanta Ghosh
100% (1)
Recurrent Neural Network: Dr. Sukanta Ghosh
34 pages
15.03.2024_CSA3007_A24+D23+D24 (1)
No ratings yet
15.03.2024_CSA3007_A24+D23+D24 (1)
8 pages
DL CO3- PPT 1
No ratings yet
DL CO3- PPT 1
22 pages
Unit_4
No ratings yet
Unit_4
13 pages
RNN
No ratings yet
RNN
15 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
What is an RNN
No ratings yet
What is an RNN
6 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
36 pages
Deep Arch Msc 2024
No ratings yet
Deep Arch Msc 2024
83 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
RNN
No ratings yet
RNN
28 pages
4-Recurrent Neural Network
No ratings yet
4-Recurrent Neural Network
21 pages
Endsem Imp Dl Unit 4
No ratings yet
Endsem Imp Dl Unit 4
30 pages
RNN
No ratings yet
RNN
79 pages
RNN Neural Network
No ratings yet
RNN Neural Network
23 pages
RNN
No ratings yet
RNN
23 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
UNIT5
No ratings yet
UNIT5
13 pages
A Recurrent Neural Network
No ratings yet
A Recurrent Neural Network
3 pages
RNN LSTM Gru R
No ratings yet
RNN LSTM Gru R
97 pages
Module 06
No ratings yet
Module 06
5 pages
Deep Learning (MODULE-4)
No ratings yet
Deep Learning (MODULE-4)
102 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Past Simple Test
No ratings yet
Past Simple Test
9 pages
HINDI ASSESSMENT PROMPT - 9
No ratings yet
HINDI ASSESSMENT PROMPT - 9
14 pages
Đề thi vào 10 chuyên anh đáp án ĐỀ THI THỬ LẦN
100% (1)
Đề thi vào 10 chuyên anh đáp án ĐỀ THI THỬ LẦN
9 pages
بەرگی دووەم
No ratings yet
بەرگی دووەم
648 pages
Return To 'Sample Roleplay With Assessment' Page
No ratings yet
Return To 'Sample Roleplay With Assessment' Page
2 pages
TITLE: Learners With Physical Disability
No ratings yet
TITLE: Learners With Physical Disability
12 pages
Phil IRI GST Group Screening - Test
No ratings yet
Phil IRI GST Group Screening - Test
4 pages
LS Computational Linguistics
No ratings yet
LS Computational Linguistics
3 pages
Unit 3. Modified Final
No ratings yet
Unit 3. Modified Final
34 pages
IEM Nacional Pitalito - Reseña Histórica
No ratings yet
IEM Nacional Pitalito - Reseña Histórica
1 page
Unit 1 Complete Notes
No ratings yet
Unit 1 Complete Notes
183 pages
First Conditional When, As Soon As, Unless, Until, Before: Presentation
100% (1)
First Conditional When, As Soon As, Unless, Until, Before: Presentation
2 pages
Homework Oh Homework by Jack Prelutsky Lyrics
100% (1)
Homework Oh Homework by Jack Prelutsky Lyrics
6 pages
Order of The Adjectives Exercises
No ratings yet
Order of The Adjectives Exercises
2 pages
Greek Alphabet...
No ratings yet
Greek Alphabet...
2 pages
Pub Appliable-Linguistics PDF
100% (1)
Pub Appliable-Linguistics PDF
330 pages
Grammar Usage and Mechanics Second-115-120
No ratings yet
Grammar Usage and Mechanics Second-115-120
6 pages
Data Structure Midterm Exam Ay 2023-2024
No ratings yet
Data Structure Midterm Exam Ay 2023-2024
7 pages
Reported Speech Notes
No ratings yet
Reported Speech Notes
3 pages
Curriculum Format Ukg
No ratings yet
Curriculum Format Ukg
2 pages
Jadwal Tuweb 2021.1 Oke
No ratings yet
Jadwal Tuweb 2021.1 Oke
41 pages
GUIDELINES in Test Construction (Multiple Choice Test)
No ratings yet
GUIDELINES in Test Construction (Multiple Choice Test)
56 pages
Live Beat 3 - Contents
No ratings yet
Live Beat 3 - Contents
1 page
Module 1 - UNITS 1-2 - 9T - 2023
No ratings yet
Module 1 - UNITS 1-2 - 9T - 2023
5 pages
Triunfo Militar: Inglês
No ratings yet
Triunfo Militar: Inglês
7 pages
Communication-Theory
No ratings yet
Communication-Theory
7 pages
Course Outline
No ratings yet
Course Outline
26 pages
LK MKDK Psikologi Pendidikan 2024 Fix
No ratings yet
LK MKDK Psikologi Pendidikan 2024 Fix
149 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.