Unit 4 NLP
Unit 4 NLP
Models, Self-attention, multi-headed attention, BERT, RoBERTa, Fine Tuning for downstream tasks, Text
classification and Text generation
Recurrent Neural Networks (RNNs) work a bit different from regular neural networks. In neural network the
information flows in one direction from input to output. However in RNN information is fed back into the system after
each step. Think of it like reading a sentence, when you’re trying to predict the next word you don’t just look at the
current word but also need to remember the words that came before to make accurate guess.
RNNs allow the network to “remember” past information by feeding the output from one step into next step. This
helps the network understand the context of what has already happened and make better predictions based on
that. For example when predicting the next word in a sentence the RNN uses the previous words to help decide what
word is most likely to come next.
Feedforward Neural Networks (FNNs) process data in one direction from input to output without retaining information
from previous inputs. This makes them suitable for tasks with independent inputs like image classification. However,
FNNs struggle with sequential data since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow information from previous steps
to be fed back into the network. This feedback enables RNNs to remember prior inputs making them ideal for tasks
where context is important.
Recurrent neural networks (RNNs) set themselves apart from other neural networks with their unique capabilities:
• Internal Memory: This is the key feature of RNNs. It allows them to remember past inputs and use that context
when processing new information.
• Sequential Data Processing: Because of their memory, RNNs are exceptional at handling sequential data where
the order of elements matters. This makes them ideal for speech recognition, machine translation, natural
language processing (NLP) and text generation.
• Contextual Understanding: RNNs can analyze the current input in relation to what they’ve “seen” before. This
contextual understanding is crucial for tasks where meaning depends on prior information.
• Dynamic Processing: RNNs can continuously update their internal memory as they process new data, allowing
them to adapt to changing patterns within a sequence.
1. Recurrent Neurons
The fundamental processing unit in RNN is a Recurrent Unit. Recurrent units hold a hidden state that maintains
information about previous inputs in a sequence. Recurrent units can “remember” information from prior steps by
feeding back their hidden state, allowing them to capture dependencies across time.
2. RNN Unfolding
RNN unfolding or unrolling is the process of expanding the recurrent structure over time steps. During unfolding each
step of the sequence is represented as a separate layer in a series illustrating how information flows across each time
step.
This unrolling enables backpropagation through time (BPTT) a learning process where errors are propagated across
time steps to adjust the network’s weights enhancing the RNN’s ability to learn dependencies within sequential data.
RNNs share similarities in input and output structures with other deep learning architectures but differ significantly in
how information flows from input to output. Unlike traditional deep neural networks, where each dense layer has
distinct weight matrices, RNNs use shared weights across time steps, allowing them to remember information over
sequences.
In RNNs, the hidden state HiHi is calculated for every input XiXi to retain sequential dependencies. The computations
follow these core formulas:
h=σ(U⋅X+W⋅ht−1+B)
Here, h represents the current hidden state, U and W are weight matrices, and B is the bias.
2. Output Calculation:
Y=O(V⋅h+C)
The output Y is calculated by applying OO, an activation function, to the weighted hidden state,
where V and C represent weights and bias.
Working of RNN
At each time step RNNs process units with a fixed activation function. These units have an internal hidden state that
acts as memory that retains information from previous time steps. This memory allows the network to store past
knowledge and adapt based on new inputs.
The current hidden state ht depends on the previous state ht−1 and the current input xt, and is calculated using the
following relations:
1. State Update:
ht=f(ht−1,xt)
where: ht is the current state, ht−1 is the previous state and xt is the input at the current time step
ht=tanh(Whh⋅ht−1+Wxh⋅xt)
Here, Whh is the weight matrix for the recurrent neuron, and Wxh is the weight matrix for the input neuron.
3. Output Calculation:
yt=Why⋅ht
where yt is the output and Why is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works on sequential data here we use an
updated backpropagation which is known as backpropagation through time.
Backpropagation Through Time (BPTT) in RNNs
Since RNNs process sequential data Backpropagation Through Time (BPTT) is used to update the network’s
parameters. The loss function L(θ) depends on the final hidden state h3 and each hidden state relies on preceding ones
forming a sequential dependency chain:
In BPTT, gradients are backpropagated through each time step. This is essential for updating network parameters based
on temporal dependencies.
Issues in RNN
In recurrent neural networks (RNNs), the exploding gradient and vanishing gradient problems occur due to the
repeated multiplication of gradients through many time steps during backpropagation. These issues affect the learning
process and model performance.
1. Vanishing Gradient Problem
• When backpropagating through time (BPTT), if the gradient values are less than 1, they keep getting smaller
as they propagate backward.
• This leads to very small weight updates, making the network slow to learn or even stop learning altogether.
• Cause: Activation functions like sigmoid or tanh squash values between (0,1) or (-1,1), making gradients
shrink.
2. Exploding Gradient Problem
• If the gradient values are greater than 1, they increase exponentially as they propagate backward.
• This leads to very large weight updates, causing the network to become unstable and diverge.
• Cause: Large weight values or repeated multiplication of large gradients.
There are four types of RNNs based on the number of inputs and outputs in the network:
1. One-to-One RNN
This is the simplest type of neural network architecture where there is a single input and a single output. It is used for
straightforward classification tasks such as binary classification where no sequential data is involved.
2. One-to-Many RNN
In a One-to-Many RNN the network processes a single input to produce multiple outputs over time. This is useful in
tasks where one input triggers a sequence of predictions (outputs). For example in image captioning a single image can
be used as input to generate a sequence of words as a caption.
Example: Music Generation
Input: A single note Output: A sequence of notes
3. Many-to-One RNN
The Many-to-One RNN receives a sequence of inputs and generates a single output. This type is useful when the
overall context of the input sequence is needed to make one prediction. In sentiment analysis the model receives a
sequence of words (like a sentence) and produces a single output like positive, negative or neutral.
4. Many-to-Many RNN
The Many-to-Many RNN type processes a sequence of inputs and generates a sequence of outputs. In language
translation task a sequence of words in one language is given as input, and a corresponding sequence in another language
is generated as output.
To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were
developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM
recurrent unit tries to “remember” all the past knowledge that the network is seen so far and to “forget” irrelevant data.
This is done by introducing different activation function layers called “gates” for different purposes. Each LSTM
recurrent unit also maintains a vector called the Internal Cell State which conceptually describes the information that
was chosen to be retained by the previous LSTM recurrent unit.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates: the input gate, the forget gate and
the output gate. These gates decide what information to add to, remove from and output from the memory cell.
• Input gate: Controls what information is added to the memory cell.
• Forget gate: Determines what information is removed from the memory cell.
• Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the network which allows
them to learn long-term dependencies. The network has a hidden state which is like its short-term memory. This memory
is updated using the current input, the previous hidden state and the current state of the memory cell.
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different memory blocks called cells.
Information is retained by the cells and the memory manipulations are done by the gates. There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two inputs xt (input at the particular
time) and ht-1 (previous cell output) are fed to the gate and multiplied with weight matrices followed by the addition of bias.
The resultant is passed through an activation function which gives a binary output. If for a particular cell state the output is 0,
the piece of information is forgotten and for output 1, the information is retained for future use.
ft=σ(Wf⋅ [ht−1,xt] + bf
where:
• [ht-1, xt] denotes the concatenation of the current input and the previous hidden state.
Input gate
The addition of useful information to the cell state is done by the input gate. First, the information is regulated using
the sigmoid function and filter the values to be remembered similar to the forget gate using inputs ht-1 and xt. Then, a
vector is created using tanh function that gives an output from -1 to +1, which contains all the possible values from ht-
1 and xt. At last, the values of the vector and the regulated values are multiplied to obtain the useful information. The
equation for the input gate is:
it=σ(Wi⋅[ht−1,xt]+bi)
#Ct=tanh(Wc⋅[ht−1,xt]+bc)
We multiply the previous state by ft, disregarding the information we had previously chosen to ignore. Next, we include
it∗Ct. This represents the updated candidate values, adjusted for the amount that we chose to update each state value.
Ct=ft⊙Ct−1+it⊙#Ct
where
Output gate
The task of extracting useful information from the current cell state to be presented as output is done by the output gate.
First, a vector is generated by applying tanh function on the cell. Then, the information is regulated using the sigmoid
function and filter by the values to be remembered using inputs ht−1and xt. At last, the values of the vector and the
regulated values are multiplied to be sent as an output and input to the next cell. The equation for the output gate is:
ot=σ(Wo⋅[ht−1,xt]+bo)
Attention Mechanism
Imagine you are trying to understand a complex image or translate a sentence from one language to another. Your brain
instinctively focuses on specific parts of the image or particular words in the sentence that are most relevant to your
task. This selective focus is what we refer to as attention, and it’s a fundamental aspect of human cognition. Attention
mechanisms in deep learning aim to mimic this selective focus process in artificial neural networks.
At its core, an attention mechanism allows a model to focus on different parts of the input data with varying degrees
of importance. It assigns weights to each element in the input sequence, highlighting the elements that are most
relevant to the task at hand. This not only enhances the model’s understanding of the data but also improves its
performance in tasks like language translation, image captioning, and more.
Key Takeaway: Attention mechanisms enable neural networks to mimic human-like selective focus, improving their
ability to process and understand complex data.
Attention mechanisms have become indispensable in various deep-learning applications due to their ability to address
some critical challenges:
1. Long Sequences: Traditional neural networks struggle with processing long sequences, such as translating a
paragraph from one language to another. Attention mechanisms allow models to focus on the relevant parts of
the input, making them more effective at handling lengthy data.
2. Contextual Understanding: In tasks like language translation, understanding the context of a word is crucial
for accurate translation. Attention mechanisms enable models to consider the context by assigning different
attention weights to each word in the input sequence.
3. Improved Performance: Models equipped with attention mechanisms often outperform their non-attention
counterparts. They achieve state-of-the-art results in tasks like machine translation, image classification, and
speech recognition.
Let's say we want to translate the English sentence: "I love deep learning"
A traditional sequence-to-sequence (Seq2Seq) model with an LSTM or GRU processes each word sequentially and
generates an output word-by-word. However, this approach struggles with long sentences because the entire meaning
is compressed into a fixed-length context vector.
Instead of encoding everything into a single vector, attention assigns different importance (weights) to different
input words while generating each output word.
For example, when translating "love" to "adore", the model should focus more on "love" rather than the other words.
Attention mechanism computes a score for each input word at every decoding step and dynamically decides which
parts of the input sequence are most relevant.
Self-Attention
Self-Attention (also called Scaled Dot-Product Attention) is a mechanism where a model learns to weigh different
parts of the same input sequence to determine which are most important for making predictions. This is a key
component of Transformers (e.g., BERT, GPT).
Here, the word "it" refers to "the mat", not "the cat". Traditional models (like RNNs and LSTMs) struggle to capture
such long-range dependencies. Self-attention helps by dynamically focusing on important words.
Each word in the sentence is first converted into a word embedding (vector representation). Suppose we have:
Attention scores determine how much focus one word should give to another. This is computed as:
Score = QKT
If we compute the dot product for the sentence, we get an attention matrix:
This means "sat" focuses more on itself (0.45) and a little on "cat" (0.30).
Each word’s new representation is the weighted sum of the Value (V) matrix:
Output=Attention Scores×V
Multi-Head Attention
Multi-Head Attention (MHA) is an enhanced version of self-attention used in Transformer models (e.g., BERT,
GPT, ViTs). It allows the model to attend to different parts of the input simultaneously, improving its ability to
capture complex relationships.
In self-attention, a single attention mechanism focuses on one type of relationship at a time. However, language and
images have multiple aspects to capture.
Example Sentence: ➡ "She saw the jaguar in the zoo and was amazed by its speed."
If we use only one attention mechanism, the model might focus only on "zoo" and ignore "speed." Multi-head
attention solves this by using multiple attention heads to learn different aspects of meaning!
Transformers are a type of deep learning architecture that have revolutionized the field of natural language processing
(NLP) in recent years. They are widely used for tasks such as language translation, text classification, sentiment
analysis, and more.
History
The transformer architecture was first introduced in a 2017 paper by Google researchers Vaswani et al. called “Attention
Is All You Need”. This paper proposed a novel approach to NLP tasks that relied solely on the self-attention mechanism,
a type of attention mechanism that allows the model to weigh the importance of different words in a sentence when
encoding it into a fixed-size vector representation. This was a departure from previous NLP models that relied on
recurrent neural networks (RNNs) or convolutional neural networks (CNNs) to process sequences of words.
The transformer architecture was revolutionary in that it allowed for much faster training times and better parallelization
on GPUs, since the self-attention mechanism could be computed in parallel for all words in a sequence. This made it
possible to train much larger models on much larger datasets, leading to significant improvements in NLP performance.
Transformer
The transformer architecture is composed of an encoder and a decoder, each of which is made up of multiple layers of
self-attention and feedforward neural networks. The self-attention mechanism is the heart of the transformer, allowing
the model to weigh the importance of different words in a sentence based on their affinity with each other. This is similar
to how a human might read a sentence, focusing on the most relevant parts of the text rather than reading it linearly
from beginning to end.
In addition to self-attention, the transformer also introduces positional bias, which allows the model to keep track of the
relative positions of words in a sentence. This is important because the order of words in a sentence can significantly
impact its meaning.
Transformer: Encoder-Decoder
The transformer encoder-decoder architecture is used for tasks like language translation, where the model must take in
a sentence in one language and output a sentence in another language. The encoder takes in the input sentence and
produces a fixed-size vector representation of it, which is then fed into the decoder to generate the output sentence. The
decoder uses both self-attention and cross-attention, where the attention mechanism is applied to the output of the
encoder and the input of the decoder.
One of the most popular transformer encoder-decoder models is the T5 (Text-to-Text Transfer Transformer), which was
introduced by Google in 2019. The T5 can be fine-tuned for a wide range of NLP tasks, including language translation,
question answering, summarization, and more.
Real-world examples of the transformer encoder-decoder architecture include Google Translate, which uses the T5
model to translate text between languages, and Facebook’s M2M-100, a massive multilingual machine translation
model that can translate between 100 different languages.
Transformer: Encoder
The transformer encoder architecture is used for tasks like text classification, where the model must classify a piece of
text into one of several predefined categories, such as sentiment analysis, topic classification, or spam detection. The
encoder takes in a sequence of tokens and produces a fixed-size vector representation of the entire sequence, which can
then be used for classification.
One of the most popular transformer encoder models is BERT (Bidirectional Encoder Representations from
Transformers), which was introduced by Google in 2018. BERT is pre-trained on large amounts of text data and can be
fine-tuned for a wide range of NLP tasks.
Unlike the encoder-decoder architecture, the transformer encoder is only concerned with the input sequence and does
not generate any output sequence. It applies self-attention mechanism to the input tokens, allowing it to focus on the
most relevant parts of the input for the given task.
Real-world examples of the transformer encoder architecture include sentiment analysis, where the model must classify
a given review as positive or negative, and email spam detection, where the model must classify a given email as spam
or not spam.
Transformer: Decoder
The transformer decoder architecture is used for tasks like language generation, where the model must generate a
sequence of words based on an input prompt or context. The decoder takes in a fixed-size vector representation of the
context and uses it to generate a sequence of words one at a time, with each word being conditioned on the previously
generated words.
One of the most popular transformer decoder models is the GPT-3 (Generative Pre-trained Transformer 3), which was
introduced by OpenAI in 2020. The GPT-3 is a massive language model that can generate human-like text in a wide
range of styles and genres.
The transformer decoder architecture introduces a technique called triangle masking for attention, which ensures that
the attention mechanism only looks at tokens to the left of the current token being generated. This prevents the model
from “cheating” by looking at tokens that it hasn’t generated yet.
Real-world examples of the transformer decoder architecture include text generation, where the model must generate a
story or article based on a given prompt or topic, and chatbots, where the model must generate responses to user inputs
in a natural and engaging way.
Drawbacks
1. High computational cost due to the attention mechanism, which increases quadratically with sequence length.
2. Difficulty in interpretation and debugging due to the attention mechanism operating over the entire input
sequence.
Despite these downsides, the transformer architecture remains a powerful and widely-used tool in NLP, and research is
ongoing to mitigate its computational requirements and improve its interpretability and robustness.
BERT
BERT is a tool/model which understand language beter than any other model in the history. It’s freely available and it
is increadibly versatily as it can solve a large number of problems related to lanugage tasks. You have used bert without
even knowing!
If you have used google serch, you have already used BERT
BERT is based on the Transformer model architecture. Examining the model as if it were a single black box, a machine
translation application would take a sentence in one language and translate it into a different language.
• Basic Transformer consists of an encoder to read the text input and a decoder to produce a prediction for the
task.
• Since BERT’s goal is to generate a language representation model, it only needs the encoder part. hence, BERT
is basically a trained Transformer Encoder stack
A basic structure of encoder block [Source]
Training of BERT
During pretraining, BERT uses two objectives: masked language modeling and next sentence prediction.
• Masked Language Modeling(MLM) basically masks 80% of the 15% of the randomly selected input tokens
and uses the other tokens to attempt to predict the mask (missing word).
• Next Sentence Prediction(NSP)is a binary classification loss for predicting whether two segments follow
each other or are from a different document to create a semantic meaning.
BERT is significantly undertrained and the following areas stand the scope of modifications.
The masking is done only once during data preprocessing, resulting in a single static mask. Hence, the same input masks
were fed to the model on every single epoch.
• In this, each input has a pair of segments, which can each contain multiple natural sentences, but the total
combined length must be less than 512 tokens.
• It is noticed that individual sentences hurt performance on downstream tasks, which according to the
hypothesis is because the model was not able to learn long-range dependencies, hence the authors could
experiment with removing/adding NSP loss to see the effect in the model’s performance.
3. Text Encoding:
• The original BERT implementation uses a character-level BPE vocabulary of size 30K.
• BERT uses the WordPiece method a language-modeling-based variant of Byte Pair Encoding.
Originally BERT is trained for 1M steps with a batch size of 256 sequences, which shows room for improvement in
perplexity on the Masked Language Modelling objective.
2. Removing NSP :
• SENTENCE-PAIR+NSP: Each input contains a pair of natural sentences, sampled from a contiguous portion
of one document or separate documents. The NSP loss is retained.
• FULL-SENTENCES: Each input is packed with full sentences sampled contiguously from single or cross
documents, such that the total length is at most 512 tokens. We remove the NSP loss.
• DOC-SENTENCES: Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross
document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so we
dynamically increase the batch size in these cases to achieve a similar number of total tokens as
FULLSENTENCES. We remove the NSP loss.
Results —
• Removing the NSP loss matches or slightly improves downstream task performance, in contrast to the original
BERT with NSP loss.
• The sequences from a single document (DOC-SENTENCES) perform slightly better than packing sequences
from multiple documents (FULL-SENTENCES) as shown in Table 1.
Table 1: Comparison of performance of models with and without NSP loss (image is taken from the paper)
It is noticed that training a model with large mini-batches improves the perplexity of MLM objective and end-accuracy.
1M steps, batch size of 256 has equivalent computational cost to 31K steps, batch size of 8K.
Large batches are also easier to parallelize via distributed parallel training.
4. Byte-Pair Encoding:
• Here Byte-Pair Encoding is used over raw bytes instead of Unicode characters.
• The BPE subword vocabulary is reduced to 50K (still bigger than BERT’s vocab size) units.
A quick example of Byte Pair Encoding (BPE) [Source]
Despite degrading the performance of end-task performance in some cases, this method was used for encoding as it is
a universal encoding scheme that doesn’t need any preprocessing and tokenization rules.
It was observed that training BERT on larger datasets, greatly improves its performance. Hence the training data was
increased to 160GB of uncompressed text.
RoBERTa
RoBERTa (short for “Robustly Optimized BERT Approach”) is an advanced version of the BERT (Bidirectional
Encoder Representations from Transformers) model, created by researchers at Facebook AI. Similar to BERT,
RoBERTa is a transformer-based language model that employs self-attention to analyze input sequences and produce
contextualized word representations within a sentence.
RoBERTa vs BERT
A key difference between RoBERTa and BERT is that RoBERTa was trained on a significantly larger dataset and with
a more effective training procedure. Specifically, RoBERTa was trained on 160GB of text, over 10 times the size of the
dataset used for BERT. Additionally, RoBERTa employs a dynamic masking technique during training, which enhances
the model’s ability to learn more robust and generalizable word representations.
RoBERTa has demonstrated superior performance compared to BERT and other leading models on various natural
language processing tasks, such as language translation, text classification, and question answering. It has also served
as a foundational model for numerous successful NLP models and has gained popularity for both research and industrial
applications.
In summary, RoBERTa is a powerful and effective language model that has made significant contributions to NLP,
advancing progress across a wide range of applications.
The RoBERTa model shares the same architecture as the BERT model. It is a reimplementation of BERT with
modifications to key hyperparameters and minor adjustments to embeddings.
The general pre-training and fine-tuning procedures for BERT are illustrated in Figure 1 below. In BERT, the same
architecture is used for both pre-training and fine-tuning, except for the output layers. The pre-trained model parameters
are used to initialize models for various downstream tasks. During fine-tuning, all parameters are adjusted.
Architecture of BERT model
In contrast, RoBERTa does not use the next-sentence pretraining objective. Instead, it is trained with much larger mini-
batches and higher learning rates. RoBERTa employs a different pretraining scheme and replaces the character-level
BPE vocabulary with a byte-level BPE tokenizer (similar to GPT-2). Additionally, RoBERTa does not require the
definition of which token belongs to which segment, as it lacks token_type_ids. Segments can be easily divided using
the separation token tokenizer.sep_token (or ).
Furthermore, unlike the 16GB dataset originally used to train BERT, RoBERTa is trained on a massive dataset exceeding
160GB of uncompressed text. This dataset includes the 16GB of English Wikipedia and Books Corpus used in BERT,
along with additional data from the WebText corpus (38 GB), the CommonCrawl News dataset (63 million articles, 76
GB), and Stories from Common Crawl (31 GB). RoBERTa was pre-trained using this extensive dataset and 1024 V100
Tesla GPUs running for a day.
RoBERTa has a similar architecture to BERT, but to enhance performance, the authors made several simple design
changes to the architecture and training procedure. These changes include:
1. Removing the Next Sentence Prediction (NSP) Objective: In BERT, the model is trained to predict whether
two segments of a document are from the same or different documents using an auxiliary NSP loss. The authors
experimented with versions of the model with and without the NSP loss and found that removing the NSP loss
either matched or slightly improved performance on downstream tasks.
2. Training with Larger Batch Sizes and Longer Sequences: BERT was originally trained for 1 million steps with
a batch size of 256 sequences. RoBERTa was trained with 125 steps of 2,000 sequences and 31,000 steps with
8,000 sequences per batch. Larger batches improve perplexity on the masked language modeling objective and
end-task accuracy. They are also easier to parallelize using distributed parallel training.
3. Dynamically Changing the Masking Pattern: In BERT, masking is done once during data preprocessing,
resulting in a single static mask. To avoid this, training data is duplicated and masked 10 times with different
strategies over 40 epochs, resulting in 4 epochs with the same mask. This strategy was compared with dynamic
masking, where different masks are generated each time data is passed into the model.
While RoBERTa is a powerful model, it’s not without its limitations. Here are some:
1. Computational Resources: Training and fine-tuning RoBERTa requires significant computational resources,
including powerful GPUs and large amounts of memory. This can make it challenging for individuals or
organizations with limited resources to utilize RoBERTa effectively.
2. Domain Specificity: Pre-trained language models like RoBERTa may not perform optimally on domain-
specific tasks or datasets without further fine-tuning. They may require additional training on domain-specific
data to achieve the desired level of performance.
3. Data Efficiency: RoBERTa and similar models require large amounts of data for pre-training, which might not
be available for all languages or domains. This reliance on extensive data can limit their applicability in settings
where data is scarce or expensive to acquire.
4. Interpretability: The black-box nature of RoBERTa can make it difficult to interpret how the model arrives at
its predictions. Understanding the inner workings of the model and diagnosing errors or biases can be
challenging, especially in complex applications or sensitive domains.
5. Fine-tuning Challenges: While fine-tuning RoBERTa for specific tasks can improve performance, it requires
expertise and experimentation to select the right hyperparameters, data augmentation techniques, and training
strategies. This process can be time-consuming and resource-intensive.
6. Bias and Fairness: Pre-trained language models like RoBERTa can inherit biases present in the training data,
leading to biased or unfair predictions. Addressing bias and ensuring fairness in AI models remains a
significant challenge, requiring careful data curation and model design considerations.
Text Classification
Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural
language inference, and assessing grammatical correctness.
Output
POSITIVE
0.900
NEUTRAL
0.100
NEGATIVE
0.000
Use Cases
You can track the sentiments of your customers from the product reviews using sentiment analysis models. This can
help understand churn and retention by grouping reviews by sentiment, to later analyze the text and make strategic
decisions based on this knowledge.
Task Variants
In NLI the model determines the relationship between two given texts. Concretely, the model takes a premise and a
hypothesis and returns a class that can either be:
• neutral, which means there's no relation between the hypothesis and the premise.
The benchmark dataset for this task is GLUE (General Language Understanding Evaluation). NLI models have different
variants, such as Multi-Genre NLI, Question NLI and Winograd NLI.
Example 1:
Premise: A man inspects the uniform of a figure in some East Asian country.
Label: Contradiction
Example 2:
Label: Entailment
QNLI is the task of determining if the answer to a certain question can be found in a given document. If the answer can
be found the label is “entailment”. If the answer cannot be found the label is “not entailment".
Sentence: It is also known as the “Great Dying” because it is considered the largest mass extinction in the Earth’s
history.
Sentence: The managing director of London Weekend Television (LWT), Greg Dyke, met with the representatives of
the "big five" football clubs in England in 1990.
Label: entailment
Sentiment Analysis
In Sentiment Analysis, the classes can be polarities like positive, negative, neutral, or sentiments such as happiness or
anger.
Quora Question Pairs models assess whether two provided questions are paraphrases of each other. The model takes
two questions and returns a binary value, with 0 being mapped to “not paraphrase” and 1 to “paraphrase". The
benchmark dataset is Quora Question Pairs inside the GLUE benchmark. The dataset consists of question pairs and
their labels.
Question1: “How can I increase the speed of my internet connection while using a VPN?”
Label: Paraphrase
Grammatical Correctness
Linguistic Acceptability is the task of assessing the grammatical acceptability of a sentence. The classes in this task are
“acceptable” and “unacceptable”. The benchmark dataset used for this task is Corpus of Linguistic Acceptability
(CoLA). The dataset consists of texts and their labels.
Label: Unacceptable
Label: Acceptable.
Text Generation
Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete
text or paraphrase.
Input
Output
Once upon a time, we knew that our ancestors were on the verge of extinction. The great explorers and poets of the Old
World, from Alexander the Great to Chaucer, are dead and gone. A good many of our ancient explorers and poets have
This task covers guides on both text-generation and text-to-text generation models. Popular large language models that
are used for chats or following instructions are also covered in this task. You can find the list of selected open-source
large language models here, ranked by their performance scores.
Use Cases
Code Generation
A Text Generation model, also known as a causal language model, can be trained on code from scratch to help the
programmers in their repetitive coding tasks. One of the most popular open-source models for code generation is
StarCoder, which can generate code in 80+ languages. You can try it here.
Stories Generation
A story generation model can receive an input like "Once upon a time" and proceed to create a story-like text based on
those first words. You can try this application which contains a model trained on story generation, by MosaicML.
If your generative model training data is different than your use case, you can train a causal language model from
scratch. Learn how to do it in the free transformers course!
Task Variants
A popular variant of Text Generation models predicts the next word given a bunch of words. Word by word a longer
text is formed that results in for example:
The most popular models for this task are GPT-based models, Mistral or Llama series. These models are trained on data
that has no labels, so you just need plain text to train your own model. You can train text generation models to generate
a wide variety of documents, from code to stories.
These models are trained to learn the mapping between a pair of texts (e.g. translation from one language to another).
The most popular variants of these models are NLLB, FLAN-T5, and BART. Text-to-Text models are trained with
multi-tasking capabilities, they can accomplish a wide range of tasks, including summarization, translation, and text
classification.
When it comes to text generation, the underlying language model can come in several types:
Base models: refers to plain language models like Mistral 7B and Meta Llama-3-70b. These models are good for fine-
tuning and few-shot prompting.
Instruction-trained models: these models are trained in a multi-task manner to follow a broad range of instructions
like "Write me a recipe for chocolate cake". Models like Qwen 2 7B, Yi 1.5 34B Chat, and Meta Llama 70B Instruct
are examples of instruction-trained models. In general, instruction-trained models will produce better responses to
instructions than base models.
Human feedback models: these models extend base and instruction-trained models by incorporating human feedback
that rates the quality of the generated text according to criteria like helpfulness, honesty, and harmlessness. The human
feedback is then combined with an optimization technique like reinforcement learning to align the original model to be
closer with human preferences. The overall methodology is often called Reinforcement Learning from Human
Feedback, or RLHF for short. Zephyr ORPO 141B A35B is an open-source model aligned through human feedback.
There are language models that can input both text and image and output text, called vision language models. IDEFICS
2 and MiniCPM Llama3 V are good examples. They accept the same generation parameters as other language models.
However, since they also take images as input, you have to use them with the image-to-text pipeline. You can find more
information about this in the image-to-text task page.
Text-to-Text generation models have a separate pipeline called text2text-generation. This pipeline takes an input
containing the sentence including the task and returns the output of the accomplished task.
text2text_generator = pipeline("text2text-generation")
text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")
You can use huggingface.js to infer text classification models on Hugging Face Hub.
model: "distilbert-base-uncased-finetuned-sst-2-english",
});