Machine learning
Machine learning
Text Classification is a core task in Natural Language Processing (NLP) where a piece
of text is assigned one or more categories or labels based on its content.
It’s like training a computer to understand what a text is about and then sort it into the
right box.
Examples:
Common Applications:
• Spam Detection
• Sentiment Analysis
• Language Detection
Sentiment Analysis in Natural Language Processing (NLP) is the task of determining the
emotional tone or attitude expressed in a piece of text. It’s often used to identify whether
the sentiment behind a statement is positive, negative, or neutral.
Sentiment analysis tries to figure out how someone feels based on what they’ve written
or said.
How It Works:
• You give the algorithm input-output pairs (e.g., text and its label).
Example:
Text Label
2. Features – Text transformed into a format the model can understand (e.g., vectors).
• Sentiment Analysis
• Text Classification
• Part-of-Speech Tagging
RNN:
Recurrent Neural Networks (RNNs) are a class of neural networks that are particularly well-
suited for sequence data, making them a foundational model in Natural Language
Processing (NLP). Here's a breakdown of what RNNs are, how they work in NLP, and their
limitations:
What Are RNNs?
RNNs are designed to handle sequential data, where the order of inputs matters. Unlike
traditional feedforward neural networks, RNNs have a "memory" — they maintain a
hidden state that is updated at each time step, allowing them to capture dependencies
over time.
This is especially useful in NLP tasks, since language is inherently sequential (e.g., word
order matters).
• the hidden state from the previous time step ht−1and computes the new hidden
state ht
• ht=tanh(W.x1+U.ht-1+b)
• b: Bias vector.
• tanh: A non-linear activation function (squashes the output between -1 and 1).
This hidden state can then be used to make predictions (e.g., the next word, a classification
label, etc.).
• Named Entity Recognition (NER): Identify entities like names or dates in text.
• Speech Recognition: Convert audio into text (with RNNs for audio sequences).
Variants of RNNs
To address limitations of basic RNNs, several advanced versions have been developed:
• GRU (Gated Recurrent Unit): A simplified version of LSTM that performs similarly
with fewer parameters.
Limitations of RNNs
Limitations of RNNs
What is it?
When training RNNs using backpropagation through time (BPTT), the gradients (used to
update weights) can become very small as they move backward through time steps.
The model struggles to learn from data where context from far back in the sequence is
crucial (e.g., understanding a sentence with clauses spread out).
What is it?
Common fix:
What is it?
RNNs process one time step at a time — you can’t compute them all in parallel like you can
with CNNs or Transformers.
4. Short-Term Memory
Example:
In the sentence:
"The cat that was chased by the dog was very tired."
To predict "tired," the model should understand the subject "cat" — but basic RNNs often
fail to connect that far back unless enhanced (e.g., with LSTMs).
What is it?
Even LSTMs and GRUs (which are better than vanilla RNNs) start to break down with very
long sequences (like full documents, paragraphs, etc.).
RNNs tend to give more weight to recent words, which can hurt performance in tasks that
require long-term context retention.
A standard RNN only processes sequences left to right (past to future). That means it can’t
use future context unless you build a Bidirectional RNN, which doubles the computation.
While RNNs were dominant in NLP for years, they’ve largely been superseded by
Transformer-based models like BERT and GPT, which can capture long-range
dependencies more effectively and train in parallel.
Gradient : A gradient is basically a vector of partial derivatives — in simpler terms, it tells
you how much a function changes when you change its input a little bit.
Both LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are types of
recurrent neural networks (RNNs), but with gate mechanisms that help them:
Key Components:
Main Gates:
Strengths:
Key Components:
Main Gates:
Strengths:
LSTM vs GRU
Summary
Parameters :
Limits, boundaries, or rules that define how something works or what’s allowed.
A Transformer is a deep learning architecture introduced in the paper "Attention is All You
Need" (2017) by Vaswani et al. It was designed to handle sequential data (like text), but
without using RNNs or CNNs.
Key Idea:
Instead of processing data step-by-step (like RNNs), Transformers look at all tokens at
once — thanks to the Attention Mechanism.
What Is Attention?
The Attention Mechanism allows a model to focus on relevant parts of the input when
making predictions — kind of like how you pay more attention to certain words in a
sentence to understand its meaning.
Simple Example:
Take this sentence:
If you want to understand who ran up the tree, attention helps the model focus on "cat"
when processing "ran" — even though they’re far apart in the sentence.
How important are the other words when interpreting this word?
It assigns a score (weight) to every other word, then creates a weighted sum of their
representations.
Transformer Architecture
A Transformer is made of Encoder and Decoder blocks (used in translation, etc.), but
many modern models (like BERT or GPT) use just the encoder or decoder.
Encoder Layer:
• Multi-Head Self-Attention
Decoder Layer:
• Encoder-Decoder Attention
Self-Attention
Each word attends to all other words in the input to build a context-aware representation
of itself. That’s how Transformers understand the meaning of words in context.
Multi-Head Attention
Instead of doing attention once, the model does it multiple times in parallel (called
"heads"). Each head learns to focus on different types of relationships between words.
Summary:
What is Self-Attention?
Self-attention is a mechanism that allows each word (or token) in a sequence to look at all
other words in the sequence and decide which ones are important for understanding its
meaning.
It helps the model understand context, even for words that are far apart.
Simple Analogy:
To understand the meaning of "bank", your brain looks at "flooded" and "rain", realizing
it’s probably a river bank, not a financial institution.
o Take the dot product of the query of the current word with the keys of all
words in the sequence.
o This gives a score for how much attention this word should pay to every other
word.
3. Apply softmax:
Formula:
Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(
\frac{QK^T}{\sqrt{d_k}} \right) VAttention(Q,K,V)=softmax(dkQKT)V
“What other words in the sentence should I focus on when forming my own
representation?”
And every word does this at the same time — hence "self-attention."
Example:
Sentence:
To understand "sat", self-attention might make the model focus more on "cat", while for
"mat", it might attend to "on" and "sat".
• Handles long-term dependencies easily (no need for sequential processing like
RNNs).
Introduced in the 2017 paper "Attention is All You Need", the Transformer is a neural
network designed to handle sequences without recurrence (no RNNs).
Key Ideas
• Positional Encoding: Since there's no recurrence, we inject info about the position
of each word.
css
CopyEdit
┌────────────┐
└────────────┘ │
┌────────────┐
│ Decoder │→ Output
└────────────┘
But in many modern NLP models (like BERT or GPT), only one side is used.
• Multi-Head Self-Attention
• Encoder-Decoder Attention
(attends to the encoder’s output)
css
[Multi-Head Self-Attention]
[Feed-Forward Network]
Positional Encoding
Multi-Head Attention
Feed-Forward Network
Summary