0% found this document useful (0 votes)
4 views17 pages

Machine learning

Text Classification in NLP involves assigning categories or labels to text based on its content, with applications in sentiment analysis, spam detection, and intent detection. Supervised learning is used to train models on labeled data, while Recurrent Neural Networks (RNNs) and their variants like LSTMs and GRUs are employed for handling sequential data. Transformers have largely replaced RNNs due to their ability to process data in parallel and manage long-range dependencies effectively.

Uploaded by

123sanjaypurohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Machine learning

Text Classification in NLP involves assigning categories or labels to text based on its content, with applications in sentiment analysis, spam detection, and intent detection. Supervised learning is used to train models on labeled data, while Recurrent Neural Networks (RNNs) and their variants like LSTMs and GRUs are employed for handling sequential data. Transformers have largely replaced RNNs due to their ability to process data in parallel and manage long-range dependencies effectively.

Uploaded by

123sanjaypurohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

What is Text Classification in NLP?

Text Classification is a core task in Natural Language Processing (NLP) where a piece
of text is assigned one or more categories or labels based on its content.

It’s like training a computer to understand what a text is about and then sort it into the
right box.

Examples:

Text Classification Task Result

"I love this product!" Sentiment Analysis Positive

"You’ve won a free iPhone!" Spam Detection Spam

"Breaking: Stock prices soar today" News Categorization Business

Common Applications:

• Spam Detection

• Sentiment Analysis

• Topic Labeling (e.g., Politics, Sports, Tech)

• Intent Detection in chatbots

• Language Detection

• Toxic Comment Detection

Sentiment Analysis in Natural Language Processing (NLP) is the task of determining the
emotional tone or attitude expressed in a piece of text. It’s often used to identify whether
the sentiment behind a statement is positive, negative, or neutral.

Sentiment analysis tries to figure out how someone feels based on what they’ve written
or said.

Supervised Learning Basics in NLP


Supervised Learning is a type of machine learning where the model learns from labeled
data—meaning each training example is paired with the correct output (or label).

How It Works:

• You give the algorithm input-output pairs (e.g., text and its label).

• The model learns patterns from these examples.

• Once trained, it can predict labels for unseen data.

Example:

Text Label

“This movie was fantastic!” Positive

“The service was terrible.” Negative

The model learns what makes a sentence "positive" or "negative."

Components of Supervised Learning in NLP:

1. Dataset – A collection of text samples and their labels.

2. Features – Text transformed into a format the model can understand (e.g., vectors).

3. Model – An algorithm that maps features to labels.

4. Evaluation Metrics – Accuracy, Precision, Recall, F1-Score.

Common NLP Tasks Using Supervised Learning:

• Sentiment Analysis

• Named Entity Recognition (NER)

• Text Classification

• Part-of-Speech Tagging

• Question Answering (in simpler forms)

RNN:

Recurrent Neural Networks (RNNs) are a class of neural networks that are particularly well-
suited for sequence data, making them a foundational model in Natural Language
Processing (NLP). Here's a breakdown of what RNNs are, how they work in NLP, and their
limitations:
What Are RNNs?

RNNs are designed to handle sequential data, where the order of inputs matters. Unlike
traditional feedforward neural networks, RNNs have a "memory" — they maintain a
hidden state that is updated at each time step, allowing them to capture dependencies
over time.

This is especially useful in NLP tasks, since language is inherently sequential (e.g., word
order matters).

How RNNs Work in NLP

At each time step t1, an RNN receives:

• an input vector xt (e.g., the embedding of the current word),

• the hidden state from the previous time step ht−1and computes the new hidden
state ht

• ht=tanh(W.x1+U.ht-1+b)

Here’s what each part means:

• xt The input vector at time ttt (e.g., the embedding of a word).

• ht−1The hidden state from the previous time step.

• w: Weight matrix for the input.

• U: Weight matrix for the previous hidden state.

• b: Bias vector.

• tanh: A non-linear activation function (squashes the output between -1 and 1).

This hidden state can then be used to make predictions (e.g., the next word, a classification
label, etc.).

Common NLP Tasks with RNNs

• Text Generation: Predict the next word/character in a sequence.

• Language Modeling: Learn probabilities of sequences of words.


• Machine Translation: Translate a sentence from one language to another.

• Named Entity Recognition (NER): Identify entities like names or dates in text.

• Sentiment Analysis: Classify the sentiment of a sentence or document.

• Speech Recognition: Convert audio into text (with RNNs for audio sequences).

Variants of RNNs

To address limitations of basic RNNs, several advanced versions have been developed:

• LSTM (Long Short-Term Memory): Introduces gates to control the flow of


information and deal with long-term dependencies.

• GRU (Gated Recurrent Unit): A simplified version of LSTM that performs similarly
with fewer parameters.

Limitations of RNNs

• Vanishing/Exploding Gradients: Hard to learn long-term dependencies in long


sequences.

• Sequential Processing: Can’t be parallelized easily (unlike transformers).

• Training Time: Slower compared to more modern models.

Limitations of RNNs

1. Vanishing Gradient Problem

What is it?

When training RNNs using backpropagation through time (BPTT), the gradients (used to
update weights) can become very small as they move backward through time steps.

Why it's a problem:

• The network “forgets” long-range dependencies.

• If an important piece of information appears early in the sequence, it may not


influence the final output.
Result:

The model struggles to learn from data where context from far back in the sequence is
crucial (e.g., understanding a sentence with clauses spread out).

2. Exploding Gradient Problem

What is it?

Opposite of vanishing gradients. Sometimes gradients become too large, leading to


unstable training.

Why it's a problem:

• The model weights explode.

• The learning process breaks down (loss becomes NaN or infinity).

Common fix:

• Gradient clipping: Set a maximum limit on gradient values during training.

3. Sequential Processing = Slow Training

What is it?

RNNs process one time step at a time — you can’t compute them all in parallel like you can
with CNNs or Transformers.

Why it's a problem:

• Training and inference are slower, especially for long sequences.

• Can't fully utilize modern parallel-processing hardware like GPUs.

4. Short-Term Memory

Basic RNNs are mostly only good at capturing short-term dependencies.

Example:

In the sentence:
"The cat that was chased by the dog was very tired."

To predict "tired," the model should understand the subject "cat" — but basic RNNs often
fail to connect that far back unless enhanced (e.g., with LSTMs).

5. Difficulty Handling Long Sequences

What is it?

Even LSTMs and GRUs (which are better than vanilla RNNs) start to break down with very
long sequences (like full documents, paragraphs, etc.).

6. Bias Toward Recent Inputs

RNNs tend to give more weight to recent words, which can hurt performance in tasks that
require long-term context retention.

7. Limited Bidirectionality (in vanilla RNNs)

A standard RNN only processes sequences left to right (past to future). That means it can’t
use future context unless you build a Bidirectional RNN, which doubles the computation.

Why Transformers Took Over

Transformers (like BERT, GPT, etc.) solve many of these issues:

• No vanishing gradients (self-attention sees all tokens at once).

• Fully parallelizable = much faster training.

• Better long-range dependency handling.

RNNs vs. Transformers

While RNNs were dominant in NLP for years, they’ve largely been superseded by
Transformer-based models like BERT and GPT, which can capture long-range
dependencies more effectively and train in parallel.
Gradient : A gradient is basically a vector of partial derivatives — in simpler terms, it tells
you how much a function changes when you change its input a little bit.

LSTMs and GRUs

What Are LSTMs and GRUs?

Both LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are types of
recurrent neural networks (RNNs), but with gate mechanisms that help them:

• Remember important information over long sequences

• Forget irrelevant or outdated info

• Handle vanishing gradients much better

LSTM (Long Short-Term Memory)

Key Components:

• Cell state: Memory of the network — flows through time steps.

• Hidden state: Output at each time step.

• Gates: Special neural layers that control the flow of info.

Main Gates:

1. Forget gate: Decides what to forget.

2. Input gate: Decides what new info to store.

3. Output gate: Decides what to output.

Strengths:

• Excellent for long-term dependencies

• Works well in tasks like text classification, machine translation, speech


recognition, etc.

GRU (Gated Recurrent Unit)

Key Components:

• Combines cell state and hidden state into one.


• Fewer gates than LSTM → faster and simpler.

Main Gates:

1. Update gate: Controls what to keep from the past.

2. Reset gate: Controls how much past info to forget.

Strengths:

• Faster to train (fewer parameters than LSTM).

• Still handles long dependencies well.

• Often performs similarly to LSTMs with less complexity.

Use in NLP Tasks

Both are used in:

NLP Task How They're Used

Text Classification Sentiment analysis, spam detection, etc.

Used in seq2seq models for encoder-decoder


Machine Translation
architecture.

Speech Recognition Recognizing words from audio sequences.

Named Entity Recognition


Identifying names, dates, etc., in text.
(NER)

Question Answering As part of deep contextual models.

LSTM vs GRU

Feature LSTM GRU

Gates 3 (Forget, Input, Output) 2 (Update, Reset)

Memory Separate cell & hidden state Combined state

Complexity More complex (more parameters) Simpler (faster to train)


Feature LSTM GRU

Performance Slightly better on some tasks Comparable, sometimes better

Summary

• LSTM = great when you need strong long-term memory.

• GRU = faster, simpler, often just as good.

• Both are a major improvement over basic RNNs.

• However, in many modern NLP systems, they've been replaced or augmented by


Transformer-based models (like BERT, GPT, etc.) due to their scalability and power.

Parameters :

Limits, boundaries, or rules that define how something works or what’s allowed.

What Are Transformers?

A Transformer is a deep learning architecture introduced in the paper "Attention is All You
Need" (2017) by Vaswani et al. It was designed to handle sequential data (like text), but
without using RNNs or CNNs.

Key Idea:

Instead of processing data step-by-step (like RNNs), Transformers look at all tokens at
once — thanks to the Attention Mechanism.

What Is Attention?

The Attention Mechanism allows a model to focus on relevant parts of the input when
making predictions — kind of like how you pay more attention to certain words in a
sentence to understand its meaning.

Simple Example:
Take this sentence:

"The cat, which the dog chased, ran up the tree."

If you want to understand who ran up the tree, attention helps the model focus on "cat"
when processing "ran" — even though they’re far apart in the sentence.

How Attention Works (Conceptually)

For every word in the input, attention calculates:

How important are the other words when interpreting this word?

It assigns a score (weight) to every other word, then creates a weighted sum of their
representations.

Transformer Architecture

A Transformer is made of Encoder and Decoder blocks (used in translation, etc.), but
many modern models (like BERT or GPT) use just the encoder or decoder.

Encoder Layer:

• Multi-Head Self-Attention

• Feed Forward Neural Network

• Add & Norm (residual connections + layer normalization)

Decoder Layer:

• Masked Multi-Head Self-Attention

• Encoder-Decoder Attention

• Feed Forward Neural Network

Self-Attention

Each word attends to all other words in the input to build a context-aware representation
of itself. That’s how Transformers understand the meaning of words in context.
Multi-Head Attention

Instead of doing attention once, the model does it multiple times in parallel (called
"heads"). Each head learns to focus on different types of relationships between words.

Why Transformers Are Better Than RNNs:

Feature RNNs/LSTMs Transformers

Parallel processing No Yes

Long-range dependencies Struggles Handles well

Speed Slow (step-by-step) Fast (all at once)

Memory of past tokens Limited Strong via self-attention

Transformers in NLP Today

Transformers power all modern NLP giants:

• BERT: Bidirectional Encoder Representations

• GPT (1, 2, 3, 4): Generative Pretrained Transformers

• T5, XLNet, RoBERTa, etc.

Summary:

• Attention = lets the model focus on relevant words.

• Transformer = architecture that uses only attention (no RNNs).

• Why it matters: Transformers are fast, scalable, and state-of-the-art in NLP.

What is Self-Attention?
Self-attention is a mechanism that allows each word (or token) in a sequence to look at all
other words in the sequence and decide which ones are important for understanding its
meaning.

It helps the model understand context, even for words that are far apart.

Simple Analogy:

Imagine reading this sentence:

"The bank was flooded after the heavy rain."

To understand the meaning of "bank", your brain looks at "flooded" and "rain", realizing
it’s probably a river bank, not a financial institution.

That's self-attention in action — you're focusing on the relevant words to understand a


particular word.

How Self-Attention Works (Step-by-Step)

For each word in the input sequence:

1. Create three vectors for each word:

o Query (Q): What am I looking for?

o Key (K): What do I offer?

o Value (V): What info do I carry?

2. Compute attention scores:

o Take the dot product of the query of the current word with the keys of all
words in the sequence.

o This gives a score for how much attention this word should pay to every other
word.

3. Apply softmax:

o Turn the scores into probabilities (all add up to 1).

4. Weighted sum of values:

o Multiply each value vector by its attention weight.


o Sum them to get the new representation of the word.

Formula:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(
\frac{QK^T}{\sqrt{d_k}} \right) VAttention(Q,K,V)=softmax(dkQKT)V

• Q,K,VQ, K, VQ,K,V: Matrices of query, key, and value vectors

• dkd_kdk: Dimension of the key vector (used for scaling)

Self-Attention = "Look Around"

Each word learns:

“What other words in the sentence should I focus on when forming my own
representation?”

And every word does this at the same time — hence "self-attention."

Example:

Sentence:

"The cat sat on the mat."

To understand "sat", self-attention might make the model focus more on "cat", while for
"mat", it might attend to "on" and "sat".

So each word gets contextualized based on the entire sentence.

Why It’s Powerful

• Handles long-term dependencies easily (no need for sequential processing like
RNNs).

• Can be parallelized (huge speed boost).

• Helps in understanding word meaning in context (e.g., "bank" in different


sentences).
What is Transformer Architecture?

Introduced in the 2017 paper "Attention is All You Need", the Transformer is a neural
network designed to handle sequences without recurrence (no RNNs).

Instead, it relies entirely on the Self-Attention mechanism and Feed-Forward Networks


— making it fast, scalable, and highly effective for NLP tasks like translation,
summarization, and text generation.

Key Ideas

• Self-Attention: Every word in the sequence attends to every other word to


understand its context.

• Positional Encoding: Since there's no recurrence, we inject info about the position
of each word.

• Stacked Layers: Multiple layers of attention and feed-forward sub-layers process


and refine the information.

• Parallelization: Entire sequences are processed at once (not step-by-step), making


training fast.

Transformer Architecture Overview

It has two main parts (originally):

css

CopyEdit

┌────────────┐

Input →─────▶│ Encoder │─────┐

└────────────┘ │

┌────────────┐

│ Decoder │→ Output

└────────────┘
But in many modern NLP models (like BERT or GPT), only one side is used.

1. Encoder (used in BERT, T5)

Each encoder layer contains:

• Multi-Head Self-Attention

• Add & Layer Norm

• Feed Forward Network

• Add & Layer Norm

There are N encoder layers stacked on top of each other.

2. Decoder (used in GPT, T5)

Each decoder layer contains:

• Masked Multi-Head Self-Attention


(prevents attending to future words)

• Encoder-Decoder Attention
(attends to the encoder’s output)

• Feed Forward Network

• Add & Norm layers throughout

Again, N decoder layers are stacked.

Internal Block Structure

css

[Input Embeddings] + [Positional Encodings]

[Multi-Head Self-Attention]

[Add & Norm]


[Feed-Forward Network]

[Add & Norm]

Key Components Explained:

Positional Encoding

• Adds a sense of order to the sequence (since transformers have no loops).

• Injects position info using sinusoidal patterns.

Multi-Head Attention

• Runs self-attention multiple times in parallel.

• Each head learns different relationships in the sentence.

Feed-Forward Network

• A simple fully connected network applied to each position independently.

Residual Connections + Layer Norm

• Helps with training stability and flow of gradients.

Transformer in Action (Example NLP Tasks):

NLP Task Use of Transformer

Text Classification Use encoder output (e.g. BERT)

Machine Translation Full encoder-decoder

Summarization Encoder-decoder (e.g. T5, BART)

Text Generation Decoder only (e.g. GPT series)

Encoder vs Decoder Use Cases


Model Uses Encoder Uses Decoder Purpose

BERT Yes No Understanding text

GPT No Yes Generating text

T5 Yes Yes Text-to-text tasks

BART Yes Yes Seq2Seq tasks (like summarization)

Summary

• Transformers use attention instead of recurrence.

• They’re fast, parallelizable, and context-aware.

• Powerhouse behind almost every modern NLP model.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy