0% found this document useful (0 votes)
6 views6 pages

Unit 2

The document discusses Word Window Classification, a technique in natural language processing (NLP) used to classify words based on their surrounding context, which is essential for tasks like Named Entity Recognition and Part-of-Speech tagging. It also covers the basics of neural networks, particularly Recurrent Neural Networks (RNNs), and their applications in handling sequential data, along with challenges like vanishing and exploding gradients. Additionally, it introduces N-gram language models and perplexity as metrics for evaluating language models.

Uploaded by

pinky.thogaru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Unit 2

The document discusses Word Window Classification, a technique in natural language processing (NLP) used to classify words based on their surrounding context, which is essential for tasks like Named Entity Recognition and Part-of-Speech tagging. It also covers the basics of neural networks, particularly Recurrent Neural Networks (RNNs), and their applications in handling sequential data, along with challenges like vanishing and exploding gradients. Additionally, it introduces N-gram language models and perplexity as metrics for evaluating language models.

Uploaded by

pinky.thogaru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIT-2 – Classification: NN (Noun)

• Target Word: "sat"


Word Window Classification – Word Window: [cat, sat, on]

– POS Tags: [NN, VB, IN]


• Word Window Classification is a technique used – Classification: VB (Verb)
processing (NLP) to classify words based on the context • Target Word: "on"
provided by surrounding words, known as a “window”. – Word Window: [sat, on, the]
• This approach is particularly useful in tasks like Named
Entity Recognition (NER), Part-of-Speech (POS) tagging, – POS Tags: [VB, IN, DT]
and other sequence labelling tasks.
– Classification: IN (Preposition)
How Word Window Classification Works?
Applications
• Word Window: A word window is a fixed-size
• Named Entity Recognition (NER): Classifying words
context around a target word. For instance, if the
into categories like person, location, organization, etc.
window size is 3, the window will include the target
• Part-of-Speech (POS) Tagging: Assigning parts of
word, one word to its left, and one word to its right.
speech to each word in a sentence.
• Feature Extraction: The features for the target word
• Chunking: Dividing a text into syntactically correlated
are extracted from this window. These features can
parts like noun or verb phrases.
include the words themselves, their embeddings, POS
Benefits
tags, or any other relevant linguistic features.
• Context-Aware: Takes into account the surrounding
• Model Training: A machine learning model (e.g., context, leading to better classification performance.
logistic regression, SVM, or a neural network) is trained • Simplicity: Relatively simple to implement and
on these features to classify the target word. understand.
• Sliding Window: The window slides over the text,
classifying each word based on its surrounding context.
Example Neural Networks for text
• Consider the sentence: "The quick brown fox jumps Neural networks are like smart algorithms that learn
over the lazy dog." patterns from data. When it comes to text, here's a
• If the target word is "fox" and the window size is 3, simple explanation:
the window will look like this: Basic Concept:
• Previous word: "brown" • Neurons (Nodes): Think of neurons as tiny decision-
• Target word: "fox" makers. Each neuron takes some input (like a word or
• Next word: "jumps" a number) and decides if it should pass that input
• Features for "fox" could include the embeddings of along to the next layer of neurons based on how
"brown", "fox", and "jumps". "important" it thinks the input is.
Layers:
Example to illustrate how word window classification – Input Layer: This is where the text data first
works: enters the network. If we're working with text, each
• Task: Part-of-Speech Tagging word or character can be turned into numbers
Example Sentence: (using something called embeddings or one-hot
• The cat sat on the mat. encoding) and fed into the input layer.
Goal: – Hidden Layers: These are layers between the
• Assign each word in the sentence its correct part of input and output. They do the heavy lifting by learning
speech (POS) tag. complex patterns in the data. Each layer passes its
Word Window: output to the next one.
• We'll use a word window of size 3 (1 word to the – Output Layer: This gives the final prediction,
left, the target word, and 1 word to the right). like identifying whether a text is positive or negative in
• For the sentence "The cat sat on the mat": sentiment analysis.
• Target Word: "cat"  Learning:
– Word Window: [The, cat, sat] • The network learns by adjusting the importance
– POS Tags: [DT (determiner), NN (Noun), VB (Verb)] (weights) of the connections between neurons. It does
this over many cycles (called epochs) using examples • Unigram (1-gram): A single word (e.g., "The") •
of text and the correct answers. Bigram (2-gram): A sequence of two words (e.g., "The
• It tries to minimize errors using a method called cat")
back propagation, where it checks how far off its • Trigram (3-gram): A sequence of three words (e.g.,
prediction was and adjusts the weights to do better "The cat sits") • And so on...
next time.
 Language Model:
 Activation Functions: These are like filters that
• A language model assigns probabilities to sequences
decide if a neuron should activate (send a signal
of words. • For an N-gram model, the probability of a
forward). They add non-linearity, which helps the
word depends on the previous N-1 words.
network learn complex patterns.
• For example, in a trigram model, the probability of a
word depends on the two preceding words.
 How It Works?
Applying to Text
• The model is trained on a large corpus of text,
• Text as Input: Text is turned into numbers counting how often different N-grams occur.
(embeddings) that the network can understand. • It uses these counts to estimate the probability of a
• Pattern Recognition: The neural network learns word following a given sequence of N-1 words.
patterns like which words usually appear together,
sentence structures, or even the sentiment behind
phrases.
• Prediction: After learning from lots of examples, it
can predict things like the sentiment of a sentence,
classify topics, or generate new text.

Embeddings
 Embeddings are a way to represent words or
phrases as numerical vectors, making them
understandable for machine learning models,
especially neural networks.

• Why Use Embeddings? Applications


Computers Understand Numbers: Text data needs • Text Prediction: N-gram models can predict the next
to be converted into numbers because computers word or sequence of words in a sentence (e.g., in auto
work with numbers, not words. complete).
Capture Meaning: Simple methods like assigning a • Speech Recognition: They help in determining the
unique number to each word don’t capture the most probable words spoken in a sequence.
meaning or relationships between words. Embeddings • Spelling Correction: They can suggest corrections by
solve this by encoding semantic relationships between considering the most likely word sequences.
words.  Limitations:
• Limited Context: Higher-order N-gram models (e.g.,
N-gram Language Models
trigram, 4- gram) capture more context but require
 N-gram language models are a type of statistical much more data and computational power.
model used in natural language processing (NLP) to • Data Sparsity: Rare N-grams might not appear often
predict the probability of a sequence of words in a enough in the training data, leading to poor
sentence. They are called "N-gram" models because probability estimates for some word sequences.
they consider sequences of "N" words at a time.
• Over fitting: High-order N-gram models might fit the
training data too closely and not generalize well to
Key Concepts: new text.
• N-gram:
– An N-gram is a contiguous sequence of "N"
Perplexity
items (usually words) from a given text. • Perplexity is a measurement used in natural
– For example: language processing to evaluate the quality of a
language model.
• It essentially tells us how well a probability model
predicts a sample of text.
• Lower perplexity indicates a better model because it Hidden Markov Models
suggests the model is better at predicting the text. • A Hidden Markov Model (HMM) is a statistical model
 What is Perplexity? used to represent systems that are governed by a
• Understanding Perplexity: Markov process with hidden states.
– Perplexity is the exponentiation of the average • HMMs are widely used in areas such as speech
negative log-likelihood of a test set, which can be recognition, natural language processing, and
interpreted as the average branching factor of a bioinformatics.
language model.  States:
– In simpler terms, it tells us how "surprised" the • Hidden States: The actual states of the system are
model is by the text. If a model is well-trained and not directly observable. Instead, they are inferred
predicts the text well, it will have low perplexity (low based on observable outputs.
surprise). If the model is poorly trained, it will have • Observable States: These are the outputs or
high perplexity (high surprise). observations that can be directly seen or measured.
• Markov Property:

The Markov property assumes that the probability of


transitioning to the next state depends only on the
current state, not on the sequence of previous states.
This is known as the first-order Markov property.

Transition Probabilities: It define the likelihood of


moving from one hidden state to another. They are
usually represented in a matrix called the transition
matrix.
Emission Probabilities: It define the likelihood of
 Perplexity and Language Models: observing a particular output given a specific hidden
• N-gram Models: Perplexity is often used to evaluate state. These are represented in an emission matrix.
N-gram models. A trigram model, for example, will Initial State Probabilities: These are the probabilities
have lower perplexity than a bigram model if it better of the system starting in each possible hidden state.
captures the text's patterns.
• Neural Language Models: Modern neural language
models (e.g., RNNs, Transformers) often achieve much
lower perplexity than traditional N-gram models,
indicating they are better at predicting sequences of
words.
 Interpreting Perplexity:
• A lower perplexity score indicates a better model.
For example, if one model has a perplexity of 50 and
another has 100, the first model is considered to be
better at predicting the text.
• However, perplexity is relative; it should be
compared within the same dataset and task.

Example
• Suppose a language model predicts the following
sequence: "The cat sat on the mat.
• "If the model predicts each word with high
probability, the perplexity will be low, suggesting the
model understands the text well.
• If the model predicts each word with low probability,
the perplexity will be high, suggesting the model is
less effective.
network to the next, enabling the network to maintain
a memory of previous inputs.

Key Features of RNNs:

Sequential Data Handling: RNNs are designed to


process sequences of data, such as sentences or time
series, where the order of the elements is important.

Hidden State: RNNs maintain a hidden state that is


updated at each time step, capturing information about
previous elements in the sequence. This hidden state
helps the network retain context and understand
dependencies between words in a sentence.

Shared Weights: The same set of weights is applied at


each time step, allowing RNNs to generalize across
different positions in the input sequence.

RNNs are widely used in various NLP tasks, including:


Language Modelling: Predicting the next word in a
sequence based on the previous words.

Text Generation: Generating coherent text by


predicting sequences of words, character by character
or word by word.

Machine Translation: Translating text from one


language to another by processing the input sequence
and generating the corresponding sequence in the
target language.

Speech Recognition: Converting spoken language into


text by processing the sequence of audio features.

Sentiment Analysis: Analysing sequences of text to


determine the sentiment (positive, negative, neutral)
expressed in the text.

Architecture of an RNN:

Recurrent Neural network Input Layer: Accepts sequential input data (e.g., words
in a sentence, stock prices).
RNN(Recurrent Neural Network): Recurrent Neural
Hidden Layer: Contains a recurrent connection that
Networks (RNNs) are a class of neural networks
allows the network to remember past states.
specifically designed to handle sequential data, making
them particularly well-suited for tasks in Natural Output Layer: Generates the output at each time step
Language Processing (NLP). Unlike traditional or after processing the entire sequence.
feedforward neural networks, RNNs have loops that
allow information to be passed from one step of the
Challenges with RNNs: In RNNs: This typically happens when there are large
weight values or when trying to model very complex
Vanishing Gradient Problem: Gradients diminish over
sequences, where the error gradients multiply and
long sequences, making it difficult for the network to
grow rapidly as they propagate backward through time.
learn long-term dependencies.
Impact: The training process becomes unstable, with
Exploding Gradient Problem:Gradients can grow
the model's loss function often resulting in "NaN" (Not
excessively large during backpropagation, causing
a Number) values, and the model's performance
instability.
deteriorates.
Limited Memory: Difficulty in handling very long
sequences due to reliance on hidden states.

Vanishing Gradients and Exploding Gradients are two


common problems encountered during the training of
deep neural networks, particularly in Recurrent Neural
Networks (RNNs) and other deep architectures. These
issues arise during the backpropagation process, which
is used to update the model's weights by calculating
gradients.

Vanishing Gradients and exploding


gradient
Vanishing Gradients

The Vanishing Gradient problem occurs when the


gradients of the loss function with respect to the
model's parameters become very small as they are
propagated backward through the network. This
leadsto very small updates to the model’s weights,
effectively stalling learning, particularly in the early
layers of the network. This problem is especially
prevalent in deep networks or in RNNs when trying to
capture long-term dependencies.

In RNNs: When processing long sequences, the


contributions from earlier inputs diminish
exponentially, making it difficult for the network to
learn relationships between distant inputs in the
sequence.

Impact: The model struggles to learn and represent


long-range dependencies in the data, leading to poor
performance on tasks that require understanding of
context over long sequences (e.g., in NLP tasks like
language modeling or translation).

Exploding Gradients

The Exploding Gradient problem occurs when the


gradients become excessively large during
backpropagation. This can cause the model's weights to
grow exponentially, leading to numerical instability.
The model may diverge during training, making it
impossible to learn anything meaningful.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy