Embedding
Embedding
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu
Applications:
Speech Recognition
Sequence models are designed for tasks involving sequential data, where the order of elements
is crucial.
Source: https://cbare.github.io/images/rnn-applications.png
UE22AM343BB5: Large Language Models and Their Applications
Characteristics of Sequential Data
Source: https://www.researchgate.net/publication/320032800/figure/fig2/
AS:631649222025264@1527608318736/Sequential-data-can-be-ordered-in-many-ways-including-A-
temperature-measured-over-time.png
UE22AM343BB5: Large Language Models and Their Applications
Key Features of Sequence models
Memory Maintenance: Sequence models maintain a 'memory' of previous inputs, allowing them to
influence future predictions.
Adaptability: They can handle various types of sequential data, including text, audio, and time series.
Source: https://www.scaler.com/topics/images/sequence-to-sequence-model-1.webp
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models
Sequence Symbol: A sequence symbol consists of a period (.) followed by an alphabetic character
and can include up to 61 alphanumeric characters. It is used to branch to specific statements during
conditional assembly processing.
START and END Tokens:
START Token (<s>): Indicates the beginning of a sequence, essential for models to know when to
start generating output.
END Token (</s>): Signals the completion of a sequence, allowing models to determine when to stop
generating output.
Source: https://neclab.eu/fileadmin/_processed_/6/9/csm_bison_df9f29f797.png
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models
Input Example: x represents the input sentence (e.g., "Harry Potter and Hermione
Granger invented a new spell.").
Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data
by retaining information from previous inputs.
Basic Structure:
Input Layer: Receives sequential data.
Hidden Layer: Contains recurrent connections that allow information to be passed from one time step
to the next, maintaining a hidden state (memory).
Output Layer: Produces the final output based on the processed information.
Functionality: RNNs utilize loops in their architecture, enabling them to remember previous inputs and
make predictions based on both current and past data.
UE22AM343BB5: Large Language Models and Their Applications
Backpropagation Through Time (BPTT)
Source1: https://miro.medium.com/v2/
resize:fit:1204/0*SWHzEFzYRDSnc3w2
Source2: https://stanford.edu/~shervine/teaching/cs-230/
illustrations/architecture-rnn-ltr.png?
9ea4417fc145b9346a3e288801dbdfdc
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs
Source: https://api.wandb.ai/files/ayush-thakur/images/projects/103390/4fc355be.png
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs
● Many-to-One RNN:
○ Multiple inputs are processed to produce a single output.
○ This structure aggregates information from a sequence into one final prediction.
○ Sentiment Analysis, where a sequence of words is analyzed to determine the
sentiment (positive or negative).
● Many-to-Many RNN:
○ Multiple inputs are mapped to multiple outputs.
○ Each input in the sequence can produce an output, allowing for complex
transformations.
○ Machine Translation, where a sentence in one language is translated into another
language.
UE22AM343BB5: Large Language Models and Their Applications
Language Models
Role in Sequence Generation: Language models predict the next word in a sequence based on previous words,
making them essential for tasks like text generation and machine translation.
Functionality: By leveraging the hidden state in RNNs, language models can capture contextual relationships between
words, improving coherence and relevance in generated text.
Source: https://www.researchgate.net/publication/387105119/figure/fig1/AS:11431281298347911@1734405295120/a-Next-token-
prediction-The-LLM-is-trained-to-predict-thenext-token-in-corpus.png
UE22AM343BB5: Large Language Models and Their Applications
Sampling Novel Sequences
Sampling novel sequences involves generating new data points from a trained model,
particularly in tasks like text generation, music composition, or image synthesis.
Techniques:
Examples:
Text Generation: Using an RNN to generate sentences by sampling words based on learned
probabilities.
Music Composition: Generating new melodies by sampling notes from a trained model.
Source: https://
www.researchgate.net/
publication/371808605/figure/
fig4/
AS:11431281201908100@169866
2784771/Schematic-of-the-novel-
sampling-approach-based-on-
image-segmentation-
technology.png
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem
source: https://miro.medium.com/v2/resize:fit:1280/1*yRetLBtXzHj9zsZ-n_BqDg.gif
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem
Solutions:
● Long Short-Term Memory (LSTM): Introduces memory cells and gating mechanisms that
help retain information over long periods and mitigate the vanishing gradient problem.
● Gated Recurrent Units (GRU): A simpler alternative to LSTMs that combines input and
forget gates to manage information flow effectively.
● Gradient Clipping: A technique where gradients are capped at a certain threshold to
prevent them from becoming too small or too large (A great solution to take care of the
exploding gradients problem).
Source: https://analyticsindiamag.com/wp-content/
uploads/2021/11/gradients.png
UE22AM343BB5: Large Language Models and Their Applications
Concept of Deep RNNs
Deep Recurrent Neural Networks (Deep RNNs) extend the traditional RNN architecture by stacking
multiple layers of recurrent units. This allows the network to learn more complex representations
and capture intricate patterns in sequential data.
Mechanism: In Deep RNNs, the hidden state information is passed not only to the next time step of
the current layer but also to the current time step of the next layer, facilitating richer temporal
dynamics.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Word Embeddings
Word embeddings are dense vector representations of words in a continuous vector space, where similar
words are positioned close to each other based on their semantic meanings and contextual usage.
Importance in NLP:
● Semantic Understanding: Word embeddings enable models to understand the meaning of words based on
their context, capturing nuanced relationships that traditional methods like one-hot encoding cannot.
● Dimensionality Reduction: Compared to one-hot encoding, which results in high-dimensional sparse
vectors, word embeddings provide a lower-dimensional representation that is computationally efficient.
● Facilitates Machine Learning: By converting words into numerical vectors, embeddings allow machine
learning algorithms to process text data effectively.
Key Concepts:
● Distributional Hypothesis: This principle states that words with similar meanings tend to occur in similar
contexts, forming the basis for training word embeddings.
● Training Methods: Common approaches include:: A predictive model that uses neural networks to learn
word representations based on surrounding context.
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings
Pre-trained Embeddings:
● Utilizing pre-trained models (e.g., Google News vectors) can save time and resources. These
embeddings can be fine-tuned for specific tasks or used directly in various applications.
Source : https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-1024x538.png
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings
Source: https://thinkingneuron.com/wp-content/uploads/2021/12/image-38.png
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings
Dimensionality:
Semantic Relationships:
The distance between word vectors in the embedding space reflects their semantic similarity. Words
with similar meanings are located closer together, while those with different meanings are further
apart.
Example: In a well-trained embedding space, the vectors for "king" and "queen" are close together,
while "king" and "car" are farther apart.
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings
Arithmetic Properties:
Vector Arithmetic: Word embeddings exhibit interesting arithmetic properties that allow for
semantic manipulation. For instance, the relationship can be expressed as:
This property illustrates how embeddings can capture gender relationships and other analogies.
UE22AM343BB5: Large Language Models and Their Applications
Structure of the Embedding Matrix:
An embedding matrix is a two-dimensional array where each row corresponds to a unique word in
the vocabulary and each column represents a dimension of the embedding.
Usage:
● Input to Models: The embedding matrix is used as input to machine learning models, allowing
them to process text data as numerical vectors.
● Training Process: During training, the embeddings are adjusted based on the context in which
words appear, optimizing their positions in the vector space to reflect semantic relationships
accurately.
UE22AM343BB5: Large Language Models and Their Applications
Techniques for Learning Word Embeddings
Overview of Methods:
Source: https://miro.medium.com/v2/resize:fit:1400/1*Pub6nXI7Wg4wbNQvEsyd4Q.png
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec Model
Developed by Google, Word2Vec is a predictive model that uses a neural network to learn word representations.
Continuous Bag of Words (CBOW): Predicts the target word from surrounding context words. Example: For the sentence "The cat
sat on the mat," if "sat" is the target, CBOW uses "The," "cat," "on," "the," and "mat" as input.
Skip-Gram: Predicts surrounding context words given a target word. Example: Given the target word "sat," it predicts "The," "cat,"
"on," "the," and "mat."
Source: https://miro.medium.com/v2/resize:fit:1400/1*xC6wfTU_zpUlpRlXs5NZ4w.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling Technique
Negative sampling is used to improve the efficiency of training models like Word2Vec by focusing on meaningful
comparisons during optimization.
Process: Instead of calculating gradients for all possible output words, negative sampling selects a small subset of
negative examples that are not true associations with the input. For example, when training the model with the
target word "king," it might learn to associate it with "queen" while distinguishing it from unrelated words like
"table."
Benefits:
● Reduces computational complexity by limiting the number of comparisons needed during training.
● Simplifies the training objective to a binary classification task, where the model predicts whether pairs of words
are likely to co-occur.
Source: https://i.sstatic.net/7V7EO.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling in Word2Vec
Source: https://miro.medium.com/v2/resize:fit:2000/format:webp/1*YnI8R6DWKd8Zw1mTeF-bSA.png
UE22AM343BB5: Large Language Models and Their Applications
GloVe Word Vectors
GloVe is a count-based model that generates word embeddings by leveraging global statistical
information from a corpus. It creates a co-occurrence matrix that captures how often words appear
together in a given context.
Comparison with Word2Vec: While Word2Vec focuses on local context (predicting surrounding
words), GloVe utilizes global statistics to derive embeddings.
Key Features: Captures both global and local context information, making it effective for various
NLP tasks.
Applications: Similar to Word2Vec, GloVe is used in sentiment analysis, recommendation systems,
and information retrieval.
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec VS Glove
Word2Vec GloVe
Transforms the unlabelled raw corpus into Enforce the word vectors to capture sub-linear
labelled data. relationships in the vector space.
Requires little memory. Adds some more practical meaning into word
vectors.
The mapping between the target word to its Gives lower weight for highly frequent word
context word implicitly embeds the sub-linear pairs, to prevent the meaningless stop-words
relationship into the vector space of words. not to dominate training process.
The sub-linear relationships are not explicitly Takes a lot of memory for storage a co-
defined. occurrence matrix.
Computationally expensive. Time consuming, as to reconstruct matrix when
hyper-parameter changes.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a Natural Language Processing (NLP) technique
that involves identifying and extracting subjective information from text data. It determines the
emotional tone behind a series of words, classifying sentiments as positive, negative, or neutral.
Techniques Used:
● Lexicon-Based Approach: Utilizes predefined lists of words (lexicons) with associated sentiment
scores. The overall sentiment is calculated by summing the scores of words in a text.
● Machine Learning-Based Approach: Employs algorithms trained on labeled datasets to classify
sentiment. Common algorithms include Naive Bayes, Support Vector Machines (SVM), and
Logistic Regression.
● Deep Learning-Based Approach: Uses neural networks (e.g., RNNs, LSTMs, Transformers like
BERT) to capture complex patterns and context in text data.
● Advantages: High accuracy in understanding context and nuances.
● Limitations: Computationally intensive and requires significant resources.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings
Word embeddings can inadvertently capture and perpetuate biases present in training data.
Debiasing is crucial to ensure fairness and ethical considerations in NLP applications.
Methods of Debiasing:
● Hard Debiasing: Involves projecting biased word vectors onto a subspace that removes gender-
related information while retaining semantic meaning.
● Soft Debiasing: A more flexible approach that reduces bias while allowing some degree of
information retention. This method adjusts embeddings based on their proximity to biased
terms.
● FineDeb Framework: A two-phase debiasing process that first modifies embeddings learned by
the language model and then fine-tunes the model on the language modeling objective. This
approach has shown effectiveness in reducing bias while maintaining performance.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings
Source: https://arxiv.org/html/2402.11512v2/x1.png
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models
Attention mechanisms allow models to focus on specific parts of the input sequence, enhancing
performance in tasks such as machine translation and text summarization by assigning different
importance levels to various input elements.
Significance:
● Contextual Relevance: Attention helps models weigh the importance of different words in a
sentence, similar to how humans focus on relevant information while ignoring distractions.
● Improved Performance: By enabling models to consider the entire context rather than relying
solely on fixed-length representations, attention mechanisms significantly enhance the ability
to capture long-range dependencies.
● Types of Attention:
● Scaled Dot-Product Attention: Computes attention weights using the dot product of query and key
vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input simultaneously, capturing
various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models
Types of Attention:
● Scaled Dot-Product Attention: Computes attention weights using the dot product of query
and key vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input
simultaneously, capturing various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Attention Mechanism Illustrated
UE22AM343BB5: Large Language Models and Their Applications
Picking the Most Likely Sentence
Attention Scoring: Each word in a sequence is assigned an attention score based on its
relevance to the context. The model computes these scores using learned weights from
the training phase.
In a translation task, when translating "The cat sat on the mat," the model will assign
higher attention scores to "cat" and "mat" when generating corresponding words in another
language.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy
Beam search is an optimization technique used in NLP for generating sequences. It maintains
multiple hypotheses at each step of generation, allowing for a more comprehensive search
compared to greedy algorithms.
Key Components:
● Beam Width: This hyperparameter determines how many sequences (hypotheses) are kept at
each step. A larger beam width results in a broader search but increases computational costs.
● Process: The algorithm begins with an initial state and expands all possible next states. It then
selects the top 'N' states based on their probabilities for further exploration.
● Refinements: Adjusting beam width can balance between computational efficiency and output
quality. Techniques like length normalization can also be applied to prevent bias towards
shorter sequences.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy Illustrated
Source: https://miro.medium.com/v2/resize:fit:770/1*tEjhWqUgjX37VnT7gJN-4g.png
UE22AM343BB5: Large Language Models and Their Applications
Error Analysis in Beam Search
Common Pitfalls: Suboptimal Beam Width Selection: Choosing a beam width that is too small may
lead to missing optimal sequences, while too large may increase computational burden without
significant gains.
Pruning Issues: Incorrect pruning strategies can lead to discarding potentially good candidates
early in the search process, resulting in suboptimal outputs.
Lack of Diversity: If all hypotheses are too similar, beam search may converge on a single type of
output, limiting creativity and variability in generated sequences.
Mitigation Strategies: Implementing techniques such as diverse beam search can help maintain
diversity among generated sequences while still focusing on high-probability candidates.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition
Attention mechanisms enable neural networks to focus selectively on specific parts of the input
data, allowing for more efficient processing and better performance in tasks where context is
crucial.
Conceptual Overview:
Human Analogy: Just as humans can focus on relevant information while ignoring distractions,
attention mechanisms allow models to weigh the importance of different input elements
dynamically.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition
How It Works:
The attention mechanism computes a set of attention scores, which determine how much focus to
place on each part of the input. These scores are derived from the relationships between the input
elements.
Mathematical Representation:
● Given an input sequence represented as vectors, the attention score αij for each pair of
input elements can be computed as follows:
● where score(qi,kj) is a function (e.g., dot product) that measures the compatibility
between query
● qi and key kj.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition
Source: https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation
Components Involved:
1. Queries (Q): Represent the current state or position in the sequence that needs to focus
on other parts.
2. Keys (K): Represent all positions in the input sequence that can be attended to.
3. Values (V): Contain the actual information that corresponds to each key.
This is a common technique used in attention models where the dot product of queries and keys is
computed to derive attention scores. The scores are then scaled by the square root of the
dimension of the key vectors to stabilize gradients during training.
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation
Multi-Head Attention:
Extends scaled dot-product attention by allowing multiple sets of queries, keys, and values. Each
head learns different aspects of the input data, providing richer representations.
Attention mechanisms in speech recognition allow models to focus on specific segments of audio input,
improving the accuracy of transcribing spoken language into text.
● Use Cases:
○ Real-Time Transcription: Systems like Google Voice and virtual assistants (e.g., Siri, Alexa) utilize
attention mechanisms to enhance the accuracy of real-time speech-to-text conversion by
focusing on relevant audio features.
○ Contextual Understanding: Attention helps models determine which parts of the audio signal
correspond to specific words, enabling better handling of homophones and context-dependent
phrases.
UE22AM343BB5: Large Language Models and Their Applications
Speech Recognition with Attention Models
● Technical Implementation:
○ Sequence-to-Sequence Models: In a typical architecture, an encoder processes the audio input
to produce a feature representation, while a decoder generates the text output. The attention
mechanism allows the decoder to focus on different parts of the encoder output at each
decoding step.
○ Attention Weights: The model computes attention weights that indicate how much focus
should be placed on different time steps of the audio input during transcription.
UE22AM343BB5: Large Language Models and Their Applications
Trigger Word Detection Techniques
Trigger word detection involves identifying specific keywords or phrases (e.g., "Hey Siri," "OK Google") that
activate voice assistants. Attention mechanisms enhance this process by focusing on relevant parts of the
audio stream.
● Techniques Used:
○ End-to-End Deep Learning Models: These models use attention mechanisms to analyze audio signals
and detect trigger words directly from raw audio inputs without requiring extensive feature
engineering.
● Attention Mechanism Role:
○ By applying attention to segments of audio data, the model can prioritize frames that are more likely
to contain trigger words, improving detection accuracy and reducing false positives.
● Implementation Example:
○ A convolutional neural network (CNN) combined with an attention layer can process spectrograms of
audio signals, allowing the model to focus on time-frequency representations that correlate with
trigger words.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Transformer Networks
Transformer networks are a type of deep learning architecture designed to handle sequential data,
primarily used in natural language processing (NLP). They leverage self-attention mechanisms to process
input data in parallel, enabling efficient training and improved performance on various tasks.
Key Features:
● Parallelization: Unlike recurrent neural networks (RNNs), which process data sequentially,
transformers can process entire sequences simultaneously, significantly speeding up training times.
● Self-Attention Mechanism: This allows the model to weigh the importance of different input
elements dynamically, capturing long-range dependencies without the limitations of fixed-length
contexts.
● Encoder-Decoder Architecture: Comprising an encoder that processes the input sequence and a
decoder that generates the output sequence, transformers are highly effective for tasks like
translation and summarization.
● Layer Normalization and Residual Connections: These components help stabilize training and
facilitate the flow of gradients through the network.
UE22AM343BB5: Large Language Models and Their Applications
Self-Attention Mechanism Explained
● Functionality:
1. Self-attention transforms an input sequence into three distinct vectors: Query (Q), Key (K), and Value (V). Each
input element is projected into these vectors through learned linear transformations.
● Process:
1. Compute Attention Scores: The attention score between each query and key is calculated using the dot product:
1. Scale Scores: The scores are scaled by the square root of the dimension of the key vectors to prevent overly
large values that can skew softmax outputs.
2. Apply Softmax: The scaled scores are passed through a softmax function to obtain attention weights that sum to
one.
3. Weighted Sum of Values: Each value vector is weighted by the corresponding attention weight, producing a
contextually enriched output for each input element.
● Benefits:
1. Long-range Dependencies: Self-attention captures relationships between distant elements in sequences,
enhancing contextual understanding.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention Concept
Multi-head attention extends the self-attention mechanism by running multiple attention heads in
parallel, allowing the model to capture diverse relationships within the input data.
Importance in Processing Data: Each head independently computes its own set of attention scores
and outputs. This enables the model to focus on different parts of the input sequence
simultaneously, leading to a richer representation.
Process Overview:
Benefits:
Please head over here for a good visualization of the underlying processes: https://
poloclub.github.io/transformer-explainer/
UE22AM343BB5: Large Language Models and Their Applications
References
Main Reference:
https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5
Large Language Models and Their Applications
Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu