0% found this document useful (0 votes)
16 views55 pages

Embedding

The document discusses Large Language Models (LLMs) and their applications, emphasizing the importance of deep learning in AI. It covers sequence models, particularly Recurrent Neural Networks (RNNs), their types, and challenges like the vanishing gradients problem, along with solutions such as LSTMs and GRUs. Additionally, it explains word embeddings, their significance in natural language processing, and techniques for learning them, including Word2Vec and GloVe.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views55 pages

Embedding

The document discusses Large Language Models (LLMs) and their applications, emphasizing the importance of deep learning in AI. It covers sequence models, particularly Recurrent Neural Networks (RNNs), their types, and challenges like the vanishing gradients problem, along with solutions such as LSTMs and GRUs. Additionally, it explains word embeddings, their significance in natural language processing, and techniques for learning them, including Word2Vec and GloVe.

Uploaded by

yashaswinivmipuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

UE22AM343BB5

Large Language Models and Their Applications

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack: Varun Bharadwaj,


Teaching Assistant
UE22AM343BB5: Large Language Models and Their Applications
Overview of Deep Learning Specialization

Importance of Deep Learning:

Revolutionizing AI with advanced algorithms.

Enabling machines to learn from vast amounts of data.

Applications:

Natural Language Processing (NLP)

Speech Recognition

Time Series Forecasting

Image and Video Analysis


UE22AM343BB5: Large Language Models and Their Applications
What are Sequence Models?

Sequence models are designed for tasks involving sequential data, where the order of elements
is crucial.

Examples of sequential data include:


● Textual data
● Time series data
● Audio signals
● Video streams

Source: https://cbare.github.io/images/rnn-applications.png
UE22AM343BB5: Large Language Models and Their Applications
Characteristics of Sequential Data

Sequential Dependency: The order of data points affects predictions.


Interdependence: Elements in a sequence are not independent; their relationships matter for accurate
modeling. Sequences can vary in length. Elements within a sequence are interdependent, unlike
traditional i.i.d. data.

Source: https://www.researchgate.net/publication/320032800/figure/fig2/
AS:631649222025264@1527608318736/Sequential-data-can-be-ordered-in-many-ways-including-A-
temperature-measured-over-time.png
UE22AM343BB5: Large Language Models and Their Applications
Key Features of Sequence models

Memory Maintenance: Sequence models maintain a 'memory' of previous inputs, allowing them to
influence future predictions.
Adaptability: They can handle various types of sequential data, including text, audio, and time series.

Source: https://www.scaler.com/topics/images/sequence-to-sequence-model-1.webp
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models

Sequence Symbol: A sequence symbol consists of a period (.) followed by an alphabetic character
and can include up to 61 alphanumeric characters. It is used to branch to specific statements during
conditional assembly processing.
START and END Tokens:
START Token (<s>): Indicates the beginning of a sequence, essential for models to know when to
start generating output.
END Token (</s>): Signals the completion of a sequence, allowing models to determine when to stop
generating output.

Source: https://neclab.eu/fileadmin/_processed_/6/9/csm_bison_df9f29f797.png
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models

Notation in Named Entity Recognition (NER):

Input Example: x represents the input sentence (e.g., "Harry Potter and Hermione
Granger invented a new spell.").

Output Example: y is the corresponding output, indicating entities recognized in the


input.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to RNNs

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data
by retaining information from previous inputs.
Basic Structure:
Input Layer: Receives sequential data.
Hidden Layer: Contains recurrent connections that allow information to be passed from one time step
to the next, maintaining a hidden state (memory).
Output Layer: Produces the final output based on the processed information.

Functionality: RNNs utilize loops in their architecture, enabling them to remember previous inputs and
make predictions based on both current and past data.
UE22AM343BB5: Large Language Models and Their Applications
Backpropagation Through Time (BPTT)

BPTT is an extension of the backpropagation algorithm used specifically for training


RNNs. It allows the network to learn from sequences by propagating errors
backward through time steps.
Importance: Enables the adjustment of weights across multiple time steps, allowing
RNNs to learn long-term dependencies in sequential data.
Helps mitigate issues such as vanishing gradients by summing errors at each time
step.
Process: During training, the RNN is "unfolded" over time steps, creating a
feedforward network for each time step. Errors are calculated and propagated back
through these steps to update weights.

Source1: https://miro.medium.com/v2/
resize:fit:1204/0*SWHzEFzYRDSnc3w2
Source2: https://stanford.edu/~shervine/teaching/cs-230/
illustrations/architecture-rnn-ltr.png?
9ea4417fc145b9346a3e288801dbdfdc
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

Source: https://api.wandb.ai/files/ayush-thakur/images/projects/103390/4fc355be.png
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

RNNs are of four different types:


● One-to-One RNN:
○ A single input is mapped to a single output.
○ This structure is used for tasks where each input corresponds directly to one
output.
○ Image Classification, where an image is classified into one category.
● One-to-Many RNN:
○ A single input generates multiple outputs.
○ Useful for generating sequences based on a single input.
○ Image Captioning, where an image is used to generate a descriptive sentence.
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

● Many-to-One RNN:
○ Multiple inputs are processed to produce a single output.
○ This structure aggregates information from a sequence into one final prediction.
○ Sentiment Analysis, where a sequence of words is analyzed to determine the
sentiment (positive or negative).
● Many-to-Many RNN:
○ Multiple inputs are mapped to multiple outputs.
○ Each input in the sequence can produce an output, allowing for complex
transformations.
○ Machine Translation, where a sentence in one language is translated into another
language.
UE22AM343BB5: Large Language Models and Their Applications
Language Models

Role in Sequence Generation: Language models predict the next word in a sequence based on previous words,
making them essential for tasks like text generation and machine translation.

Functionality: By leveraging the hidden state in RNNs, language models can capture contextual relationships between
words, improving coherence and relevance in generated text.

Applications:Used in chatbots, automated content creation, and real-time translation services.

Source: https://www.researchgate.net/publication/387105119/figure/fig1/AS:11431281298347911@1734405295120/a-Next-token-
prediction-The-LLM-is-trained-to-predict-thenext-token-in-corpus.png
UE22AM343BB5: Large Language Models and Their Applications
Sampling Novel Sequences

Sampling novel sequences involves generating new data points from a trained model,
particularly in tasks like text generation, music composition, or image synthesis.

Techniques:

● Random Sampling: Selecting the next element based on a probability distribution


derived from the model's output. This can lead to diverse outputs but may also produce
incoherent sequences.
● Top-k Sampling: Instead of sampling from the entire output distribution, only the top-k
most probable next elements are considered, promoting more coherent results.
● Temperature Sampling: Adjusts the probability distribution's sharpness. A higher
temperature leads to more randomness, while a lower temperature makes the model's
predictions more deterministic.
UE22AM343BB5: Large Language Models and Their Applications
Sampling Novel Sequences

Examples:

Text Generation: Using an RNN to generate sentences by sampling words based on learned
probabilities.

Music Composition: Generating new melodies by sampling notes from a trained model.

Source: https://
www.researchgate.net/
publication/371808605/figure/
fig4/
AS:11431281201908100@169866
2784771/Schematic-of-the-novel-
sampling-approach-based-on-
image-segmentation-
technology.png
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem

The vanishing gradients problem occurs when gradients used in


backpropagation become exceedingly small, effectively
preventing weights from updating and halting learning. This is
particularly problematic in RNNs with long sequences.

Causes: The repeated multiplication of gradients during


backpropagation through many layers can lead to exponential
decay, resulting in negligible updates for earlier layers.

source: https://miro.medium.com/v2/resize:fit:1280/1*yRetLBtXzHj9zsZ-n_BqDg.gif
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem

Solutions:

● Long Short-Term Memory (LSTM): Introduces memory cells and gating mechanisms that
help retain information over long periods and mitigate the vanishing gradient problem.
● Gated Recurrent Units (GRU): A simpler alternative to LSTMs that combines input and
forget gates to manage information flow effectively.
● Gradient Clipping: A technique where gradients are capped at a certain threshold to
prevent them from becoming too small or too large (A great solution to take care of the
exploding gradients problem).

Source: https://analyticsindiamag.com/wp-content/
uploads/2021/11/gradients.png
UE22AM343BB5: Large Language Models and Their Applications
Concept of Deep RNNs

Deep Recurrent Neural Networks (Deep RNNs) extend the traditional RNN architecture by stacking
multiple layers of recurrent units. This allows the network to learn more complex representations
and capture intricate patterns in sequential data.

Advantages over Shallow Networks:


● Hierarchical Feature Learning: Deep RNNs can learn hierarchical representations of data,
enabling them to capture both low-level and high-level features from sequences.
● Improved Performance: With multiple layers, Deep RNNs can model complex dependencies and
interactions within the data, leading to better performance on tasks such as language modeling
and time series prediction.
● Enhanced Capacity: More layers increase the model's capacity to learn from large datasets,
improving generalization on unseen data.

Mechanism: In Deep RNNs, the hidden state information is passed not only to the next time step of
the current layer but also to the current time step of the next layer, facilitating richer temporal
dynamics.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space, where similar
words are positioned close to each other based on their semantic meanings and contextual usage.

Importance in NLP:
● Semantic Understanding: Word embeddings enable models to understand the meaning of words based on
their context, capturing nuanced relationships that traditional methods like one-hot encoding cannot.
● Dimensionality Reduction: Compared to one-hot encoding, which results in high-dimensional sparse
vectors, word embeddings provide a lower-dimensional representation that is computationally efficient.
● Facilitates Machine Learning: By converting words into numerical vectors, embeddings allow machine
learning algorithms to process text data effectively.

Key Concepts:
● Distributional Hypothesis: This principle states that words with similar meanings tend to occur in similar
contexts, forming the basis for training word embeddings.
● Training Methods: Common approaches include:: A predictive model that uses neural networks to learn
word representations based on surrounding context.
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings

Techniques for Implementation:

● Word2Vec:Developed by Google, Word2Vec employs two architectures:


● GloVe:Constructs embeddings by analyzing global word co-occurrence statistics from a corpus. It
creates a co-occurrence matrix and derives embeddings from it.

Pre-trained Embeddings:

● Utilizing pre-trained models (e.g., Google News vectors) can save time and resources. These
embeddings can be fine-tuned for specific tasks or used directly in various applications.

Source : https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-1024x538.png
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings

Applications of Word Embeddings:

Used in sentiment analysis, machine translation, text classification, and


more, enhancing model performance by providing rich semantic
representations.

Source: https://thinkingneuron.com/wp-content/uploads/2021/12/image-38.png
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings

Dimensionality:

● Word embeddings are typically represented in a lower-dimensional space (e.g., 50 to 300


dimensions), which reduces the complexity of data while preserving semantic relationships.
● Lower dimensionality helps mitigate the curse of dimensionality, allowing models to generalize
better and reducing computational costs compared to high-dimensional sparse representations
like one-hot encoding.

Semantic Relationships:
The distance between word vectors in the embedding space reflects their semantic similarity. Words
with similar meanings are located closer together, while those with different meanings are further
apart.

Example: In a well-trained embedding space, the vectors for "king" and "queen" are close together,
while "king" and "car" are farther apart.
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings

Arithmetic Properties:

Vector Arithmetic: Word embeddings exhibit interesting arithmetic properties that allow for
semantic manipulation. For instance, the relationship can be expressed as:

king − man + woman ≈ queen

This property illustrates how embeddings can capture gender relationships and other analogies.
UE22AM343BB5: Large Language Models and Their Applications
Structure of the Embedding Matrix:

An embedding matrix is a two-dimensional array where each row corresponds to a unique word in
the vocabulary and each column represents a dimension of the embedding.

Usage:
● Input to Models: The embedding matrix is used as input to machine learning models, allowing
them to process text data as numerical vectors.
● Training Process: During training, the embeddings are adjusted based on the context in which
words appear, optimizing their positions in the vector space to reflect semantic relationships
accurately.
UE22AM343BB5: Large Language Models and Their Applications
Techniques for Learning Word Embeddings

Overview of Methods:

Word embeddings can be learned using two primary


approaches: Frequency-based and Prediction-based
methods.

Frequency-based Methods:These methods derive


embeddings from the statistical properties of words
in a corpus, focusing on word co-occurrences.

Prediction-based Methods:These methods leverage


neural networks to predict words based on their
context, resulting in more nuanced embeddings.

Source: https://miro.medium.com/v2/resize:fit:1400/1*Pub6nXI7Wg4wbNQvEsyd4Q.png
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec Model

Developed by Google, Word2Vec is a predictive model that uses a neural network to learn word representations.

Architectures: Predicts the target word from surrounding context words.

Continuous Bag of Words (CBOW): Predicts the target word from surrounding context words. Example: For the sentence "The cat
sat on the mat," if "sat" is the target, CBOW uses "The," "cat," "on," "the," and "mat" as input.

Skip-Gram: Predicts surrounding context words given a target word. Example: Given the target word "sat," it predicts "The," "cat,"
"on," "the," and "mat."

Source: https://miro.medium.com/v2/resize:fit:1400/1*xC6wfTU_zpUlpRlXs5NZ4w.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling Technique

Negative sampling is used to improve the efficiency of training models like Word2Vec by focusing on meaningful
comparisons during optimization.
Process: Instead of calculating gradients for all possible output words, negative sampling selects a small subset of
negative examples that are not true associations with the input. For example, when training the model with the
target word "king," it might learn to associate it with "queen" while distinguishing it from unrelated words like
"table."
Benefits:

● Reduces computational complexity by limiting the number of comparisons needed during training.
● Simplifies the training objective to a binary classification task, where the model predicts whether pairs of words
are likely to co-occur.

Source: https://i.sstatic.net/7V7EO.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling in Word2Vec

Source: https://miro.medium.com/v2/resize:fit:2000/format:webp/1*YnI8R6DWKd8Zw1mTeF-bSA.png
UE22AM343BB5: Large Language Models and Their Applications
GloVe Word Vectors

GloVe is a count-based model that generates word embeddings by leveraging global statistical
information from a corpus. It creates a co-occurrence matrix that captures how often words appear
together in a given context.
Comparison with Word2Vec: While Word2Vec focuses on local context (predicting surrounding
words), GloVe utilizes global statistics to derive embeddings.
Key Features: Captures both global and local context information, making it effective for various
NLP tasks.
Applications: Similar to Word2Vec, GloVe is used in sentiment analysis, recommendation systems,
and information retrieval.
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec VS Glove

Word2Vec GloVe
Transforms the unlabelled raw corpus into Enforce the word vectors to capture sub-linear
labelled data. relationships in the vector space.
Requires little memory. Adds some more practical meaning into word
vectors.
The mapping between the target word to its Gives lower weight for highly frequent word
context word implicitly embeds the sub-linear pairs, to prevent the meaningless stop-words
relationship into the vector space of words. not to dominate training process.
The sub-linear relationships are not explicitly Takes a lot of memory for storage a co-
defined. occurrence matrix.
Computationally expensive. Time consuming, as to reconstruct matrix when
hyper-parameter changes.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a Natural Language Processing (NLP) technique
that involves identifying and extracting subjective information from text data. It determines the
emotional tone behind a series of words, classifying sentiments as positive, negative, or neutral.

Techniques Used:

● Lexicon-Based Approach: Utilizes predefined lists of words (lexicons) with associated sentiment
scores. The overall sentiment is calculated by summing the scores of words in a text.
● Machine Learning-Based Approach: Employs algorithms trained on labeled datasets to classify
sentiment. Common algorithms include Naive Bayes, Support Vector Machines (SVM), and
Logistic Regression.
● Deep Learning-Based Approach: Uses neural networks (e.g., RNNs, LSTMs, Transformers like
BERT) to capture complex patterns and context in text data.
● Advantages: High accuracy in understanding context and nuances.
● Limitations: Computationally intensive and requires significant resources.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings

Word embeddings can inadvertently capture and perpetuate biases present in training data.
Debiasing is crucial to ensure fairness and ethical considerations in NLP applications.

Methods of Debiasing:

● Hard Debiasing: Involves projecting biased word vectors onto a subspace that removes gender-
related information while retaining semantic meaning.
● Soft Debiasing: A more flexible approach that reduces bias while allowing some degree of
information retention. This method adjusts embeddings based on their proximity to biased
terms.
● FineDeb Framework: A two-phase debiasing process that first modifies embeddings learned by
the language model and then fine-tunes the model on the language modeling objective. This
approach has shown effectiveness in reducing bias while maintaining performance.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings

Source: https://arxiv.org/html/2402.11512v2/x1.png
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models

Attention mechanisms allow models to focus on specific parts of the input sequence, enhancing
performance in tasks such as machine translation and text summarization by assigning different
importance levels to various input elements.

Significance:

● Contextual Relevance: Attention helps models weigh the importance of different words in a
sentence, similar to how humans focus on relevant information while ignoring distractions.
● Improved Performance: By enabling models to consider the entire context rather than relying
solely on fixed-length representations, attention mechanisms significantly enhance the ability
to capture long-range dependencies.
● Types of Attention:
● Scaled Dot-Product Attention: Computes attention weights using the dot product of query and key
vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input simultaneously, capturing
various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models

Types of Attention:

● Scaled Dot-Product Attention: Computes attention weights using the dot product of query
and key vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input
simultaneously, capturing various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Attention Mechanism Illustrated
UE22AM343BB5: Large Language Models and Their Applications
Picking the Most Likely Sentence

Attention Scoring: Each word in a sequence is assigned an attention score based on its
relevance to the context. The model computes these scores using learned weights from
the training phase.

Sentence Generation Process: The model generates sentences by selecting words


based on their computed probabilities. It uses attention scores to prioritize words that are
more relevant based on the context provided by previous words.
Examples:

In a translation task, when translating "The cat sat on the mat," the model will assign
higher attention scores to "cat" and "mat" when generating corresponding words in another
language.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy

Beam search is an optimization technique used in NLP for generating sequences. It maintains
multiple hypotheses at each step of generation, allowing for a more comprehensive search
compared to greedy algorithms.

Key Components:

● Beam Width: This hyperparameter determines how many sequences (hypotheses) are kept at
each step. A larger beam width results in a broader search but increases computational costs.
● Process: The algorithm begins with an initial state and expands all possible next states. It then
selects the top 'N' states based on their probabilities for further exploration.
● Refinements: Adjusting beam width can balance between computational efficiency and output
quality. Techniques like length normalization can also be applied to prevent bias towards
shorter sequences.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy Illustrated

Source: https://miro.medium.com/v2/resize:fit:770/1*tEjhWqUgjX37VnT7gJN-4g.png
UE22AM343BB5: Large Language Models and Their Applications
Error Analysis in Beam Search

Common Pitfalls: Suboptimal Beam Width Selection: Choosing a beam width that is too small may
lead to missing optimal sequences, while too large may increase computational burden without
significant gains.
Pruning Issues: Incorrect pruning strategies can lead to discarding potentially good candidates
early in the search process, resulting in suboptimal outputs.
Lack of Diversity: If all hypotheses are too similar, beam search may converge on a single type of
output, limiting creativity and variability in generated sequences.
Mitigation Strategies: Implementing techniques such as diverse beam search can help maintain
diversity among generated sequences while still focusing on high-probability candidates.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

Attention mechanisms enable neural networks to focus selectively on specific parts of the input
data, allowing for more efficient processing and better performance in tasks where context is
crucial.

Conceptual Overview:

Human Analogy: Just as humans can focus on relevant information while ignoring distractions,
attention mechanisms allow models to weigh the importance of different input elements
dynamically.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

How It Works:

The attention mechanism computes a set of attention scores, which determine how much focus to
place on each part of the input. These scores are derived from the relationships between the input
elements.

Mathematical Representation:
● Given an input sequence represented as vectors, the attention score αij for each pair of
input elements can be computed as follows:
● where score(qi,kj) is a function (e.g., dot product) that measures the compatibility
between query
● qi and key kj.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

Source: https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation

Components Involved:

1. Queries (Q): Represent the current state or position in the sequence that needs to focus
on other parts.
2. Keys (K): Represent all positions in the input sequence that can be attended to.
3. Values (V): Contain the actual information that corresponds to each key.

Scaled Dot-Product Attention:

This is a common technique used in attention models where the dot product of queries and keys is
computed to derive attention scores. The scores are then scaled by the square root of the
dimension of the key vectors to stabilize gradients during training.
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation

Multi-Head Attention:

Extends scaled dot-product attention by allowing multiple sets of queries, keys, and values. Each
head learns different aspects of the input data, providing richer representations.

Attention Mechanism Workflow:

1. Compute attention scores using queries and keys.


2. Apply softmax to obtain normalized weights.
3. Multiply weights with value vectors to get the final output representation.
UE22AM343BB5: Large Language Models and Their Applications
Speech Recognition with Attention Models

Attention mechanisms in speech recognition allow models to focus on specific segments of audio input,
improving the accuracy of transcribing spoken language into text.
● Use Cases:
○ Real-Time Transcription: Systems like Google Voice and virtual assistants (e.g., Siri, Alexa) utilize
attention mechanisms to enhance the accuracy of real-time speech-to-text conversion by
focusing on relevant audio features.
○ Contextual Understanding: Attention helps models determine which parts of the audio signal
correspond to specific words, enabling better handling of homophones and context-dependent
phrases.
UE22AM343BB5: Large Language Models and Their Applications
Speech Recognition with Attention Models

● Technical Implementation:
○ Sequence-to-Sequence Models: In a typical architecture, an encoder processes the audio input
to produce a feature representation, while a decoder generates the text output. The attention
mechanism allows the decoder to focus on different parts of the encoder output at each
decoding step.
○ Attention Weights: The model computes attention weights that indicate how much focus
should be placed on different time steps of the audio input during transcription.
UE22AM343BB5: Large Language Models and Their Applications
Trigger Word Detection Techniques

Trigger word detection involves identifying specific keywords or phrases (e.g., "Hey Siri," "OK Google") that
activate voice assistants. Attention mechanisms enhance this process by focusing on relevant parts of the
audio stream.
● Techniques Used:
○ End-to-End Deep Learning Models: These models use attention mechanisms to analyze audio signals
and detect trigger words directly from raw audio inputs without requiring extensive feature
engineering.
● Attention Mechanism Role:
○ By applying attention to segments of audio data, the model can prioritize frames that are more likely
to contain trigger words, improving detection accuracy and reducing false positives.
● Implementation Example:
○ A convolutional neural network (CNN) combined with an attention layer can process spectrograms of
audio signals, allowing the model to focus on time-frequency representations that correlate with
trigger words.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Transformer Networks

Transformer networks are a type of deep learning architecture designed to handle sequential data,
primarily used in natural language processing (NLP). They leverage self-attention mechanisms to process
input data in parallel, enabling efficient training and improved performance on various tasks.
Key Features:

● Parallelization: Unlike recurrent neural networks (RNNs), which process data sequentially,
transformers can process entire sequences simultaneously, significantly speeding up training times.
● Self-Attention Mechanism: This allows the model to weigh the importance of different input
elements dynamically, capturing long-range dependencies without the limitations of fixed-length
contexts.
● Encoder-Decoder Architecture: Comprising an encoder that processes the input sequence and a
decoder that generates the output sequence, transformers are highly effective for tasks like
translation and summarization.
● Layer Normalization and Residual Connections: These components help stabilize training and
facilitate the flow of gradients through the network.
UE22AM343BB5: Large Language Models and Their Applications
Self-Attention Mechanism Explained

● Functionality:
1. Self-attention transforms an input sequence into three distinct vectors: Query (Q), Key (K), and Value (V). Each
input element is projected into these vectors through learned linear transformations.
● Process:
1. Compute Attention Scores: The attention score between each query and key is calculated using the dot product:

1. Scale Scores: The scores are scaled by the square root of the dimension of the key vectors to prevent overly
large values that can skew softmax outputs.
2. Apply Softmax: The scaled scores are passed through a softmax function to obtain attention weights that sum to
one.
3. Weighted Sum of Values: Each value vector is weighted by the corresponding attention weight, producing a
contextually enriched output for each input element.
● Benefits:
1. Long-range Dependencies: Self-attention captures relationships between distant elements in sequences,
enhancing contextual understanding.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention Concept

Multi-head attention extends the self-attention mechanism by running multiple attention heads in
parallel, allowing the model to capture diverse relationships within the input data.
Importance in Processing Data: Each head independently computes its own set of attention scores
and outputs. This enables the model to focus on different parts of the input sequence
simultaneously, leading to a richer representation.
Process Overview:

● Input Embeddings: Input text is converted into embeddings.


● Linear Projections: The embeddings are projected into multiple sets of Q, K, and V vectors for
each attention head.
● Scaled Dot-Product Attention for Each Head: Each head performs its own scaled dot-product
attention calculation.
● Concatenation and Linear Transformation: Outputs from all heads are concatenated and
passed through a final linear layer to produce a unified output.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention Concept

Benefits:

● Diverse Representations: Captures various aspects of relationships in data, enhancing model


robustness.
● Improved Context Understanding: By attending to multiple parts simultaneously, multi-head
attention enables deeper contextual insights.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention

Please head over here for a good visualization of the underlying processes: https://
poloclub.github.io/transformer-explainer/
UE22AM343BB5: Large Language Models and Their Applications
References

Main Reference:
https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5
Large Language Models and Their Applications

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack: Varun Bharadwaj,


Teaching Assistant

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy