0% found this document useful (0 votes)

16 views55 pages

Embedding

The document discusses Large Language Models (LLMs) and their applications, emphasizing the importance of deep learning in AI. It covers sequence models, particularly Recurrent Neural Networks (RNNs), their types, and challenges like the vanishing gradients problem, along with solutions such as LSTMs and GRUs. Additionally, it explains word embeddings, their significance in natural language processing, and techniques for learning them, including Word2Vec and GloVe.

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views55 pages

Embedding

Uploaded by

yashaswinivmipuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

UE22AM343BB5

Large Language Models and Their Applications

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack: Varun Bharadwaj,

Teaching Assistant
UE22AM343BB5: Large Language Models and Their Applications
Overview of Deep Learning Specialization

Importance of Deep Learning:

Revolutionizing AI with advanced algorithms.

Enabling machines to learn from vast amounts of data.

Applications:

Natural Language Processing (NLP)

Speech Recognition

Time Series Forecasting

Image and Video Analysis

UE22AM343BB5: Large Language Models and Their Applications
What are Sequence Models?

Sequence models are designed for tasks involving sequential data, where the order of elements
is crucial.

Examples of sequential data include:

● Textual data
● Time series data
● Audio signals
● Video streams

Source: https://cbare.github.io/images/rnn-applications.png
UE22AM343BB5: Large Language Models and Their Applications
Characteristics of Sequential Data

Sequential Dependency: The order of data points affects predictions.

Interdependence: Elements in a sequence are not independent; their relationships matter for accurate
modeling. Sequences can vary in length. Elements within a sequence are interdependent, unlike
traditional i.i.d. data.

Source: https://www.researchgate.net/publication/320032800/figure/fig2/
AS:631649222025264@1527608318736/Sequential-data-can-be-ordered-in-many-ways-including-A-
temperature-measured-over-time.png
UE22AM343BB5: Large Language Models and Their Applications
Key Features of Sequence models

Memory Maintenance: Sequence models maintain a 'memory' of previous inputs, allowing them to
influence future predictions.
Adaptability: They can handle various types of sequential data, including text, audio, and time series.

Source: https://www.scaler.com/topics/images/sequence-to-sequence-model-1.webp
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models

Sequence Symbol: A sequence symbol consists of a period (.) followed by an alphabetic character
and can include up to 61 alphanumeric characters. It is used to branch to specific statements during
conditional assembly processing.
START and END Tokens:
START Token (<s>): Indicates the beginning of a sequence, essential for models to know when to
start generating output.
END Token (</s>): Signals the completion of a sequence, allowing models to determine when to stop
generating output.

Source: https://neclab.eu/fileadmin/_processed_/6/9/csm_bison_df9f29f797.png
UE22AM343BB5: Large Language Models and Their Applications
Common Notations Used in Sequence Models

Notation in Named Entity Recognition (NER):

Input Example: x represents the input sentence (e.g., "Harry Potter and Hermione
Granger invented a new spell.").

Output Example: y is the corresponding output, indicating entities recognized in the

input.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to RNNs

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data
by retaining information from previous inputs.
Basic Structure:
Input Layer: Receives sequential data.
Hidden Layer: Contains recurrent connections that allow information to be passed from one time step
to the next, maintaining a hidden state (memory).
Output Layer: Produces the final output based on the processed information.

Functionality: RNNs utilize loops in their architecture, enabling them to remember previous inputs and
make predictions based on both current and past data.
UE22AM343BB5: Large Language Models and Their Applications
Backpropagation Through Time (BPTT)

BPTT is an extension of the backpropagation algorithm used specifically for training

RNNs. It allows the network to learn from sequences by propagating errors
backward through time steps.
Importance: Enables the adjustment of weights across multiple time steps, allowing
RNNs to learn long-term dependencies in sequential data.
Helps mitigate issues such as vanishing gradients by summing errors at each time
step.
Process: During training, the RNN is "unfolded" over time steps, creating a
feedforward network for each time step. Errors are calculated and propagated back
through these steps to update weights.

Source1: https://miro.medium.com/v2/
resize:fit:1204/0*SWHzEFzYRDSnc3w2
Source2: https://stanford.edu/~shervine/teaching/cs-230/
illustrations/architecture-rnn-ltr.png?
9ea4417fc145b9346a3e288801dbdfdc
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

Source: https://api.wandb.ai/files/ayush-thakur/images/projects/103390/4fc355be.png
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

RNNs are of four different types:

● One-to-One RNN:
○ A single input is mapped to a single output.
○ This structure is used for tasks where each input corresponds directly to one
output.
○ Image Classification, where an image is classified into one category.
● One-to-Many RNN:
○ A single input generates multiple outputs.
○ Useful for generating sequences based on a single input.
○ Image Captioning, where an image is used to generate a descriptive sentence.
UE22AM343BB5: Large Language Models and Their Applications
Types of RNNs

● Many-to-One RNN:
○ Multiple inputs are processed to produce a single output.
○ This structure aggregates information from a sequence into one final prediction.
○ Sentiment Analysis, where a sequence of words is analyzed to determine the
sentiment (positive or negative).
● Many-to-Many RNN:
○ Multiple inputs are mapped to multiple outputs.
○ Each input in the sequence can produce an output, allowing for complex
transformations.
○ Machine Translation, where a sentence in one language is translated into another
language.
UE22AM343BB5: Large Language Models and Their Applications
Language Models

Role in Sequence Generation: Language models predict the next word in a sequence based on previous words,
making them essential for tasks like text generation and machine translation.

Functionality: By leveraging the hidden state in RNNs, language models can capture contextual relationships between
words, improving coherence and relevance in generated text.

Applications:Used in chatbots, automated content creation, and real-time translation services.

Source: https://www.researchgate.net/publication/387105119/figure/fig1/AS:11431281298347911@1734405295120/a-Next-token-
prediction-The-LLM-is-trained-to-predict-thenext-token-in-corpus.png
UE22AM343BB5: Large Language Models and Their Applications
Sampling Novel Sequences

Sampling novel sequences involves generating new data points from a trained model,
particularly in tasks like text generation, music composition, or image synthesis.

Techniques:

● Random Sampling: Selecting the next element based on a probability distribution

derived from the model's output. This can lead to diverse outputs but may also produce
incoherent sequences.
● Top-k Sampling: Instead of sampling from the entire output distribution, only the top-k
most probable next elements are considered, promoting more coherent results.
● Temperature Sampling: Adjusts the probability distribution's sharpness. A higher
temperature leads to more randomness, while a lower temperature makes the model's
predictions more deterministic.
UE22AM343BB5: Large Language Models and Their Applications
Sampling Novel Sequences

Examples:

Text Generation: Using an RNN to generate sentences by sampling words based on learned
probabilities.

Music Composition: Generating new melodies by sampling notes from a trained model.

Source: https://
www.researchgate.net/
publication/371808605/figure/
fig4/
AS:11431281201908100@169866
2784771/Schematic-of-the-novel-
sampling-approach-based-on-
image-segmentation-
technology.png
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem

The vanishing gradients problem occurs when gradients used in

backpropagation become exceedingly small, effectively
preventing weights from updating and halting learning. This is
particularly problematic in RNNs with long sequences.

Causes: The repeated multiplication of gradients during

backpropagation through many layers can lead to exponential
decay, resulting in negligible updates for earlier layers.

source: https://miro.medium.com/v2/resize:fit:1280/1*yRetLBtXzHj9zsZ-n_BqDg.gif
UE22AM343BB5: Large Language Models and Their Applications
Vanishing Gradients Problem

Solutions:

● Long Short-Term Memory (LSTM): Introduces memory cells and gating mechanisms that
help retain information over long periods and mitigate the vanishing gradient problem.
● Gated Recurrent Units (GRU): A simpler alternative to LSTMs that combines input and
forget gates to manage information flow effectively.
● Gradient Clipping: A technique where gradients are capped at a certain threshold to
prevent them from becoming too small or too large (A great solution to take care of the
exploding gradients problem).

Source: https://analyticsindiamag.com/wp-content/
uploads/2021/11/gradients.png
UE22AM343BB5: Large Language Models and Their Applications
Concept of Deep RNNs

Deep Recurrent Neural Networks (Deep RNNs) extend the traditional RNN architecture by stacking
multiple layers of recurrent units. This allows the network to learn more complex representations
and capture intricate patterns in sequential data.

Advantages over Shallow Networks:

● Hierarchical Feature Learning: Deep RNNs can learn hierarchical representations of data,
enabling them to capture both low-level and high-level features from sequences.
● Improved Performance: With multiple layers, Deep RNNs can model complex dependencies and
interactions within the data, leading to better performance on tasks such as language modeling
and time series prediction.
● Enhanced Capacity: More layers increase the model's capacity to learn from large datasets,
improving generalization on unseen data.

Mechanism: In Deep RNNs, the hidden state information is passed not only to the next time step of
the current layer but also to the current time step of the next layer, facilitating richer temporal
dynamics.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space, where similar
words are positioned close to each other based on their semantic meanings and contextual usage.

Importance in NLP:
● Semantic Understanding: Word embeddings enable models to understand the meaning of words based on
their context, capturing nuanced relationships that traditional methods like one-hot encoding cannot.
● Dimensionality Reduction: Compared to one-hot encoding, which results in high-dimensional sparse
vectors, word embeddings provide a lower-dimensional representation that is computationally efficient.
● Facilitates Machine Learning: By converting words into numerical vectors, embeddings allow machine
learning algorithms to process text data effectively.

Key Concepts:
● Distributional Hypothesis: This principle states that words with similar meanings tend to occur in similar
contexts, forming the basis for training word embeddings.
● Training Methods: Common approaches include:: A predictive model that uses neural networks to learn
word representations based on surrounding context.
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings

Techniques for Implementation:

● Word2Vec:Developed by Google, Word2Vec employs two architectures:

● GloVe:Constructs embeddings by analyzing global word co-occurrence statistics from a corpus. It
creates a co-occurrence matrix and derives embeddings from it.

Pre-trained Embeddings:

● Utilizing pre-trained models (e.g., Google News vectors) can save time and resources. These
embeddings can be fine-tuned for specific tasks or used directly in various applications.

Source : https://kavita-ganesan.com/wp-content/uploads/skipgram-vs-cbow-continuous-bag-of-words-word2vec-word-representation-1024x538.png
UE22AM343BB5: Large Language Models and Their Applications
Using Word Embeddings

Applications of Word Embeddings:

Used in sentiment analysis, machine translation, text classification, and

more, enhancing model performance by providing rich semantic
representations.

Source: https://thinkingneuron.com/wp-content/uploads/2021/12/image-38.png
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings

Dimensionality:

● Word embeddings are typically represented in a lower-dimensional space (e.g., 50 to 300

dimensions), which reduces the complexity of data while preserving semantic relationships.
● Lower dimensionality helps mitigate the curse of dimensionality, allowing models to generalize
better and reducing computational costs compared to high-dimensional sparse representations
like one-hot encoding.

Semantic Relationships:
The distance between word vectors in the embedding space reflects their semantic similarity. Words
with similar meanings are located closer together, while those with different meanings are further
apart.

Example: In a well-trained embedding space, the vectors for "king" and "queen" are close together,
while "king" and "car" are farther apart.
UE22AM343BB5: Large Language Models and Their Applications
Key Properties of Word Embeddings

Arithmetic Properties:

Vector Arithmetic: Word embeddings exhibit interesting arithmetic properties that allow for
semantic manipulation. For instance, the relationship can be expressed as:

king − man + woman ≈ queen

This property illustrates how embeddings can capture gender relationships and other analogies.
UE22AM343BB5: Large Language Models and Their Applications
Structure of the Embedding Matrix:

An embedding matrix is a two-dimensional array where each row corresponds to a unique word in
the vocabulary and each column represents a dimension of the embedding.

Usage:
● Input to Models: The embedding matrix is used as input to machine learning models, allowing
them to process text data as numerical vectors.
● Training Process: During training, the embeddings are adjusted based on the context in which
words appear, optimizing their positions in the vector space to reflect semantic relationships
accurately.
UE22AM343BB5: Large Language Models and Their Applications
Techniques for Learning Word Embeddings

Overview of Methods:

Word embeddings can be learned using two primary

approaches: Frequency-based and Prediction-based
methods.

Frequency-based Methods:These methods derive

embeddings from the statistical properties of words
in a corpus, focusing on word co-occurrences.

Prediction-based Methods:These methods leverage

neural networks to predict words based on their
context, resulting in more nuanced embeddings.

Source: https://miro.medium.com/v2/resize:fit:1400/1*Pub6nXI7Wg4wbNQvEsyd4Q.png
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec Model

Developed by Google, Word2Vec is a predictive model that uses a neural network to learn word representations.

Architectures: Predicts the target word from surrounding context words.

Continuous Bag of Words (CBOW): Predicts the target word from surrounding context words. Example: For the sentence "The cat
sat on the mat," if "sat" is the target, CBOW uses "The," "cat," "on," "the," and "mat" as input.

Skip-Gram: Predicts surrounding context words given a target word. Example: Given the target word "sat," it predicts "The," "cat,"
"on," "the," and "mat."

Source: https://miro.medium.com/v2/resize:fit:1400/1*xC6wfTU_zpUlpRlXs5NZ4w.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling Technique

Negative sampling is used to improve the efficiency of training models like Word2Vec by focusing on meaningful
comparisons during optimization.
Process: Instead of calculating gradients for all possible output words, negative sampling selects a small subset of
negative examples that are not true associations with the input. For example, when training the model with the
target word "king," it might learn to associate it with "queen" while distinguishing it from unrelated words like
"table."
Benefits:

● Reduces computational complexity by limiting the number of comparisons needed during training.
● Simplifies the training objective to a binary classification task, where the model predicts whether pairs of words
are likely to co-occur.

Source: https://i.sstatic.net/7V7EO.png
UE22AM343BB5: Large Language Models and Their Applications
Negative Sampling in Word2Vec

Source: https://miro.medium.com/v2/resize:fit:2000/format:webp/1*YnI8R6DWKd8Zw1mTeF-bSA.png
UE22AM343BB5: Large Language Models and Their Applications
GloVe Word Vectors

GloVe is a count-based model that generates word embeddings by leveraging global statistical
information from a corpus. It creates a co-occurrence matrix that captures how often words appear
together in a given context.
Comparison with Word2Vec: While Word2Vec focuses on local context (predicting surrounding
words), GloVe utilizes global statistics to derive embeddings.
Key Features: Captures both global and local context information, making it effective for various
NLP tasks.
Applications: Similar to Word2Vec, GloVe is used in sentiment analysis, recommendation systems,
and information retrieval.
UE22AM343BB5: Large Language Models and Their Applications
Word2Vec VS Glove

Word2Vec GloVe
Transforms the unlabelled raw corpus into Enforce the word vectors to capture sub-linear
labelled data. relationships in the vector space.
Requires little memory. Adds some more practical meaning into word
vectors.
The mapping between the target word to its Gives lower weight for highly frequent word
context word implicitly embeds the sub-linear pairs, to prevent the meaningless stop-words
relationship into the vector space of words. not to dominate training process.
The sub-linear relationships are not explicitly Takes a lot of memory for storage a co-
defined. occurrence matrix.
Computationally expensive. Time consuming, as to reconstruct matrix when
hyper-parameter changes.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a Natural Language Processing (NLP) technique
that involves identifying and extracting subjective information from text data. It determines the
emotional tone behind a series of words, classifying sentiments as positive, negative, or neutral.

Techniques Used:

● Lexicon-Based Approach: Utilizes predefined lists of words (lexicons) with associated sentiment
scores. The overall sentiment is calculated by summing the scores of words in a text.
● Machine Learning-Based Approach: Employs algorithms trained on labeled datasets to classify
sentiment. Common algorithms include Naive Bayes, Support Vector Machines (SVM), and
Logistic Regression.
● Deep Learning-Based Approach: Uses neural networks (e.g., RNNs, LSTMs, Transformers like
BERT) to capture complex patterns and context in text data.
● Advantages: High accuracy in understanding context and nuances.
● Limitations: Computationally intensive and requires significant resources.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings

Word embeddings can inadvertently capture and perpetuate biases present in training data.
Debiasing is crucial to ensure fairness and ethical considerations in NLP applications.

Methods of Debiasing:

● Hard Debiasing: Involves projecting biased word vectors onto a subspace that removes gender-
related information while retaining semantic meaning.
● Soft Debiasing: A more flexible approach that reduces bias while allowing some degree of
information retention. This method adjusts embeddings based on their proximity to biased
terms.
● FineDeb Framework: A two-phase debiasing process that first modifies embeddings learned by
the language model and then fine-tunes the model on the language modeling objective. This
approach has shown effectiveness in reducing bias while maintaining performance.
UE22AM343BB5: Large Language Models and Their Applications
Debiasing Word Embeddings

Source: https://arxiv.org/html/2402.11512v2/x1.png
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models

Attention mechanisms allow models to focus on specific parts of the input sequence, enhancing
performance in tasks such as machine translation and text summarization by assigning different
importance levels to various input elements.

Significance:

● Contextual Relevance: Attention helps models weigh the importance of different words in a
sentence, similar to how humans focus on relevant information while ignoring distractions.
● Improved Performance: By enabling models to consider the entire context rather than relying
solely on fixed-length representations, attention mechanisms significantly enhance the ability
to capture long-range dependencies.
● Types of Attention:
● Scaled Dot-Product Attention: Computes attention weights using the dot product of query and key
vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input simultaneously, capturing
various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Basic Attention Models

Types of Attention:

● Scaled Dot-Product Attention: Computes attention weights using the dot product of query
and key vectors, scaled by the square root of their dimension.
● Multi-Head Attention: Allows the model to attend to different parts of the input
simultaneously, capturing various relationships and interactions.
UE22AM343BB5: Large Language Models and Their Applications
Attention Mechanism Illustrated
UE22AM343BB5: Large Language Models and Their Applications
Picking the Most Likely Sentence

Attention Scoring: Each word in a sequence is assigned an attention score based on its
relevance to the context. The model computes these scores using learned weights from
the training phase.

Sentence Generation Process: The model generates sentences by selecting words

based on their computed probabilities. It uses attention scores to prioritize words that are
more relevant based on the context provided by previous words.
Examples:

In a translation task, when translating "The cat sat on the mat," the model will assign
higher attention scores to "cat" and "mat" when generating corresponding words in another
language.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy

Beam search is an optimization technique used in NLP for generating sequences. It maintains
multiple hypotheses at each step of generation, allowing for a more comprehensive search
compared to greedy algorithms.

Key Components:

● Beam Width: This hyperparameter determines how many sequences (hypotheses) are kept at
each step. A larger beam width results in a broader search but increases computational costs.
● Process: The algorithm begins with an initial state and expands all possible next states. It then
selects the top 'N' states based on their probabilities for further exploration.
● Refinements: Adjusting beam width can balance between computational efficiency and output
quality. Techniques like length normalization can also be applied to prevent bias towards
shorter sequences.
UE22AM343BB5: Large Language Models and Their Applications
Beam Search Strategy Illustrated

Source: https://miro.medium.com/v2/resize:fit:770/1*tEjhWqUgjX37VnT7gJN-4g.png
UE22AM343BB5: Large Language Models and Their Applications
Error Analysis in Beam Search

Common Pitfalls: Suboptimal Beam Width Selection: Choosing a beam width that is too small may
lead to missing optimal sequences, while too large may increase computational burden without
significant gains.
Pruning Issues: Incorrect pruning strategies can lead to discarding potentially good candidates
early in the search process, resulting in suboptimal outputs.
Lack of Diversity: If all hypotheses are too similar, beam search may converge on a single type of
output, limiting creativity and variability in generated sequences.
Mitigation Strategies: Implementing techniques such as diverse beam search can help maintain
diversity among generated sequences while still focusing on high-probability candidates.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

Attention mechanisms enable neural networks to focus selectively on specific parts of the input
data, allowing for more efficient processing and better performance in tasks where context is
crucial.

Conceptual Overview:

Human Analogy: Just as humans can focus on relevant information while ignoring distractions,
attention mechanisms allow models to weigh the importance of different input elements
dynamically.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

How It Works:

The attention mechanism computes a set of attention scores, which determine how much focus to
place on each part of the input. These scores are derived from the relationships between the input
elements.

Mathematical Representation:
● Given an input sequence represented as vectors, the attention score αij for each pair of
input elements can be computed as follows:
● where score(qi,kj) is a function (e.g., dot product) that measures the compatibility
between query
● qi and key kj.
UE22AM343BB5: Large Language Models and Their Applications
Attention Model Intuition

Source: https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation

Components Involved:

1. Queries (Q): Represent the current state or position in the sequence that needs to focus
on other parts.
2. Keys (K): Represent all positions in the input sequence that can be attended to.
3. Values (V): Contain the actual information that corresponds to each key.

Scaled Dot-Product Attention:

This is a common technique used in attention models where the dot product of queries and keys is
computed to derive attention scores. The scores are then scaled by the square root of the
dimension of the key vectors to stabilize gradients during training.
UE22AM343BB5: Large Language Models and Their Applications
Detailed Attention Model Explanation

Multi-Head Attention:

Extends scaled dot-product attention by allowing multiple sets of queries, keys, and values. Each
head learns different aspects of the input data, providing richer representations.

Attention Mechanism Workflow:

1. Compute attention scores using queries and keys.

2. Apply softmax to obtain normalized weights.
3. Multiply weights with value vectors to get the final output representation.
UE22AM343BB5: Large Language Models and Their Applications
Speech Recognition with Attention Models

Attention mechanisms in speech recognition allow models to focus on specific segments of audio input,
improving the accuracy of transcribing spoken language into text.
● Use Cases:
○ Real-Time Transcription: Systems like Google Voice and virtual assistants (e.g., Siri, Alexa) utilize
attention mechanisms to enhance the accuracy of real-time speech-to-text conversion by
focusing on relevant audio features.
○ Contextual Understanding: Attention helps models determine which parts of the audio signal
correspond to specific words, enabling better handling of homophones and context-dependent
phrases.
UE22AM343BB5: Large Language Models and Their Applications
Speech Recognition with Attention Models

● Technical Implementation:
○ Sequence-to-Sequence Models: In a typical architecture, an encoder processes the audio input
to produce a feature representation, while a decoder generates the text output. The attention
mechanism allows the decoder to focus on different parts of the encoder output at each
decoding step.
○ Attention Weights: The model computes attention weights that indicate how much focus
should be placed on different time steps of the audio input during transcription.
UE22AM343BB5: Large Language Models and Their Applications
Trigger Word Detection Techniques

Trigger word detection involves identifying specific keywords or phrases (e.g., "Hey Siri," "OK Google") that
activate voice assistants. Attention mechanisms enhance this process by focusing on relevant parts of the
audio stream.
● Techniques Used:
○ End-to-End Deep Learning Models: These models use attention mechanisms to analyze audio signals
and detect trigger words directly from raw audio inputs without requiring extensive feature
engineering.
● Attention Mechanism Role:
○ By applying attention to segments of audio data, the model can prioritize frames that are more likely
to contain trigger words, improving detection accuracy and reducing false positives.
● Implementation Example:
○ A convolutional neural network (CNN) combined with an attention layer can process spectrograms of
audio signals, allowing the model to focus on time-frequency representations that correlate with
trigger words.
UE22AM343BB5: Large Language Models and Their Applications
Introduction to Transformer Networks

Transformer networks are a type of deep learning architecture designed to handle sequential data,
primarily used in natural language processing (NLP). They leverage self-attention mechanisms to process
input data in parallel, enabling efficient training and improved performance on various tasks.
Key Features:

● Parallelization: Unlike recurrent neural networks (RNNs), which process data sequentially,
transformers can process entire sequences simultaneously, significantly speeding up training times.
● Self-Attention Mechanism: This allows the model to weigh the importance of different input
elements dynamically, capturing long-range dependencies without the limitations of fixed-length
contexts.
● Encoder-Decoder Architecture: Comprising an encoder that processes the input sequence and a
decoder that generates the output sequence, transformers are highly effective for tasks like
translation and summarization.
● Layer Normalization and Residual Connections: These components help stabilize training and
facilitate the flow of gradients through the network.
UE22AM343BB5: Large Language Models and Their Applications
Self-Attention Mechanism Explained

● Functionality:
1. Self-attention transforms an input sequence into three distinct vectors: Query (Q), Key (K), and Value (V). Each
input element is projected into these vectors through learned linear transformations.
● Process:
1. Compute Attention Scores: The attention score between each query and key is calculated using the dot product:

1. Scale Scores: The scores are scaled by the square root of the dimension of the key vectors to prevent overly
large values that can skew softmax outputs.
2. Apply Softmax: The scaled scores are passed through a softmax function to obtain attention weights that sum to
one.
3. Weighted Sum of Values: Each value vector is weighted by the corresponding attention weight, producing a
contextually enriched output for each input element.
● Benefits:
1. Long-range Dependencies: Self-attention captures relationships between distant elements in sequences,
enhancing contextual understanding.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention Concept

Multi-head attention extends the self-attention mechanism by running multiple attention heads in
parallel, allowing the model to capture diverse relationships within the input data.
Importance in Processing Data: Each head independently computes its own set of attention scores
and outputs. This enables the model to focus on different parts of the input sequence
simultaneously, leading to a richer representation.
Process Overview:

● Input Embeddings: Input text is converted into embeddings.

● Linear Projections: The embeddings are projected into multiple sets of Q, K, and V vectors for
each attention head.
● Scaled Dot-Product Attention for Each Head: Each head performs its own scaled dot-product
attention calculation.
● Concatenation and Linear Transformation: Outputs from all heads are concatenated and
passed through a final linear layer to produce a unified output.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention Concept

Benefits:

● Diverse Representations: Captures various aspects of relationships in data, enhancing model

robustness.
● Improved Context Understanding: By attending to multiple parts simultaneously, multi-head
attention enables deeper contextual insights.
UE22AM343BB5: Large Language Models and Their Applications
Multi-Head Attention

Please head over here for a good visualization of the underlying processes: https://
poloclub.github.io/transformer-explainer/
UE22AM343BB5: Large Language Models and Their Applications
References

Main Reference:
https://www.youtube.com/watch?v=S7oA5C43Rbc
UE22AM343BB5
Large Language Models and Their Applications

Dr. Shylaja S S
Director of Cloud Computing & Big Data (CCBD), Centre
for Data Sciences & Applied Machine Learning (CDSAML)
Department of Computer Science and Engineering
shylaja.sharath@pes.edu

Ack: Varun Bharadwaj,

Teaching Assistant

Prompting Techniques Slide Deck
No ratings yet
Prompting Techniques Slide Deck
29 pages
404 Introduction To Research Methods A Practical Guide
100% (1)
404 Introduction To Research Methods A Practical Guide
2 pages
Recurrent Neural Network: SUBMITTED BY: Harmanjeet Singh ROLL NO - 1803448 B.Tech, Cse (7) Ctiemt, Shahpur (Jalandhar)
No ratings yet
Recurrent Neural Network: SUBMITTED BY: Harmanjeet Singh ROLL NO - 1803448 B.Tech, Cse (7) Ctiemt, Shahpur (Jalandhar)
11 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
99 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
Deep Learning Recurrent Neural Networks - Introduction
No ratings yet
Deep Learning Recurrent Neural Networks - Introduction
106 pages
15 Intuition Insights Tina Zion
No ratings yet
15 Intuition Insights Tina Zion
21 pages
Deep Learning (MODULE-4)
No ratings yet
Deep Learning (MODULE-4)
102 pages
Unit-Iv DL
No ratings yet
Unit-Iv DL
54 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
34 - Three Address Code
No ratings yet
34 - Three Address Code
30 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
42 pages
Past Continuous+Past Simple When While
100% (31)
Past Continuous+Past Simple When While
2 pages
The McKinsey 7S Framework
No ratings yet
The McKinsey 7S Framework
7 pages
06 NNDL SeqModNN Part1 Jan Apr24
No ratings yet
06 NNDL SeqModNN Part1 Jan Apr24
92 pages
DL Unit Iv
No ratings yet
DL Unit Iv
15 pages
AOML
No ratings yet
AOML
14 pages
Learn To Lucid Dream
100% (4)
Learn To Lucid Dream
87 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
8 Sequence Models - The Mathematical Engineering of Deep Learning (2021)
No ratings yet
8 Sequence Models - The Mathematical Engineering of Deep Learning (2021)
22 pages
AD3501 DL UNIT 3 Notes - Nil AD3501 DL UNIT 3 Notes - Nil
No ratings yet
AD3501 DL UNIT 3 Notes - Nil AD3501 DL UNIT 3 Notes - Nil
31 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
12 pages
FINAL Lesson 3.2 Respect For Diversity
No ratings yet
FINAL Lesson 3.2 Respect For Diversity
28 pages
Deep Learning Unit 4 by Syed Ateeq
No ratings yet
Deep Learning Unit 4 by Syed Ateeq
34 pages
A Detailed Survey On Enhancing Low Light Images Using Retinex Theory and Deep Learning
No ratings yet
A Detailed Survey On Enhancing Low Light Images Using Retinex Theory and Deep Learning
11 pages
National Products1
100% (1)
National Products1
3 pages
Reading Comprehension Skills and ProbLem Solving Skills
100% (2)
Reading Comprehension Skills and ProbLem Solving Skills
13 pages
Cultivation Theory
No ratings yet
Cultivation Theory
16 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
TF Idf
No ratings yet
TF Idf
6 pages
Chapter 5 - RNN Updated
No ratings yet
Chapter 5 - RNN Updated
116 pages
Regularization in Linear Regression
No ratings yet
Regularization in Linear Regression
1 page
11 RNN
No ratings yet
11 RNN
32 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Ch6 - Lesson 4 - Communication
No ratings yet
Ch6 - Lesson 4 - Communication
11 pages
Om Chapter 3
No ratings yet
Om Chapter 3
5 pages
Unit IV
No ratings yet
Unit IV
31 pages
A Recurrent Neural Network
No ratings yet
A Recurrent Neural Network
22 pages
Chapter 18 Translation
No ratings yet
Chapter 18 Translation
32 pages
Conditional Conundrums PDF
No ratings yet
Conditional Conundrums PDF
2 pages
Entities in Pega Government Platform
No ratings yet
Entities in Pega Government Platform
2 pages
ANN Text and Sequence Processing
No ratings yet
ANN Text and Sequence Processing
33 pages
5a. Recurrent Neural Networks
No ratings yet
5a. Recurrent Neural Networks
45 pages
Lodge Proposal
No ratings yet
Lodge Proposal
14 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Machine Translation
No ratings yet
Machine Translation
10 pages
Endsem Imp DL Unit 4
No ratings yet
Endsem Imp DL Unit 4
30 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
DS303 RNN LSTM
No ratings yet
DS303 RNN LSTM
16 pages
DL Unit 4 Part 2
No ratings yet
DL Unit 4 Part 2
8 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
Aoml Projj
No ratings yet
Aoml Projj
11 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
9: Curriculum Planning: I. SDP and Curriculum Planning 3 II. Structures For Curriculum Planning 6
No ratings yet
9: Curriculum Planning: I. SDP and Curriculum Planning 3 II. Structures For Curriculum Planning 6
53 pages
RNNBasics
No ratings yet
RNNBasics
23 pages
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
No ratings yet
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
12 pages
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
No ratings yet
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
9 pages
WWW Elitmuszone Com Elitmus Data Sufficiency Tutorial
No ratings yet
WWW Elitmuszone Com Elitmus Data Sufficiency Tutorial
9 pages
What Is Classroom Observation Guide For Reporting
No ratings yet
What Is Classroom Observation Guide For Reporting
4 pages
Lec14 RNN3 8 Feb 18
No ratings yet
Lec14 RNN3 8 Feb 18
16 pages
DL Module 5
No ratings yet
DL Module 5
10 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
Unit III - Recurrent Neural Networks
No ratings yet
Unit III - Recurrent Neural Networks
44 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
Industrial/Organizational Psychology 2010: A Research Odyssey
No ratings yet
Industrial/Organizational Psychology 2010: A Research Odyssey
24 pages
Unit 5
No ratings yet
Unit 5
76 pages
RNN
No ratings yet
RNN
23 pages
Diagnostic Test English Iii - SS - 2021 - 2
No ratings yet
Diagnostic Test English Iii - SS - 2021 - 2
2 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
Ad3501 DL Unit 3 Notes
No ratings yet
Ad3501 DL Unit 3 Notes
30 pages
30 Encoder, Decoder, Sequence To Sequence 25-09-2024
No ratings yet
30 Encoder, Decoder, Sequence To Sequence 25-09-2024
5 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
Unit 4
No ratings yet
Unit 4
27 pages
Unit-Iv DL
No ratings yet
Unit-Iv DL
23 pages
Lec 4 Recurrent Neural Network Long Short-Term Memory
No ratings yet
Lec 4 Recurrent Neural Network Long Short-Term Memory
32 pages
Introduction To Recurrent Neural Networks
No ratings yet
Introduction To Recurrent Neural Networks
15 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Recurrent Neural Networks Tutorial, Part 1 - Introduction To RNNs - WildML
No ratings yet
Recurrent Neural Networks Tutorial, Part 1 - Introduction To RNNs - WildML
8 pages
Unit 3
No ratings yet
Unit 3
30 pages
Survey On Recurrent Neural Network in Natural Lang
No ratings yet
Survey On Recurrent Neural Network in Natural Lang
5 pages
Soft Computing 1
No ratings yet
Soft Computing 1
15 pages
Lecture Notes - RRN
No ratings yet
Lecture Notes - RRN
8 pages
Action - Plan - in - English 2022-23
No ratings yet
Action - Plan - in - English 2022-23
3 pages
Module 06
No ratings yet
Module 06
5 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
Gingoog City Colleges - Junior High School: The Problem and Its Background
No ratings yet
Gingoog City Colleges - Junior High School: The Problem and Its Background
6 pages
National STEM School Education Strategy
No ratings yet
National STEM School Education Strategy
12 pages
Cars 2
100% (6)
Cars 2
18 pages
Intelligent Compilers
No ratings yet
Intelligent Compilers
9 pages
1 Module Reading in Philippine History
No ratings yet
1 Module Reading in Philippine History
3 pages
Phonology - Allophone
No ratings yet
Phonology - Allophone
8 pages
Beauty Is A Verb
No ratings yet
Beauty Is A Verb
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Embedding

Uploaded by

Embedding

Uploaded by

UE22AM343BB5

Large Language Models and Their Applications

Ack: Varun Bharadwaj,

Importance of Deep Learning:

Revolutionizing AI with advanced algorithms.

Enabling machines to learn from vast amounts of data.

Natural Language Processing (NLP)

Time Series Forecasting

Image and Video Analysis

Examples of sequential data include:

Sequential Dependency: The order of data points affects predictions.

Notation in Named Entity Recognition (NER):

Output Example: y is the corresponding output, indicating entities recognized in the

BPTT is an extension of the backpropagation algorithm used specifically for training

RNNs are of four different types:

Applications:Used in chatbots, automated content creation, and real-time translation services.

● Random Sampling: Selecting the next element based on a probability distribution

The vanishing gradients problem occurs when gradients used in

Causes: The repeated multiplication of gradients during

Advantages over Shallow Networks:

Techniques for Implementation:

● Word2Vec:Developed by Google, Word2Vec employs two architectures:

Applications of Word Embeddings:

Used in sentiment analysis, machine translation, text classification, and

● Word embeddings are typically represented in a lower-dimensional space (e.g., 50 to 300

king − man + woman ≈ queen

Word embeddings can be learned using two primary

Frequency-based Methods:These methods derive

Prediction-based Methods:These methods leverage

Architectures: Predicts the target word from surrounding context words.

Sentence Generation Process: The model generates sentences by selecting words

Scaled Dot-Product Attention:

Attention Mechanism Workflow:

1. Compute attention scores using queries and keys.

● Input Embeddings: Input text is converted into embeddings.

● Diverse Representations: Captures various aspects of relationships in data, enhancing model

Ack: Varun Bharadwaj,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.