0% found this document useful (0 votes)
13 views10 pages

AI Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

AI Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Few-Shot and Zero-Shot Learning

Machine learning is evolving beyond traditional supervised methods to minimize


reliance on large datasets. Few-shot and zero-shot learning are at the forefront of
this change, allowing models to learn from minimal or even no labeled examples.
This article explores the mechanics of both approaches, their real world
applications, and their role in the era of large language models (LLMs).
 Few-Shot Learning (FSL) - Learn with very few labeled examples (1-100
samples per class). Involves training on multiple related tasks.
 Zero-Shot Learning (ZSL) - Classify new tasks or categories with no
task specific examples. Leverages auxiliary data such as attributes or
language models.
 Do They Require Training? FSL requires task specific training, even
though minimal, while ZSL typically leverages pre-trained models and
does not require training on new tasks.
Few-Shot Learning (FSL)
Few-shot learning tackles the problem of learning a new task with minimal
labeled examples, generally ranging from 1 to 100 samples per class. This
approach is essential in low data scenarios like healthcare or specialized
industries where collecting labeled data is difficult or costly.
 Training - Few-shot learning requires prior training, but not directly on the
task of interest. Instead, the model is trained on a large set of similar tasks
to develop a generalized learning strategy.
 Meta Learning - Often referred to as "learning to learn," meta learning is
a central concept in FSL. The model trains on a variety of tasks to develop
a process that can be quickly adapted to new tasks. Meta learning
methods include, Prototypical Networks - Prototypical networks learn a
metric space where few-shot tasks are solved by computing prototype
vectors (mean embeddings) for each class. These prototypes are then
compared with new sample embeddings using a distance metric like
Euclidean distance. MAML (Model Agnostic Meta Learning) - MAML
learns an initialization that is quickly adaptable to new tasks with only a
few gradient updates. It allows for flexible application across classification,
reinforcement learning, and regression tasks.
Challenges,
 Task Similarity - FSL works best when the new task is similar to those
seen during training. If the tasks are very different, model performance
may degrade.
 Embedding Quality – Embedding based methods like Prototypical
Networks rely on the quality of the learned representation space, which
may not generalize well across domains (ex - from natural images to
medical images).
Zero-Shot Learning (ZSL)
Zero-shot learning enables models to handle tasks or categories without seeing
any task real specific examples. Instead of relying on labeled examples for each
new class, ZSL uses auxiliary information, such as semantic embeddings,
attributes, or language models.
 Training - ZSL doesn't require training on the specific task or class it
needs to solve. The model is typically pre-trained on a large corpus of
general data, and then generalizes to unseen classes using a shared
semantic space.
 Semantic Space - ZSL models map both the input (ex - images) and the
auxiliary information (ex - textual descriptions, attributes) into a shared
semantic space. One example is the Attribute Label Embedding (ALE)
model, which uses a set of attributes (ex - "has tail," "striped") to describe
objects. These attributes guide classification for unseen categories by
comparing them to known objects.
ZSL and Large Language Models (LLMs) - In the era of LLMs like GPT-4 and
CLIP (Contrastive Language Image Pretraining), ZSL capabilities have expanded.
These models are pre-trained on vast datasets containing text image pairs,
enabling them to perform zero-shot classification by aligning visual data with
natural language descriptions.
 CLIP jointly learns image and text representations, allowing for zero-shot
image classification based on textual prompts. For instance, when given a
prompt like “a dog wearing sunglasses,” CLIP retrieves images by mapping
both text and image to a shared representation space.
Challenges,
 Hubness Problem - In ZSL, unseen class embeddings sometimes cluster
near a few dominant points in the semantic space, which leads to poor
performance (hubness). Techniques like generative adversarial networks
(GANs) or variational autoencoders (VAEs) are used to synthesize samples
for unseen classes, helping alleviate this issue.
Key Points
 Training and Generalization - FSL needs training across many tasks,
while ZSL leverages pre-trained models and auxiliary data for
generalization. FSL tends to specialize within domains, whereas ZSL
thrives on transferring knowledge across domains.
 Multimodal Learning - The role of multimodal models (like CLIP)
highlights the power of using textual information as supervisory signals.
Combining visual data with attributes or textual descriptions enhances the
capacity of models to generalize without task specific data.
 LLM Integration - The integration of ZSL techniques into LLMs shows
great promise for future AI systems that handle multimodal tasks or
unseen categories with minimal supervision. As models like CLIP and GPT
evolve, their ability to generalize across domains is improving, offering
even richer applications for ZSL in real world scenarios.
References
1. Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-
shot Learning. Advances in Neural Information Processing Systems
(NeurIPS).
2. Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for
Fast Adaptation of Deep Networks. Proceedings of the International
Conference on Machine Learning (ICML).
3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., &
Sutskever, I. (2021). Learning Transferable Visual Models From Natural
Language Supervision. Proceedings of the International Conference on
Machine Learning (ICML).
Retrieval Augmented Generation (RAG)
In the ever evolving field of Natural Language Processing (NLP), models like GPT-
3 have revolutionized text generation, while search engines have dominated the
art of retrieval. However, the real magic happens when you combine these two
capabilities and that’s exactly what Retrieval Augmented Generation (RAG) does.
My journey into understanding RAG opened up new perspectives on how AI
systems can generate context aware, factually grounded, and dynamic
responses.
This overview will walk you through what RAG is, its technical aspects, and how it
functions by merging retrieval and generation to create something far more
powerful than standalone models.
What is Retrieval Augmented Generation (RAG)?
At its core, RAG is a hybrid model that integrates retrieval (fetching information
from external data) and generation (producing coherent text). Imagine asking a
highly knowledgeable assistant to answer a tricky question. Instead of relying
only on their memory, they quickly sift through books and articles to find the
most relevant information before crafting a detailed and accurate answer. This is
essentially how RAG works, combining the search capabilities of a retriever with
the text producing skills of a generator.
The Two Key Components
Retriever: This component pulls relevant information from a large knowledge
base (think of it as a huge library). RAG typically uses models like Dense Passage
Retrieval (DPR), which turns both the query and documents into dense vectors
(mathematical representations). By comparing these vectors, the model retrieves
the most relevant documents.
Generator: The generator is a language model usually a transformer based
model like BART or GPT that produces text based on the information retrieved. It
stitches together the retrieved information into a coherent response, adding the
creativity and fluency we expect from AI models today.
The Human Encyclopedia with Internet Access
Picture yourself asking a friend who’s great at trivia, “What causes
thunderstorms?” Now, your friend might have some knowledge about weather
patterns, but instead of relying solely on memory, they quickly skim through
weather articles on the internet. With this information in hand, they then give
you a precise, well crafted answer.
The retriever is your friend scanning the internet for articles.
The generator is the same friend, now articulating the most relevant details into
a coherent response.
This two step approach is exactly what makes RAG powerful. It doesn’t just
"guess" from what it knows, it searches first, then speaks.

Why RAG is Revolutionary in NLP


Now, let’s dive into why RAG was needed in the first place. While language
models like GPT-3 and BERT are capable of generating coherent text, they are
limited by what they were trained on. These models are “frozen in time” once
training is complete—they don’t have access to new information or updates. If
you ask them something about recent events or highly specific queries, their
answers could be outdated or wrong.
RAG solves two key challenges,
Keeping Information Current: By using a retriever to access external, up to date
information, RAG stays current, even if the language model itself was trained
years ago.
Improving Accuracy: Language models sometimes "hallucinate," meaning they
generate text that sounds right but is factually incorrect. RAG helps by grounding
the generated response in actual documents or data retrieved from a knowledge
base, reducing the likelihood of hallucination.
How It Works
To better understand how RAG operates, we need to break it down step by step.
Step 1 - Retrieval Using Dense Passage Retrieval (DPR)
The first component in RAG is the retriever, and one of the most commonly used
techniques is Dense Passage Retrieval (DPR). Unlike traditional keyword based
retrieval methods, DPR converts both the query and the documents into dense
vectors (multi dimensional numerical representations). These vectors allow the
model to better understand the context and meaning of the text, improving the
accuracy of the documents it retrieves.
Let’s say you ask RAG, “What are black holes?” DPR converts both your query
and millions of scientific articles into vectors and retrieves the most relevant
articles about black holes. The difference between DPR and traditional search
engines is that it isn’t just matching keywords; it’s matching semantic meaning.
Step 2 - Generating with Transformers
Once the retriever gathers the top relevant documents, the generator (typically a
transformer based model like BART or GPT) takes these documents and uses
them to craft a response.
Transformers are ideal for this task because they excel at understanding long
range dependencies in text, which means they can consider the overall context
of the documents while generating fluent and contextually accurate responses.
Two Modes of RAG
The way RAG merges retrieval and generation is unique. I discovered that RAG
operates in two primary modes, Sequence Level Fusion and Token Level Fusion.
These modes dictate how the retriever’s data is integrated into the generation
process.
Sequence Level Fusion: In this mode, the generator model receives the entire
retrieved document at once, using it as context to generate a response. It’s
similar to giving your assistant a batch of research material and asking them to
summarize it.
Token Level Fusion: This is a bit more complex. The generator doesn’t wait for
the entire retrieved document. Instead, at each token (word generation), it
decides whether to rely on its internal knowledge or to use information from the
retrieved document. This fine grained approach allows for more accurate,
context aware generation.
ROPE and Positional Encoding
As discussed in the previous article, ROPE in Transformers is the new norm for its
unique ability to go beyond the traditional positional encoding. What makes
ROPE unique is that it enhances the model’s ability to handle longer sequences.
Traditional positional encodings lose accuracy with longer input texts, but ROPE
solves this by rotating embeddings using mathematical rotations to better
preserve the relative position of tokens over long distances.
Technical Breakdown of ROPE
Rotary Positional Encoding is based on the idea that position information can be
represented by multiplying vector components with sinusoidal functions.
It gives transformers the ability to "remember" the position of tokens even when
processing very long sequences of text, allowing the model to maintain a high
level of contextual understanding without degrading over time.
Where RAG Shines
When I explored the real world uses of RAG, it became clear that this model has
huge potential across various industries. Here are some applications where RAG
is already proving to be a game changer,
Dynamic Question Answering Systems: RAG can access vast databases or
knowledge sources to provide real time, factually correct answers. For example,
a medical chatbot using RAG could pull the latest medical studies to answer
questions on rare diseases.
Legal and Financial Document Analysis: Imagine a law firm using RAG to quickly
retrieve and summarize relevant cases or financial reports from thousands of
pages of documentation.
Real Time Information Summarization: In industries like journalism, RAG could
retrieve breaking news updates and generate concise summaries for reporters,
allowing them to quickly digest large volumes of information.
Why RAG Represents the Future of NLP
The beauty of Retrieval Augmented Generation lies in its ability to combine the
best of two worlds, retrieval and generation. Through its retriever, it ensures that
responses are based on relevant, accurate data, and through its generator, it
crafts these responses into fluent, human like text.
This hybrid approach addresses the limitations of traditional generative models,
offering solutions to problems like outdated knowledge and hallucination. With
advancements like ROPE, RAG continues to evolve, pushing the boundaries of
what’s possible in NLP. Whether used for question answering, content creation, or
information retrieval, RAG is paving the way for more intelligent, reliable, and
adaptable AI systems.

UNDERSTANDING TRANSFORMERS IN NLP


Transformers are the backbone of modern Natural Language Processing (NLP).
They’ve transformed how models understand language by processing all the
words in a sentence simultaneously, rather than sequentially, and leveraging the
self-attention mechanism to capture complex dependencies between words,
regardless of how far apart they are. Here’s a technical breakdown of how
transformers work.
1. Why Transformers Revolutionized NLP
Before transformers, models like Recurrent Neural Networks (RNNs) and Long
Short-Term Memory networks (LSTMs) were used. These models processed words
one at a time, making them effective for short sequences but inefficient for long
sentences or paragraphs. Their sequential nature also made training slower due
to limited parallelization.
Transformers removed these limitations by processing words in parallel. This
allowed them to understand long-range dependencies (connections between
distant words in a sentence) without forgetting earlier information, a common
issue with RNNs.
2. Transformer Architecture - The Core Structure
The transformer model is made up of two key components,
 Encoder: Reads and processes the input sentence.
 Decoder: Generates the output sentence based on what the encoder has
learned.
Each encoder and decoder block contains layers that include,
 Self-Attention: This helps the model decide which words in the sentence
should influence each other.
 Feed-Forward Neural Networks (FFNNs): A simple neural network
applied to every word’s representation after self-attention.
 Positional Encoding: Adds information about the order of words.
In the diagram (above), the input sentence is fed into the encoder, which uses
self-attention and positional encodings to create meaningful word
representations. The decoder mirrors this process while generating the output.

3. Self-Attention - How Transformers Understand Long Sentences


The self-attention mechanism is at the heart of the transformer model. It
allows each word to attend (or focus) on other words in the sentence to capture
relationships, no matter how far apart they are. Here’s how self-attention works,
step by step,
1. Query, Key, and Value Vectors (Q, K, V): For each word, the model
computes three vectors, Query: Represents what this word is asking (what
it wants to focus on). Key: Helps other words decide how much they
should pay attention to this word. Value: Carries the actual information
that’s transferred.
2. Dot Product: For each word, the model computes a dot product between
the Query of one word and the Key of every other word. This step tells
the model how much attention each word should receive.
3. Weighted Sum of Values: After computing attention scores, the model
applies them to the Value vectors and creates new representations of the
words that incorporate context from the entire sentence.
In practice, the self-attention mechanism helps the model connect words like
"lion" and "rest" in the sentence “The lion, which was tired after the long
hunt, lay down to rest.” Even though "lion" and "rest" are far apart, the self-
attention layer ensures that the model understands that they’re closely related.
4. Multi-Head Attention - Seeing Language From Different Angles
In real-world sentences, a single attention mechanism might not be enough to
capture all the nuances of word relationships. That’s why transformers use
multi-head attention, which allows the model to focus on different parts of the
sentence at the same time.
 Each attention head computes its own self-attention scores, meaning one
head might focus on verb-object relationships, while another might focus
on subject-modifier relationships.
 The results from all heads are then combined, allowing the model to have
a richer understanding of the sentence.
For example, in the sentence “The cat sat on the mat,” one attention head
might focus on the relationship between “cat” and “sat,” while another might
focus on the relationship between “on” and “mat.”
5. Positional Encoding - Teaching Transformers About Word Order
Unlike RNNs, transformers don’t process words in a specific order. To understand
which word comes first, transformers use positional encodings. These are
added to the word embeddings and allow the model to capture the order of
words in a sentence.
Positional encoding is done using sine and cosine functions that generate unique
values for each position. For instance, in “The cat sat on the mat,” positional
encodings help the model understand that "the" comes before "cat" and that the
order affects the meaning of the sentence.
5.1 Rotary Positional Embedding (ROPE) - A New Approach to Positional
Encoding
While traditional positional encodings, based on sine and cosine functions, assign
fixed values to represent the positions of words in a sentence, Rotary
Positional Embeddings (ROPE) offer a more flexible method by embedding
positional information directly into the self-attention mechanism. ROPE helps
capture relative positional relationships between words more effectively,
especially for long-range dependencies.
To understand how ROPE improves upon traditional positional encoding, let’s use
a sentence and examine how each method works.
Example : "The spacecraft, after years of exploration, finally returned to
Earth."
Traditional Positional Encoding
In transformers, positional encodings are added to word embeddings to give the
model a sense of order. For example, in the sentence,
 Word Embeddings: Each word ("The," "spacecraft," "returned," "Earth,"
etc.) is first transformed into a high-dimensional vector representing its
meaning.
 Positional Encodings: Then, sine and cosine functions generate
positional embeddings that reflect the absolute position of each word. For
instance, "The" (position 1) gets a different positional encoding than
"returned" (position 9), and "Earth" (position 10).
In traditional positional encoding, the model understands the sentence's
structure based on these absolute positions. However, because the positional
encodings are fixed, the model might struggle to capture long-range
relationships efficiently. It might understand that "returned" is linked to
"spacecraft," but the further apart the words are, the harder it is for the model to
grasp their connection fully.
How ROPE (Rotary Positional Embedding) Handles
Now, let’s see how ROPE approaches this.
In ROPE, instead of adding absolute positional encodings to word embeddings,
the positional information is embedded directly into the self-attention mechanism
through the rotation of query and key vectors. This embedding process helps
capture relative positions between words rather than just absolute ones.
 For example, when processing "spacecraft" and "returned," the model
doesn’t just consider their fixed positions (2 and 9); instead, it rotates the
query and key vectors for these words by an angle that reflects their
relative distance from each other in the sentence.
 This rotation allows the self-attention mechanism to directly encode
positional relationships, meaning the transformer can better
understand that "spacecraft" and "returned" are related, even though they
are separated by many words.
Key Differences Between Traditional Positional Encoding and ROPE
 Traditional Positional Encoding: This method assigns a unique position
to each word in the sequence and represents that position through fixed
sine and cosine values. This works well for short sentences but may
struggle with long-range dependencies because it lacks the capacity to
model relationships between distant words efficiently. In the example, it
encodes "spacecraft" and "returned" based on their absolute positions (2
and 9). The model knows where each word is, but as the distance between
words grows, the relationship becomes harder to capture accurately.
 ROPE: In contrast, ROPE rotates the query and key vectors in self-
attention based on the relative positions of the words. This gives the
transformer a more flexible way to capture dependencies over long
distances. In the same example, the query vector for "spacecraft" and the
key vector for "returned" would be rotated relative to their distance. This
allows the model to more easily recognize that these two words are
related, despite being far apart in the sentence.
Visualizing the Difference
Imagine positional encoding as a timeline where each word has a fixed spot.
Traditional encodings only consider where a word is on that timeline. ROPE,
however, measures the distance between points on the timeline (i.e., the
relative distance), which allows the transformer to better understand which
words are related, even if they are far apart.
6. Residual Connections and Layer Normalization - Making Transformers
More Efficient
Layer Normalization is a key technique used in transformers to stabilize the
training process. Traditional normalization methods like batch normalization rely
on batch statistics, which can be problematic when the batch size is small or
when working with variable-length sequences, as often seen in NLP.
Layer normalization, on the other hand, operates on individual layers by
normalizing across the features for each sample, allowing for more consistent
gradient flow. It helps the model train faster by maintaining a stable range of
inputs across layers and preventing issues like vanishing or exploding gradients.
Here's how it works,
 At each layer, the mean and variance of the activations are calculated for
a single input.
 These values are then used to normalize the activations so that they have
a mean of 0 and a variance of 1, ensuring stable training dynamics.
 This normalization is especially beneficial for models like transformers,
where deeper architectures might otherwise struggle with gradient
instability.
Layer normalization not only accelerates convergence but also improves the
robustness of the model during training.
7. Transformers in Action - Capturing Long Dependencies
Let’s revisit our earlier example sentence,
“The lion, which was tired after the long hunt, lay down to rest.”
In this sentence, understanding the word “lion” depends not only on nearby
words but also on the distant word “rest”. The self-attention mechanism
ensures that while processing the word “lion,” the transformer gives significant
attention to “rest,” allowing the model to understand that the lion is the one
resting, despite the distance between these words.
Traditional models like RNNs would struggle with such long-range dependencies,
as they process words sequentially and often forget earlier words as they move
through the sentence. Transformers, with their parallel processing and self-
attention, solve this problem by considering the full sentence at once.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy