A M3 RD Ipjn Yd Ps GKF
A M3 RD Ipjn Yd Ps GKF
Autoregressive Models
Understanding Autoregressive Models
Definition: Autoregressive (AR) models predict future values in a time series
based on a linear function of previous observations. They are widely used in
time-series forecasting, econometrics, and natural language processing.
Equation:
X p
Xt = b + wi Xt−i + ϵt (1)
i=1
Xt = ϕXt−1 + ϵt (6)
1
Binary Cross-Entropy Loss (BCELoss)
Definition
Binary Cross-Entropy (BCE) is a widely used loss function for binary classifi-
cation tasks. It measures the difference between two probability distributions
and is commonly used in logistic regression and neural networks.
Mathematical Formulation:
N
1 X
L=− [yi log yˆi + (1 − yi ) log(1 − yˆi )] (7)
N i=1
NLP Pipeline
Natural Language Processing (NLP) involves preprocessing textual data before
feeding it into machine learning models. The key steps include:
Tokenization
• Splitting text into words or subwords.
• Helps in feature extraction for models.
• Example: Hello, world!” → [Hello”, ,” , world”, “!”]
Text Normalization
• Converts words into a standard format to improve model performance.
• Includes lowercasing, stemming, and lemmatization.
2
Lowercasing
• Converts text to lowercase to avoid treating Hello” and hello” as different
words.
Stemming
• Reduces words to their root form by removing suffixes.
Lemmatization
• Converts words to their dictionary form.
• Example: better” → good”
Understanding N-Grams
Definition: A sequence of N items from a given text.
Types:
Probability Calculation:
count(wn−1 , wn )
P (wn |wn−1 ) = (8)
count(wn−1 )
count(wn−2 , wn−1 , wn )
P (wn |wn−2 , wn−1 ) = (9)
count(wn−2 , wn−1 )
3
Recurrent Neural Networks (RNNs)
Definition
A Recurrent Neural Network (RNN) is a type of artificial neural network de-
signed to process sequential data. Unlike traditional feedforward networks,
RNNs use internal memory (hidden states) to capture dependencies in sequen-
tial inputs.
Basic Structure:
• Input Layer: Accepts sequential data.
Types of RNNs:
• One-to-One: Standard feedforward network.
• One-to-Many: Single input, multiple outputs.
Challenges of RNNs
• Vanishing Gradient Problem: Gradients shrink exponentially over
time, making it difficult for the network to learn long-term dependencies.
• Exploding Gradient Problem: Gradients grow exponentially, leading
to unstable training.
• Limited Memory: Standard RNNs struggle with capturing long-range
dependencies in sequential data.
4
Long Short-Term Memory (LSTM)
Definition
LSTMs are an advanced type of RNN designed to address the vanishing gradient
problem by incorporating memory cells that selectively retain information over
long sequences.
LSTM Architecture
LSTM networks consist of three gates:
• Forget Gate: Controls what portion of the past information should be
discarded.
• Input Gate: Regulates what new information should be added to mem-
ory.
• Output Gate: Determines the final output based on the cell state.
Mathematical Formulation: Forget Gate:
Input Gate:
it = σ(Wi [ht−1 , xt ] + bi ) (13)
C̃t = tanh(WC [ht−1 , xt ] + bC ) (14)
Cell State Update:
Ct = ft ∗ Ct−1 + it ∗ C̃t (15)
Output Gate:
ot = σ(Wo [ht−1 , xt ] + bo ) (16)
ht = ot ∗ tanh(Ct ) (17)
5
Mathematical Formulation
Reset Gate:
rt = σ(Wr [ht−1 , xt ] + br ) (18)
Update Gate:
zt = σ(Wz [ht−1 , xt ] + bz ) (19)
Candidate Hidden State:
Bidirectional RNNs
Definition
A Bidirectional RNN (BiRNN) processes input sequences in both forward and
backward directions to improve context understanding.
Mathematical Formulation:
hbackward
t = f (Wx xt + U hbackward
t+1 ) (23)
ht = [hft orward , hbackward
t ] (24)
6
Attention Mechanism for NLP
WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , W O ∈ Rhdv ×dmodel
QK ⊤
Attention(Q, K, V ) = softmax √ V
dk
7
Sentiment Classification
Definition:
Sentiment classification is the task of identifying the emotional tone behind a
body of text, typically categorized as positive, negative, or neutral.
Goal:
To automatically determine the sentiment expressed in text using computational
methods.
Applications:
• Product reviews
• Social media monitoring
• Customer feedback analysis
Approaches:
1. Rule-based: Uses manually defined rules and sentiment lexicons (e.g.,
SentiWordNet).
2. Machine Learning: Trains models (e.g., Naive Bayes, SVM) on labeled
data.
3. Deep Learning: Uses neural networks (e.g., RNNs, LSTMs, Transform-
ers) for better context understanding.
Steps Involved:
• Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization
• Feature Extraction: Bag of Words, TF-IDF, word embeddings
Popular Tools/Libraries:
• NLTK
• TextBlob
• Scikit-learn
8
Word Embeddings
Definition:
Word embeddings are dense vector representations of words in a continuous
vector space where semantically similar words are mapped close together.
Purpose:
To capture syntactic and semantic meaning of words for use in machine learning
models.
Characteristics:
Popular Techniques:
• Word2Vec: Uses two architectures:
– CBOW (Continuous Bag of Words): Predicts the current word
from surrounding context.
– Skip-gram: Predicts surrounding context words given the current
(center) word.
• GloVe (Global Vectors): Combines global word co-occurrence statis-
tics.
Advantages:
• Captures semantic similarity (e.g., king - man + woman = queen)
• Improves performance in NLP tasks
Applications:
• Text classification
• Machine translation
• Question answering
• Named entity recognition (NER)
9
BLEU Score (Bilingual Evaluation Understudy)
Definition:
BLEU is an automatic evaluation metric for comparing machine-generated text
(e.g., translations) against one or more reference texts using n-gram precision.
Purpose:
To measure the quality of machine translation by checking how many n-grams
in the candidate sentence appear in the reference sentences.
Formula: !
N
X
BLEU = BP · exp wn log pn
n=1
Where:
Brevity Penalty:
(
1 if c > r
BP =
e(1−r/c) if c ≤ r
10
Machine Translation (MT)
Definition:
Machine Translation is the task of automatically translating text from one lan-
guage to another using computational models.
Evolution:
• Statistical MT: Early systems based on word/phrase probabilities.
• Seq2Seq Models: Encoder-decoder neural networks using RNNs.
• Attention Mechanism: Solves long-range dependency issues in Seq2Seq.
• Transformers: Fully attention-based models, state-of-the-art in MT.
Preprocessing Pipeline:
• Tokenization: Splits text into meaningful units using language-specific
tokenizers.
• Vocabulary Building: Maps words to indices; includes special tokens
like <sos>, <eos>, <pad>, and <unk>.
• Numericalization: Converts token sequences into index sequences for
model input.
Introduction to Transformers
Transformers revolutionized machine translation by overcoming the limitations
of RNNs:
• RNN Issues: Struggles with long-range dependencies, poor paralleliza-
tion, and fading context.
• Transformer Strengths: Self-attention allows direct modeling of de-
pendencies, facilitates parallel computation, and maintains context across
sentences.
Positional Encoding
Transformers are permutation-invariant. Positional encodings are added to to-
ken embeddings to provide order information.
Encoder
The encoder processes input sequences and creates contextualized embeddings.
It consists of the following components:
• Input embedding
11
• Positional encoding
• Multi-head self-attention
• Feed-forward network
• Layer normalization + residual connections
Decoder
The decoder generates the output sequence by attending to both previous tokens
and the encoder’s output. It consists of the following components:
• Output embedding
• Positional encoding
• Masked multi-head self-attention
• Encoder-decoder attention
• Feed-forward network
• Linear + softmax layer
Motivation
Labeled data is costly to acquire, while unlabeled data is abundant. SSL utilizes
unlabeled data to enhance learning, reducing reliance on labeled data.
12
Mathematical Formulation
Let:
• DL = {(xi , yi )}li=1 : Labeled data
• DU = {xi }l+u
i=l+1 : Unlabeled data
Importance
SSL reduces the need for large labeled datasets, making it particularly useful in
domains with limited labeled data but abundant unlabeled data.
13
Loss function:
X
L = Lsupervised + λl Lreconstruction (l)
l
Pi-Model
The Pi-Model enforces consistency regularization by ensuring stable predictions
under input perturbations:
• Input x is passed through the network twice with different augmenta-
tions/noise.
Intuition
The encoder compresses an input into a latent variable z, treated as a random
variable. This enables diverse outputs from the same input.
VAE Loss
L = Eq(z|x) [log p(x|z)] − KL(q(z|x) ∥ p(z))
Consists of:
• Reconstruction Loss
• KL Divergence
14
Diffusion Models
Diffusion models reverse noise addition across steps for sharper outputs, unlike
VAEs’ single-step approach.
VAEs in SSL
Extended with a classifier, using reconstruction, KL divergence, and classifica-
tion losses for labeled data, and pseudo-labels for unlabeled data.
Applications
• Medical Imaging: Small labeled datasets, synthetic augmentation.
• Autonomous Driving: Unlabeled video frames.
Conditional VAEs
CVAEs guide generation by conditioning on a label y, e.g., generating a digit
“4” by conditioning on y = 4.
Environment
Responds to agent’s actions, provides feedback, and can be deterministic or
stochastic.
State/Observation
Represents the environment’s current situation. States can be complete or par-
tial.
15
Action
Decisions made by the agent that affect the environment.
Reward
Scalar feedback indicating the quality of actions taken by the agent.
Policy
Strategy that defines how actions are selected.
Common Approaches
• ϵ-greedy
• Decaying ϵ-greedy
• Softmax exploration
• Upper Confidence Bound (UCB)
• Thompson Sampling
• Intrinsic Motivation/Curiosity
16
State Spaces
The state space is the set of all possible states:
• Discrete: Finite number of states
Markov Property
The Markov property states that the future depends only on the current state:
• Forms the foundation of Markov Decision Processes (MDPs).
• Simplifies learning by discarding past history.
17
Reward Estimation
Each arm corresponds to an unknown reward distribution. We estimate the
expected reward using the sample mean:
Nt (a)
1 X
Qt (a) = Ri
Nt (a) i=1
where Qt (a) is the estimated value of arm a at time t, and Nt (a) is the number
of times arm a has been selected up to time t.
Greedy Algorithm
The greedy algorithm selects the arm with the highest estimated reward:
Epsilon-Greedy Algorithm
The Epsilon-Greedy algorithm balances exploration and exploitation:
• With probability ϵ, select an arm randomly (exploration).
• With probability 1−ϵ, select the arm with the highest Qt (a) (exploitation).
The exploration rate ϵ controls the trade-off.
18
Markov Decision Processes (MDPs)
State-value function:
19
SARSA (On-policy):
20