0% found this document useful (0 votes)
21 views20 pages

A M3 RD Ipjn Yd Ps GKF

The document provides an overview of various machine learning concepts, including autoregressive models, binary cross-entropy loss, NLP pipelines, recurrent neural networks (RNNs), long short-term memory (LSTM), and transformers. It discusses the mathematical formulations, applications, and challenges associated with these models, as well as techniques like word embeddings and sentiment classification. Additionally, it covers semi-supervised learning, emphasizing its importance in leveraging both labeled and unlabeled data for improved model performance.

Uploaded by

moachan765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views20 pages

A M3 RD Ipjn Yd Ps GKF

The document provides an overview of various machine learning concepts, including autoregressive models, binary cross-entropy loss, NLP pipelines, recurrent neural networks (RNNs), long short-term memory (LSTM), and transformers. It discusses the mathematical formulations, applications, and challenges associated with these models, as well as techniques like word embeddings and sentiment classification. Additionally, it covers semi-supervised learning, emphasizing its importance in leveraging both labeled and unlabeled data for improved model performance.

Uploaded by

moachan765
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

End Module C Exam Handbook

Autoregressive Models
Understanding Autoregressive Models
Definition: Autoregressive (AR) models predict future values in a time series
based on a linear function of previous observations. They are widely used in
time-series forecasting, econometrics, and natural language processing.
Equation:
X p
Xt = b + wi Xt−i + ϵt (1)
i=1

Xt = b + w1 Xt−1 (AR(1)) (2)


Xt = b + w1 Xt−1 + w2 Xt−2 (AR(2)) (3)

State Transition Matrix:


      
Xt w1 w2 ... wp Xt−1 b
 Xt−1  1 0 ... 0 Xt−2  0
 =  .. ..   ..  +  ..  (4)
      
 .. .. ..
 .   . . . .   .  .
Xt−p+1 0 ... 1 0 Xt−p 0

Sequence Modelling Auto-Regressive Models


Definition: A time-series model where current values depend on past values.
Equation:
p
X
Xt = c + ϕi Xt−i + ϵt (5)
i=1

Special Case: AR(1):

Xt = ϕXt−1 + ϵt (6)

Markov Chain Connection:


• Transition probability: P (Xt+1 |Xt )
• Can be represented as a transition matrix.

1
Binary Cross-Entropy Loss (BCELoss)
Definition
Binary Cross-Entropy (BCE) is a widely used loss function for binary classifi-
cation tasks. It measures the difference between two probability distributions
and is commonly used in logistic regression and neural networks.
Mathematical Formulation:
N
1 X
L=− [yi log yˆi + (1 − yi ) log(1 − yˆi )] (7)
N i=1

Properties of BCE Loss


• Penalizes incorrect predictions heavily.

• Ensures stable training with a small ϵ to avoid log(0) errors.


• Works well for probabilistic models in classification tasks.

NLP Pipeline
Natural Language Processing (NLP) involves preprocessing textual data before
feeding it into machine learning models. The key steps include:

Tokenization
• Splitting text into words or subwords.
• Helps in feature extraction for models.
• Example: Hello, world!” → [Hello”, ,” , world”, “!”]

Part-of-Speech (POS) Tagging


• Assigns grammatical roles to words (noun, verb, adjective, etc.).
• Used in syntactic parsing and named entity recognition.

• Example: The cat sleeps” → [(The”, DET), (cat”, NOUN), (sleeps”,


VERB)]

Text Normalization
• Converts words into a standard format to improve model performance.
• Includes lowercasing, stemming, and lemmatization.

2
Lowercasing
• Converts text to lowercase to avoid treating Hello” and hello” as different
words.

Stemming
• Reduces words to their root form by removing suffixes.

• Example: running” → run”

Lemmatization
• Converts words to their dictionary form.
• Example: better” → good”

Stopword and Punctuation Removal


• Removes commonly occurring words (e.g., the”, is”, “and”) that do not
add significant meaning.
• Helps reduce dimensionality and improve model performance.

Understanding N-Grams
Definition: A sequence of N items from a given text.
Types:

• Unigram: {word1 , word2 , ...}


• Bigram: {word1 word2 , word2 word3 , ...}
• Trigram: {word1 word2 word3 , ...}

Probability Calculation:

count(wn−1 , wn )
P (wn |wn−1 ) = (8)
count(wn−1 )

count(wn−2 , wn−1 , wn )
P (wn |wn−2 , wn−1 ) = (9)
count(wn−2 , wn−1 )

3
Recurrent Neural Networks (RNNs)
Definition
A Recurrent Neural Network (RNN) is a type of artificial neural network de-
signed to process sequential data. Unlike traditional feedforward networks,
RNNs use internal memory (hidden states) to capture dependencies in sequen-
tial inputs.
Basic Structure:
• Input Layer: Accepts sequential data.

• Hidden Layer with Loops: Stores past information.


• Output Layer: Produces predictions.
Mathematical Formulation:

ht = tanh(Wx xt + Wh ht−1 + b) (10)


ot = Wo ht (11)

Types of RNNs:
• One-to-One: Standard feedforward network.
• One-to-Many: Single input, multiple outputs.

• Many-to-One: Multiple inputs, single output.


• Many-to-Many: Sequence-to-sequence tasks.
Backpropagation Through Time (BPTT):
• Used to train RNNs by unrolling through time and applying backpropa-
gation.

Challenges of RNNs
• Vanishing Gradient Problem: Gradients shrink exponentially over
time, making it difficult for the network to learn long-term dependencies.
• Exploding Gradient Problem: Gradients grow exponentially, leading
to unstable training.
• Limited Memory: Standard RNNs struggle with capturing long-range
dependencies in sequential data.

4
Long Short-Term Memory (LSTM)
Definition
LSTMs are an advanced type of RNN designed to address the vanishing gradient
problem by incorporating memory cells that selectively retain information over
long sequences.

LSTM Architecture
LSTM networks consist of three gates:
• Forget Gate: Controls what portion of the past information should be
discarded.
• Input Gate: Regulates what new information should be added to mem-
ory.

• Output Gate: Determines the final output based on the cell state.
Mathematical Formulation: Forget Gate:

ft = σ(Wf [ht−1 , xt ] + bf ) (12)

Input Gate:
it = σ(Wi [ht−1 , xt ] + bi ) (13)
C̃t = tanh(WC [ht−1 , xt ] + bC ) (14)
Cell State Update:
Ct = ft ∗ Ct−1 + it ∗ C̃t (15)
Output Gate:
ot = σ(Wo [ht−1 , xt ] + bo ) (16)
ht = ot ∗ tanh(Ct ) (17)

Gated Recurrent Units (GRUs)


Definition
GRUs are a variant of LSTMs that use gating mechanisms to control information
flow but require fewer parameters than LSTMs.

5
Mathematical Formulation
Reset Gate:
rt = σ(Wr [ht−1 , xt ] + br ) (18)
Update Gate:
zt = σ(Wz [ht−1 , xt ] + bz ) (19)
Candidate Hidden State:

h̃t = tanh(Wh [rt ∗ ht−1 , xt ] + bh ) (20)

Final Hidden State:


ht = (1 − zt ) ∗ ht−1 + zt ∗ h̃t (21)

Bidirectional RNNs
Definition
A Bidirectional RNN (BiRNN) processes input sequences in both forward and
backward directions to improve context understanding.
Mathematical Formulation:

hft orward = f (Wx xt + U hft−1


orward
) (22)

hbackward
t = f (Wx xt + U hbackward
t+1 ) (23)
ht = [hft orward , hbackward
t ] (24)

Applications of RNNs, LSTMs, and GRUs


• Natural Language Processing (NLP): Machine translation, text gen-
eration, and sentiment analysis.
• Speech Recognition: Transcribing spoken language into text.

• Time-Series Forecasting: Predicting financial market trends and weather


patterns.
• Bioinformatics: Analyzing DNA sequences and protein structures.

6
Attention Mechanism for NLP

Figure 1: Input transformation used in a Transformer. The input matrix X is


multiplied by weight matrices W Q , W K , and W V to produce Query (Q), Key
(K), and Value (V ) matrices.

MultiHead(Q, K, V ) = Concat(head1 , . . . , headh )W O


where

headi = Attention(QWiQ , KWiK , V WiV )


and projections are the parameter matrices:

WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , W O ∈ Rhdv ×dmodel

The attention mechanism is defined as:

QK ⊤
 
Attention(Q, K, V ) = softmax √ V
dk

7
Sentiment Classification
Definition:
Sentiment classification is the task of identifying the emotional tone behind a
body of text, typically categorized as positive, negative, or neutral.
Goal:
To automatically determine the sentiment expressed in text using computational
methods.
Applications:

• Product reviews
• Social media monitoring
• Customer feedback analysis

Approaches:
1. Rule-based: Uses manually defined rules and sentiment lexicons (e.g.,
SentiWordNet).
2. Machine Learning: Trains models (e.g., Naive Bayes, SVM) on labeled
data.
3. Deep Learning: Uses neural networks (e.g., RNNs, LSTMs, Transform-
ers) for better context understanding.

Steps Involved:
• Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization
• Feature Extraction: Bag of Words, TF-IDF, word embeddings

• Model Training: Using labeled data to train classifiers


• Prediction: Classify unseen text into sentiment categories

Popular Tools/Libraries:

• NLTK
• TextBlob
• Scikit-learn

• Hugging Face Transformers

8
Word Embeddings
Definition:
Word embeddings are dense vector representations of words in a continuous
vector space where semantically similar words are mapped close together.
Purpose:
To capture syntactic and semantic meaning of words for use in machine learning
models.
Characteristics:

• Low-dimensional (e.g., 100–300 dimensions)


• Real-valued vectors
• Context-based representations

Popular Techniques:
• Word2Vec: Uses two architectures:
– CBOW (Continuous Bag of Words): Predicts the current word
from surrounding context.
– Skip-gram: Predicts surrounding context words given the current
(center) word.
• GloVe (Global Vectors): Combines global word co-occurrence statis-
tics.

• FastText: Improves Word2Vec by using subword information.


• BERT Embeddings: Contextual embeddings generated using trans-
former models.

Advantages:
• Captures semantic similarity (e.g., king - man + woman = queen)
• Improves performance in NLP tasks

Applications:
• Text classification
• Machine translation

• Question answering
• Named entity recognition (NER)

9
BLEU Score (Bilingual Evaluation Understudy)
Definition:
BLEU is an automatic evaluation metric for comparing machine-generated text
(e.g., translations) against one or more reference texts using n-gram precision.
Purpose:
To measure the quality of machine translation by checking how many n-grams
in the candidate sentence appear in the reference sentences.
Formula: !
N
X
BLEU = BP · exp wn log pn
n=1

Where:

• pn = modified precision for n-grams (usually n = 1 to 4)


• wn = weight for n-gram (commonly wn = 1
N)

• BP = brevity penalty to penalize short candidates

Brevity Penalty:
(
1 if c > r
BP =
e(1−r/c) if c ≤ r

• c = length of candidate translation

• r = effective reference length

10
Machine Translation (MT)
Definition:
Machine Translation is the task of automatically translating text from one lan-
guage to another using computational models.
Evolution:
• Statistical MT: Early systems based on word/phrase probabilities.
• Seq2Seq Models: Encoder-decoder neural networks using RNNs.
• Attention Mechanism: Solves long-range dependency issues in Seq2Seq.
• Transformers: Fully attention-based models, state-of-the-art in MT.

Preprocessing Pipeline:
• Tokenization: Splits text into meaningful units using language-specific
tokenizers.
• Vocabulary Building: Maps words to indices; includes special tokens
like <sos>, <eos>, <pad>, and <unk>.
• Numericalization: Converts token sequences into index sequences for
model input.

Introduction to Transformers
Transformers revolutionized machine translation by overcoming the limitations
of RNNs:
• RNN Issues: Struggles with long-range dependencies, poor paralleliza-
tion, and fading context.
• Transformer Strengths: Self-attention allows direct modeling of de-
pendencies, facilitates parallel computation, and maintains context across
sentences.

Positional Encoding
Transformers are permutation-invariant. Positional encodings are added to to-
ken embeddings to provide order information.

Encoder
The encoder processes input sequences and creates contextualized embeddings.
It consists of the following components:
• Input embedding

11
• Positional encoding
• Multi-head self-attention
• Feed-forward network
• Layer normalization + residual connections

Decoder
The decoder generates the output sequence by attending to both previous tokens
and the encoder’s output. It consists of the following components:
• Output embedding
• Positional encoding
• Masked multi-head self-attention
• Encoder-decoder attention
• Feed-forward network
• Linear + softmax layer

Complete Transformer Model


The Transformer model consists of:
• Encoder: Processes the source sequence in parallel.
• Decoder: Generates the target sequence, attending to the encoder’s out-
put.
• Training: Uses teacher forcing with a shifted version of the target se-
quence.

Semi-Supervised Learning (SSL)


Introduction
Semi-Supervised Learning (SSL) combines a small amount of labeled data with
a large pool of unlabeled data. It strikes a balance between supervised learning
(using labeled data) and unsupervised learning (using only unlabeled data),
leveraging both to improve model performance.

Motivation
Labeled data is costly to acquire, while unlabeled data is abundant. SSL utilizes
unlabeled data to enhance learning, reducing reliance on labeled data.

12
Mathematical Formulation
Let:
• DL = {(xi , yi )}li=1 : Labeled data

• DU = {xi }l+u
i=l+1 : Unlabeled data

The objective is to minimize:

L(f ) = Lsupervised (f ; DL ) + Lunsupervised (f ; DU )

Importance
SSL reduces the need for large labeled datasets, making it particularly useful in
domains with limited labeled data but abundant unlabeled data.

Key Assumptions in SSL


• Self-Training: Confident predictions on unlabeled data are added to the
training set.
• Co-Training: Two classifiers, trained on different views, teach each
other.
• Cluster Assumption: Points in the same cluster likely share the same
label.
• Manifold Assumption: Data points lie on a low-dimensional manifold.

Inductive vs Transductive SSL


• Inductive SSL: Learns a general classifier applicable to unseen data.
• Transductive SSL: Focuses on labeling a fixed set of unlabeled data.

Ladder Networks and -Models (Pi-Models)


Ladder Networks
A Ladder Network combines supervised and unsupervised learning by denoising
internal representations:
• Encoder: Adds noise to the input and processes it.
• Decoder: Recovers clean activations for each layer.

• Skip Connections: Form a ”ladder” structure between encoder and de-


coder.

13
Loss function:
X
L = Lsupervised + λl Lreconstruction (l)
l

Where λl is a weight for each layer’s reconstruction loss.

Pi-Model
The Pi-Model enforces consistency regularization by ensuring stable predictions
under input perturbations:
• Input x is passed through the network twice with different augmenta-
tions/noise.

• Two outputs: f1 (x + ϵ1 ) and f2 (x + ϵ2 ).


Loss function:
Lunsupervised = ∥f1 (x) − f2 (x)∥2
This encourages predictions to remain consistent under small input changes.

Variational Autoencoders (VAEs)


Introduction
VAEs model latent representations as probability distributions, enabling data
generation by sampling from the learned distribution.

Intuition
The encoder compresses an input into a latent variable z, treated as a random
variable. This enables diverse outputs from the same input.

VAE Loss
L = Eq(z|x) [log p(x|z)] − KL(q(z|x) ∥ p(z))
Consists of:

• Reconstruction Loss
• KL Divergence

Why Use a Distribution?


Modeling z as a distribution allows infinite variants of data generation.

14
Diffusion Models
Diffusion models reverse noise addition across steps for sharper outputs, unlike
VAEs’ single-step approach.

VAEs in SSL
Extended with a classifier, using reconstruction, KL divergence, and classifica-
tion losses for labeled data, and pseudo-labels for unlabeled data.

Applications
• Medical Imaging: Small labeled datasets, synthetic augmentation.
• Autonomous Driving: Unlabeled video frames.

Conditional VAEs
CVAEs guide generation by conditioning on a label y, e.g., generating a digit
“4” by conditioning on y = 4.

Introduction to Reinforcement Learning


Definition and Basic Concepts
Reinforcement Learning (RL) is a machine learning paradigm where an agent
learns to make decisions through interaction with an environment to maximize
rewards.

Comparison with Other Learning Paradigms

Key Components of Reinforcement Learning


Agent
Learns and makes decisions to maximize cumulative reward.

Environment
Responds to agent’s actions, provides feedback, and can be deterministic or
stochastic.

State/Observation
Represents the environment’s current situation. States can be complete or par-
tial.

15
Action
Decisions made by the agent that affect the environment.

Reward
Scalar feedback indicating the quality of actions taken by the agent.

Policy
Strategy that defines how actions are selected.

Exploration vs. Exploitation Dilemma


Balancing between exploring new actions and exploiting known good actions.

Common Approaches
• ϵ-greedy
• Decaying ϵ-greedy

• Softmax exploration
• Upper Confidence Bound (UCB)
• Thompson Sampling

• Intrinsic Motivation/Curiosity

Types of Reinforcement Learning Algorithms


• Model-based vs. Model-free

• Value-based vs. Policy-based


• On-policy vs. Off-policy
• Deterministic vs. Stochastic Policies

Concepts in Reinforcement Learning


Episodes
An episode is a sequence from an initial state to a terminal state:
• Finite in length, with a defined start and endpoint.

16
State Spaces
The state space is the set of all possible states:
• Discrete: Finite number of states

• Continuous: Infinite number of states.


• Fully Observable: Complete state visible to the agent.
• Partially Observable: Only partial state visible to the agent.

Markov Property
The Markov property states that the future depends only on the current state:
• Forms the foundation of Markov Decision Processes (MDPs).
• Simplifies learning by discarding past history.

Challenges in Reinforcement Learning


Delayed Rewards
Challenges with delayed rewards:
• Credit Assignment Problem: Determining which actions led to the
final reward.
• Approaches: Temporal Difference Learning, Eligibility Traces, Reward
Shaping, Hierarchical RL.

Continuous vs. Discrete Action Spaces


• Discrete Actions: Finite set (e.g., Q-learning).
• Continuous Actions: Infinite set (e.g., DDPG, SAC, PPO).

Multi-Armed Bandit Problem (MAB)


Problem Overview
The goal in the multi-armed bandit (MAB) problem is to maximize the expected
total reward by selecting actions (arms) over time, while facing uncertainty in
reward distributions. We aim to learn which actions yield better rewards.

17
Reward Estimation
Each arm corresponds to an unknown reward distribution. We estimate the
expected reward using the sample mean:
Nt (a)
1 X
Qt (a) = Ri
Nt (a) i=1

where Qt (a) is the estimated value of arm a at time t, and Nt (a) is the number
of times arm a has been selected up to time t.

Incremental Update Rule


To avoid storing all past rewards, we update the estimated value incrementally:
1
Qn = Qn−1 + (Rn − Qn−1 )
n
where Qn is the updated estimate after observing reward Rn at step n.

Greedy Algorithm
The greedy algorithm selects the arm with the highest estimated reward:

Action = arg max Qt (a)


a

However, it may converge prematurely to suboptimal arms and avoid exploring.

Epsilon-Greedy Algorithm
The Epsilon-Greedy algorithm balances exploration and exploitation:
• With probability ϵ, select an arm randomly (exploration).

• With probability 1−ϵ, select the arm with the highest Qt (a) (exploitation).
The exploration rate ϵ controls the trade-off.

Upper Confidence Bound (UCB)


The UCB algorithm explores uncertain arms more frequently. The UCB1 for-
mula is: s
ln t
UCBt (a) = Qt (a) + c ·
Nt (a)
where c is the exploration coefficient. Arms with higher uncertainty are explored
earlier.

18
Markov Decision Processes (MDPs)
State-value function:

V π (s) = Eπ [Gt |st = s]

Expected return starting from state s under policy π.


Action-value function:

Qπ (s, a) = Eπ [Gt |st = s, at = a]

Expected return from state s, taking action a, then following π.

Solving MDPs with Dynamic Programming


Bellman Expectation Equation (Policy Evaluation):
X X
V π (s) = π(a|s) P (s′ |s, a)[R + γV π (s′ )]
a s′

Iteratively update value estimates using current policy.


Value Iteration:
X
V (s) = max P (s′ |s, a)[R + γV (s′ )]
a
s′

Optimal value found by repeatedly applying Bellman optimality.

Monte Carlo Methods


MC Value Estimation:
N (s)
1 X
V (s) = Gi
N (s) i=1

Average return over episodes starting from s.


Learn from complete episodes without model of environment.

Temporal Difference (TD) and Deep Reinforce-


ment Learning
TD(0) Update:

V (s) ← V (s) + α[r + γV (s′ ) − V (s)]

Update value using immediate reward and next state’s estimate.

19
SARSA (On-policy):

Q(s, a) ← Q(s, a) + α[r + γQ(s′ , a′ ) − Q(s, a)]

Learn action values while following the same policy.


Q-Learning (Off-policy):

Q(s, a) ← Q(s, a) + α[r + γ max



Q(s′ , a′ ) − Q(s, a)]
a

Learn optimal policy regardless of the behavior policy.


DQN Loss Function:
 2
′ ′ −
L(θ) = r + γ max

Q(s , a ; θ ) − Q(s, a; θ)
a

Neural network predicts Q-values; loss guides learning.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy