0% found this document useful (0 votes)
97 views47 pages

Module - 05 Machine Learning (BCS602) Search Creators

The document provides an overview of clustering algorithms in machine learning, emphasizing the importance of cluster analysis in unsupervised learning to group unlabelled data into meaningful clusters. It discusses various clustering techniques, including hierarchical, partitional, and density-based methods, along with their advantages and challenges. Additionally, it introduces reinforcement learning, outlining its principles, applications, and challenges in decision-making processes.

Uploaded by

gamingsrb34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views47 pages

Module - 05 Machine Learning (BCS602) Search Creators

The document provides an overview of clustering algorithms in machine learning, emphasizing the importance of cluster analysis in unsupervised learning to group unlabelled data into meaningful clusters. It discusses various clustering techniques, including hierarchical, partitional, and density-based methods, along with their advantages and challenges. Additionally, it introduces reinforcement learning, outlining its principles, applications, and challenges in decision-making processes.

Uploaded by

gamingsrb34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Module-5

Chapter – 01 - Clustering Algorithms

Introduction to Clustering Approaches

 Cluster analysis is the fundamental task of unsupervised learning. Unsupervised

learning involves exploring the given dataset.

 Cluster analysis is a technique of partitioning a collection of unlabelled objects that

have many attributes into meaningful disjoint groups or clusters.

 This is done using a trial and error approach as there are no supervisors available as

in classification.

 The characteristic of clustering is that the objects in the clusters or groups are

similar to each other within the clusters while differ from the objects in other

clusters significantly.

 The input for cluster analysis is examples or samples. These are known as objects,

data points or data instances.

 All these terms are same and used interchangeably in this chapter. All the samples

or objects with no labels associated with them are called unlabelled.

 The output is the set of clusters (or groups) of similar data if it exists in the input.

 For example, the following Figure 13.1(a) shows data points or samples with two

features shown in different shaded samples and Figure 13.1(b) shows the manually

drawn ellipse to indicate the clusters formed.

Search Creators... Page 1


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Search Creators... Page 2


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Visual identification of clusters in this case is easy as the examples have only two
features.

But, when examples have more features, say 100, then clustering cannot be done
manually and automatic clustering algorithms are required.

Also, automating the clustering process is desirable as these tasks are considered
difficult by humans and almost impossible. All clusters are repre- sented by centroids.

Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the
centroid is given as.

The clusters should not overlap and every cluster should represent only one class.
Therefore, clustering algorithms use trial and error method to form clusters that can be
converted to labels.

Difference between Clustering & Classification

Search Creators... Page 3


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Applications of Clustering

Challenges of Clustering Algorithms

High-Dimensional Data

o As the number of features increases, clustering becomes difficult.

Scalability Issue

o Some algorithms perform well for small datasets but fail for large-scale data.

Unit Inconsistency

o Different measurement units (e.g., kg vs. pounds) can create problems.

Proximity Measure Design

o Choosing an appropriate distance metric is crucial for accurate clustering.

Search Creators... Page 4


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Advantages and Disadvantages of Clustering Algorithms

Proximity Measures

Proximity measures determine similarity or dissimilarity among objects.

Distance measures (dissimilarity) indicate how different objects are.

Similarity measures indicate how alike objects are.

property: More distance → Less similarity, and vice versa.

Properties of Distance Measures (Metric Conditions)

Search Creators... Page 5


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Types of Distance Measures Based on Data Types

Quantitative Variables

Search Creators... Page 6


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Binary Attributes

Categorical Variables

Distance is 1 if different, 0 if same.

Example: Gender (Male, Female) → Distance = 1

Search Creators... Page 7


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Ordinal Variables

Vector-Based Distance Measures (For Text & Documents)

Cosine Similarity

o Measures angle between two vectors.


o Formula:

Search Creators... Page 8


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Distance Measures

Hierarchical Clustering Algorithms

Overview

 Produces a nested partition of objects with hierarchical relationships.


 Represented using a dendrogram.
 Two main categories: Agglomerative and Divisive methods.

Types of Hierarchical Clustering

1. Agglomerative Methods (Bottom-Up)


o Each sample starts as an individual cluster.
o Clusters are merged iteratively until one cluster remains.
o Once a cluster is formed, it cannot be undone (irreversible).
2. Divisive Methods (Top-Down)
o Starts with a single cluster containing all data points.
o Splits iteratively into smaller clusters.
o Continues until each sample becomes its own cluster.

Search Creators... Page 9


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Agglomerative Clustering Techniques

Single Linkage (MIN Algorithm)

o Merges clusters based on the smallest distance between two points from different
clusters.
o Related to the Minimum Spanning Tree (MST).

Complete Linkage (MAX or Clique Algorithm)

Search Creators... Page 10


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Average Linkage Algorithm

Mean-Shift Clustering Algorithm

 Non-parametric and hierarchical clustering technique.


 Also known as mode-seeking or sliding window algorithm.
 No prior knowledge of cluster count or shape required.
 Moves towards high-density regions in data using a kernel function (e.g., Gaussian
window).

Search Creators... Page 11


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Advantages of Mean-Shift Clustering

No model assumptions

Suitable for all non-convex shapes

Only one parameter of the window, that is, bandwidth is required Robust to noise

No issues of local minima or premature termination

Disadvantages of Mean-Shift Clustering

Selecting the bandwidth is a challenging task. If it is larger, then many clusters are
missed. If it is small, then many points are missed and convergence occurs as the
problem.

The number of clusters cannot be specified and user has no control over this
parameter.

Partitional Clustering Algorithm

 k-means is a widely used partitional clustering algorithm.


 The user specifies k, the number of clusters.
 Assumes non-overlapping clusters.
 Works well for circular or spherical clusters.

Process of k-means Algorithm

1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids

Search Creators... Page 12


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o Compute the mean vector of assigned points to update cluster centroids.


o Repeat this process until no changes occur in cluster assignments.
4. Termination
o The process stops when cluster assignments remain unchanged.

Mathematical Optimization

Advantages

1. Simple and easy to implement.


2. Efficient for small to medium datasets.

Search Creators... Page 13


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Disadvantages

1. Sensitive to initialization – different initial points may lead to different results.


2. Time-consuming for large datasets – requires multiple iterations.

Choosing the Value of k

 No fixed rule for selecting k.


 Use Elbow Method:
o Run k-means with different values of k.
o Plot Within Cluster Sum of Squares (WCSS) vs. k.
o The optimal k is at the "elbow" where the curve flattens.

Computational Complexity

O(nkId), where:

o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes

Density-based Methods

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-


based clustering algorithm.
 Clusters are dense regions of data points separated by areas of low density (noise).
 Works well for arbitrary-shaped clusters and datasets with noise.

Search Creators... Page 14


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Uses two parameters:

1. ε (epsilon) – Neighborhood radius.


2. m (minPts) – Minimum number of points within ε to form a cluster.

Types of Points in DBSCAN

1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.

Density Connectivity Measures

1. Direct Density Reachability


o Point X is directly reachable from Y if:
 X is in the ε-neighborhood of Y.
 Y is a core point.
2. Densely Reachable

Search Creators... Page 15


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o X is densely reachable from Y if there exists a chain of core points linking


them.
3. Density Connected
o X and Y are density connected if they are both densely reachable from a
common core point Z.

Advantages of DBSCAN

1. Can detect arbitrary-shaped clusters.


2. Robust to noise and outliers.
3. Does not require specifying the number of clusters (k-means does).

Disadvantages of DBSCAN

1. Sensitive to ε and m parameters – Poor parameter choice can affect results.


2. Fails in datasets with varying density – A single ε may not work for all clusters.
3. Computationally expensive for high-dimensional data.

Grid-based Approach

 Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
 Suitable for high-dimensional data.
 Uses subspace clustering, dense cells, and monotonicity property.

Concepts

Subspace Clustering

o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.

Search Creators... Page 16


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o CLIQUE (Clustering in Quest) is a widely used grid-based subspace clustering


algorithm.

Concept of Dense Cells

o CLIQUE partitions dimensions into intervals (cells).


o A cell is dense if its data point density exceeds a threshold.
o Dense cells are merged to form clusters.

Monotonicity Property

Search Creators... Page 17


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o Uses anti-monotonicity (Apriori property):


 If a k-dimensional cell is dense, then all (k-1) dimensional projections
must also be dense.
 If a lower-dimensional cell is not dense, then higher-dimensional cells
containing it are also not dense.
o Similar to association rule mining in frequent pattern mining.

Advantages of CLIQUE

1. Insensitive to input order of objects.


2. No assumptions about data distribution.
3. Finds high-density clusters in subspaces of high-dimensional data.

Disadvantage of CLIQUE

 Tuning grid parameters (grid size, density threshold) is difficult.


 Finding the optimal threshold to classify a cell as dense is challenging.

Search Creators... Page 18


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Chapter :- 2

Reinforcement Learning

Overview of Reinforcement Learning

What is Reinforcement Learning?

 Reinforcement Learning (RL) is a machine learning paradigm that mimics how


humans and animals learn through experience.
 Humans interact with the environment, receive feedback (rewards or penalties),
and adjust their behavior accordingly.
 Example: A child touching fire learns to avoid it after experiencing pain (negative
reinforcement).

How RL Works in Machines

 RL simulates real-world scenarios for a computer program (agent) to learn by trial


and error.
 The agent executes actions, receives positive or negative rewards, and optimizes
its future actions based on these experiences.

Types of Reinforcement Learning

1. Positive Reinforcement Learning


o Rewards encourage good behavior (reinforce correct actions).
o Example: A robot gets +10 points for reaching a goal successfully.
o Effect: Increases the likelihood of repeating the rewarded action.
2. Negative Reinforcement Learning
o Negative rewards discourage unwanted actions.
o Example: A game agent loses -10 points for stepping into a danger zone.
o Effect: Helps the agent learn to avoid negative outcomes.

Search Creators... Page 19


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Characteristics of RL

 Sequential Decision-Making: The agent makes a series of decisions to maximize total


rewards.
 Trial and Error Learning: The agent learns by exploring different actions and their
consequences.
 No Supervised Labels: Unlike supervised learning, RL does not require labeled data; it
learns from experience.

Applications of Reinforcement Learning

 Robotics: Teaching robots to walk, grasp objects, or perform complex tasks.


 Gaming: AI agents in chess, Go, and video games (e.g., AlphaGo, OpenAI Five).
 Autonomous Vehicles: Self-driving cars learn optimal driving strategies.
 Finance: AI-based trading strategies for stock markets.
 Healthcare: Personalized treatment plans based on patient responses.

Scope of Reinforcement Learning

Reinforcement Learning (RL) is well-suited for decision-making problems in dynamic and


uncertain environments. It excels in cases where an agent must learn through trial and
error and optimize its actions based on delayed rewards.

Situations Where RL Can Be Used

Pathfinding and Navigation

o Consider a grid-based game where a robot must navigate from a starting node (E) to
a goal node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on
their efficiency.

Search Creators... Page 20


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o In obstacle-based games, RL can identify safe paths while avoiding dangerous


zones.

Dynamic Decision-Making with Uncertainty

o RL is useful in environments where not all information is known upfront.


o It is not suitable for tasks like object detection, where a classifier with complete
labeled data performs better.

Characteristics of Reinforcement Learning

1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps
before receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can
have long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.

Challenges in Reinforcement Learning

Reward Design

o Setting the right reward values is crucial. Incorrectly designed rewards may lead the
agent to learn undesired behavior.

Absence of a Fixed Model

Search Creators... Page 21


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.

Partial Observability

o Some environments, like weather prediction, involve uncertainty because complete


state information is unavailable.

High Computational Complexity

o Games like Go involve a huge state space, making RL training time-consuming.


o More possible actions → More training time needed.

Applications of Reinforcement Learning

1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications

Search Creators... Page 22


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o AI systems that generate programs, images, and optimize machine learning


models.

Reinforcement Learning as Machine Learning

Reinforcement Learning (RL) is a distinct branch of machine learning that differs


significantly from supervised learning.

While supervised learning depends on labeled data, reinforcement learning learns


through interaction with the environment, making decisions based on trial and error.

Why RL Is Necessary?

Some tasks cannot be solved using supervised learning due to the absence of a labeled
training dataset. For example:

 Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
 Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.

Challenges in Reinforcement Learning Compared to Supervised Learning

 More complex decision-making since every action affects future outcomes.


 Longer training times due to trial-and-error learning.
 Delayed rewards, making it difficult to attribute success or failure to a specific
action.

Search Creators... Page 23


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Differences between Supervised Learning and Reinforcement Learning

Components of Reinforcement Learning

Reinforcement Learning (RL) is based on an agent interacting with an environment to


learn an optimal strategy through trial and error.

Basic Components of RL

Search Creators... Page 24


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

1. Agent – The decision-maker (e.g., a robot, self-driving car, AI player in a game).


2. Environment – The external world where the agent interacts (e.g., a game board,
real-world traffic).
3. State (S) – A representation of the environment at a specific time.
4. Actions (A) – The possible choices available to the agent.
5. Rewards (R) – The feedback signal received by the agent for taking an action.
6. Policy (π) – The agent’s strategy for selecting actions based on states.
7. Episodes – The sequence of states, actions, and rewards from the start state to
the goal state.

Types of RL Problems
Learning Problems

 Unknown environment – The agent learns by trial and error.


 Goal – Improve the policy through interaction.
 Example – A robot navigating through an unknown maze.

Planning Problems

 Known environment – The agent can compute and improve the policy using a
model.
 Example – Chess AI that plans its moves based on game rules.

Environment and Agent

 The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
 The agent makes decisions and performs actions to maximize rewards.

Example

In self-driving cars,

Search Creators... Page 25


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

 The environment includes roads, traffic, and signals.


 The agent is the AI system making driving decisions.

States and Actions

 State (S) – Represents the current situation.


 Action (A) – Causes a transition from one state to another.

Example (Navigation)

In a grid-based game, states represent positions (A, B, C, etc.), and actions are
movements (UP, DOWN, LEFT, RIGHT).

Types of States

1. Start State – Where the agent begins.


2. Goal State – The target state with the highest reward.
3. Non-terminal States – Intermediate steps between start and goal.

Search Creators... Page 26


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Types of Episodes

 Episodic – Has a definite start and goal state (e.g., solving a maze).
 Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).

Policies in RL

A policy (π) is the strategy used by the agent to choose actions.

Types of Policies

Choosing the Best Policy

 The optimal policy is the one that maximizes cumulative expected rewards.

Rewards in RL

 Immediate Reward (r) – The instant feedback for an action.


 Total Reward (G) – The sum of all rewards collected during an episode.
 Long-term Reward – The cumulative future reward.

Search Creators... Page 27


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Discount Factor (γ)

RL Algorithm Categories

 Model-Based RL – Uses a predefined model (e.g., Chess AI).


 Model-Free RL – Learns by trial and error (e.g., a robot navigating an unknown
environment).

Markov Decision Process

A Markov Chain is a stochastic process that satisfies the Markov property.

It consists of a sequence of random variables where the probability of transitioning to


the next state depends only on the current state and not on the past states.

Search Creators... Page 28


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Example: University Transition

Consider two universities:

 80% of students from University A move to University B for a master's degree,


while 20% remain in University A.
 60% of students from University B move to University A, while 40% remain in
University B.

This can be represented as a Markov Chain, where:

 States represent the universities.


 Edges denote the probability of transitioning between states.

A transition matrix at time is defined as:

Each row represents a probability distribution, meaning the sum of elements in each
row equals 1.

Probability Prediction

Let the initial distribution be:

To find the state distribution after one time step

After two time steps:

Search Creators... Page 29


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

The system stabilizes over time, reflecting the equilibrium distribution.

Markov Decision Process (MDP)

An MDP extends a Markov Chain by incorporating rewards. It consists of:

1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function

Markov Assumption

The Markov property states that the probability of reaching a state and receiving a
reward depends only on the previous state and action :

MDP Process

1. Observe the current state .


2. Choose an action .
3. Receive a reward .
4. Move to the next state .
5. Repeat to maximize cumulative rewards.

State Transition Probability

Search Creators... Page 30


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

The probability of moving from state to after taking action is given by:

This forms a state transition matrix, where each row represents transition probabilities
from one state to another.

Expected Reward

The expected reward for an action in state is given by:

Training and Testing of RL Systems

Once an MDP is modeled, the system undergoes:

1. Training: The agent repeatedly interacts with the environment, adjusting


parameters based on rewards.
2. Inference: A trained model is deployed to make decisions in real-time.
3. Retraining: When the environment changes, the model is retrained to adapt and
improve performance.

Goal of MDP

The agent's objective is to maximize total accumulated rewards over time by following
an optimal policy.

Search Creators... Page 31


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Multi-Arm Bandit Problem and Reinforcement Problem Types

Reinforcement Learning Overview

Reinforcement learning (RL) uses trial and error to learn a series of actions that
maximize the total reward. RL consists of two fundamental sub-problems:

Prediction (Value Estimation):

o The goal is to predict the total reward (return), also known as policy evaluation or
value estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.

Policy Improvement:

o The objective is to determine actions that maximize returns.


o This process is known as policy improvement.
o Both prediction and policy improvement can be combined into policy iteration, where
these steps are used alternately to find an optimal policy.

Multi-Arm Bandit Problem

A commonly encountered problem in reinforcement learning is the multi-arm bandit


problem (or N-arm bandit problem).

Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine.
When a lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).

The challenge is that each arm provides rewards randomly within this range.

Search Creators... Page 32


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Objective:

Given a limited number of attempts, the goal is to maximize the total reward by
selecting the best lever.

A logical approach is to determine which lever has the highest average reward and use it
repeatedly.

Formalization:

Given k attempts on an N-arm slot machine, with rewards , the expected reward (action-
value function) is:

The best action is defined as:

This indicates the action that returns the highest average reward and is used as an
indicator of action quality.

Example:

If a slot machine is chosen five times and returns rewards , the quality of this action is:

Search Creators... Page 33


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Exploration vs Exploitation and Selection Policies

In reinforcement learning, an agent must decide how to select actions:

Exploration:

o Tries all actions, even if they lead to sub-optimal decisions.


o Useful in games where exploring different actions provides better long-term
rewards.
o Risky but informative.

Exploitation:

o Uses the current best-known action repeatedly.


o Focuses on short-term gains.
o Simple but often sub-optimal.

A balance between exploration and exploitation is crucial for optimal decision-making.

Selection Policies

Greedy Method

 Picks the best-known action at any given time.


 Based solely on exploitation.
 Risk: It may miss out on exploring better options.

ε-Greedy Method

 Balances exploration and exploitation.


 With probability ε, the agent explores a random action.
 With probability 1 - ε, it selects the best-known action.
 ε ranges from 0 to 1 (e.g., ε = 0.1 means a 10% chance of exploration).

Search Creators... Page 34


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Reinforcement Learning Agent Types

An RL agent can be classified into different approaches based on how it learns:

1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future
reward from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches

Search Creators... Page 35


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o No predefined model of the environment.


o Use methods like Temporal Differencing (TD) Learning and Monte Carlo
methods to estimate values from experience.

Reinforcement Algorithm Selection

The choice of a reinforcement learning algorithm depends on factors such as:

 Availability of models
 Nature of updates (incremental vs. batch learning)
 Exploration vs. exploitation trade-offs
 Computational efficiency

Model-based Learning

Passive Learning refers to a model-based environment, where the environment is known.


This means that for any given state, the next state and action probability distribution are
known.

Markov Decision Process (MDP) and Dynamic Programming are powerful tools for
solving reinforcement learning problems in this context.

Search Creators... Page 36


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

The mathematical foundation for passive learning is provided by MDP. These model-
based reinforcement learning problems can be solved using dynamic programming after
constructing the model with MDP.

The primary objective in reinforcement learning is to take an action a that transitions the
system from the current state to the end state while maximizing rewards. These
rewards can be positive or negative.

The goal is to maximize expected rewards by choosing the optimal policy:

for all possible values of s at time t.

Policy and Value Functions

An agent in reinforcement learning has multiple courses of action for a given state. The
way the agent behaves is determined by its policy.

A policy is a distribution over all possible actions with probabilities assigned to each
action.

Different actions yield different rewards. To quantify and compare these rewards, we
use value functions.

Value Function Notation

A value function summarizes possible future scenarios by averaging expected returns


under a given policy π.

It is a prediction of future rewards and computes the expected sum of future rewards
for a given state s under policy π:

Search Creators... Page 37


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

where v(s) represents the quality of the state based on a long-term strategy.

Example

If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in.

Value functions can be of two types:

 State-Value Function (for a state)


 State-Action Function (for a state-action pair)

State-Value Function

Denoted as v(s), the state-value function of an MDP is the expected return from state s
under a policy π:

This function accumulates all expected rewards, potentially discounted over time, and
helps determine the goodness of a state.

The optimal state-value function is given by:

Action-Value Function (Q-Function)

Apart from v(s), another function called the Q-function is used. This function returns a
real value indicating the total expected reward when an agent:

1. Starts in state s
2. Takes action a
3. Follows a policy π afterward

Search Creators... Page 38


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Bellman Equation

Dynamic programming methods require a recursive formulation of the problem. The


recursive formulation of the state-value function is given by the Bellman equation:

Solving Reinforcement Problems

There are two main algorithms for solving reinforcement learning problems using
conventional methods:

1. Value Iteration
2. Policy Iteration

Value Iteration

Value iteration estimates v(s) iteratively:

Search Creators... Page 39


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Algorithm

1. Initialize v(s) arbitrarily (e.g., all zeros).


2. Iterate until convergence:
o For each state s, update v(s) using the Bellman equation.
o Repeat until changes are negligible.

Policy Iteration

Policy iteration consists of two main steps:

1. Policy Evaluation
2. Policy Improvement

Policy Evaluation

Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal
v(s) is found.

Policy Improvement

The policy improvement process is performed as follows:

1. Evaluate the current policy using policy evaluation.


2. Solve the Bellman equation for the current policy to obtain v(s).
3. Improve the policy by applying the greedy approach to maximize expected
rewards.
4. Repeat the process until the policy converges to the optimal policy.

Algorithm

1. Start with an arbitrary policy π.


2. Perform policy evaluation using Bellman’s equation.
Search Creators... Page 40
BCS602 | MACHINE LEARNING| SEARCH CREATORS.

3. Improve the policy greedily.


4. Repeat until convergence.

Model Free Methods

Model-free methods do not require complete knowledge of the environment. Instead,


they learn through experience and interaction with the environment.

The reward determination in model-free methods can be categorized into three


formulations:

1. Episodic Formulation: Rewards are assigned based on the outcome of an entire


episode. For example, if a game is won, all actions in the episode receive a positive
reward (+1). If lost, all actions receive a negative reward (-1). However, this approach
may unfairly penalize or reward intermediate actions.
2. Continuous Formulation: Rewards are determined immediately after an action. An
example is the multi-armed bandit problem, where an immediate reward between $1
- $10 can be given after each action.
3. Discounted Returns: Long-term rewards are considered using a discount factor. This
method is often used in reinforcement learning algorithms.

Model-free methods primarily utilize the following techniques:

 Monte Carlo (MC) Methods


 Temporal Difference (TD) Learning

Monte-Carlo Methods

Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from
interactions with their environment.

Search Creators... Page 41


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Characteristics of Monte Carlo Methods:

 Experience is divided into episodes, where each episode is a sequence of states


from a starting state to a goal state.
 Episodes must terminate; regardless of the starting point, an episode must reach
an endpoint.
 Value-action functions are computed only after the completion of an episode,
making MC an incremental method.
 MC methods compute rewards at the end of an episode to estimate maximum
expected future rewards.
 Empirical mean is used instead of expected return; the total return over multiple
episodes is averaged.
 Due to the non-stationary nature of environments, value functions are computed
for a fixed policy and revised using dynamic programming.

Monte Carlo Mean Value Computation:

The mean value of a state is calculated as:

Incremental Monte Carlo Update:

The value function is updated incrementally using the following formula:

Search Creators... Page 42


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Temporal Difference (TD) Learning

Temporal Difference (TD) Learning is an alternative to Monte Carlo methods. It is also a


model-free technique that learns from experience and interaction with the environment.

Characteristics of TD Learning:

 Bootstrapping Method: Updates are based on the current estimate and future
reward.
 Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
 More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
 Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.

Differences between Monte Carlo and TD Learning

Search Creators... Page 43


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Eligibility Traces and TD(λ)

TD Learning can be accelerated using eligibility traces, which allow updates to be


spread over multiple states. This leads to a family of algorithms called TD(λ), where λ is
the decay parameter (0 ≤ λ ≤ 1):

 λ = 0: Only the previous prediction is updated.


 λ = 1: All previous predictions are updated.

By incorporating eligibility traces, TD(λ) provides an alternative short-term memory


mechanism to enhance learning efficiency.

Search Creators... Page 44


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

Q-Learning

Q-Learning Algorithm

1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:

4. End the training once convergence is reached (Q-values become stable).

This iterative process helps the agent learn optimal Q-values, which guide it to take
actions that maximize rewards.

Search Creators... Page 45


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)

Initialize Q-table:

o Create a table Q(s,a) for all state-action pairs.

Search Creators... Page 46


BCS602 | MACHINE LEARNING| SEARCH CREATORS.

o Initialize Q-values with random or zero values.

Set parameters:

o Learning rate α (typically between 0 and 1).


o Discount factor γ (typically close to 1).
o Exploration-exploitation strategy (e.g., ε-greedy policy).

Repeat for each episode:

o Start from an initial state s.


o Choose an action a using the ε-greedy policy.

Repeat until the terminal state is reached:

End the training when Q-values converge.

Differences between SARSA and Q-Learning

Search Creators... Page 47

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy