Module - 05 Machine Learning (BCS602) Search Creators
Module - 05 Machine Learning (BCS602) Search Creators
Module-5
This is done using a trial and error approach as there are no supervisors available as
in classification.
The characteristic of clustering is that the objects in the clusters or groups are
similar to each other within the clusters while differ from the objects in other
clusters significantly.
The input for cluster analysis is examples or samples. These are known as objects,
All these terms are same and used interchangeably in this chapter. All the samples
The output is the set of clusters (or groups) of similar data if it exists in the input.
For example, the following Figure 13.1(a) shows data points or samples with two
features shown in different shaded samples and Figure 13.1(b) shows the manually
Visual identification of clusters in this case is easy as the examples have only two
features.
But, when examples have more features, say 100, then clustering cannot be done
manually and automatic clustering algorithms are required.
Also, automating the clustering process is desirable as these tasks are considered
difficult by humans and almost impossible. All clusters are repre- sented by centroids.
Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the
centroid is given as.
The clusters should not overlap and every cluster should represent only one class.
Therefore, clustering algorithms use trial and error method to form clusters that can be
converted to labels.
Applications of Clustering
High-Dimensional Data
Scalability Issue
o Some algorithms perform well for small datasets but fail for large-scale data.
Unit Inconsistency
Proximity Measures
Quantitative Variables
Binary Attributes
Categorical Variables
Ordinal Variables
Cosine Similarity
Distance Measures
Overview
o Merges clusters based on the smallest distance between two points from different
clusters.
o Related to the Minimum Spanning Tree (MST).
No model assumptions
Only one parameter of the window, that is, bandwidth is required Robust to noise
Selecting the bandwidth is a challenging task. If it is larger, then many clusters are
missed. If it is small, then many points are missed and convergence occurs as the
problem.
The number of clusters cannot be specified and user has no control over this
parameter.
1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids
Mathematical Optimization
Advantages
Disadvantages
Computational Complexity
O(nkId), where:
o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes
Density-based Methods
1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.
Advantages of DBSCAN
Disadvantages of DBSCAN
Grid-based Approach
Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
Suitable for high-dimensional data.
Uses subspace clustering, dense cells, and monotonicity property.
Concepts
Subspace Clustering
o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.
Monotonicity Property
Advantages of CLIQUE
Disadvantage of CLIQUE
Chapter :- 2
Reinforcement Learning
Characteristics of RL
o Consider a grid-based game where a robot must navigate from a starting node (E) to
a goal node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on
their efficiency.
1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps
before receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can
have long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.
Reward Design
o Setting the right reward values is crucial. Incorrectly designed rewards may lead the
agent to learn undesired behavior.
o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.
Partial Observability
1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications
Why RL Is Necessary?
Some tasks cannot be solved using supervised learning due to the absence of a labeled
training dataset. For example:
Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.
Basic Components of RL
Types of RL Problems
Learning Problems
Planning Problems
Known environment – The agent can compute and improve the policy using a
model.
Example – Chess AI that plans its moves based on game rules.
The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
The agent makes decisions and performs actions to maximize rewards.
Example
In self-driving cars,
Example (Navigation)
In a grid-based game, states represent positions (A, B, C, etc.), and actions are
movements (UP, DOWN, LEFT, RIGHT).
Types of States
Types of Episodes
Episodic – Has a definite start and goal state (e.g., solving a maze).
Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).
Policies in RL
Types of Policies
The optimal policy is the one that maximizes cumulative expected rewards.
Rewards in RL
RL Algorithm Categories
Each row represents a probability distribution, meaning the sum of elements in each
row equals 1.
Probability Prediction
1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function
Markov Assumption
The Markov property states that the probability of reaching a state and receiving a
reward depends only on the previous state and action :
MDP Process
The probability of moving from state to after taking action is given by:
This forms a state transition matrix, where each row represents transition probabilities
from one state to another.
Expected Reward
Goal of MDP
The agent's objective is to maximize total accumulated rewards over time by following
an optimal policy.
Reinforcement learning (RL) uses trial and error to learn a series of actions that
maximize the total reward. RL consists of two fundamental sub-problems:
o The goal is to predict the total reward (return), also known as policy evaluation or
value estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.
Policy Improvement:
Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine.
When a lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).
The challenge is that each arm provides rewards randomly within this range.
Objective:
Given a limited number of attempts, the goal is to maximize the total reward by
selecting the best lever.
A logical approach is to determine which lever has the highest average reward and use it
repeatedly.
Formalization:
Given k attempts on an N-arm slot machine, with rewards , the expected reward (action-
value function) is:
This indicates the action that returns the highest average reward and is used as an
indicator of action quality.
Example:
If a slot machine is chosen five times and returns rewards , the quality of this action is:
Exploration:
Exploitation:
Selection Policies
Greedy Method
ε-Greedy Method
1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future
reward from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches
Availability of models
Nature of updates (incremental vs. batch learning)
Exploration vs. exploitation trade-offs
Computational efficiency
Model-based Learning
Markov Decision Process (MDP) and Dynamic Programming are powerful tools for
solving reinforcement learning problems in this context.
The mathematical foundation for passive learning is provided by MDP. These model-
based reinforcement learning problems can be solved using dynamic programming after
constructing the model with MDP.
The primary objective in reinforcement learning is to take an action a that transitions the
system from the current state to the end state while maximizing rewards. These
rewards can be positive or negative.
An agent in reinforcement learning has multiple courses of action for a given state. The
way the agent behaves is determined by its policy.
A policy is a distribution over all possible actions with probabilities assigned to each
action.
Different actions yield different rewards. To quantify and compare these rewards, we
use value functions.
It is a prediction of future rewards and computes the expected sum of future rewards
for a given state s under policy π:
where v(s) represents the quality of the state based on a long-term strategy.
Example
If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in.
State-Value Function
Denoted as v(s), the state-value function of an MDP is the expected return from state s
under a policy π:
This function accumulates all expected rewards, potentially discounted over time, and
helps determine the goodness of a state.
Apart from v(s), another function called the Q-function is used. This function returns a
real value indicating the total expected reward when an agent:
1. Starts in state s
2. Takes action a
3. Follows a policy π afterward
Bellman Equation
There are two main algorithms for solving reinforcement learning problems using
conventional methods:
1. Value Iteration
2. Policy Iteration
Value Iteration
Algorithm
Policy Iteration
1. Policy Evaluation
2. Policy Improvement
Policy Evaluation
Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal
v(s) is found.
Policy Improvement
Algorithm
Monte-Carlo Methods
Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from
interactions with their environment.
Characteristics of TD Learning:
Bootstrapping Method: Updates are based on the current estimate and future
reward.
Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.
Q-Learning
Q-Learning Algorithm
1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:
This iterative process helps the agent learn optimal Q-values, which guide it to take
actions that maximize rewards.
SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)
Initialize Q-table:
Set parameters: