Reinforcement Learning-1
Reinforcement Learning-1
UNIT 1
Reinforcement learning is a machine learning technique that trains a model to take actions in an
environment to maximize reward. The model learns through trial and error, and is rewarded for
desired behaviors and punished for undesired ones.
Reinforcement learning is one of three basic machine learning paradigms, along with supervised and
unsupervised learning. It is used by software and machines to find the best behaviour or path to take
in a specific situation.
RL Framework
Agent: The agent is the entity that interacts with the environment and makes decisions. The agent
is like a student or a learner. Imagine you're teaching a robot to clean a room. The robot is the agent.
It's the one that has to figure out how to clean the room effectively.
Environment: Think of the environment as the world or the place where the agent (robot) is
working. In our example, the environment is the room itself. It's everything around the robot,
including the furniture and the mess on the floor.
State: A state is like the situation or condition the agent (robot) finds itself in. For the cleaning robot,
a state could be when it's in front of a dirty spot, or when it's near an obstacle like a chair. States
describe what's happening at a specific moment.
Action: Actions are like the things the agent (robot) can do. For our cleaning robot, actions could
include moving forward, turning left, picking up trash, or stopping. These are the choices the robot
can make to change its state.
Reward: Think of rewards as points or treats that the agent (robot) gets when it does something
good. If the robot cleans a dirty spot, it gets a reward. If it bumps into a chair, it might get a negative
reward. The goal is for the robot to collect as many rewards as possible.
So, in our example, the cleaning robot (agent) is in a room (environment), and it has to decide what
to do (actions) based on what it sees (state) to earn rewards. It's like teaching the robot to clean by
rewarding it when it does a good job and giving it feedback when it makes mistakes. Over time, the
robot learns to clean the room better because it wants to get more rewards. That's how RL works –
by learning from experiences in an environment to make better decisions.
RL Elements
Aside from the agent and the environment, a reinforcement learning model has four essential
components: a policy, a reward, a value function, and an environment model.
Policy:
Think of a policy as a set of rules or instructions for the agent to decide what to do.
It's like a strategy or a plan that the agent follows to take actions in different situations.
For example, in a game, a policy could be: "When the enemy is close, run away. When you see a
treasure, go get it."
The policy helps the agent know how to act in the environment.
Deterministic Policy:
It's like a strict rule where the agent always does the same thing in a given situation, no
matter what.
Stochastic Policy:
It's more flexible. The agent sometimes does one thing and sometimes another in the same
situation, based on probabilities, like rolling a dice to decide.
Reward:
Imagine rewards as prizes or scores the agent earns for doing things right.
It's like getting a gold star for completing a homework assignment or a treat for doing a trick.
Rewards tell the agent if it's doing well or not in achieving its goals.
For instance, in a maze-solving task, the agent gets a reward when it finds the exit and a smaller
reward for taking wrong turns.
Value Function:
Think of the value function as a guide that helps the agent decide how good different situations or
actions are.
It's like having a map that tells you which paths are better and which ones are worse.
The value function helps the agent understand which states or actions are more likely to lead to high
rewards.
For example, in a chess game, it might tell the agent that being in a strong position on the board is
better than being in a weak one.
Environment Model:
An environment model is like a simulator or a copy of the world that the agent uses to practice and
learn without any real consequences.
It's like having a video game where you can try different strategies without any real-life risks.
The agent can use this model to predict what might happen if it takes certain actions in the real
environment.
Imagine you're playing a game, and you want to decide your next move. The Markov Property says
that all you need to know to make a good decision is the current situation or state you're in. You
don't need to remember the entire history of the game; just knowing your current state is enough.
For example, if you're playing chess, knowing the current position of the pieces on the board is all
you need to decide your next move. You don't have to remember all the previous moves.
Markov Property
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation
means is that the transition from state S[t] to S[t+1] is entirely independent of the past. So,
the RHS of the Equation means the same as LHS if the system has a Markov Property.
Bellman Equation
The Bellman equation in reinforcement learning helps us figure out how valuable it is to be
in a certain situation or state. It tells us how much reward we can expect to get from that
situation going forward. So, it's like a way to calculate how good or bad a place or situation is
for making decisions.
A Markov Decision Process (MDP) is a mathematical framework used in Reinforcement Learning (RL)
to model decision-making problems. It's characterized by a tuple (S,A,P,R,γ), where:
S represents the set of states in the environment.
A is the set of actions that an agent can take in each state.
P is the state transition probability matrix. It represents the probability of transitioning from one
state to another by taking a specific action.
R is the reward function, which defines the immediate reward an agent receives upon transitioning
from one state to another by taking a specific action.
γ is the discount factor that represents the importance of future rewards relative to immediate
rewards. It's a value between 0 and 1.
The MDP framework assumes the Markov property, which means the future state depends only on
the current state and action, not on the history of states and actions that led to the current state.
The goal in RL, within the MDP framework, is to find the optimal policy. A policy defines the agent's
strategy or behavior, specifying which action to take in each state. The optimal policy is the one that
maximizes the expected cumulative reward over time.
In summary, MDPs provide a formalism for modeling decision-making problems in a way that allows
RL algorithms to learn optimal strategies by interacting with an environment, receiving feedback in
the form of rewards, and updating their policies based on this feedback.
Markov Chain
In reinforcement learning (RL), Markov chains serve as a foundational concept in modeling stochastic
environments. Markov chains are a type of mathematical system that transitions between different
states in a way that fulfills the Markov property: the probability of transitioning to the next state
depends only on the current state and not on the sequence of previous states.
2. Homogeneity Property:
In a homogeneous Markov chain, the transition probabilities between states remain
constant over time. This property is often expressed as:
P(X_n+1 = j | X_n = i) = P(X_1 = j | X_0 = i) for all n, i, and j.
3. Irreducibility Property:
An irreducible Markov chain means that it is possible to reach any state from any
other state with a positive probability over some number of steps. In other words,
there are no isolated sets of states within the chain.
These properties are fundamental in understanding and analyzing Markov Chains, which are
often used to model systems where events evolve in a probabilistic manner, like random
walks or queueing systems.
Applications of Markov chains span various fields, including economics, biology, information
theory, and more. They are utilized in modeling systems that exhibit probabilistic transitions
between states, such as predicting stock market trends, analyzing DNA sequences, modeling
random walks, and simulating queueing systems.
Bandit Optimality
Bandit optimality refers to achieving the best possible performance in a problem known as the multi-
armed bandit. This problem is named after a hypothetical scenario where a gambler is facing
multiple slot machines (or "one-armed bandits") with different payout probabilities (unknown to the
gambler). The gambler aims to maximize their total winnings by deciding which machines to play and
how often.
1. Multiple Options (Arms) with Unknown Rewards: Imagine you're faced with
several choices, like different buttons to press, each leading to a surprise gift, but you
don't know which button gives the best gift. Each button has a hidden chance of
giving a reward.
2. Making Choices and Getting Rewards: You have to keep choosing and pressing
these buttons one after the other. Every time you press a button, you receive a gift,
but you don't know in advance what you'll get from each button.
3. Goal: Maximize Total Rewards: Your main aim is to get as many good gifts as
possible by pressing buttons multiple times. Ultimately, you want to figure out which
buttons tend to give the best gifts and use that information to get the most gifts
overall.
1. Exploration: You want to try different machines to understand which ones are the
most rewarding.
2. Exploitation: You also want to keep playing the machines that seem to pay out well
based on what you've learned.
The key is to balance trying new machines (exploration) and playing the ones you think are
best (exploitation) to maximize your total winnings over time.
Bandit optimality is about finding the best strategy to balance trying new machines and
sticking with the ones that seem to pay out the most. The goal is to win as much money as
possible while learning about the machines' payout rates.
Researchers study different strategies to see which ones work best in the long run. They look
at regret (how much you might have missed out on by not choosing the best machines
earlier), convergence (how strategies perform as they get more chances to play), and overall
performance to figure out which strategies help you win the most money over time.
AMP
In simpler words, imagine you're trying to teach a computer program to play a game.
Initially, when it starts learning, its performance might not be great. It might make mistakes
and not win much. However, as it plays more and more games, it gets better and its average
performance improves.
AMP focuses on understanding how well the learning algorithm performs in the long run,
after it has had lots of chances to learn and get better. It helps us assess whether the
learning method is consistently improving and approaching its best possible performance as
it gains more experience.
So, when we talk about AMP in RL, we're looking at how good the learning process becomes
after a lot of practice or experience, and whether it reaches a stable or optimal level of
performance over time. It helps us evaluate if the learning algorithm is getting better and
better at the task it's learning to solve.
UNIT 2
Dynamic programming is a method for solving problems that involve making a sequence of decisions
over time. In the context of Markov Decision Processes (MDPs), it's a way to find the best strategy to
make decisions in order to achieve a goal.
So, dynamic programming for MDPs is like solving a puzzle by repeatedly estimating the value of
being in different states and finding the best actions to take in each state. It's a systematic way to make
decisions that lead to the best possible outcome in a world with uncertainty and rewards.
Bellman Optimality:
Dynamic programming is a problem-solving approach that helps us find the best solution to a
complex problem by breaking it into smaller, simpler parts. The key idea behind dynamic
programming is the "principle of optimality," which says that the best solution to the overall problem
can be achieved by combining the best solutions to its smaller sub-problems.
This principle works well for problems that have a finite or countable number of states, making it
easier to minimize the complexity of finding the best solution. However, it doesn't work as effectively
for problems with continuous state spaces, like inventory management or dynamic pricing, where
there are an infinite number of possible states.
In dynamic programming, we break down the problem into smaller sub-problems, and the key insight
is that the state we're in at any given moment is independent of the decisions we made in previous
states. This allows us to separate our initial decision from future decisions and optimize them
separately. We calculate the value of being in a certain state (Vπ(S)) as the reward we get in that state
(r(s, a)) plus the value of being in the next state (Vπ(S')). This helps us find the best solution step by
step by combining solutions to smaller sub-problems.
So, in simple terms, dynamic programming helps us find the best solution to a complex problem by
breaking it into smaller pieces and combining the best solutions to those pieces, but it's more effective
for problems with a limited number of states and less so for problems with continuous states.
Policy Iteration and Value iteration
Policy Iteration:
Policy Iteration is a dynamic programming algorithm used to find optimal policy of an MDP.
The algorithm starts with with an initial policy and iteratively improves it until convergence.
Policy Evaluation: Imagine you're playing a game, and you have a certain way of making
decisions in each situation (a policy). You want to know how good this policy is. So, you try it
out, see how much you win or lose, and keep adjusting it until it's as good as it can be.
Policy Improvement: Once you've figured out how good your current policy is, you make it
better by choosing the best actions in each situation. This helps you win more in the game.
You keep going back and forth between evaluating your policy and improving it until you
can't make it any better. That's when you've found the best way to play the game.
It is computationally expensive and can be used for large MDP’S. Faster than value iteration.
Value Iteration:
Value function is a dynamic programming algo used to find the optimal value function of a
MDP.
Value Iteration Step: Instead of having a specific strategy (policy), you're more focused on
figuring out how good each situation is and what's the best thing to do in each situation.
You start with some guesses about how good each situation is.
Then, you keep updating your guesses by looking at how good nearby situations are and the
rewards you get for different actions. You do this over and over until your guesses don't
change much.
After you're done, you can tell what's the best thing to do in each situation based on your
final guesses.
In short, policy iteration is about refining your strategy step by step, while value iteration is about
figuring out how good each situation is and what's the best action in each situation. Both methods
help you make better decisions in games or real-life situations.
UNIT-4
Functional Approximation
Function approximation in Reinforcement Learning (RL) involves using parameterized functions, such as neural
networks, linear models, decision trees, or other models, to approximate and represent complex relationships
between states, actions, and values within an RL problem.
Sure, in Reinforcement Learning (RL), function approximation helps an agent figure out what actions
to take in different situations without keeping track of every single possibility.
Value Function Approximation: Instead of remembering the value of every specific action in every
situation (like a big table), we use a smart function, like a neural network, to guess these values. This
function takes in information about the situation (state) and predicts how good each action might be.
For example, think of a game where you have to make decisions. Instead of remembering the outcome
of every choice you've made before in every possible scenario, a function (like a neural network)
helps guess how good each choice might be based on the current situation.
Policy Approximation: Function approximation can also help directly with decision-making by
learning a strategy or plan (policy) for the agent. Instead of remembering a strict set of rules for each
situation, a function, such as a neural network, learns to suggest the best action to take given a certain
situation.
For instance, consider learning how to play a video game. Rather than memorizing a list of
instructions for every level, a function (like a neural network) learns to guide your actions based on
what it has learned about the game.
So, these methods use smart functions (like neural networks) to help the agent make decisions and
learn strategies without needing to remember every single detail of every situation.
Basic Equation for Value Function Approximation: In the context of RL, the value function (V) for a
given state (S) is usually approximated as a weighted sum of features (F) with some adjustable
parameters (θ):
V(S) ≈ θ₁ * F₁(S) + θ₂ * F₂(S) + θ₃ * F₃(S) + ... + θₙ * Fₙ(S)
V(S) represents the estimated value of being in state S. F ₁(S), F ₂(S), ..., F ₙ(S) are feature functions
that describe the relevant characteristics of the state. θ ₁, θ ₂, ..., θ ₙ are the parameters of the function
that need to be learned.
Basic Equation for Policy Approximation: For policy approximation, a similar concept applies. The
probability of taking an action (A) in a given state (S) is approximated using a function that depends
on adjustable parameters (θ):
π(A|S) ≈ θ₁ * ϕ₁(S, A) + θ₂ * ϕ₂(S, A) + θ₃ * ϕ₃(S, A) + ... + θₘ * ϕₘ(S, A)
π(A|S)represents the estimated probability of taking action A in state S. ϕ ₁(S, A), ϕ ₂(S, A), ..., ϕ ₘ(S,
A) are feature functions that describe the state-action pairs. θ ₁, θ ₂, ..., θ ₘ are the parameters of the
policy function.
For example, if you have data points representing the growth of a plant over time, you can use
function approximation to find an equation that accurately describes how the plant's height changes as
a function of time. This equation can then be used for predictions, analysis, or simply understanding
the data better.
Least Square Method:
The Least Square Method is a specific technique within function approximation, primarily used for
finding the equation of a straight line that best fits a set of data points.
For instance, if you have data on the relationship between hours of study and exam scores, you can
use the Least Square Method to find the best-fitting straight line that describes how studying time
relates to exam performance.
Imagine you're trying to teach a robot to clean a room. The robot takes actions like moving around, picking up
objects, and cleaning. In RL terms, each action it takes might result in some immediate reward (like picking up a
dirty item) and might also affect future rewards (like making the room cleaner).
Eligibility traces are a concept used in Reinforcement Learning (RL) that helps in credit assignment—
determining which actions or states are responsible for the received rewards. They are a way to assign
credit for received rewards to the actions or states that contributed to those rewards, even if they
occurred several time steps earlier.
The idea behind eligibility traces is to maintain a memory or trace of the recent states and actions that
have been visited by the agent. This memory, represented as eligibility traces, influences how much
credit is assigned to certain actions or states when a reward is received.
There are different types of eligibility traces, such as accumulating traces (accumulating credit over
multiple time steps) and replacing traces (replacing old traces with new ones). The two most
commonly used types of traces are:
Accumulating Traces: In accumulating traces, the trace value increases over time, accumulating
credit for visited states or actions. It's often updated using the following formula:
et(s)=γλet−1(s)+1(St=s)
Where:
et(s)=1(St=s)
Eligibility traces are often used in conjunction with Temporal Difference (TD) learning methods, such
as TD(λ) or SARSA(λ), to update value functions or policies.
By using eligibility traces, RL agents can better assign credit over time, effectively handle delayed
rewards, and learn more efficiently from experiences in environments where rewards might not be
immediate or clear-cut.
UNIT 5
Creating real-time applications in reinforcement learning (RL) is an exciting area with numerous
potential use cases. Below, I'll provide examples of real-time applications in NLP and system
recommendation using reinforcement learning.
In both NLP and system recommendation, RL enables systems to adapt and improve continuously,
resulting in more efficient and user-centric decision-making processes, ultimately benefiting both
businesses and end-users.