0% found this document useful (0 votes)
25 views8 pages

Ai Report Endsem

Reaserch in ai

Uploaded by

Diya Manth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Ai Report Endsem

Reaserch in ai

Uploaded by

Diya Manth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment 6: Understanding and Implementing

Hopfield Networks for Associative Memory and


Combinatorial Optimization
Nimisha Kushwaha Garima Singh Pankaj Diya Manth
IIIT Vadodara IIIT Vadodara IIIT Vadodara IIIT Vadodara
Computer Science Computer Science Computer Science Computer Science
Roll number: 202251079 Roll number : 202251047 Roll number : 202251083 Roll number : 202251043

Abstract—This lab assignment explores the Hopfield network,


focusing on its utility in associative memory and solving combi-
natorial optimization problems. We implement a 10x10 binary
associative memory and evaluate the network’s storage capacity.
Additionally, we assess its error-correcting capability and apply
the network to solve the Eight-Rook and Traveling Salesman
Problems (TSP). Through careful design of energy functions and
weight matrices, the study highlights the practical applications
of Hopfield networks in pattern recognition and optimization.
I. I NTRODUCTION
Hopfield networks are a class of recurrent artificial neural
networks introduced by John Hopfield in 1982. These net-
works are known for their ability to store and recall patterns
through associative memory and solve optimization problems
using energy minimization. This lab aims to achieve the
following:
• Implement and analyze a 10x10 binary associative mem-
ory.
• Evaluate the storage capacity and error-correcting capa-
bility of the Hopfield network.
• Solve combinatorial problems such as the Eight-Rook
Problem and a Traveling Salesman Problem (TSP) using Fig. 1. Diagram of a Hopfield Network
a Hopfield network.
II. I MPLEMENTATION AND A NALYSIS 2) Error-Correcting Capability: To evaluate error correc-
A. 10x10 Binary Associative Memory tion, patterns are stored and then presented to the network
with random bit flips. The network successfully corrects up to
The Hopfield network is implemented to store binary pat-
20% bit flips for stored patterns. The performance diminishes
terns in a 10x10 grid. The weight matrix is calculated using
with higher noise levels, highlighting the network’s inherent
the Hebbian learning rule:
limitations.
P
1 X p p
Wij = ξ ξ f ori ̸= j (1) B. Eight-Rook Problem
N p=1 i j
The Eight-Rook problem requires placing eight rooks on a
where P is the number of patterns, ξip is the i-th element chessboard such that no two rooks threaten each other. The
of the p-th pattern, and N is the number of neurons. energy function is designed to penalize invalid configurations:
1) Storage Capacity: The theoretical storage capacity of
a Hopfield network is approximately 0.15N , where N is 1X X
E= wij si sj − θi si , (2)
the number of neurons. For a 10x10 grid (N = 100), the 2 i,j i
capacity is around 15 distinct patterns. Simulations confirm
this estimate, as additional patterns degrade recall accuracy where wij ensures rooks are not placed in the same row
due to spurious states. or column, and θi enforces one rook per row and column.
Weights are chosen such that invalid placements result in IV. C ONCLUSION
higher energy, encouraging valid solutions during convergence. This lab showcases the versatility of Hopfield networks
in associative memory and combinatorial optimization. While
effective for small-scale problems, future work should explore
advanced techniques to enhance scalability and minimize
spurious states. Applications in real-world scenarios further
validate the practical utility of these networks.
R EFERENCES
[1] J. J. Hopfield, “Neural networks and physical systems with emergent col-
lective computational abilities,” *Proceedings of the National Academy
of Sciences*, vol. 79, no. 8, pp. 2554–2558, 1982.
[2] D. J. C. MacKay, Information Theory, Inference, and Learning Algo-
rithms, Cambridge University Press, 2003.

Fig. 2. Eight-Rook Problem Solution: Rooks placed on a chessboard such


that no two threaten each other.

C. Traveling Salesman Problem (TSP)


The TSP involves finding the shortest route visiting a set
of cities exactly once. For 10 cities, the Hopfield network
is modeled with N = 100 neurons, representing city-time
combinations. The energy function is defined as:
 2 !2
X X X X X
E=A  xij − 1 +B xij − 1 +C dij xij xi′ j ′ ,
i j j i ij
(3)
where xij indicates visiting city i at time j, dij is the distance
between cities, and A, B, C are weighting factors. A total of
N 2 = 102 = 100 weights are required.

Fig. 3. Traveling Salesman Problem: A Hopfield network solution for 10


cities.

III. R ESULTS AND D ISCUSSION

The Hopfield network demonstrates robust performance for


associative memory tasks, with reliable pattern recall and
error correction for moderate noise levels. For the Eight-
Rook problem, the network converges to valid configurations
efficiently. Solving the TSP with 10 cities highlights the
network’s potential in optimization, though scalability remains
a challenge for larger instances due to increased weight
complexity.
Week 7 - Basics of data structure for state-space
search tasks and use of random numbers required
for MDP and RL, Understanding Exploitation and
Exploration in simple n-arm bandit reinforcement
learning task, epsilon-greedy algorithm
Nimisha Kushwaha Garima Singh Pankaj Diya Manth
IIIT Vadodara IIIT Vadodara IIIT Vadodara IIIT Vadodara
Computer Science Computer Science Computer Science Computer Science
Roll number: 202251079 Roll number : 202251047 Roll number : 202251083 Roll number : 202251043

Abstract—This submission addresses multiple reinforcement the likelihood of that move being chosen again in similar
learning tasks. First, it reviews and analyzes the MENACE system game situations.
by Donald Michie, identifying key implementation components
and attempting to recreate it in code. Next, it explores a binary IV. C OMPUTER S IMULATION P ROGRAM
bandit problem using an epsilon-greedy algorithm to optimize
In the computerized version of MENACE, two key differ-
action selection. Finally, a 10-armed bandit with non-stationary
rewards is developed, and a modified epsilon-greedy agent is used ences arise:
to track evolving rewards, with its performance evaluated over • Backwards Stage Evaluation: Moves are evaluated
10,000 time steps. These tasks apply core reinforcement learning starting from the end of the game and working backward,
principles to adaptive decision-making in dynamic environments. making decisions in later stages more crucial as the game
progresses.
I. G ITHUB L INK • Odds System: Move probabilities are adjusted using
odds, defined as:
Github Link for all codes: GitHub Link , click here p
Odds =
II. P ROBLEM S TATEMENT 1−p
The task is to read the reference on MENACE by Donald where p represents the probability of selecting a particular
Michie and examine its implementations. The goal is to pick move.
the implementation that you find most compelling, carefully V. R EINFORCEMENT A DJUSTMENT
analyze the code, and highlight the key sections that contribute The reinforcement system modifies the odds after each game
significantly to its function. If feasible, an attempt should based on the outcome:
be made to code the MENACE system in a programming
• Victory: Odds are multiplied by a reinforcement multi-
language of your choice.
plier Mn .
III. MENACE OVERVIEW • Defeat: Odds are divided by Mn .

MENACE operates as a trial-and-error learning system, VI. S LIDING O RIGIN M ECHANISM


where each game position is represented by a distinct match- A dynamic reinforcement multiplier is introduced, based
box containing beads of various colors. The colors correspond on the player’s performance history. Outcomes (win, draw, or
to different potential moves, and the machine learns through loss) are weighted using a decay factor D, giving more recent
reinforcement based on game outcomes: games a higher weight:
• Positive Reinforcement: If the machine wins, it is • Win:
rewarded by adding more beads of the same color as Rn = Mn−µ+1
the selected move to the corresponding matchbox. This
• Draw:
increases the likelihood of selecting that move in future
Rn = Mn−µ
games.
• Negative Reinforcement: If the machine loses, the bead • Loss:
corresponding to the selected move is removed, reducing Rn = Mn−µ−1
This adaptive technique fine-tunes the reinforcement based IX. R ESULTS
on past outcomes, encouraging better strategies over time and The epsilon-greedy algorithm was implemented over 10,000
improving overall gameplay. As more games are played, the iterations, yielding a total cumulative reward of X. The selec-
system approaches expert-level performance through repeated tion frequency revealed that Bandit A was chosen NA times,
cycles of learning. while Bandit B was selected NB times. The final estimated
rewards were QA = YA and QB = YB , with approximately
VII. P ROBLEM S TATEMENT - 02
10% of the actions involving exploration, demonstrating the
The objective of this project is to maximize expected algorithm’s efficacy in balancing exploration and exploitation.
rewards in a binary bandit setting involving two independent
bandits, binaryBanditA and binaryBanditB. Each X. P ROBLEM S TATEMENT - 03
bandit offers two actions, yielding binary rewards of 1 (suc- The objective is to develop a 10-armed bandit model where
cess) or 0 (failure) based on a stationary stochastic process. the mean rewards of ten arms start equal and evolve through
The implementation of an epsilon-greedy algorithm will guide independent random walks. Each mean reward is updated at
action selection, balancing exploration and exploitation to each time step by adding a normally distributed increment with
enhance cumulative rewards. a mean of zero and a standard deviation of 0.01. This dynamic
setup simulates fluctuating rewards over time, necessitating
VIII. M ETHODOLOGY an effective strategy to maximize cumulative rewards via the
A. Initialization bandit_nonstat(action) function.

To begin, the following parameters are established: XI. M ETHODOLOGY


• Expected Reward Estimates: The initial expected re- The methodology for implementing the 10-armed bandit
wards for both bandits are set to zero: problem consists of the following steps:
• Initialization:
QA = QB = 0
– Set the number of arms N = 10.
• Action Count Initialization: The count of actions taken – Set the number of iterations T = 1000.
for each bandit is initialized to zero: – Initialize the estimated rewards Q and action counts
N as follows:
NA = NB = 0
Q = 0 ∈ R1×N , N = 0 ∈ R1×N
B. Epsilon-Greedy Algorithm
– Initialize the mean rewards array with equal values:
An epsilon value of ϵ = 0.1 is employed to regulate the
exploration-exploitation trade-off: mean rewards = 0 ∈ R1×N
• Exploration: With a probability of ϵ, a random bandit
• Action Selection:
(A or B) is selected.
• Exploitation: With a probability of 1 − ϵ, the bandit with
– Select the action using a greedy strategy by choosing
the highest estimated reward is chosen. the arm with the highest estimated reward from the
array Q.
During each iteration, a random number k is generated:
– Identify the index of the maximum value in Q:
• If k < 0.1, a random bandit is chosen.
• If k ≥ 0.1, the bandit with the highest expected reward action = arg max Q[i]
i
is selected.
– This selection maximizes the expected reward based
C. Reward Update on prior information.
• Reward Generation and Update: Obtain the re-
After selecting a bandit, the expected reward estimates are
ward for the selected action by calling the function
updated using the formulas:
bandit_nonstat:
• For Bandit A:

R(A) − Qold (A) reward = bandit nonstat(action, mean rewards)


Qnew (A) = Qold (A) +
N (A) After obtaining the reward, update the action count and
• For Bandit B: the estimated reward for the selected action:

R(B) − Qold (B) N (action) = N (action) + 1


Qnew (B) = Qold (B) +
N (B) The estimated reward is then adjusted using the formula:
where R(A) and R(B) are the rewards received from selecting 1
Bandit A or Bandit B, respectively. Q(action) = Q(action)+ (reward − Q(action))
N (action)
This update rule ensures that the estimated reward for the setting. To address this, a modified epsilon-greedy agent
chosen action becomes more accurate over time as more is implemented, utilizing a ”forgetting factor” α, which
data is collected, allowing the algorithm to better predict prioritizes recent rewards over older data, allowing for
future rewards based on past experiences. better adaptation to dynamic reward changes.
The goal is to test this modified epsilon-greedy algorithm
XII. F UNCTION A NALYSIS
across 10,000 iterations and evaluate its performance in
– The algorithm logs rewards from each selected action terms of identifying the most rewarding actions in a non-
to track performance. stationary context.
– Cumulative average rewards are calculated by di-
viding total rewards by the number of iterations, A. Modified Epsilon-Greedy Algorithm
facilitating effectiveness assessment.
The bandit_nonstat function simulates non- In non-stationary environments, the challenge lies in
stationary rewards by adding normally distributed noise adapting to changing rewards, as past performance may
(mean = 0, standard deviation = 1) to the mean reward of no longer accurately reflect future outcomes. The basic
the selected action, capturing essential reward variability epsilon-greedy algorithm calculates the average of all
for evaluating adaptability. rewards, which is not optimal for dynamic reward envi-
ronments. To solve this issue, we incorporate a forgetting
XIII. R ESULTING G RAPH factor α into the reward update equation:
The graph produced by the ten_armed_bandit func-
tion displays the average reward over time, with the x- Qnew (a) = Qold (a) + α · (Robserved − Qold (a))
axis representing iterations and the y-axis showing the
cumulative average reward. Where:
– Qnew (a) is the updated estimated reward for action
a.
– Robserved is the reward observed after taking action
a.
– α is the forgetting factor, or step-size, which ranges
between 0 and 1.
This adjustment allows the algorithm to react faster to
recent changes in the reward distributions, giving more
importance to the current rewards over past performance,
thus improving its ability to adapt to non-stationary
environments.

B. Action Selection and Reward Update


As with the previous sections, action selection follows an
epsilon-greedy policy. However, the update to the action-
value estimates incorporates the forgetting factor:
Fig. 1. Average Reward Over Time for the 10-Armed Bandit Algorithm – Action Selection: At each time step, the action is
selected either greedily (with probability 1 − ϵ) or
Initially, the average reward may fluctuate due to the randomly (with probability ϵ).
random walk affecting the mean rewards. Over time, it is – Reward Update: After observing the reward, the
expected to stabilize as the algorithm identifies and favors action-value estimate is updated using the forgetting
the arms yielding higher rewards. This trend illustrates factor α as per the equation above.
the algorithm’s adaptation to non-stationary conditions
and demonstrates the effectiveness of the greedy strategy C. Experimental Setup
in balancing exploration and exploitation within the 10-
armed bandit framework. The 10-armed bandit problem is simulated for 10,000
time steps. At each step, the reward of each arm evolves
XIV. P ROBLEM S TATEMENT - 04 via an independent random walk, defined by adding a
The challenge in this section is to develop a 10-armed normally distributed increment with a mean of 0 and stan-
bandit system with non-stationary rewards, where each dard deviation of 0.01 to each arm’s mean reward. The
arm’s expected reward undergoes a random walk. The performance of the modified epsilon-greedy algorithm is
basic epsilon-greedy algorithm, while effective for sta- tracked by measuring the cumulative reward over time,
tionary environments, struggles in this non-stationary along with the frequency of action selection for each arm.
D. Results and Evaluation the practical advantages of using a forgetting factor in
After 10,000 iterations, the modified epsilon-greedy agent reinforcement learning tasks with non-stationary condi-
demonstrates a notable improvement over the standard tions.
epsilon-greedy approach in a non-stationary setting. The
inclusion of the forgetting factor α allows the algorithm to
adapt more effectively to the evolving reward landscape,
as evidenced by:
– Higher cumulative reward: The agent achieves
higher total rewards over time compared to the basic
epsilon-greedy algorithm.
– Faster adaptation: The agent is quicker to shift its
action selection toward arms with increasing rewards,
indicating successful tracking of the non-stationary
environment.
– Improved exploitation: With better action-value
estimates, the agent exploits high-reward arms more
effectively as time progresses.
The results suggest that the modified epsilon-greedy
algorithm, with a properly tuned forgetting factor, can
significantly outperform the standard approach in non-
stationary environments.

Fig. 2. Performance of Modified Epsilon-Greedy Algorithm in a 10-Armed


Bandit with Non-Stationary Rewards

This graph (Figure 2) illustrates the agent’s cumulative


reward over time, showing how the modified epsilon-
greedy algorithm successfully tracks and exploits the
most rewarding actions in the evolving environment. The
x-axis represents the number of iterations, and the y-axis
indicates the cumulative reward.
XV. C ONCLUSION
The modified epsilon-greedy algorithm with a forget-
ting factor provides a robust solution to the problem
of non-stationary rewards. It enables quicker adaptation
to changes in the reward landscape, allowing for more
effective decision-making in dynamic environments. The
results from this 10-armed bandit simulation demonstrate
Assignment 8 : Markov Decision Process and
Dynamic Programming: Solving the Gbike Bicycle
Rental Problem
Nimisha Kushwaha Garima Singh Pankaj Diya Manth
IIIT Vadodara IIIT Vadodara IIIT Vadodara IIIT Vadodara
Computer Science Computer Science Computer Science Computer Science
Roll number: 202251079 Roll number : 202251047 Roll number : 202251083 Roll number : 202251043

Abstract—This assignment explores the application of Markov transitions are influenced by the action taken and the
Decision Processes (MDPs) and dynamic programming in solving number of bikes rented or returned.
real-world problems. Specifically, it involves solving the Gbike bi- • Discount Factor: The discount factor is set to 0.9,
cycle rental problem by formulating a finite MDP, applying policy
iteration, and using the value iteration method. Additionally, we considering the long-term nature of the business.
investigate the impact of varying reward functions in a Markov
B. Finite MDP Formulation
Decision Process environment.
The finite MDP can be described as:
I. I NTRODUCTION X
!
′ ′ ′
Markov Decision Processes (MDPs) provide a mathematical V (s) = max T (s, a, s ) [R(s, a, s ) + γV (s )]
a
s′
framework for sequential decision-making under uncertainty,
where decisions impact the future state of the environment. where: - V (s) is the value function at state s, - T (s, a, s′ ) is
This study investigates the Gbike bicycle rental problem, the transition probability from state s to s′ given action a, -
where we manage two locations for renting bicycles, and the R(s, a, s′ ) is the immediate reward for transitioning from s to
goal is to formulate an MDP to optimize the decision-making s′ under action a, - γ is the discount factor.
process. We utilize dynamic programming, specifically value
III. S OLUTION A PPROACH : P OLICY I TERATION
iteration and policy iteration, to find the optimal policy that
maximizes the expected return from bicycle rentals. We apply policy iteration to solve the Gbike problem. Policy
iteration involves two main steps:
II. P ROBLEM F ORMULATION 1) Policy Evaluation: Given a policy, we calculate the
A. Markov Decision Process (MDP) value function for each state.
2) Policy Improvement: Given the value function, we im-
In the Gbike problem, we consider a finite MDP with the prove the policy by selecting the action that maximizes
following components: the expected return at each state.
• States: The state is represented by the number of bikes The process repeats until the policy converges to the optimal
at each location at the end of the day. This is a tuple policy.
of the number of bikes at each location, constrained by
the maximum number of bikes that can be parked at a A. Modified Gbike Problem
location (20 bikes per location). The modified Gbike problem introduces an additional con-
• Actions: The action corresponds to the number of bikes straint: one employee at the first location can shuttle one bike
moved from one location to another overnight, with a to the second location for free. This change reduces the cost
maximum of 5 bikes moved per night. of moving a bike between locations and impacts the policy
• Rewards: The reward depends on whether bikes are iteration calculations. Additionally, a penalty of INR 4 is
available for rent. If a bike is rented, the reward is INR imposed if more than 10 bikes are parked overnight at any
10. If no bike is available, the reward is 0. Additionally, location, influencing the reward function.
moving bikes incurs a cost of INR 2 per bike, and a
B. Implementation of Policy Iteration
parking fee of INR 4 is charged if more than 10 bikes
are parked at a location. To implement policy iteration, we:
• Transition Probability: The transitions between states • Initialize a random policy.
are probabilistic, influenced by random arrival and return • Evaluate the policy using the Bellman equation.
rates of bikes, which follow Poisson distributions. The • Improve the policy based on the value function.
• Iterate the policy evaluation and improvement steps until
convergence.
IV. R ESULTS
After applying policy iteration to the Gbike problem, we
observe the following:
• The optimal policy maximizes the expected profit by ap-
propriately balancing bike movement between locations
and minimizing parking penalties.
• The introduction of the free bike shuttle service reduced
the overall transportation cost and improved the policy.
• The number of bikes parked at each location remained
below 10 in most states, preventing parking penalties.
V. D ISCUSSION
The results demonstrate that policy iteration effectively
solves the Gbike bicycle rental problem. By incorporating
stochastic factors like Poisson distributions for bike rentals
and returns, the policy was adapted to real-world uncertainties.
The free bike shuttle service improved operational efficiency
by reducing transportation costs. Future work could explore
scalability to larger locations and more complex state spaces.
VI. C ONCLUSION
This assignment provided a comprehensive understanding of
Markov Decision Processes and their application to dynamic
decision-making problems. By solving the Gbike bicycle rental
problem using policy iteration, we demonstrated the power
of MDPs in managing real-world logistics with uncertainty.
The methodology can be applied to various domains requiring
sequential decision-making.
R EFERENCES
[1] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
3rd ed. Pearson, 2010.
[2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
2nd ed. MIT Press, 2018.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy