0% found this document useful (0 votes)
3 views13 pages

FB DQN

The document outlines an experiment to train an AI agent to play Flappy Bird using Deep Q-Networks (DQN) with the aim of maximizing survival time and score. It details the simulation environment, action and observation spaces, reward structure, and the training algorithm. The experiment successfully demonstrates the agent's ability to play the game after training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

FB DQN

The document outlines an experiment to train an AI agent to play Flappy Bird using Deep Q-Networks (DQN) with the aim of maximizing survival time and score. It details the simulation environment, action and observation spaces, reward structure, and the training algorithm. The experiment successfully demonstrates the agent's ability to play the game after training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Exp 5.

Flappy Bird
Exp No: 5 Flappy Bird

27.03.2025

Aim: To train an AI agent to play Flappy Bird using Deep Q-Networks (DQN).
Objective: Flappy Bird is a game where the player controls a bird that must navigate through
gaps between pipes without hitting them. The goal is to maximize the agent's survival time
and score using reinforcement learning with a deep Q-network (DQN).

Simulation Tool: The Flappy Bird environment is implemented


using gym or gymnasium along with pygame for visualization.

Action Space Discrete(2)

Observation
Continuous (processed via CNN in DQN)
Space

import gymnasium.make("FlappyBird-v0")

Description: The game starts with the bird in the air, where it continuously falls due to
gravity. The player (agent) can either flap (jump) or do nothing. The objective is to pass
through as many pipes as possible without colliding.
Algorithm :

1. Initialize the deep Q-network with random weights.

2. Set hyperparameters such as learning rate (), discount factor (), exploration rate (), and
replay memory size.

3. For each episode:

Start at the initial state.

Choose an action using an -greedy policy.

Perform the action and observe the next state, reward, and done status.

Store the experience (state, action, reward, next state, done) in a replay buffer.

Sample a mini-batch from the replay buffer.

Compute the target Q-value using:

Update the Q-network using backpropagation.

Exp 5. Flappy Bird 1


Reduce over time.

Repeat until the game is over.

4. Train the DQN for multiple episodes until convergence.

5. Use the trained model to find the optimal policy.

Action Space: The action shape is (1,) in the range {0, 1} , indicating whether the bird should
flap or not.

0: Do nothing (bird falls)

1: Flap (bird jumps up)

Observation Space:
The observation consists of pixel frames processed using convolutional neural networks
(CNNs). The input state includes:

Stacked frames for temporal information

Bird’s vertical position

Bird’s velocity

Distance to next pipe

Height of next pipe

Rewards : Reward schedule:

Successfully passing a pipe: +1

Collision with ground or pipe: -1 (Game Over)

Starting State :

The episode starts with the bird in an initial position with a downward velocity.
Episode End : The episode ends if the following happens:

The bird collides with a pipe.

The bird collides with the ground.

The bird flies too high (in some implementations).

Arguments
The render_mode argument enables visualization, and sutton_barto_reward modifies the
reward structure to match the original implementation.

Program:

Exp 5. Flappy Bird 2


import os
import random
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import gym
from gym import spaces
from IPython.display import clear_output, display
import time

# Suppress TensorFlow warnings


os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# Force CPU usage instead of GPU to avoid compatibility issues


os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

print("TensorFlow version:", tf.__version__)


print("Running with device:", tf.config.list_physical_devices())

# Define the Flappy Bird environment


class FlappyBirdEnv(gym.Env):
def __init__(self):
super(FlappyBirdEnv, self).__init__()

# Environment parameters
self.gravity = 1
self.bird_velocity = 0
self.bird_position = 50
self.pipe_gap = 40
self.pipe_width = 10
self.pipe_velocity = 2
self.pipes = []
self.screen_width = 100
self.screen_height = 100

Exp 5. Flappy Bird 3


self.pipe_spawn_freq = 50
self.frames_since_last_pipe = 0
self.score = 0

# Define action and observation space


self.action_space = spaces.Discrete(2) # 0: do nothing, 1: flap
self.observation_space = spaces.Box(low=0, high=255, shape=(4,), dtype

def reset(self):
self.bird_velocity = 0
self.bird_position = 50
self.pipes = [{'x': 70, 'gap_pos': random.randint(20, 80)}]
self.frames_since_last_pipe = 0
self.score = 0
return self._get_state()

def step(self, action):


# Apply action (flap or do nothing)
if action == 1:
self.bird_velocity = -10

# Update bird position


self.bird_velocity += self.gravity
self.bird_position += self.bird_velocity

# Spawn new pipes


self.frames_since_last_pipe += 1
if self.frames_since_last_pipe >= self.pipe_spawn_freq:
self.pipes.append({'x': self.screen_width, 'gap_pos': random.randint(20
self.frames_since_last_pipe = 0

# Move pipes
for pipe in self.pipes:
pipe['x'] -= self.pipe_velocity

# Remove pipes that are off-screen


self.pipes = [pipe for pipe in self.pipes if pipe['x'] + self.pipe_width > 0]

Exp 5. Flappy Bird 4


# Check if bird has passed a pipe
for pipe in self.pipes:
if self.screen_width // 5 == pipe['x'] + self.pipe_width:
self.score += 1

# Check for collisions


done = False
reward = 0.1 # Default small positive reward

# Bird hits the ground or ceiling


if self.bird_position <= 0 or self.bird_position >= self.screen_height:
done = True
reward = -10
else:
# Check for pipe collisions
for pipe in self.pipes:
if (self.screen_width // 5 >= pipe['x'] and
self.screen_width // 5 <= pipe['x'] + self.pipe_width):
if (self.bird_position <= pipe['gap_pos'] - self.pipe_gap // 2 or
self.bird_position >= pipe['gap_pos'] + self.pipe_gap // 2):
done = True
reward = -10
break

return self._get_state(), reward, done, {'score': self.score}

def _get_state(self):
# Get the nearest pipe
nearest_pipe = None
nearest_distance = float('inf')

for pipe in self.pipes:


if pipe['x'] + self.pipe_width >= self.screen_width // 5:
distance = pipe['x'] - self.screen_width // 5
if distance < nearest_distance:
nearest_distance = distance
nearest_pipe = pipe

Exp 5. Flappy Bird 5


if nearest_pipe is None:
# If no pipe ahead, use default values
horizontal_distance = self.screen_width
gap_pos = self.screen_height // 2
else:
horizontal_distance = nearest_pipe['x'] - self.screen_width // 5
gap_pos = nearest_pipe['gap_pos']

# Normalized state:
# [bird_y, bird_velocity, distance_to_pipe, center_of_gap]
state = [
self.bird_position / self.screen_height,
self.bird_velocity / 10,
horizontal_distance / self.screen_width,
gap_pos / self.screen_height if nearest_pipe else 0.5
]

return np.array(state, dtype=np.float32)

def render(self):
# Create a simple visualization using matplotlib
plt.figure(figsize=(5, 5))
plt.xlim(0, self.screen_width)
plt.ylim(0, self.screen_height)

# Draw pipes
for pipe in self.pipes:
# Top pipe
plt.fill_between([pipe['x'], pipe['x'] + self.pipe_width],
[0, 0],
[pipe['gap_pos'] - self.pipe_gap // 2, pipe['gap_pos'] - self.pi
color='green')
# Bottom pipe
plt.fill_between([pipe['x'], pipe['x'] + self.pipe_width],
[pipe['gap_pos'] + self.pipe_gap // 2, pipe['gap_pos'] + self.p
[self.screen_height, self.screen_height],
color='green')

Exp 5. Flappy Bird 6


# Draw bird
plt.scatter(self.screen_width // 5, self.bird_position, color='yellow', s=100)

# Add score
plt.text(5, 95, f'Score: {self.score}', fontsize=12)

plt.title('Flappy Bird')
plt.axis('off')

# Display in Colab
display(plt.gcf())
plt.close()

# Define DQN Agent


class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()
self.target_model = self._build_model()
self.update_target_model()

def _build_model(self):
# Neural Net for Deep-Q learning Model - simplified for stability
model = Sequential()
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_r
return model

def update_target_model(self):

Exp 5. Flappy Bird 7


# copy weights from model to target_model
self.target_model.set_weights(self.model.get_weights())

def remember(self, state, action, reward, next_state, done):


self.memory.append((state, action, reward, next_state, done))

def act(self, state):


if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
state_tensor = tf.convert_to_tensor(state.reshape(1, -1), dtype=tf.float32)
act_values = self.model(state_tensor, training=False)
return np.argmax(act_values[0])

def replay(self, batch_size):


if len(self.memory) < batch_size:
return

minibatch = random.sample(self.memory, batch_size)


states = np.array([transition[0] for transition in minibatch])
actions = np.array([transition[1] for transition in minibatch])
rewards = np.array([transition[2] for transition in minibatch])
next_states = np.array([transition[3] for transition in minibatch])
dones = np.array([transition[4] for transition in minibatch])

# Get current states and predict Q values


states_tensor = tf.convert_to_tensor(states, dtype=tf.float32)
with tf.GradientTape() as tape:
q_values = self.model(states_tensor, training=True)

# Select the Q values for the actions that were taken


indices = tf.range(0, tf.shape(q_values)[0]) * tf.shape(q_values)[1] + act
selected_q_values = tf.gather(tf.reshape(q_values, [-1]), indices)

# Get Q values for next states with target model


next_states_tensor = tf.convert_to_tensor(next_states, dtype=tf.float32
next_q_values = self.target_model(next_states_tensor, training=False)

# Calculate targets

Exp 5. Flappy Bird 8


max_next_q_values = tf.reduce_max(next_q_values, axis=1)
targets = rewards + (1 - dones) * self.gamma * max_next_q_values

# Calculate loss
loss = tf.keras.losses.mse(selected_q_values, targets)

# Get gradients and update model


grads = tape.gradient(loss, self.model.trainable_variables)
self.model.optimizer.apply_gradients(zip(grads, self.model.trainable_varia

# Decay epsilon
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay

# Custom save method to avoid keras restrictions


def save_model(self, filepath):
self.model.save_weights(filepath)

# Custom load method


def load_model(self, filepath):
self.model.load_weights(filepath)

# Training function
def train_dqn(episodes=100):
env = FlappyBirdEnv()
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
batch_size = 32

# Keep track of scores


scores = []

print("Starting training for", episodes, "episodes")

for e in range(episodes):
state = env.reset()
total_reward = 0

Exp 5. Flappy Bird 9


done = False
step = 0

while not done:


step += 1
# Choose action
action = agent.act(state)

# Take action
next_state, reward, done, info = env.step(action)
total_reward += reward

# Remember the experience


agent.remember(state, action, reward, next_state, done)

state = next_state

# Train through replay


if len(agent.memory) > batch_size:
agent.replay(batch_size)

# Update target model occasionally


if step % 10 == 0:
agent.update_target_model()

# Visualize occasionally - reduced frequency for better performance


if e % 50 == 0 and step % 50 == 0:
clear_output(wait=True)
env.render()
time.sleep(0.01)

scores.append(info['score'])

# Print episode stats


avg_score = np.mean(scores[-100:]) if len(scores) >= 100 else np.mean(s
print(f"Episode: {e+1}/{episodes}, Score: {info['score']}, Epsilon: {agent.e

# Save model weights occasionally

Exp 5. Flappy Bird 10


if (e+1) % 50 == 0:
print(f"Saving model at episode {e+1}")
model_path = f"flappy_bird_model_ep{e+1}"
try:
agent.save_model(model_path)
print(f"Successfully saved model to {model_path}")
except Exception as ex:
print(f"Failed to save model: {ex}")
# Continue without saving

# Plot learning curve


plt.figure(figsize=(10, 6))
plt.plot(scores)
plt.title('Learning Curve')
plt.xlabel('Episode')
plt.ylabel('Score')
plt.show()

return agent, scores

# Function to watch trained agent play


def watch_agent_play(agent, episodes=3):
env = FlappyBirdEnv()

for e in range(episodes):
state = env.reset()
done = False
step = 0

while not done and step < 1000: # Add step limit as a safeguard
step += 1
clear_output(wait=True)
env.render()

# Agent chooses action with no exploration


state_tensor = tf.convert_to_tensor(state.reshape(1, -1), dtype=tf.float3
action = np.argmax(agent.model(state_tensor, training=False)[0])

Exp 5. Flappy Bird 11


# Take action
state, reward, done, info = env.step(action)

time.sleep(0.05) # Slow down for better visualization

print(f"Episode {e+1}: Score = {info['score']}")

# Run the training


print("Starting Flappy Bird DQN training (Final fixed version for Colab)")

# Use try-except to handle potential errors gracefully


try:
# Train agent with fewer episodes for testing - reduce to 100 for quicker res
agent, scores = train_dqn(episodes=100)

# Watch the trained agent play


watch_agent_play(agent, episodes=3)
except Exception as e:
print(f"An error occurred during execution: {e}")

Output: (Simulation Screen Shots)

Exp 5. Flappy Bird 12


Result: Using Deep Q-Networks (DQN), we successfully train an agent to play Flappy Bird.
The neural network learns optimal Q-values to make decisions based on state observations.
The trained model enables the agent to flap at the right time to maximize its survival and
score.

Exp 5. Flappy Bird 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy