DQN with Atari in 6 Minutes
Published:
This post will introduce Q-Learning, one of the most famous algorithms in Reinforcement Learning which utilizes Neural Networks as function approximators. All tutorials in the Reinforcement Learning sections require prior knowledge of Deep Neural Networks and it is recommended that you have a look at the notebooks in the Deep Learning section.
We will be implementing Q-Learning using ConvNets to play the game of Pong in Atari 2600 domain. This is a naive replication of David Silver's DQN paper. So let's get started!
Importing Dependencies
We will use PyTorch and gym (by OpenAI) for our experiments as these are lightweight and easy to program.
from google.colab import drive
user_name = '/content/drive'
drive.mount(user_name, force_remount=True)
import math, random
import gym
import numpy as np
import pickle as pkl
from matplotlib.image import imsave
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F
import matplotlib.pyplot as plt
checkpoint_name = '/content/drive/My Drive/Colab Notebooks/Checkpoint'
USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if USE_CUDA else autograd.Variable(*args, **kwargs)
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/')
from wrappers import make_atari, wrap_deepmind, wrap_pytorch
Replay Buffer
Almost all Reinforcement Learning methods make use of a Replay Buffer which saves past experiences from a particular episode. Each (state,action,reward,next_state) tuple is stored in the buffer so that the RL algorithm can be trained later when the episode is over.
class ReplayBuffer:
def __init__(self,capacity):
self.capacity = capacity
self.buffer = []
self.position = 0
def push(self, state, action, reward, next_state, done):
state = np.expand_dims(state, 0)
next_state = np.expand_dims(next_state, 0)
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
return np.concatenate(state), action, reward, np.concatenate(next_state), done
def __len__(self):
return len(self.buffer)
ConvNet Architecture
Now it's time to construct our agent. We will make use of a ConvNet to train our Q-Learning algorithm. Parameters for the ConvNet such as number of layers, filter sizes, stride and activations have all been kept same as the ones mentioned in the paper. Fully-Connected Dense layers consisting of 52 hidden units and ReLU non-linearity have been added on top of the ConvNet architecture.
Our agent will take the input as the state from the environment, an 84x84 image of the game screen and the output will be one of the following 3 actions- move up (0), move down (1) and do nothing (2). The output layer consists of 3 output nodes of which the $argmax$ is selected as a result of the greedy poliy which we will discuss in the next section.
class CnnDQN(nn.Module):
def __init__(self, input_shape, num_actions):
super(CnnDQN, self).__init__()
self.input_shape = input_shape
self.num_actions = num_actions
self.features = nn.Sequential(
nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU()
)
self.fc = nn.Sequential(
nn.Linear(self.feature_size(), 512),
nn.ReLU(),
nn.Linear(512, self.num_actions)
)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
def feature_size(self):
return self.features(autograd.Variable(torch.zeros(1, *self.input_shape))).view(1, -1).size(1)
def act(self, state, epsilon):
if random.random() > epsilon:
state = Variable(torch.FloatTensor(np.float32(state)).unsqueeze(0), volatile=True)
q_value = self.forward(state)
action = q_value.max(1)[1].data[0]
else:
action = random.randrange(env.action_space.n)
return action
Update Rule
Now we will make our update rule for the multi-agent algorithm. We will make use of Q-Learning which updates our parameters based on the best/greedy Q-action value $\underset{a}{max}Q(s^{'},a)$. Here, $s^{'}$ is the next state and $a$ is the action selected.
Weight updates for our agents take the form of Q-updates-
For our experiment, we will compute the TD Loss corresponding to each agent as the MSE difference between the Q-value and the expected Q-value ($\underset{a}{max}Q(s^{'},a)$). Mathematically,
The total loss will be the weighted sum of the manager and order network losses. Feel free to play around with the weight coefficients and see how loss variation progresses.
def compute_td_loss(batch_size):
state, action, reward, next_state, done = replay_buffer.sample(batch_size)
state = Variable(torch.FloatTensor(np.float32(state))).to(device)
next_state = Variable(torch.FloatTensor(np.float32(next_state)), volatile=True).to(device)
action = Variable(torch.LongTensor(action)).to(device)
reward = Variable(torch.FloatTensor(reward)).to(device)
done = Variable(torch.FloatTensor(done)).to(device)
# criterion = nn.SmoothL1Loss(reduction='mean')
q_values = model(state)
next_q_values = model(next_state)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
next_q_value = next_q_values.max(1)[0]
# reward += abs(q_value - next_q_value)
expected_q_value = reward + gamma * next_q_value * (1 - done)
# loss = criterion(q_value,Variable(expected_q_value.data))
loss = (q_value - Variable(expected_q_value.data)).pow(2).mean()
loss = loss.clamp(-1,1)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss
Initialize Setup
Now that we are all ready to train our model, we will initialize our learning setup. We will implement our algorithm on Pong by skipping frequent frames in order to save up on memory and make the process data efficient. We will initialize our environment, our agent, the optimizer and the replay buffer along with its capacity. Keep in mind that Atari environments require significant exploration. Thus, a buffer with large capacity is suitable for the training process.
env_id = "PongNoFrameskip-v4"
env = make_atari(env_id)
env = wrap_deepmind(env)
env = wrap_pytorch(env)
model = CnnDQN(env.observation_space.shape, env.action_space.n).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.00001)
replay_initial = 10000
replay_buffer = ReplayBuffer(100000)
load_model = True
if load_model==True:
model_checkpoint = torch.load(checkpoint_name+'/model.pth.tar', map_location=device)
model.load_state_dict(model_checkpoint['model_state_dict'])
optimizer.load_state_dict(model_checkpoint['optimizer_state_dict'])
loss = model_checkpoint['loss']
Exploration
As discussed in our previous tutorials, we make use of the exploration-starts strategy wherein we start of with 100% exploration and exponentially decay its amount over agent's learning. This is a key step for the model to sample appropriate actions during long-horizon episodic tasks.
epsilon_start = 1.0
epsilon_final = 0.01
epsilon_decay = 30000
epsilon_by_frame = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)
Learning Loop
We initialize our training hyperparameters and reset the environment to its initial state. Now, we make use of the standard PyTorch loop mechanism and step through the environment upon sampling actions from our agent. TD updates are made once the replay buffer reaches a capacity of more than 10,000 experiences. Once the loop completes, we save our progress (loss and rewards) for visualization.
loss = None
num_frames = 50000
batch_size = 32
gamma = 0.99
losses = []
all_rewards = []
episode_reward = 0
state = env.reset()
for frame_idx in range(1, num_frames + 1):
epsilon = 0.01 #epsilon_by_frame(frame_idx)
action = model.act(state, epsilon)
next_state, reward, done, _ = env.step(action)
replay_buffer.push(state, action, reward, next_state, done)
state = next_state
episode_reward += reward
if frame_idx > 10000:
imsave(checkpoint_name+'/Frames/'+str(frame_idx)+'.png',state[0,:,:])
if done:
state = env.reset()
all_rewards.append(episode_reward)
episode_reward = 0
if len(replay_buffer) > replay_initial:
loss = compute_td_loss(batch_size)
loss = loss.item()
losses.append(loss)
if frame_idx % 10000 == 0:
print('Step-',frame_idx,'/',num_frames,'|Episode Reward-',all_rewards[-1],'|Loss-',loss)
torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss},checkpoint_name+'/model.pth.tar')
# plt.plot(all_rewards)
data_save = {}
data_save['loss'] = losses
data_save['reward'] = all_rewards
with open(checkpoint_name+'/data_save.pkl', 'wb') as f: #data+same as frame folder
pkl.dump(data_save, f)