DQN with Atari in 6 Minutes

6 minute read

Published:

This post will introduce Q-Learning, one of the most famous algorithms in Reinforcement Learning which utilizes Neural Networks as function approximators. All tutorials in the Reinforcement Learning sections require prior knowledge of Deep Neural Networks and it is recommended that you have a look at the notebooks in the Deep Learning section.

We will be implementing Q-Learning using ConvNets to play the game of Pong in Atari 2600 domain. This is a naive replication of David Silver's DQN paper. So let's get started!

Importing Dependencies

We will use PyTorch and gym (by OpenAI) for our experiments as these are lightweight and easy to program.

from google.colab import drive
user_name = '/content/drive'
drive.mount(user_name, force_remount=True)

import math, random
import gym
import numpy as np
import pickle as pkl
from matplotlib.image import imsave
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd 
import torch.nn.functional as F
import matplotlib.pyplot as plt

checkpoint_name = '/content/drive/My Drive/Colab Notebooks/Checkpoint'

USE_CUDA = torch.cuda.is_available()
device = torch.device('cuda' if USE_CUDA else 'cpu')
Variable = lambda *args, **kwargs: autograd.Variable(*args, **kwargs).cuda() if USE_CUDA else autograd.Variable(*args, **kwargs)
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/')
from wrappers import make_atari, wrap_deepmind, wrap_pytorch

Replay Buffer

Almost all Reinforcement Learning methods make use of a Replay Buffer which saves past experiences from a particular episode. Each (state,action,reward,next_state) tuple is stored in the buffer so that the RL algorithm can be trained later when the episode is over.

class ReplayBuffer:
  def __init__(self,capacity): 
    self.capacity = capacity
    self.buffer = []
    self.position = 0

  def push(self, state, action, reward, next_state, done):
      state      = np.expand_dims(state, 0)
      next_state = np.expand_dims(next_state, 0)
          
      self.buffer.append((state, action, reward, next_state, done))
  
  def sample(self, batch_size):
      state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
      return np.concatenate(state), action, reward, np.concatenate(next_state), done
  
  def __len__(self):
      return len(self.buffer)

ConvNet Architecture

Now it's time to construct our agent. We will make use of a ConvNet to train our Q-Learning algorithm. Parameters for the ConvNet such as number of layers, filter sizes, stride and activations have all been kept same as the ones mentioned in the paper. Fully-Connected Dense layers consisting of 52 hidden units and ReLU non-linearity have been added on top of the ConvNet architecture.

Our agent will take the input as the state from the environment, an 84x84 image of the game screen and the output will be one of the following 3 actions- move up (0), move down (1) and do nothing (2). The output layer consists of 3 output nodes of which the $argmax$ is selected as a result of the greedy poliy which we will discuss in the next section.

class CnnDQN(nn.Module):
    def __init__(self, input_shape, num_actions):
        super(CnnDQN, self).__init__()
        
        self.input_shape = input_shape
        self.num_actions = num_actions
        
        self.features = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )
        
        self.fc = nn.Sequential(
            nn.Linear(self.feature_size(), 512),
            nn.ReLU(),
            nn.Linear(512, self.num_actions)
        )
        
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x
    
    def feature_size(self):
        return self.features(autograd.Variable(torch.zeros(1, *self.input_shape))).view(1, -1).size(1)
    
    def act(self, state, epsilon):
        if random.random() > epsilon:
            state   = Variable(torch.FloatTensor(np.float32(state)).unsqueeze(0), volatile=True)
            q_value = self.forward(state)
            action  = q_value.max(1)[1].data[0]
        else:
            action = random.randrange(env.action_space.n)
        return action

Update Rule

Now we will make our update rule for the multi-agent algorithm. We will make use of Q-Learning which updates our parameters based on the best/greedy Q-action value $\underset{a}{max}Q(s^{'},a)$. Here, $s^{'}$ is the next state and $a$ is the action selected.

Weight updates for our agents take the form of Q-updates-

$w\leftarrow w + \alpha[r + \gamma\underset{a}{max}Q(s^{'},a,w) - Q(s,a,w)]\nabla Q(s,a,w)$

For our experiment, we will compute the TD Loss corresponding to each agent as the MSE difference between the Q-value and the expected Q-value ($\underset{a}{max}Q(s^{'},a)$). Mathematically,

$L = \frac{1}{N}\sum_{i}[Q(s,a) - \underset{a}{max}Q(s^{'},a)]^{2}$

The total loss will be the weighted sum of the manager and order network losses. Feel free to play around with the weight coefficients and see how loss variation progresses.

def compute_td_loss(batch_size):
    state, action, reward, next_state, done = replay_buffer.sample(batch_size)

    state      = Variable(torch.FloatTensor(np.float32(state))).to(device)
    next_state = Variable(torch.FloatTensor(np.float32(next_state)), volatile=True).to(device)
    action     = Variable(torch.LongTensor(action)).to(device)
    reward     = Variable(torch.FloatTensor(reward)).to(device)
    done       = Variable(torch.FloatTensor(done)).to(device)

    # criterion = nn.SmoothL1Loss(reduction='mean')

    q_values      = model(state)
    next_q_values = model(next_state)

    q_value          = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
    next_q_value     = next_q_values.max(1)[0]
    # reward += abs(q_value - next_q_value)
    expected_q_value = reward + gamma * next_q_value * (1 - done)
    # loss = criterion(q_value,Variable(expected_q_value.data))
    loss = (q_value - Variable(expected_q_value.data)).pow(2).mean()
    loss = loss.clamp(-1,1)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss

Initialize Setup

Now that we are all ready to train our model, we will initialize our learning setup. We will implement our algorithm on Pong by skipping frequent frames in order to save up on memory and make the process data efficient. We will initialize our environment, our agent, the optimizer and the replay buffer along with its capacity. Keep in mind that Atari environments require significant exploration. Thus, a buffer with large capacity is suitable for the training process.

env_id = "PongNoFrameskip-v4"
env    = make_atari(env_id)
env    = wrap_deepmind(env)
env    = wrap_pytorch(env)

model = CnnDQN(env.observation_space.shape, env.action_space.n).to(device)
    
optimizer = optim.Adam(model.parameters(), lr=0.00001)

replay_initial = 10000
replay_buffer = ReplayBuffer(100000)

load_model = True
if load_model==True:
    model_checkpoint = torch.load(checkpoint_name+'/model.pth.tar', map_location=device) 
    model.load_state_dict(model_checkpoint['model_state_dict'])
    optimizer.load_state_dict(model_checkpoint['optimizer_state_dict'])
    loss = model_checkpoint['loss']

Exploration

As discussed in our previous tutorials, we make use of the exploration-starts strategy wherein we start of with 100% exploration and exponentially decay its amount over agent's learning. This is a key step for the model to sample appropriate actions during long-horizon episodic tasks.

epsilon_start = 1.0
epsilon_final = 0.01
epsilon_decay = 30000

epsilon_by_frame = lambda frame_idx: epsilon_final + (epsilon_start - epsilon_final) * math.exp(-1. * frame_idx / epsilon_decay)

Learning Loop

We initialize our training hyperparameters and reset the environment to its initial state. Now, we make use of the standard PyTorch loop mechanism and step through the environment upon sampling actions from our agent. TD updates are made once the replay buffer reaches a capacity of more than 10,000 experiences. Once the loop completes, we save our progress (loss and rewards) for visualization.

loss = None
num_frames = 50000
batch_size = 32
gamma      = 0.99

losses = []
all_rewards = []
episode_reward = 0

state = env.reset()
for frame_idx in range(1, num_frames + 1):
    epsilon = 0.01 #epsilon_by_frame(frame_idx)
    action = model.act(state, epsilon)
    next_state, reward, done, _ = env.step(action)
    replay_buffer.push(state, action, reward, next_state, done)
    state = next_state
    episode_reward += reward
    
    if frame_idx > 10000:
      imsave(checkpoint_name+'/Frames/'+str(frame_idx)+'.png',state[0,:,:])

    if done:
        state = env.reset()
        all_rewards.append(episode_reward)
        episode_reward = 0
        
    if len(replay_buffer) > replay_initial:
        loss = compute_td_loss(batch_size)
        loss = loss.item()
        losses.append(loss)
        
    if frame_idx % 10000 == 0:
        print('Step-',frame_idx,'/',num_frames,'|Episode Reward-',all_rewards[-1],'|Loss-',loss)
        torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss},checkpoint_name+'/model.pth.tar')
        # plt.plot(all_rewards)

data_save = {}
data_save['loss'] = losses
data_save['reward'] = all_rewards

with open(checkpoint_name+'/data_save.pkl', 'wb') as f: #data+same as frame folder
    pkl.dump(data_save, f)