Ping-Pong Playing Learner

Abstract

The pong game could be a good example, to see how one can apply RL algorithms to games and make the AI agent play against humans. At any moment, all the information that is available to us is an image frame of the game (say 210x160x3), which is a snapshot of the game at that moment. We call this image the “state” of the game in terms of RL literature. Our task now is to get the best “action” available for that state so that when we take that action, we maximize our score over time. Here, the actions available to us are moving “UP”, moving “DOWN” , or “STAY” at the same position.

Problem Statement

The goal of this project is to build a pong learning player that should be able to score 5 points(including -1 when misses the ball) on an average among 20 episodes. And, the constraint here is limiting the training time. Here, we would be using two different algorithms to build a Ping-Pong playing learner.

Noisy Cross Entropy Method
Deep-Q Networks

Later, compare the performance obtained from both the methods with the Policy Gradients method (sourcing from Karpathy’s blog). We train our RL (Reinforcement Learning) agent by actually playing against a hard-coded opponent. Among these techniques, cross-entropy uses self-designed features or in other words, it does not use convolutions to extract features from game frames but whereas the others do. Moreover, Deep Q-Networks and Policy Gradients techniques use deep neural networks in learning action-value(Q) function and policy respectively.

Reinforcement learning algorithms have proven record at complex games like AlphaGo and Dota. The main motivation to take up this project was DeepMind’s phenomenal work in the field of RL for games.

Papers and work that caught us:

Andrej Karpathy’s Pong from Pixels article
Playing Atari with Deep Reinforcement Learning, Mnih et al, 2013. Algorithm: DQN.
Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht and Stone, 2015. Algorithm: Deep Recurrent Q-Learning.
Asynchronous Methods for Deep Reinforcement Learning, Mnih et al, 2016. Algorithm: A3C.
Trust Region Policy Optimization, Schulman et al, 2015. Algorithm: TRPO.

Dataset Reinforcement learning differs from other methods like supervised learning, unsupervised learning, and semi-supervised learning in terms of training data. In RL, we do not explicitly provide external training data but it is created during training. In other words, the training data is generated by the model itself when the agent actually plays the game. As the RL algorithm learns through a described reward system, the player takes the game as a positive example when it wins and a negative example when it loses.

Methodology

Reinforcement Learning (RL) makes agent learn to perform actions in an environment so as to maximize a reward. RL is a set of methods that learn how to optimally behave in an environment and MDP is formal representation of such an environment. Given problem setup can be formulated as an Markov Decision Problem(MDP) because of the fact that it satisfies Markovian property. Markovian property: Current state information is sufficient to predict the future i.e., no past information is required.

There are two main components: (1) Environment : Represents the problem to be solved, (2) Agent: Represents the learning algorithm

Components of MDP:

Agent: Trained Pong AI-model
State: Sequence of frames. NOTE: State representation differs for different methods.
Action: Discrete space whether the paddle has to move UP, DOWN, or IDLE
Reward function shaping is crucial in any reinforcement learning algorithm as it characterizes the agent how to play. Sometimes, it is very difficult for complex games like Dota. In pong, the reward function is as follows: -1, if the agent misses the ball; 1, if the opponent misses the ball; 0, otherwise. r(s,a) = number of points gained in current turn

We used OpenAI’s Gym environment to develop and test pong learning player.

1. Cross Entropy Method

Due to the discrete nature of the game, gradient-free methods are often the primary approach to reward optimization. This includes methods such as the Cross Entropy Method and CMA-ES. These methods also perform better when reducing the state-space to a set of hand-crafted features and learning a set of weights on these features through reinforcement learning episodes. These features are self-designed for this project and can capture higher-level properties of the game frame such as: (1) agent’s paddle position, (2) opponent’s paddle position, (3) ball x-position, (4) ball y-position, (5) ball direction, and (6) distance difference (euclidean distance (agent’s paddle to ball) difference between current frame and immediate next frame).

More recent work on Ping-pong used deep learning methods. A set of convolutional layers are first applied on the game frame, then fed into networks for feature extraction. Here in pong, the frames are pre-processed to gray out the unwanted pixels. By doing so, most of the pixels become zero, which makes input data sparse. This may lead to inefficient learning and we will look at comparison at the end of this page.

Approach: We used a variant of the cross-entropy method i.e., noisy cross-entropy method which is a black-box policy optimization technique. We call this because the algorithm tries to find the weights that map the states to best action. In this method, we have two repeating phases. One, we draw a sample from a probability distribution. Two, minimize the cross-entropy between this distribution and a target distribution to produce a better sample in the next iteration. We implemented this technique to learn the weights for the six hand-crafted features. It is a linear model (weights multiplied by features) that gives value for each possible action given the state information. Then we choose the action that has the maximum value. Here, the algorithm analyzes many players over a set of episodes and selects weights as the mean of elite samples for the next iteration. Elite samples are the ones that score the maximum number of points on an average over episodes. Because of this fact, it is not a greedy approach.

After performing some iterations trying out the cross-entropy method without any noise to covariance, it showed that the learned weights reach sub-optimal values very early (also termed as local optima). This indicates that CE application to RL problems lead to the learned distribution concentrating to a single point too fast. To prevent this early convergence, [8] introduced a trick to add some extra noise (Z) to the distribution. Moreover to make the agent perform even more better, [8] mentions that decreasing noise turns out to be better than just adding constant noise.

2. Deep Q Networks (DQN)

Q Values Q values are the action-value function, which gives us the total expected reward by performing certain action (a) following a policy (𝚷) in the current state (s). It tells us the total discounted reward we might expect in the future if we perform action (a) right now.

Q Learning Q learning is a model free, values based and off policy learner to estimate the Q value function of the agent.

Q Learning Overview

Create an empty Q table
While not converged update the Q table :
- Start from state s1, and take an action a1 thereby receive reward r1
- The action a1 is taken by looking up the Q table with highest value or by random.
- Update the Q values.

Drawbacks of Q Learning

Large state-action pairs leading to large q tables making it less useful for scalability.
Computation time

Approach

In a DQN, a neural network is used a function approximator instead of a Q table to estimate the the Q values. The neural network gives us the estimates of the Q values for all actions available in a state, instead of giving us which action to perform. The network is updated by predicting the error between predicted Q and the target Q, but both these are estimated using the same network. This means after the update, the target Q value on the very same state would be completely different, this could lead to predicted Q and target Q constantly having to play catch up with each other. The targets are themselves, just an estimate of Q values, unlike supervised learning. The technique of using one estimate to update another estimate is known as bootstrapping. This could lead to some instability issues, without using another network called as Target Network. In our work, we have not used Target Network, but we just a single network to predict the predicted Q and target Q values.
We have used the standard single network DQN instead of extra target network. During the training phase we minimize the error estimated by the network and optimize the weights, using Mean Square Error between the target Q and predicted Q value.

Here the states are the preprocessed images of a particular snapshot of the game. Each image is converted to a grayscale image and normalized in order to reduce the state space. Each image channel is stacked with the previous state resulting a 3d tensor of shape (4, 80, 80). Then, this image is passed as state info into the DQN. Here only one channel is from our current state, the other 3 channels are the copies of the previous state. Outputs of the DQN are the Q Values for all actions available.

The architecture used : Normalized graysclae image (4, 80, 80) -> Conv2d Layer -> Conv2d Layer -> Conv2d Layer -> Linear Layer -> ReLU -> Linear Layer -> 4 Q Values (one for each action)
Each convolution layer is followed by a batchnorm and relu activation.

3. Policy Gradients

Source:link As we are using the gym environment to learn a pong game, the input is an RGB frame showing the paddles, ball, and the score.

Here, much of the information is unwanted like borders, score when understanding the state of the game. So, a processing is being done on input frames of size (210x210x8). First, we crop, downsample by factor of 2, black out background pixels. Finally set the pixel values of paddles and ball to one and then flatten it. Overall, it results in a grayscale vector of size 6400. To understand game state better (like capturing ball direction), we would be sending the difference between current and previous frames as input.

Next, we put this into a neural network (architecture defined in below section). This has a single output unit that gives probability of taking action 2 (In gym, UP-2, DOWN-3).

Main step: We compute discounted reward for the entire episode. Here, we exponentially decay any reward over time. Then, this would be normalized (subtracted with mean and divided over standard deviation) and used as an advantage (we will see how it’s used later). Given this is kind of logistic regression, loss can be defined as L = ylogp + (1-y)log(1-p).

But, in policy gradients, we multiply log probabilities with advantage to weigh them correctly as per their performance. This can make learning faster as we are modulating the gradient with advantage. We then backprop using the derivative of the [log probability of the taken action given this image] with respect ot the [output of the network(before sigmoid)] Finally, we do RMSProp to update the weights.

Architecture: Input layer(6400) -> RELU -> Hidden layer(200) -> Sigmoid -> Output(1 unit)

Training

As mentioned before, training time was fixed for all the methods. Here, we trained the models for ~36 hours CPU time on 3.5GHz x 24 machine. Below figures show the learning graphs i.e., running average reward with iterations/episodes. Once the average reward crosses zero, then there is high probability for agent win.

In Cross Entropy Method, each iteration here corresponds to <=800 games(20 players x 40 episodes). It is <= because we introduced the concept of varying episodes ie., reduce the count as the model improves. The final average reward was 5.583. So, it has 0.6329 probability of winning.

For the given training time, DQN and Policy Gradients ran for 4012 and 9756 episodes respectively. The final average reward for PG was 3.01 which indicates that the agent can beat hard-coded player with 0.5714 probability. And, the final average reward for DQN was -16.25 which indicates that the agent can beat hard-coded player only with 0.1131 probability.

Cross Entropy Method	DQN	Policy Gradient Method

Final weights for six hand-crafted features in CEM method. This gives good intuition how each feature plays role.

Hyperparameters:

CEM
- weight_cols = 6 # because of six hand-crafted features
- mean = zeros of size weight_cols # Initialization
- cov = Identity matrix of size weight_cols * 100
- percentile = 0.3 # for elite sample consideration
- r_percentile = 0.1
- batch_size = 20 # number of players
- episodes = 40 # number of episodes each player plays
- episodes_end = 10 (varying episodes concept)
- maxits = 100
DQN
- learning rate = 1e-4
- batch_size = 32
- gamma = 0.99 # discount factor in reward computation
PG
- batch_size = 10 # episodes frequency for parameter update
- learning_rate = 1e-3
- gamma = 0.99 # discount factor in reward computation
- decay_rate = 0.99 # decay factor for RMSProp

Evaluation and Results

This can be considered as tesing step. In RL, especially when building AI player, the performance that we obtain during training phase is very much in-line with testing as the player actually evaluate itself during training.

Metric: Average reward over 20 episodes(after ~36 hours training)

Method	Average reward over 20 episodes	Maximum reward among 20 episodes	Win Probability
Noisy Cross Entropy Method	4.75	14	0.6131
Deep Q-Networks	-17.24	-13.65	0.089
Policy Gradient Method	2.3	9	0.5547

Agent performance over 20 episodes (Green indicates win and red indicates lose)

Cross Entropy Method	DQN	Policy Gradient Method

Finally, let’s see how our agent plays at different points during training (CEM):

Right (Green) - Trained agent
Left (Orange) - Hard-coded player

`0` hours	`12` hours	`36` hours

Demo:

High resolution video demonstration here

Best Player - Noisy Cross Entropy Method

Difficulties faced

In CEM, we need to simulate into future action and decide if it’s best. So, there was an error in random sampling during cloning and restoring states. This resulting in agent not playing the complete episode sometimes.
A couple of times, found few bugs in the script during the model training. This led to losing descent amount of learning time.