Reinforcement Learning Tutorial with Open AI Gym

Abhinav Sagar

This blog is the Part-1 of the series. I wish to make as many blogs as possible on this subject as it is one of my favourites. So yeah, let’s get started.

According to Wikipedia, Reinforcement learning (RL) is an area of machine learning concerned with how software agents should take actions in an environment so as to maximize some notion of cumulative reward. In recent years, we’ve seen a lot of improvements in this fascinating area of research. Examples include DeepMind and the Deep Q learning architecture in 2014, beating the champion of the game of Go with AlphaGo in 2016, OpenAI and the PPO in 2017, amongst others.

In this article I will be implementing cross-entropy method with OpenAI Gym’s MountainCarContinuous environment. OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. I will be using pytorch library for the implementation.

  1. gym
  2. matplotlib
  3. numpy
  4. pytorch
  5. jupyter-notebook
Reinforcement Learning environments

A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Let’s get started with the code. Feel free to use the source code from the corresponding notebook here.

I started with importing the libraries and dependencies.

Next I created the MountainCarContinuous environment. I created a class for representing the agents initialized with the environment and hidden layer size. This class has functions for setting weights, getting weights, implementing feed forward neurons and for evaluating the performance of the agent.

Then I created a function for implementing cross entropy method. Feel free to refer to the code for the parameters and their meaning in comments. I trained the agents for maximum 500 iterations printing the rewards received after every 10 iterations.

Let’s see the result. Woah! The agent has learnt a fairly good policy in just 47 iterations. As we can see the the rewards received initially were quite low, in fact it was negative. As the training progresses the rewards increase until the model has converged. The average score at the end of 47 iterations is 90.83.

Score vs Episode

I think implementing cross entropy method should be a good starting point for someone starting out with reinforcement learning. In the next part I will be implementing OpenAI Gym’s Bipedal Walker environment using Deep Deterministic Policy Gradient (DDPG) algorithm.

The corresponding source code can be found here.

Happy reading, happy learning and happy coding.