Deep Reinforcement Learning in Robotics

From jderobot
Jump to: navigation, search

Deep Reinforcement Learning in Robotics

"SL wants to work. Even if you screw something up you'll usually get something non-random back. RL must be forced to work. If you screw something up or don't tune something well enough you're exceedingly likely to get a policy that is even worse than random. And even if it's all well tuned you'll get a bad policy 30% of the time, just because."

Andrej Karpathy


  • Sepehr MohaimenianPour (smohaime[at]sfu[dot]ca)
  • Alberto Martín (almartinflorido[at]gmail[dot]com)
  • Francisco Miguel Rivas Montero (franciscomiguel[dot]rivas[at]urjc[dot]es)


The purpose of this project is to implement Gazebo robots simulator as an OpenAI Gym environment for DeepRL in robotics. This project is developed under JdeRobot framework and as a Google Summer of Code project.



  • Project Repository
    • Most of the developed code are in openAI_gym folder of the project
    • Some necessary changes in the JdeRobot code itself (such as implementing the restart behaviour for Gazebo using ICE and...)
  • Our DQN Agent(Based on open source implementations of DQN)


Available environments[edit]

Please note that these environments work regardless of the map, you can place the TurtleBot Jde robot in any desired world in Gazebo and use these environments to convert the world to an OpenAI Gym environment without making any changes.

There are some sample worlds we used for training and testing our agents available in project repository

TurtleBot laser environment[edit]

  • The observation space is a 1D array of floating point values for laser scanner's data.
  • Predefined action space is 3 actions (Left, Right, Straight).
  • Environment id is jde-gazebo-kobuki-laser-v0

TurtleBot laser2D environment[edit]

  • The observation space is a 2D image 180x180 px of converted laser data.
  • Predefined action space is 3 actions (Left, Right, Straight).
  • Environment id is jde-gazebo-kobuki-laser2D-v0
  • Suitable for 2D Convolutional (image) based agents such as DQN.
  • Figure below demonstrates the conversion between laser data to image.

TurtleBot RGB Camera environment[edit]

  • The observation space is RGB image from left camera of TurtleBot resized to 180x180 px.
  • Predefined action space is 3 actions (Left, Right, Straight).
  • Environment id is jde-gazebo-kobuki-rgb-v0
  • Suitable for 3 channel (RGB) or 1 channel (Grayscale) image based agents.
  • Figure below demonstrates the conversion between laser data to image.

Build your own environments[edit]

Building your own environment for new robots/sensors is pretty straight forward. We have provided a well commented Template file with Laser and Camera sensor already implemented in that as an example. All you have to do is:

  • Connect to the robot using ICE interface by loading your config file in the initialisation.
  • Connect to whatever sensor you want from the robot using the JdeComm ICE interface from the config file.
  • Define your Observation and Action spaces.
  • Use the update function to read and translate all the sensors data to your observation.
  • Use ActionToVel function to translate your action to robot's velocities.
  • Define your Reward Function based on each action and state after executing the action in Step function.
  • Include your new environment in __init__ file
  • Pick a name for your environment here, this is the name that will be registered to Gym and your agents will use to connect to environment.

And you are good to go ! enjoy training...


JdeRobot & Gazebo[edit]

For compiling JdeRobot project from source code please refer to this link.

OpenAI Gym[edit]

You could use one of the options bellow for installing the OpenAI Gym. For more information please refer to OpenAI Gym isntallation

Minimal Installation using pip[edit]
$ pip install gym
Minimal installation from source code[edit]
$ git clone
$ cd gym
$ pip install -e .
Full installation using pip[edit]
$ apt-get install python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig

JdeRobot Gazebo OpenAI Gym environment[edit]

$ git clone
$ cd colab-gsoc2017-SepehrMohaimanian/src/tools/openAI_gym_env/gym-gazebo/jde_gym_gazebo
$ pip install -e .

How to use[edit]

Using this environment in a Python code is pretty easy and hassle free, you just need to run a Gazebo instance loading a world that has a particular robot in it, load the environment package in your Python code (agent), and you are good to go.


  • Load any world containing a Jde TurtleBot robot in it. You can find some sample worlds here.
  • In your Python agent's code import gym and jde_gym_gazebo
  • Create the environment as an standard openAI Gym environment, using the aforementioned environments ids for different environments.
  • Sample code:
import gym
import jde_gym_gazebo
    #Making the environment
    env = gym.make('jde-gazebo-kobuki-laser-v0')
    for x in range(total_episodes):
        #reseting the env
        observation = env.reset() 
        for i in range(max_steps_per_episode)
            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            # Execute the action and get feedback
            observation, reward, terminal, info = env.step(action)



Please refer to Report #5 for more information on this agent.

We started training simple Q-Learning algorithm using the reward function and action space that were suggested in Extending the OpenAI Gym for robotics

  • Action Space:
    • Forward: v = 0.3 m/s, w = 0.0 rad/s
    • Left/Right: v = 0.05 m/s , w = +-0.3 rad/s
  • Rewards Function:
    • Forward: 5
    • Left/Right: 1
    • Crash: -200

Although the training curve look very promising:

But the actual robot behaviour was not as good as expected, this video is the result after 4000 episodes of training and is 6x faster than real-time:

The results are similar to the mentioned paper.

This agent has following issues:

  • As the velocities for action space are chosen to be very small, the agent is very slow. Although it is safer for the agent to drive slower but we definitely don't want a boring slow robot. (note that the simulation is being ran 6x faster than real-time)
  • The reward function is chosen poorly thus the agent tends to make unnecessary turn a lot and have a semi-stochastic behaviour in the environment.

We carried on turning and editing the environment and agent by experimenting with different environmental parameters such as action state, laser scanner quantification for space space, and reward functions. As well as different hyper parameters for agent itself. (more on that in Report #4)

The training itself takes a very long time and for checking every single set of parameters we had to spend several hours of training, here's one of the fastest training instances:

After some tuning we converged to a faster agent that can explore the environment more sanely.

  • Action Space:
    • Forward: v = 0.9 m/s, w = 0.0 rad/s
    • Left/Right: v = 0.3 m/s , w = +-0.9 rad/s
  • Rewards Function:
    • Forward: 0.9
    • Left/Right: -0.003
    • Crash: -99

Looking at the training curve:

might seem that the agent is doing worse than the previous setup which has been suggested in Extending the OpenAI Gym for robotics. However the resulting agent does a better job in exploring the environment faster and more reliable, without making unnecessary turn and detours.

This improvement is achieved by:

  • Forcing the robot to go more straight and avoid turning unless it is absolutely necessary by slightly penalising the agent in turns instead of rewarding it as was suggested in Extending the OpenAI Gym for robotics.
  • Increasing the distance threshold from the walls and consider the crash in further distances to keep the robot in the centre of corridors
  • Discretising the laser data for state space non-linearly
  • And some other tweaks in the environment and agent

SARSA Agent[edit]

TBA (Postponed because of time restriction for training)

DQN Agent[edit]

We have tried several available opensource DQN agents as well as our own implementation of this algorithm on the environment. All of the DQN agents managed to solve the environment, with different convergence rate and speed. Following result is from an opensource implementation of DQN train out-of-the-box on the exact same environment we used for QLearing and SARSA algorithm. The reward function and action space are exactly the same, the only difference is in observation space that here we use 2D grayscale image generated from our laser data as the observation space (input) for our agent.

To reproduce the following results use DQN Agent.


To train the agent from scratch (randomly initialised weights):

$ python

Each training cycle takes at least 3 days on Nvidia Titan Xp. The bottle net of the training is from either Gazebo or ICE (need to be investigated) that can not provide more than 12 commands per second.

To train using my training result as initialisation for weights (fine tuning for a new map with different obstacles and ...)

$ python --load_weights weights/laser2D_baseline

Please note that the epsilon at the start is set to 1 and decreases slowly overtime, thus the actions for warmup_steps of training is completly (100%) random.

And if you want to render and see the network input live, use the --render input arguments

$ python --load_weights weights/laser2D_baseline --render

The following video is made from realtime training recorded segments in 3 days.

The reward chart for first 3 million iterations is as following.

Comparing this chart with QLearning agent shows a significant improve in the results.

Final Results[edit]

This result's weights are provided as the baseline for obstacle avoidance using laser data as input image for DQN agent is provided in agent's weights folder by the name of laser2D_baseline.

To run the agent using these weights in testing mode (epsilon=0 and backward passes are desabled) use --evaluate argument:

$ python --evaluate --load_weights weights/laser2D_baseline

And again to render the network input use --render argument:

$ python --evaluate --load_weights weights/laser2D_baseline --render

After 3 million iterations of training (almost 40,000 episodes), the agent can navigate the environment perfectly without hitting any obstacle and keep exploring forever. We also tried the trained agent in new maps that where very different from the original training map. Interestingly enough, agent managed to explore the unseen maps without any issue! Although these maps had some new structures that would create new states in observation (the sharp turns and forks would generate new shapes in observation space that agent had not seen before during it's training phase). The following video presents the agent exploring the training map and two new maps with new features (sharp turn, forks, ...). We have also rendered the agent's observation image in the video for more clarity.

One can argue that agent has learned obstacle avoidance and exploring in general and is not bound to a particular map as was done in Extending the OpenAI Gym for robotics. We believe that training the exact same agent on a more general map that has many different kinds of obstacles in it with bigger action space (more variety of velocities) can generate a general purpose obstacle avoidance agent.


Detailed Reports[edit]

Report #1

Report #2

Report #3 Reset behaviour implementation

Report #4 Environmental tuning (Last edit: 07/20/2017)

Report #5 Simple Q-Learning agent (Last edit: 07/21/2017)

Report #6 SARSA agent (TBA)

Report #7 Simple DQN agent (TBA)

Week One[edit]

  • [DONE] Compiling the JdeRobot code from source.
    • [BACKLOG] Local installation does not respect the CMAKE_INSTALL_PREFIX for python components (Related Issue).
  • [DONE] Fork JdeRobot project in my own repository.

Week Two[edit]

  • [DONE] Run Gazebo plugins and find out how they work in JdeRobot. Two ==
  • [DONE] Get familiar with ICE interface.
  • [DONE] Get familiar with OpenAI Gym environments.
  • [DONE] Write a technical proposal for mentors.
  • [DONE] Implement a method for restarting Gazebo for training purpose.

Week Three[edit]

  • [DONE] Get familiar with OpenAI Gym Spaces.
  • [DONE] Implement first version of environment for TurtleBot.
  • [DONE] Receive the laser data from Gazebo TurtleBot in environment (State space).
  • [DONE] Create action space in environment, map actions to TurtleBot's motors speeds.
  • [DONE] Send motor speeds to Gazebo.
  • [DONE] Connect the reset method to environment reset.
  • [DONE] Detect collisions for terminating the episode (Bumper or Wall sensor ?!).Used laser data threshold

Week Four[edit]

  • [DONE] Implement a basic Q-Learning agent for basic environment
  • [DONE] Train the basic Q-Learning agent
  • [DONE] Select another RL algorithm to train on basin environment: Actor-Critic
  • [CHANGED] Implement a basic Actor-Critic agent for basic environment: Replaced by SARSA agent
  • [CHANGED] Train the basic Actor-Critic agent
  • [ToDo] Compare the 2 agents results
  • [DONE] Do the first evaluation

Week Five[edit]

  • [DONE] Research on DQN Agent
  • [DONE] Tune the environment
  • [DONE] Implement a SARSA agent for basic environment

Week Six[edit]

  • [DONE] Continue research on DQN Agent
  • [DONE] Find out how to convert the 1D laser data to 2D image
  • [BACKLOG] Train the SARSA agent using laser scanner

Week Seven[edit]

  • [DONE] Implement a basic DQN agent for basic environment using 1D convolution
  • [DONE] Generate 2D image from laser scanner data
  • [DONE] Implement a basic DQN agent for basic environment using 2D convolution

Week Eight[edit]

  • [DONE] Train 1D DQN Agent using laser scanner
  • [DONE] Train 2D DQN Agent using laser scanner
  • [DONE] Do the second evaluation

Week Nine[edit]

  • [DONE] Continue training until you get the result
  • [DONE] Test the trained agent on different maps

Week Ten[edit]

  • [DONE] Implement the environment that uses camera
  • [DONE] Implement the RGB DQN agent
  • [DONE] Prepare Gazebo worlds for RGB training
  • [DONE] Start training the RGB agent

Week Eleven[edit]

  • [DOING] Train the RGB DQN agent for navigation using RGB data
  • [DONE] Tune the parameters
  • [DOING] Come up with a good reward function

Week Twelve[edit]

  • [DONE] Clean Up the code
  • [DONE] Update the Wiki
  • [DONE] Merge all development branches to master for
  • [DONE] Test everything
  • [DONE] Prepare for release
  • [ToDo] Do the final evaluation

List of all commits[edit]