Trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI

Overview

lunar-lander-logo

Introduction

This script trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI.

In order to run this script, NumPy, the OpenAI Gym toolkit, and PyTorch will need to be installed.

Each step through the Lunar Lander environment takes the general form:

state, reward, done, info = env.step(action)

and the goal is for the agent to take actions that maximize the cumulative reward achieved for the episode's duration. In this specific environment, the state space is 8-dimensional and continuous, while the action space consists of four discrete options:

  • do nothing,
  • fire the left orientation engine,
  • fire the main engine,
  • and fire the right orientation engine.

In order to "solve" the environment, the agent needs to complete the episode with at least 200 points. To learn more about how the agent receives rewards, see here.

Algorithm

Since the agent can only take one of four actions, a, at each time step t, a natural choice of policy would yield probabilities of each action as its output, given an input state, s. Namely, the policy, πθ(a|s), chosen for the agent is a neural network function approximator, designed to more closely approximate the optimal policy π*(a|s) of the agent as it trains over more and more episodes. Here, θ represents the parameters of the neural network that are initially randomized but improve over time to produce more optimal actions, meaning those actions that lead to more cumulative reward over time. Each hidden layer of the neural network uses a ReLU activation. The last layer is a softmax layer of four neurons, meaning each neuron outputs the probability that its corresponding action will be selected.

neural-network

Now that the agent has a stochastic mechanism to select output actions given an input state, it begs the question as to how the policy itself improves over episodes. At the end of each episode, the reward, Gt, due to selecting a specific action, at, at time t during the episode can be expressed as follows:

Gt = rt + (γ)rt+1 + (γ2)rt+2 + ...

where rt is the immediate reward and all remaining terms form the discounted sum of future rewards with discount factor 0 < γ < 1.

Then, the goal is to change the parameters to increase the expectation of future rewards. By taking advantage of likelihood ratios, a gradient estimator of the form below can be used:

grad = Et [ ∇θ log( πθ( at | st ) ) Gt ]

where the advantage function is given by the total reward Gt produced by the action at. Updating the parameters in the direction of the gradient has the net effect of increasing the likelihood of taking actions that were eventually rewarded and decreasing the likelihood of taking actions that were eventually penalized. This is possible because Gt takes into account all the future rewards received as well as the immediate reward.

Results

Solving the Lunar Lander challenge requires safely landing the spacecraft between two flag posts while consuming limited fuel. The agent's ability to do this was quite abysmal in the beginning.

failure...'

After training the agent overnight on a GPU, it could gracefully complete the challenge with ease!

success!

Below, the performance of the agent over 214,000 episodes is documented. The light-blue line indicates individual episodic performance, and the black line is a 100-period moving average of performance. The red line marks the 200 point success threshold.

training-results

It took a little over 17,000 episodes before the agent completed the challenge with a total reward of at least 200 points. After around 25,000 episodes, its average performance began to stabilize, yet, it should be noted that there remained a high amount of variance between individual episodes. In particular, even within the last 15,000 episodes of training, the agent failed roughly 5% of the time. Although the agent could easily conquer the challenge, it occasionally could not prevent making decisions that would eventually lead to disastrous consequences.

Discussion

One caveat with this specific implementation is that it only works with a discrete action space. However, it is possible to adapt the same algorithm to work with a continuous action space. In order to do so, the softmax output layer would have to transform into a sigmoid or tanh layer, nulling the idea that the output layer corresponds to probabilities. Each output neuron would now correspond to the mean, μ, of the (assumed) Gaussian distribution to which each action belongs. In essence, the distributional means themselves would be functions of the input state.

The training process would then consist of updating parameters such that the means shift to favor actions that result in eventual rewards and disfavor actions that are eventually penalized. While it is possible to adapt the algorithm to support continuous action spaces, it has been noted to have relatively poor or limited performance in practice. In actual scenarios involving continuous action spaces, it would almost certainly be preferable to use DDPG, PPO, or a similar algorithm.

References

License

All files in the repository are under the MIT license.

Owner
Momin Haider
Momin Haider
Source code for Fathony, Sahu, Willmott, & Kolter, "Multiplicative Filter Networks", ICLR 2021.

Multiplicative Filter Networks This repository contains a PyTorch MFN implementation and code to perform & reproduce experiments from the ICLR 2021 pa

Bosch Research 66 Jan 04, 2023
Pytorch implementation of Zero-DCE++

Zero-DCE++ You can find more details here: https://li-chongyi.github.io/Proj_Zero-DCE++.html. You can find the details of our CVPR version: https://li

Chongyi Li 157 Dec 23, 2022
PyTorch implementation of SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching This is the official PyTorch implementation of SMODICE: Versatile Offline I

Jason Ma 14 Aug 30, 2022
1st Solution For NeurIPS 2021 Competition on ML4CO Dual Task

KIDA: Knowledge Inheritance in Data Aggregation This project releases our 1st place solution on NeurIPS2021 ML4CO Dual Task. Slide and model weights a

MEGVII Research 24 Sep 08, 2022
A cross-lingual COVID-19 fake news dataset

CrossFake An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below: Cross-lingual COVID-19 Fake News Detection. Jiangshu Du,

Yingtong Dou 11 Dec 01, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

ResMLP - Pytorch Implementation of ResMLP, an all MLP solution to image classification out of Facebook AI, in Pytorch Install $ pip install res-mlp-py

Phil Wang 178 Dec 02, 2022
A tool to visualise the results of AlphaFold2 and inspect the quality of structural predictions

AlphaFold Analyser This program produces high quality visualisations of predicted structures produced by AlphaFold. These visualisations allow the use

Oliver Powell 3 Nov 13, 2022
Deep Unsupervised 3D SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment.

(ACMMM 2021 Oral) SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment This repository shows two tasks: Face landmark detection and Fac

BoomStar 51 Dec 13, 2022
3rd place solution for the Weather4cast 2021 Stage 1 Challenge

weather4cast2021_Stage1 3rd place solution for the Weather4cast 2021 Stage 1 Challenge Dependencies The code can be executed from a fresh environment

5 Aug 14, 2022
Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation'

OD-Rec Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation' Paper, saved teacher models and Andro

Xin Xia 11 Nov 22, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

4.2k Jan 01, 2023
Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation The skip connections in U-Net pass features from the levels of enc

Boheng Cao 1 Dec 29, 2021
This is an example of a reproducible modelling project

An example of a reproducible modelling project What are we doing? This example was created for the 2021 fall lecture series of Stanford's Center for O

Armin Thomas 2 Oct 26, 2021
Code for HodgeNet: Learning Spectral Geometry on Triangle Meshes, in SIGGRAPH 2021.

HodgeNet | Webpage | Paper | Video HodgeNet: Learning Spectral Geometry on Triangle Meshes Dmitriy Smirnov, Justin Solomon SIGGRAPH 2021 Set-up To ins

Dima Smirnov 61 Nov 27, 2022
Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

Yuxuan Liu 305 Dec 19, 2022
Official implementation of the network presented in the paper "M4Depth: A motion-based approach for monocular depth estimation on video sequences"

M4Depth This is the reference TensorFlow implementation for training and testing depth estimation models using the method described in M4Depth: A moti

Michaël Fonder 76 Jan 03, 2023
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 06, 2023
Emulation and Feedback Fuzzing of Firmware with Memory Sanitization

BaseSAFE This repository contains the BaseSAFE Rust APIs, introduced by "BaseSAFE: Baseband SAnitized Fuzzing through Emulation". The example/ directo

Security in Telecommunications 138 Dec 16, 2022
Code for the paper: Fighting Fake News: Image Splice Detection via Learned Self-Consistency

Fighting Fake News: Image Splice Detection via Learned Self-Consistency [paper] [website] Minyoung Huh *12, Andrew Liu *1, Andrew Owens1, Alexei A. Ef

minyoung huh (jacob) 174 Dec 09, 2022