Advancements in PPO

Introduction

Proximal Policy Optimisation (Schulman et al., 2017) is the leading algorithm for training reinforcement learning models. As well as other tasks, PPO plays a huge role in applying RLHF to LLMs (Ouyang et al., 2022). Like other reinforcement learning algorithms, it suffers from data inefficiency and converges to local instead of global optima and improving both of these are open areas of research.

In this post, I present some new research that improves the efficiency and performance of the algorithm in the following ways:

  1. An improved training algirithm (which could pretentiously be called PPO+) as it applies a small modification to the original algorithm to improve performance.
  2. An alternative critic loss where I test out the idea of using a value support in the critic instead of a scalar value.
  3. A comparison of LSTM vs GRU architectures in recurrent architectures.
  4. Using a general distribution instead of a Gaussian for continuous action spaces.

All of the experiments are based on the incredible blog post The 37 Implementation Details of Proximal Policy Optimization (Huang et al., 2022), using their codebase as a starting point. I use all the same hyperparameters as the original setup and exploring this space to optimise the results is left as further work. All code is available on GitHub and all experiments are available as you read through the post.

Update Ordering

The first trick applies to general actor-critic setups, however, I have tested the result on PPO to compare with SOTA results.

The PPO algorithm (Schulman et al., 2017) is as follows:

for iteration \(i= 1, 2, \ldots\) do
for actor \(= 1, 2, \ldots, N \) do
  Run policy \(\pi_{\theta_{i-1}}\) in environment for \(T\) timesteps
  Compute advantage estimates \(\hat{A}_t = G(s_t,a_t) - V_{\theta_{i-1}}(s_t)\)
end for
 Optimize surrogate \(L\) wrt \( \theta_i \), with \(K\) epochs and minibatch size \(M\)
end for

The main problem with the current algorithm is the way that the advantage \(\hat{A}_t\) is calculated. When calculating the advantage at iteration, \(i\), we use the value network, \(V_{\theta_{i-1}}\) trained using experience from the policy \(\pi_{\theta_{i-2}}\). While the divergence of the policy is small, I propose using a value network, \(V_{\theta_i}\) trained using experience from the policy \(\pi_{\theta_{i-1}}\). Here is the updated algorithm:

for iteration \(i= 1, 2, \ldots\) do
for actor \(= 1, 2, \ldots, N \) do
  Run policy \(\pi_{\theta_{i-1}}\) in environment for \(T\) timesteps
end for
 Optimize surrogate \(L_V\) wrt \( V_{\theta_i} \), with \(K\) epochs and minibatch size \(M\)
 Compute advantage estimates \(\hat{A}_t = G(s_t,a_t) - V_{\theta_i}(s_t)\)
 Optimize surrogate \(L_{\pi}\) wrt \( \pi_{\theta_i} \), with \(K\) epochs and minibatch size \(M\)
end for

For this and subsequent experiments, I use the same setup as (Huang et al., 2022) and we can see that my version improves on both Acrobot and CartPole classic control environments. The modification does not solve the exploration problem, hence all algorithms score 0 on MountainCar.

Additionally, I modify the Spinning Up implementation (Achiam, 2018) in the same way to see that this updated version performs better on Acrobot, similarly on CartPole and again fails to solve MountainCar.

Value Support

The next critic trick is a negative result, however, this would not be science if I left it out. In an earlier post, I described how value support was useful for improving performance. Unfortunately, this turns out not to be the case. The idea is to represent the critic network as a probability distribution over possible values (on the support) and to use the mean as the estimated value. For more details see this extract from the previous post. This should work better because it is hard for a model to generate features correlated (linearly related) to a value. After running experiments to test this, however, we see that there is a decrease in performance on Acrobot and CartPole and again a score of 0 on MountainCar.

LSTM vs GRU

The next experiments investigate a problem that remains misunderstood: LSTM vs GRU. With the rise of transformers, LSTMs and GRUs have fallen out of fashion for next-token prediction tasks. However, the advice I always went with when choosing between them (for any sequence prediction problem) was that no one understood which would work better on a given task, so try both and pick the one that does best for the problem. In RL, it is still common to use LSTMs for recurrent architectures. However, we can see that GRU performs just as well on BeamRider and Pong and even better on Breakout. On top of this, we can see that it runs slightly quicker due to the simpler architecture. It is worth noting that all these experiments are based on the NoFrameSkip version of the Atari environments, which displays every frame to the agent and requires a policy that can control in more detail the duration of each action.

Continuous Distributions

The final experiments challenge the assumption that you should use a Gaussian Distribution with a diagonal covariance matrix. Instead, I use a general distribution with pointwise estimates of the PDF at equally spaced intervals. The action dimensions are still independent meaning the model outputs \(n_\text{combs} * \text{action_dim}\) values that are normalised across each \(\text{action_dim}\).

The comb distribution performs better on Cheetah but worse on Hopper and Walker. I cannot offer an explanation as all of these environments seem very similar in terms of action dimensionality and objective. Even though it allows generalisation to a wider class of distributions, this generalised distribution is much slower to run and more complicated to implement. I call this a CombDistribution, however, the PDF is linearly interpolated between points so that the actions space is still continuous - not a discretized version of the original action space.

Conclusion

In this post, I have presented a range of research on PPO. The most impressive results come from the first experiment, which highlights how a small change in the order of updates can improve performance; and the third experiment, which highlights that using the GRU is an alternative to an LSTM that keeps or improves performance and requires less computational resources to run.

Reinforcement learning offers huge potential for agents to interact and learn from their environments without the limitations of human cognition. Future research is needed to improve techniques to compete with human-level learning and I believe that this will only be the beginning of my contributions in this area.

Bibliography

G.Brockman, V.Cheung, L.Pettersson, J.Schneider, J.Schulman, J.Tang, W.Zaremba. OpenAI Gym. arXiv preprint arXiv:arXiv:1606.01540, 2016.

M.Towers, J.Terry, A.Kwiatkowski, J.Balis, G.Cola, T.Deleu, M.Goulão, A.Kallinteris, A.KG, M.Krimmel et al.. Gymnasium. on GitHub, 2023.

S.Huang, R.Dossa, A.Raffin, A.Kanervisto, W.Wang. The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track, 2022.

L.Ouyang, J.Wu, X.Jiang, D.Almeida, C.Wainwright, P.Mishkin, C.Zhang, S.Agarwal, K.Slama, A.Ray et al.. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

J.Schrittwieser, I.Antonoglou, T.Hubert, K.Simonyan, L.Sifre, S.Schmitt, A.Guez, E.Lockhart, D.Hassabis, T.Graepel et al.. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. arXiv preprint arXiv:1911.08265, 2019.

J.Schulman, F.Wolski, P.Dhariwal, A.Radford, O.Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.

J.Achiam. Spinning Up in Deep Reinforcement Learning. on GitHub, 2018.

Code

The code is available on GitHub: https://github.com/George-Ogden/ppo-experiments contains the main experiment and https://github.com/George-Ogden/spinningup contains the modified Spinning Up implementation.