This is the official implementation of OPEN from Can Learned Optimization Make Reinforcement Less Difficult, NeurIPS 2024 (Spotlight) and the AutoRL Workshop @ ICML 2024 (Spotlight).
OPEN is a framework for learning to optimize (L2O) in reinforcement learning. Here, we provide full JAX code to replicate the experiments in our paper and foster future work in this direction. Our current codebase can be used with environments from gymnax or Brax.
All files for running OPEN are stored in <rl_optimizer/>.
Alongside training code in rl_optimizer/train.py, we include configs for [freeway, asterix, breakout, spaceinvaders, ant, gridworld]. We automate parallelisation over multiple GPUs using JAX sharding. The flag <--larger> can be used to increase the size of the network in OPEN. To learn an optimizer in one or a combination of these environments run:
python3 train.py --envs <env> --num-rollouts <num_rollouts> --popsize <popsize> --noise-level <sigma_init> --sigma-decay <sigma_decay> --lr <lr> --lr-decay <lr-decay> --num-generations <num_gens> --save-every-k <evaluation_frequency> --wandb-name "<wandb name>" --wandb-entity "<wandb entity>" [--larger]This will save a checkpoint, and evaluate the performance of the optimizer, every gridworld can not be run in tandem with other environments as it is the only environment which we apply antithetic task sampling to.
We include our hyperparameters in the paper. An example usage is:
python3 train.py --envs breakout --num-rollouts 1 --popsize 64 --noise-level 0.03 --sigma-decay 0.999 --lr 0.03 --lr-decay 0.999 --num-generations 500 --save-every-k 24 --wandb-name "OPEN Breakout"To evaluate the performance of learned optimizers, run the following command by providing the relevant wandb run IDs to <--exp-name> and the generation number to --exp-num. This code is run intermittently during training too.
For experimental purposes, we provide learned weights for the trained optimizers from our paper for the aforementioned environments in rl_optimizer/pretrained. These can be used with the argument <--pretrained> in place of wandb IDs. Use the <--larger> flag if this was used in training, and to experiment with our pretrained <multi> optimizers pass the <--multi> flag.
python3 rl_optimizer.eval --envs <env-names> --exp-name <wandb experiment IDs> --exp-num <generation numbers> --num-runs 16 --title <foldername for saving files> [--pretrained --multi --larger]We include submodules for Learned Optimization and GROOVE. Therefore, when cloning this repo, ensure to use --recurse-submodules:
git clone --recurse-submodules git@github.com:AlexGoldie/rl-learned-optimization.gitWe include requirements in setup/requirements.txt. Dependencies can be install locally using:
pip install -r setup/requirements.txtWe also provide files to help build a Docker image. Since we use wandb for logging checkpoints, you should supply this as an argument to build_docker.sh.
cd setup
chmod +x build_docker.sh
./build_docker.sh {WANDB_API_KEY}
cd ..
chmod +x run_docker.sh
./run_docker.sh {GPU_NAMES}For example, starting the docker container with access to GPUs 0 and 1 can be done as ./run_docker.sh 0,1
The following projects were used extensively in the making of OPEN:
If you use OPEN in your work, please cite the following:
@inproceedings{goldie2024can,
author={Alexander D. Goldie and Chris Lu and Matthew Thomas Jackson and Shimon Whiteson and Jakob Nicolaus Foerster},
booktitle={Advances in Neural Information Processing Systems},
title={Can Learned Optimization Make Reinforcement Learning Less Difficult?},
year={2024},
}
