RL Games: High performance RL library

Note: The next release will be 2.0.0 (unreleased). It migrates fully from gym to gymnasium. EnvPool support has been restored with envpool >= 1.2.5 (Python 3.11–3.14, NumPy 2.x, MuJoCo 3.x compatible).

Discord Channel Link

https://discord.gg/hnYRq7DsQh

Papers and related links

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning: https://arxiv.org/abs/2108.10470
DeXtreme: Transfer of Agile In-Hand Manipulation from Simulation to Reality: https://dextreme.org/ https://arxiv.org/abs/2210.13702
Transferring Dexterous Manipulation from GPU Simulation to a Remote Real-World TriFinger: https://s2r2-ig.github.io/ https://arxiv.org/abs/2108.09779
Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? https://arxiv.org/abs/2011.09533
Superfast Adversarial Motion Priors (AMP) implementation: https://twitter.com/xbpeng4/status/1506317490766303235 https://github.com/NVIDIA-Omniverse/IsaacGymEnvs
OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation: https://cremebrule.github.io/oscar-web/ https://arxiv.org/abs/2110.00704
EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine: https://arxiv.org/abs/2206.10558 and https://github.com/sail-sg/envpool
TimeChamber: A Massively Parallel Large Scale Self-Play Framework: https://github.com/inspirai/TimeChamber

Some results on the different environments

NVIDIA Isaac Gym

Dextreme

DexPBT

MJLab (MuJoCo Lab) — quadruped and humanoid locomotion

Starcraft 2 Multi Agents
BRAX
DeepMind Control Suite
EnvPool — high-throughput MuJoCo / Atari / DM Control vectorized envs
Random Envs

Implemented in Pytorch:

PPO with the support of asymmetric actor-critic variant
Support of end-to-end GPU accelerated training pipeline with Isaac Gym and Brax
Masked actions support
Multi-agent training, decentralized and centralized critic variants
Self-play

Implemented in Tensorflow 1.x (was removed in this version):

Rainbow DQN
A2C
PPO

Quickstart: Colab in the Cloud

Explore RL Games quick and easily in colab notebooks:

Mujoco training Mujoco gymnasium training example.
Brax training Brax training example, with keeping all the observations and actions on GPU.
Onnx discrete space export example with Cartpole
Onnx continuous space export example with Pendulum
Onnx continuous space with LSTM export example with Pendulum

Installation

For maximum training performance, PyTorch >= 2.2 with CUDA is recommended.

pip install rl-games

Or clone the repo and install the latest version from source:

pip install -e .

With optional extras (e.g. Atari, Mujoco, EnvPool):

pip install -e ".[atari,mujoco,envpool]"

Available extras: atari, mujoco, envpool, brax, pufferlib.

For high-throughput vectorized MuJoCo / Atari / DM Control training, install the envpool extra and see docs/ENVPOOL.md.

Using uv (recommended)

uv is a fast Python package manager. To create a virtual environment and install rl_games:

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[mujoco,envpool]"

Citing

If you use rl-games in your research please use the following citation:

@misc{rl-games2021,
title = {rl-games: A High-performance Framework for Reinforcement Learning},
author = {Makoviichuk, Denys and Makoviychuk, Viktor},
month = {May},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Denys88/rl_games}},
}

Development setup

uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[atari,mujoco]"

Training

NVIDIA Isaac Gym

Download and follow the installation instructions of Isaac Gym: https://developer.nvidia.com/isaac-gym
And IsaacGymEnvs: https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

Ant

python train.py task=Ant headless=True
python train.py task=Ant test=True checkpoint=nn/Ant.pth num_envs=100

Humanoid

python train.py task=Humanoid headless=True
python train.py task=Humanoid test=True checkpoint=nn/Humanoid.pth num_envs=100

Shadow Hand block orientation task

python train.py task=ShadowHand headless=True python train.py task=ShadowHand test=True checkpoint=nn/ShadowHand.pth num_envs=100

Other

Atari Pong

python runner.py --train --file rl_games/configs/atari/ppo_pong.yaml
python runner.py --play --file rl_games/configs/atari/ppo_pong.yaml --checkpoint nn/PongNoFrameskip.pth

Brax Ant

pip install -U "jax[cuda12]"
pip install brax
python runner.py --train --file rl_games/configs/brax/ppo_ant.yaml
python runner.py --play --file rl_games/configs/brax/ppo_ant.yaml --checkpoint runs/Ant_brax/nn/Ant_brax.pth

Experiment tracking

rl_games support experiment tracking with Weights and Biases.

python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
WANDB_API_KEY=xxxx python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --track
python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test --track
python runner.py --train --file rl_games/configs/atari/ppo_breakout_torch.yaml --wandb-project-name rl-games-special-test -wandb-entity openrlbenchmark --track

Multi GPU

We use torchrun to orchestrate any multi-gpu runs.

torchrun --standalone --nnodes=1 --nproc_per_node=2 runner.py --train --file rl_games/configs/ppo_cartpole.yaml

Config Parameters

Field	Example Value	Default	Description
seed	8	None	Seed for pytorch, numpy etc.
algo			Algorithm block.
name	a2c_continuous	None	Algorithm name. Possible values are: sac, a2c_discrete, a2c_continuous
model			Model block.
name	continuous_a2c_logstd	None	Possible values: continuous_a2c ( expects sigma to be (0, +inf), continuous_a2c_logstd ( expects sigma to be (-inf, +inf), a2c_discrete, a2c_multi_discrete
network			Network description.
name	actor_critic		Possible values: actor_critic or soft_actor_critic.
separate	False		Whether use or not separate network with same same architecture for critic. In almost all cases if you normalize value it is better to have it False
space			Network space
continuous			continuous or discrete
mu_activation	None		Activation for mu. In almost all cases None works the best, but we may try tanh.
sigma_activation	None		Activation for sigma. Will be threated as log(sigma) or sigma depending on model.
mu_init			Initializer for mu.
name	default
sigma_init			Initializer for sigma. if you are using logstd model good value is 0.
name	const_initializer
val	0
fixed_sigma	True		If true then sigma vector doesn't depend on input.
cnn			Convolution block.
type	conv2d		Type: right now two types supported: conv2d or conv1d
activation	elu		activation between conv layers.
initializer			Initialier. I took some names from the tensorflow.
name	glorot_normal_initializer		Initializer name
gain	1.4142		Additional parameter.
convs			Convolution layers. Same parameters as we have in torch.
filters	32		Number of filters.
kernel_size	8		Kernel size.
strides	4		Strides
padding	0		Padding
filters	64		Next convolution layer info.
kernel_size	4
strides	2
padding	0
filters	64
kernel_size	3
strides	1
padding	0
mlp			MLP Block. Convolution is supported too. See other config examples.
units			Array of sizes of the MLP layers, for example: [512, 256, 128]
d2rl	False		Use d2rl architecture from https://arxiv.org/abs/2010.09163.
activation	elu		Activations between dense layers.
initializer			Initializer.
name	default		Initializer name.
rnn			RNN block.
name	lstm		RNN Layer name. lstm and gru are supported.
units	256		Number of units.
layers	1		Number of layers
before_mlp	False	False	Apply rnn before mlp block or not.
config			RL Config block.
reward_shaper			Reward Shaper. Can apply simple transformations.
min_val	-1		You can apply min_val, max_val, scale and shift.
scale_value	0.1	1
normalize_advantage	True	True	Normalize Advantage.
gamma	0.995		Reward Discount
tau	0.95		Lambda for GAE. Called tau by mistake long time ago because lambda is keyword in python :(
learning_rate	3e-4		Learning rate.
name	walker		Name which will be used in tensorboard.
save_best_after	10		How many epochs to wait before start saving checkpoint with best score.
score_to_win	300		If score is >=value then this value training will stop.
grad_norm	1.5		Grad norm. Applied if truncate_grads is True. Good value is in (1.0, 10.0)
entropy_coef	0		Entropy coefficient. Good value for continuous space is 0. For discrete is 0.02
truncate_grads	True		Apply truncate grads or not. It stabilizes training.
env_name	BipedalWalker-v3		Envinronment name.
e_clip	0.2		clip parameter for ppo loss.
clip_value	False		Apply clip to the value loss. If you are using normalize_value you don't need it.
num_actors	16		Number of running actors/environments.
horizon_length	4096		Horizon length per each actor. Total number of steps will be num_actorshorizon_length num_agents (if env is not MA num_agents==1).
minibatch_size	8192		Minibatch size. Total number number of steps must be divisible by minibatch size.
minibatch_size_per_env	8		Minibatch size per env. If specified will overwrite total number number the default minibatch size with minibatch_size_per_env * nume_envs value.
mini_epochs	4		Number of miniepochs. Good value is in [1,10]
critic_coef	2		Critic coef. by default critic_loss = critic_coef * 1/2 * MSE.
lr_schedule	adaptive	None	Scheduler type. Could be None, linear or adaptive. Adaptive is the best for continuous control tasks. Learning rate is changed changed every miniepoch
kl_threshold	0.008		KL threshould for adaptive schedule. if KL < kl_threshold/2 lr = lr * 1.5 and opposite.
normalize_input	True		Apply running mean std for input.
bounds_loss_coef	0.0		Coefficient to the auxiary loss for continuous space.
max_epochs	10000		Maximum number of epochs to run.
max_frames	5000000		Maximum number of frames (env steps) to run.
normalize_value	True		Use value running mean std normalization.
use_diagnostics	True		Adds more information into the tensorboard.
value_bootstrap	True		Bootstraping value when episode is finished. Very useful for different locomotion envs.
bound_loss_type	regularisation	None	Adds aux loss for continuous case. 'regularisation' is the sum of sqaured actions. 'bound' is the sum of actions higher than 1.1.
bounds_loss_coef	0.0005	0	Regularisation coefficient
use_smooth_clamp	False		Use smooth clamp instead of regular for cliping
zero_rnn_on_done	False	True	If False RNN internal state is not reset (set to 0) when an environment is rest. Could improve training in some cases, for example when domain randomization is on
player			Player configuration block.
render	True	False	Render environment
deterministic	True	True	Use deterministic policy ( argmax or mu) or stochastic.
use_vecenv	True	False	Use vecenv to create environment for player
games_num	200		Number of games to run in the player mode.
env_config			Env configuration block. It goes directly to the environment. This example was take for my atari wrapper.
skip	4		Number of frames to skip
name	BreakoutNoFrameskip-v4		The exact name of an (atari) gym env. An example, depends on the training env this parameters can be different.
evaluation	True	False	Enables the evaluation feature for inferencing while training.
update_checkpoint_freq	100	100	Frequency in number of steps to look for new checkpoints.
dir_to_monitor			Directory to search for checkpoints in during evaluation.

Custom network example:

simple test network
This network takes dictionary observation. To register it you can add code in your init.py

from rl_games.envs.test_network import TestNetBuilder 
from rl_games.algos_torch import model_builder
model_builder.register_network('testnet', TestNetBuilder)

simple test environment example environment

Additional environment supported properties and functions

Field	Default Value	Description
use_central_value	False	If true than returned obs is expected to be dict with 'obs' and 'state'
value_size	1	Shape of the returned rewards. Network wil support multihead value automatically.
concat_infos	False	Should default vecenv convert list of dicts to the dicts of lists. Very usefull if you want to use value_boostrapping. in this case you need to always return 'time_outs' : True or False, from the env.
get_number_of_agents(self)	1	Returns number of agents in the environment
has_action_mask(self)	False	Returns True if environment has invalid actions mask.
get_action_mask(self)	None	Returns action masks if has_action_mask is true. Good example is SMAC Env

Release Notes

1.6.5

Added torch.compile support with configurable modes. Provides 10-40% performance improvement. Requires torch 2.2 or newer.
- Default mode is reduce-overhead for balanced compilation time and runtime performance
- Configurable via torch_compile parameter in yaml configs (true/false/"default"/"reduce-overhead"/"max-autotune")
- Separate compilation modes for actor and central value networks
- See torch.compile documentation for detailed configuration and mode selection guidance
Fixed critical bugs in asymmetric actor-critic (central_value) training:
- Fixed incorrect device reference in update_lr() method
- Fixed infinite loop when iterating over dataset
- Added proper __iter__ method to PPODataset class
Fixed variance calculation in RunningMeanStd to use population variance
Fixed get_mean_std_with_masks function.
Fixed missing central value optimizer state in checkpoint save/load
Added myosuite support.
Added auxilary loss support.
Update for tacsl release: CNN tower processing, critic weights loading and freezing.
Fixed SAC input normalization.
Fixed SAC agent summary writer to use configured directory instead of hardcoded 'runs/'
Fixed default player config num_games value.
Fixed applying minibatch size per env.
Added concat_output support for RNN.
SAC improvements:
- Fixed missing gamma_tensor initialization bug
- Removed hardcoded torch.compile decorators (now respects YAML config)
- Optimized tensor operations and removed unnecessary clones
Environment wrapper fixes:
- Fixed tuple/list observation handling for compatibility with various gym environments
- Added proper numpy to torch tensor conversion in cast_obs
- Fixed missing gym import in envpool wrapper
Ray integration improvements:
- Moved Ray import to lazy loading (only when RayVecEnv is used)
- Added configurable Ray initialization with ray_config parameter
- Added proper cleanup with close() method for Ray actors
- Default 1GB object store memory allocation

1.6.1

Fixed Central Value RNN bug which occurs if you train ma multi agent environment.
Added Deepmind Control PPO benchmark.
Added a few more experimental ways to train value prediction (OneHot, TwoHot encoding and crossentropy loss instead of L2).
New methods didn't. It is impossible to turn it on from the yaml files. Once we find an env which trains better it will be added to the config.
Added shaped reward graph to the tensorboard.
Fixed bug with SAC not saving weights with save_frequency.
Added multi-node training support for GPU-accelerated training environments like Isaac Gym. No changes in training scripts are required. Thanks to @ankurhanda and @ArthurAllshire for assistance in implementation.
Added evaluation feature for inferencing during training. Checkpoints from training process can be automatically picked up and updated in the inferencing process when enabled.Enhanced
Added get/set API for runtime update of rl training parameters. Thanks to @ArthurAllshire for the initial version of fast PBT code.
Fixed SAC not loading weights properly.
Removed Ray dependency for use cases it's not required.
Added warning for using deprecated 'seq_len' instead of 'seq_length' in configs with RNN networks.

1.6.0

Added ONNX export colab example for discrete and continious action spaces. For continuous case LSTM policy example is provided as well.
Improved RNNs training in continuous space, added option zero_rnn_on_done.
Added NVIDIA CuLE support: https://github.com/NVlabs/cule
Added player config everride. Vecenv is used for inference.
Fixed multi-gpu training with central value.
Fixed max_frames termination condition, and it's interaction with the linear learning rate: #212
Fixed "deterministic" misspelling issue.
Fixed Mujoco and Brax SAC configs.
Fixed multiagent envs statistics reporting. Fixed Starcraft2 SMAC environments.

1.5.2

Added observation normalization to the SAC.
Returned back adaptive KL legacy mode.

1.5.1

Fixed build package issue.

1.5.0

Added wandb support.
Added poetry support.
Fixed various bugs.
Fixed cnn input was not divided by 255 in case of the dictionary obs.
Added more envpool mujoco and atari training examples. Some of the results: 15 min Mujoco humanoid training, 2 min atari pong.
Added Brax and Mujoco colab training examples.
Added 'seed' command line parameter. Will override seed in config in case it's > 0.
Deprecated horovod in favor of torch.distributed (#171).

1.4.0

Added discord channel https://discord.gg/hnYRq7DsQh :)
Added envpool support with a few atari examples. Works 3-4x time faster than ray.
Added mujoco results. Much better than openai spinning up ppo results.
Added tcnn(https://github.com/NVlabs/tiny-cuda-nn) support. Reduces 5-10% of training time in the IsaacGym envs.
Various fixes and improvements.

1.3.2

Added 'sigma' command line parameter. Will override sigma for continuous space in case if fixed_sigma is True.

1.3.1

Fixed SAC not working

1.3.0

Simplified rnn implementation. Works a little bit slower but much more stable.
Now central value can be non-rnn if policy is rnn.
Removed load_checkpoint from the yaml file. now --checkpoint works for both train and play.

1.2.0

Added Swish (SILU) and GELU activations, it can improve Isaac Gym results for some of the envs.
Removed tensorflow and made initial cleanup of the old/unused code.
Simplified runner.
Now networks are created in the algos with load_network method.

1.1.4

Fixed crash in a play (test) mode in player, when simulation and rl_devices are not the same.
Fixed variuos multi gpu errors.

1.1.3

Fixed crash when running single Isaac Gym environment in a play (test) mode.
Added config parameter clip_actions for switching off internal action clipping and rescaling

1.1.0

Added to pypi: pip install rl-games
Added reporting env (sim) step fps, without policy inference. Improved naming.
Renames in yaml config for better readability: steps_num to horizon_length amd lr_threshold to kl_threshold

Custom stop callback

In addition to the built-in score_to_win, max_epochs, and max_frames stop conditions, training can be terminated by a user-defined callback. The callback receives the algorithm instance and returns True to stop. It is checked once per epoch on rank 0 and broadcast to other ranks under multi-GPU. Works with PPO (continuous and discrete) and SAC.

The signature:

def my_stop(algo) -> bool:
    # algo exposes: epoch_num, frame, last_mean_rewards, mean_rewards,
    # game_rewards, game_lengths, config, ...
    return algo.last_mean_rewards > 18.0 and algo.epoch_num >= 100

There are two ways to wire it in.

A. Manual algo instantiation

Build the agent yourself, set the callback, then call train():

import yaml
from rl_games.torch_runner import Runner

with open('rl_games/configs/atari/ppo_pong.yaml') as f:
    cfg = yaml.safe_load(f)

runner = Runner()
runner.load(cfg)
runner.load_config(runner.default_config)

agent = runner.algo_factory.create(
    runner.algo_name, base_name='run', params=runner.params
)
agent.stop_fn = lambda algo: algo.last_mean_rewards > 18.0 and algo.epoch_num >= 100
agent.train()

B. Through `Runner.run`

Pass the callable in the runtime args dict (programmatic), or reference it from YAML by import path (string).

Programmatic:

def my_stop(algo):
    return algo.last_mean_rewards > 18.0 and algo.epoch_num >= 100

runner = Runner()
runner.load(cfg)
runner.run({'train': True, 'stop_fn': my_stop})

YAML — set config.stop_fn to either pkg.mod:function or pkg.mod.function:

params:
  config:
    stop_fn: my_project.stops:reward_plateau

The string is resolved by importlib at training start. args['stop_fn'] (programmatic) takes precedence over config['stop_fn'] (YAML) if both are set.

stop_fn must be a callable taking the algo and returning bool; an ValueError is raised at startup otherwise.

Troubleshouting

Some of the supported envs are not installed with setup.py, you need to manually install them
Starting from rl-games 1.1.0 old yaml configs won't be compatible with the new version:
- steps_num should be changed to horizon_length amd lr_threshold to kl_threshold

Known issues

Running a single environment with Isaac Gym can cause crash, if it happens switch to at least 2 environments simulated in parallel

Name		Name	Last commit message	Last commit date
Latest commit History 842 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmarks		benchmarks
docs		docs
notebooks		notebooks
rl_games		rl_games
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
g1_flat_comparison.png		g1_flat_comparison.png
go1_flat_training.png		go1_flat_training.png
go1_rough_training.png		go1_rough_training.png
pyproject.toml		pyproject.toml
run_mjlab.py		run_mjlab.py
runner.py		runner.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL Games: High performance RL library

Discord Channel Link

Papers and related links

Some results on the different environments

Quickstart: Colab in the Cloud

Installation

Using uv (recommended)

Citing

Development setup

Training

Experiment tracking

Multi GPU

Config Parameters

Custom network example:

Release Notes

Custom stop callback

A. Manual algo instantiation

B. Through `Runner.run`

Troubleshouting

Known issues

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RL Games: High performance RL library

Discord Channel Link

Papers and related links

Some results on the different environments

Quickstart: Colab in the Cloud

Installation

Using uv (recommended)

Citing

Development setup

Training

Experiment tracking

Multi GPU

Config Parameters

Custom network example:

Release Notes

Custom stop callback

A. Manual algo instantiation

B. Through Runner.run

Troubleshouting

Known issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

B. Through `Runner.run`

Packages