Rainbow-DemoRL

Rainbow-DemoRL is a modular framework for combining demonstrations with reinforcement learning for robot manipulation. It implements three orthogonal strategies for leveraging demonstrations during online RL training, and supports arbitrary hybrid combinations of these strategies. Built on ManiSkill GPU-accelerated environments.

Three Strategies for Using Demonstrations

	Flag / keys	Idea
A: Direct data	`--use-offline-data-for-rl` (RLPD), `--use-auxiliary-bc-loss`	Prefill buffer with offline trajectories for direct use in online RL update (RLPD) and/or add a BC term on the actor.
B: Offline pretrain	`--pretrained-offline-policy-type`, `--pretrained-offline-value-type`	Train BC / CQL / CalQL / MCQ on HDF5 demos, then finetune online (CQL, CalQL).
C: Action mixing	`IBRL_`, `CHEQ_`, `RESRL_*`	Blend or choose between a frozen control prior and the RL policy (IBRL, CHEQ, Residual RL).

WSRL (A+B): --offline-buffer-type rollout — fill the offline buffer with rollouts from the pretrained policy instead of raw demos.

Beyond the paper: ACT (ACT, ACT_TD3), PARL (PARL_TD3, PARL_SAC, PARL_ACT).

Base RL: TD3, SAC — all hybrid variants build on one of these. We recommend using SAC.

Installation

git clone https://github.com/dwaitbhatt/Rainbow-DemoRL.git && cd Rainbow-DemoRL
pip install -e .

Training Pipeline

The general workflow has up to three stages. By default we recommend generating demonstrations by training a simple RL expert (SAC or TD3) for roughly 1M environment steps and saving the replay buffer to HDF5. That dataset matches the on-policy distribution you will use during hybrid training. Alternatively, you can use motion-planning trajectories from ManiSkill when you want cheap, task-specific expert data without RL pretraining.

  Stage 1: Demonstrations                    Stage 2 (optional):            Stage 3: Online RL
  (pick one source)                         Offline Pretraining
 ┌────────────────────────────┐             ┌──────────────────────┐        ┌──────────────────────────┐
 │ DEFAULT: Train RL expert   │──h5──>      │ BC / CQL / CalQL /   │─.pt──> │ SAC or TD3               │
 │ SAC or TD3, ~1M steps      │             │ MCQ / ACT            │        │ + Strategy A/B/C flags   │
 │ + --save-buffer            │             └──────────────────────┘        └──────────────────────────┘
 └────────────────────────────┘
 ┌────────────────────────────┐
 │ ALTERNATIVE: Motion        │
 │ planning demos             │
 └────────────────────────────┘

Stage 1 -- Default (RL expert demos): Run pure online RL with --save-buffer. The trainer writes trajectories under demos/<robot>/<env_id>/rl_buffer/<exp_name>/ as ManiSkill-compatible HDF5 (see TrajReplayBuffer.enable_saving in the online trainer). Use that file as --demo-path for offline training, RLPD, or filtering (e.g. filter_dataset_by_return.py for top-X% expert slices).

Stage 1 -- Alternative (motion planning): Run python -m rainbow_demorl.generate_motionplanning_demos to produce solver-generated trajectories without training an RL policy first.

Stage 2 (optional, for Strategies B and C) trains an offline policy and/or value function from the HDF5 demonstrations. Produces a .pt checkpoint.

Stage 3 trains the online RL agent, optionally leveraging the pretrained checkpoint (Strategy B), demo data (Strategy A), and/or a control prior for action mixing (Strategy C).

Quick Start

1. Obtain demonstrations

Recommended - train an RL expert (~1M steps), save the replay buffer, filter best trajectories:

python -m rainbow_demorl.train \
    -a SAC \
    -e PickCube-v1 \
    -r xarm6_robotiq \
    --online-learning-timesteps 1000000 \
    --save-buffer \
    --exp-name my_sac_expert_buffer
python rainbow_demorl/utils/filter_dataset_by_return.py -i path/to/trajectory.h5 -p 0.9

Adjust -i to your actual .h5 path. Use the filtered file as --demo-path for BC, CQL, ACT, RLPD, etc.

Alternative - motion planning (no RL expert):

python -m rainbow_demorl.generate_motionplanning_demos \
    -nt 1000 \
    -e PickCube-v1 \
    -r xarm6_robotiq

2. Train offline policy / value functions (BC, CQL, ACT, etc.)

# Simple Behavioral Cloning
python -m rainbow_demorl.train \
    -a BC_DET \
    -e PickCube-v1 \
    -r xarm6_robotiq \
    --demo-path path/to/trajectory.h5

# CQL (offline RL)
python -m rainbow_demorl.train \
    -a CQL \
    --cql_variant cql-rho \
    -e PickCube-v1 \
    -r xarm6_robotiq \
    --demo-path path/to/trajectory.h5

# ACT (offline imitation on the same demonstrations)
python -m rainbow_demorl.train \
    -a ACT \
    -e PickCube-v1 \
    -r xarm6_robotiq \
    --demo-path path/to/trajectory.h5

3. Train hybrid methods as below

Example commands

Pattern	Command gist
Pure online	`-a SAC`
RLPD (A)	`-a SAC --use-offline-data-for-rl --offline-buffer-type demos --demo-path ...`
BC finetune (B)	`-a SAC --pretrained-offline-policy-type BC_GAUSS --pretrained-offline-policy-path ...`
ACT offline + ACT_TD3 (B)	`-a ACT --demo-path ...` then `-a ACT_TD3 --pretrained-offline-policy-type ACT --pretrained-offline-policy-path ... --offline-buffer-type demos --demo-path ...` (same HDF5 for `norm_stats`)
RLPD + BC (A+B)	BC path + `--use-offline-data-for-rl --offline-buffer-type demos --demo-path ...`
WSRL (A+B)	pretrained policy path + `--use-offline-data-for-rl --offline-buffer-type rollout`
IBRL + aux BC (A+C)	`-a IBRL_TD3 --control-prior-path ... --use-auxiliary-bc-loss --offline-buffer-type demos --demo-path ...`
CalQL value + CHEQ (B+C)	`-a CHEQ_SAC --pretrained-offline-value-type CALQL --pretrained-offline-value-path ... --control-prior-path ...`
RLPD + CQL value + IBRL (A+B+C)	IBRL + `--pretrained-offline-value-type CQL_RHO --pretrained-offline-value-path ...` + RLPD flags + `--demo-path ...`
PARL	`-a PARL_TD3` (no control prior); `PARL_ACT` needs `--demo-path`

Full copy-paste blocks:

# RLPD
python -m rainbow_demorl.train -a SAC -e PickCube-v1 -r xarm6_robotiq \
  --use-offline-data-for-rl --offline-buffer-type demos --demo-path path/to/trajectory.h5

# ACT_TD3 finetune after offline ACT
python -m rainbow_demorl.train -a ACT_TD3 -e PickCube-v1 -r xarm6_robotiq \
  --pretrained-offline-policy-type ACT --pretrained-offline-policy-path path/to/act.pt \
  --offline-buffer-type demos --demo-path path/to/trajectory.h5

Main CLI flags

Flag	Alias	Role
`--algorithm`	`-a`	Algorithm name (`SAC`, `TD3`, `BC_DET`, `CQL`, `ACT`, `ACT_TD3`, `IBRL_TD3`, …)
`--env-id`	`-e`	ManiSkill task
`--robot`	`-r`	e.g. `xarm6_robotiq`, `panda`
`--online-learning-timesteps`	`-ton`	Online environment interaction steps
`--offline-learning-grad-steps`	`-toff`	Offline training steps
`--demo-path`		HDF5 demonstrations
`--offline-buffer-type`		`none` / `demos` / `rollout`
`--use-offline-data-for-rl`		RLPD-style mixing
`--use-auxiliary-bc-loss`		Extra BC on actor
`--pretrained-offline-policy-type` / `--pretrained-offline-policy-path`		Strategy B actor
`--pretrained-offline-value-type` / `--pretrained-offline-value-path`		Strategy B critic
`--control-prior-type` / `--control-prior-path`		Strategy C prior
`--save-buffer`		Save online replay under `demos/`

python -m rainbow_demorl.train --help lists everything (CHEQ, CQL variants, ACT, PARL, etc.).

Environments

Any registered ManiSkill env (e.g. PickCube-v1, PushCube-v1, StackCube-v1) can be used. We also provide examples of custom variants of PickCube in rainbow_demorl/envs/maniskill.py.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
rainbow_demorl		rainbow_demorl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rainbow-DemoRL

Three Strategies for Using Demonstrations

Installation

Training Pipeline

Quick Start

1. Obtain demonstrations

2. Train offline policy / value functions (BC, CQL, ACT, etc.)

3. Train hybrid methods as below

Example commands

Main CLI flags

Environments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rainbow-DemoRL

Three Strategies for Using Demonstrations

Installation

Training Pipeline

Quick Start

1. Obtain demonstrations

2. Train offline policy / value functions (BC, CQL, ACT, etc.)

3. Train hybrid methods as below

Example commands

Main CLI flags

Environments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages