Transactions on Machine Learning Research (TMLR), 2025
Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Edward S. Hu
The official implementation of the Model-based Reset-free (MoReFree) agent, a model-based RL agent for reset-free tasks.
If you find our paper or code useful, please reference us:
@article{yangreset,
title={Reset-free Reinforcement Learning with World Models},
author={Yang, Zhao and Moerland, Thomas M and Preuss, Mike and Plaat, Aske and Hu, Edward S},
journal={Transactions on Machine Learning Research}
}
To learn more:
MoReFree builds on PEG, an unsupervised RL agent for hard exploration tasks. We first adapted it to the reset-free setting, then added back-and-forth exploration and task-relevant imagination training.
During exploration (data collection), the long reset-free horizon is splited into small 'chunks'. Within each chunk, MoReFree strikes a balance between exploring unseen states and practicing optimal behavior in task-relevant regions by directing the goal-conditioned policy to achieve evaluation states, initial state states (emulating a reset), and exploratory goals (defined by PEG). During imagination training, MoReFree focuses the goal-conditioned policy training inside the world model on achieving evaluation states, initial states, and random replay buffer states to better prepare the policy for the aforementioned exploration scheme.
Model-based reset-free RL agents (reset-free PEG and MoReFree) outperform baselines in all tasks, and MoReFree performs the best in three more challenging tasks.
To better understand MoReFree, you can first take a look at PEG and follow the installation of PEG. We list our main modifications over PEG below:
- In
resetfree/env.py, we convert a resetfree environment (EARL format) to an environment that returns a 'done' flag everynsteps to inform the PEG codebase to finish the current 'chunk' and move to the next 'chunk' by sampling another goal, but keeping the environment itself as is. - In
resetfree/goal_picker_wrapper.py, we define the goal sampling logic, with the probability of$p$ sampling task-relevant goals ($\rho_g^*$ and$\rho_0$ ) and the reset of time sampling exploratory goals using PEG ($\rho_E$ ). - To implement task-relevant imagination training, we set
train_env_goal_perccent = 0.2. We first include both initial states and goal states intoenv_goal(the original PEG only includes goal states), then sample and train the poicy on them.
If you already finish the PEG installation, you can skip this part.
Create the conda environment by running:
conda env create -f environment.yml
Then navigate to the MoReFree folder and set it up as a local python module
# in the MoReFree folder, like /home/MoReFree/
pip install -e .
We evaluate MoReFree on 8 environments: 2 (Tabletop and Sawyer Door) is from EARL benchmark, 3 (PointUMaze, Fetch Push and Fetch Pick&Place) is from IBC tasks, we made 3 more challengling tasks ourselves (Ant, Fetch Push Hard and Fetch Pick&Place Hard).
You can follow EARL codebase to set up their environments. For other environments from IBC or mrl of the PEG codebase, we already included them in the envs folder.
You can now run the experiment by
python examples/run_goal_cond.py --configs <configs here> --logdir <your_logdir>
We set which method we want and environment we want through the configs.yaml. If you take a look at it, you can see there are a bunch of predefined configurations. Here are the relevant ones.
- PointUMaze:
point_umaze - Tabletop:
earl_tabletop - Sawyer Door:
earl_sawyer_door - Fetch Push:
fetch_push - Fetch Push (hard):
fetch_push_hard - Fetch Pick&Place:
fetch_pick - Fetch Pick&Place (hard):
fetch_pick_hard - Ant:
ant_maze
So if you would like to train a MoReFree agent in PointUMaze environment, you need to run: python examples/run_goal_cond.py --configs point_umaze --logdir 'zhao/logdir/pointumaze/'.
We list some important hyperparameters that MoReFree includes:
-
alpha_s, alpha_e, alpha_decay_rate: scheduling of$\alpha$ , which controls the back-and-forth exploration ratio. In the paper,$\alpha=0.2$ so we set alpha_s = alpha_e =$0.2$ . -
train_env_goal_percent: the ratio of task-relevant states that the goal-conditioned policy is trained on duing imagination training. In the paper, we set it to$0.2$ .
Once your agent is training, navigate to the log folder specified by --logdir. It should look something like this:
point_umaze/
|- config.yaml # parsed configs
|- eval_episodes/ # folder of evaluation episodes
|- train_episodes/ # replay buffer/folder of training episodes
|- events.out.tfevents # tensorboard outputs (scalars, GIFS)
|- metrics.jsonl # metrics in json format for plotting
|- variables.pkl # most recent snapshot of weights
If you open up the tensorboard, you should see scalars for training, and evaluation.
MoReFree builds on prior work PEG, some environments are taken from IBC and EARL, so we thank the authors for their contributions.

