Conference on Robot Learning (CoRL) 2025
Yuxin Chen*1, Chen Tang*2, Jianglan Wei1, Chenran Li1, Ran Tian1, Xiang Zhang1, Wei Zhan1, Peter Stone2,3, Masayoshi Tomizuka1
*Equal contribution
1University of California, Berkeley
2The University of Texas at Austin
3Sony AI
Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency.
In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. Residual Q-Learning (RQL) is then employed to align the policy with human preferences using the inferred reward function. Extensive evaluations on simulated and real-world tasks show that MEReQ achieves sample-efficient alignment from human intervention compared to baselines.
conda env create -f environment.yml
conda activate mereq# Install customized Gymnasium
cd Gymnasium && pip install -e . && cd ..
# Install customized HighwayEnv
cd HighwayEnv && pip install -e . && cd ..
# Install customized Stable-Baselines3
cd stable-baselines3 && pip install -e . && cd ..Train a prior policy using the provided training script:
python ./tests/test_train.py -a <ALGO> -e <ENV> -n <N_TIMESTEPS> -trSupported Environments:
| Environment Key | Environment ID |
|---|---|
lunarLanderDiscrete |
LunarLander-v2 |
lunarLanderContinuous |
LunarLanderContinuous-v2 |
mountainCarContinuous |
MountainCarContinuous-v0 |
highwayBasic |
highway-basic-v0 |
highwayAddRightReward |
highway-addRightReward-v0 |
fanucPusherBasic |
Fanuc-pusher-v0 |
fanucPusherAddTableReward |
Fanuc-pusher-addTableReward-v0 |
fanucGribberBasic |
Fanuc-gribber-v0 |
fanucGribberAddTableReward |
Fanuc-gribber-addTableReward-v0 |
fanucEraserBasic |
Fanuc-eraser-v0 |
fanucEraserAddTableReward |
Fanuc-eraser-addTableReward-v0 |
Supported Algorithms:
| Algorithm Key | Algorithm |
|---|---|
dqn |
DQN |
dqn_me |
DQN_ME |
dqn_residual |
ResidualSoftDQN |
sac |
SAC |
sac_residual |
ResidualSAC |
Evaluate a trained model:
python ./tests/test_train.py -a <ALGO> -e <ENV> -teOptional flags:
-g: Save rendering as a GIF image-r: Sample random actions from the action space-m <MODEL_PATH>: Evaluate a custom model
See scripts/pretrain.sh for example commands.
Run MEReQ with simulated expert intervention on Highway environment:
python -u MEReQ/highway_intvn.py -e RQL_intvn_plc_<EXP_NUM> 2>&1 | tee outputs/RQL_intvn_plc_<EXP_NUM>.txtOptional flags:
-dm: Use expert demonstration (learning from demonstration)-tf: Use total expert feature count (accumulated expert feature counts)-rp: Use random policy (learning from random policy with Q-learning)
See scripts/IRL_intvn.sh for example commands.
Run MEReQ with human-in-the-loop intervention:
python -u MEReQ/highway_human.py -e RQL_human_<EXP_NUM> 2>&1 | tee outputs/RQL_human_<EXP_NUM>.txtOptional flags:
-tf: Use total human feature count (accumulated human feature counts)
We provide implementations of several baseline methods. Below are examples using the fanuc_gribber task:
| Method | Command |
|---|---|
| MEReQ | python -u MEReQ/fanuc_gribber_intvn.py -eco -e fanuc_gribber/RQL_intvn_eco_<exp_num> 2>&1 | tee outputs/fanuc_gribber/RQL_intvn_eco_<exp_num>.txt |
| MEReQ-NP | python -u MEReQ/fanuc_gribber_intvn.py -e fanuc_gribber/RQL_intvn_<exp_num> 2>&1 | tee outputs/fanuc_gribber/RQL_intvn_<exp_num>.txt |
| MaxEnt | python -u MEReQ/fanuc_gribber_intvn.py -rp -e fanuc_gribber/QL_intvn_rd_<exp_num> 2>&1 | tee outputs/fanuc_gribber/QL_intvn_rd_<exp_num>.txt |
| MaxEnt-FT | python -u MEReQ/fanuc_gribber_intvn.py -cl -rp -e fanuc_gribber/QL_intvn_cl_<exp_num> 2>&1 | tee outputs/fanuc_gribber/QL_intvn_cl_<exp_num>.txt |
| HG-DAgger-FT | python -u MEReQ/fanuc_gribber_hgdagger.py -e fanuc_gribber/HGDAgger_ft_<exp_num> 2>&1 | tee outputs/fanuc_gribber/HGDAgger_ft_<exp_num>.txt |
| IWR-FT | python -u MEReQ/fanuc_gribber_iwr.py -e fanuc_gribber/IWR_ft_<exp_num> 2>&1 | tee outputs/fanuc_gribber/IWR_ft_<exp_num>.txt |
If you find this work useful, please consider citing:
@inproceedings{chen2025mereq,
author = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Ran and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi},
title = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention},
booktitle = {9th Annual Conference on Robot Learning (CoRL)},
year = {2025},
}This codebase builds upon the following open-source projects:
This project is licensed under the MIT License - see the LICENSE file for details.
