Skip to content

ThomasChen98/MEReQ

Repository files navigation

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

arXiv Paper Video License: MIT

Conference on Robot Learning (CoRL) 2025

Yuxin Chen*1, Chen Tang*2, Jianglan Wei1, Chenran Li1, Ran Tian1, Xiang Zhang1, Wei Zhan1, Peter Stone2,3, Masayoshi Tomizuka1

*Equal contribution
1University of California, Berkeley    2The University of Texas at Austin    3Sony AI

MEReQ Overview

📄 Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency.

In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. Residual Q-Learning (RQL) is then employed to align the policy with human preferences using the inferred reward function. Extensive evaluations on simulated and real-world tasks show that MEReQ achieves sample-efficient alignment from human intervention compared to baselines.

🛠️ Installation

1. Create Conda Environment

conda env create -f environment.yml
conda activate mereq

2. Install Local Dependencies

# Install customized Gymnasium
cd Gymnasium && pip install -e . && cd ..

# Install customized HighwayEnv
cd HighwayEnv && pip install -e . && cd ..

# Install customized Stable-Baselines3
cd stable-baselines3 && pip install -e . && cd ..

🚀 Usage

🎯 Pretraining

Train a prior policy using the provided training script:

python ./tests/test_train.py -a <ALGO> -e <ENV> -n <N_TIMESTEPS> -tr

Supported Environments:

Environment Key Environment ID
lunarLanderDiscrete LunarLander-v2
lunarLanderContinuous LunarLanderContinuous-v2
mountainCarContinuous MountainCarContinuous-v0
highwayBasic highway-basic-v0
highwayAddRightReward highway-addRightReward-v0
fanucPusherBasic Fanuc-pusher-v0
fanucPusherAddTableReward Fanuc-pusher-addTableReward-v0
fanucGribberBasic Fanuc-gribber-v0
fanucGribberAddTableReward Fanuc-gribber-addTableReward-v0
fanucEraserBasic Fanuc-eraser-v0
fanucEraserAddTableReward Fanuc-eraser-addTableReward-v0

Supported Algorithms:

Algorithm Key Algorithm
dqn DQN
dqn_me DQN_ME
dqn_residual ResidualSoftDQN
sac SAC
sac_residual ResidualSAC

📊 Evaluation

Evaluate a trained model:

python ./tests/test_train.py -a <ALGO> -e <ENV> -te

Optional flags:

  • -g: Save rendering as a GIF image
  • -r: Sample random actions from the action space
  • -m <MODEL_PATH>: Evaluate a custom model

See scripts/pretrain.sh for example commands.


🤖 MEReQ Training

🎮 Policy Intervention (Simulated Expert)

Run MEReQ with simulated expert intervention on Highway environment:

python -u MEReQ/highway_intvn.py -e RQL_intvn_plc_<EXP_NUM> 2>&1 | tee outputs/RQL_intvn_plc_<EXP_NUM>.txt

Optional flags:

  • -dm: Use expert demonstration (learning from demonstration)
  • -tf: Use total expert feature count (accumulated expert feature counts)
  • -rp: Use random policy (learning from random policy with Q-learning)

See scripts/IRL_intvn.sh for example commands.

👤 Human Intervention

Run MEReQ with human-in-the-loop intervention:

python -u MEReQ/highway_human.py -e RQL_human_<EXP_NUM> 2>&1 | tee outputs/RQL_human_<EXP_NUM>.txt

Optional flags:

  • -tf: Use total human feature count (accumulated human feature counts)

📈 Baselines

We provide implementations of several baseline methods. Below are examples using the fanuc_gribber task:

Method Command
MEReQ python -u MEReQ/fanuc_gribber_intvn.py -eco -e fanuc_gribber/RQL_intvn_eco_<exp_num> 2>&1 | tee outputs/fanuc_gribber/RQL_intvn_eco_<exp_num>.txt
MEReQ-NP python -u MEReQ/fanuc_gribber_intvn.py -e fanuc_gribber/RQL_intvn_<exp_num> 2>&1 | tee outputs/fanuc_gribber/RQL_intvn_<exp_num>.txt
MaxEnt python -u MEReQ/fanuc_gribber_intvn.py -rp -e fanuc_gribber/QL_intvn_rd_<exp_num> 2>&1 | tee outputs/fanuc_gribber/QL_intvn_rd_<exp_num>.txt
MaxEnt-FT python -u MEReQ/fanuc_gribber_intvn.py -cl -rp -e fanuc_gribber/QL_intvn_cl_<exp_num> 2>&1 | tee outputs/fanuc_gribber/QL_intvn_cl_<exp_num>.txt
HG-DAgger-FT python -u MEReQ/fanuc_gribber_hgdagger.py -e fanuc_gribber/HGDAgger_ft_<exp_num> 2>&1 | tee outputs/fanuc_gribber/HGDAgger_ft_<exp_num>.txt
IWR-FT python -u MEReQ/fanuc_gribber_iwr.py -e fanuc_gribber/IWR_ft_<exp_num> 2>&1 | tee outputs/fanuc_gribber/IWR_ft_<exp_num>.txt

📝 Citation

If you find this work useful, please consider citing:

@inproceedings{chen2025mereq,
  author    = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Ran and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi},
  title     = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention},
  booktitle = {9th Annual Conference on Robot Learning (CoRL)},
  year      = {2025},
}

🙏 Acknowledgements

This codebase builds upon the following open-source projects:

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published