MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

Conference on Robot Learning (CoRL) 2025

Yuxin Chen^*1, Chen Tang^*2, Jianglan Wei¹, Chenran Li¹, Ran Tian¹, Xiang Zhang¹, Wei Zhan¹, Peter Stone^2,3, Masayoshi Tomizuka¹

^*Equal contribution
¹University of California, Berkeley ²The University of Texas at Austin ³Sony AI

📄 Abstract

Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency.

In this work, we introduce Maximum-Entropy Residual-Q Inverse Reinforcement Learning (MEReQ), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. Residual Q-Learning (RQL) is then employed to align the policy with human preferences using the inferred reward function. Extensive evaluations on simulated and real-world tasks show that MEReQ achieves sample-efficient alignment from human intervention compared to baselines.

🛠️ Installation

1. Create Conda Environment

conda env create -f environment.yml
conda activate mereq

2. Install Local Dependencies

# Install customized Gymnasium
cd Gymnasium && pip install -e . && cd ..

# Install customized HighwayEnv
cd HighwayEnv && pip install -e . && cd ..

# Install customized Stable-Baselines3
cd stable-baselines3 && pip install -e . && cd ..

🚀 Usage

🎯 Pretraining

Train a prior policy using the provided training script:

python ./tests/test_train.py -a <ALGO> -e <ENV> -n <N_TIMESTEPS> -tr

Supported Environments:

Environment Key	Environment ID
`lunarLanderDiscrete`	`LunarLander-v2`
`lunarLanderContinuous`	`LunarLanderContinuous-v2`
`mountainCarContinuous`	`MountainCarContinuous-v0`
`highwayBasic`	`highway-basic-v0`
`highwayAddRightReward`	`highway-addRightReward-v0`
`fanucPusherBasic`	`Fanuc-pusher-v0`
`fanucPusherAddTableReward`	`Fanuc-pusher-addTableReward-v0`
`fanucGribberBasic`	`Fanuc-gribber-v0`
`fanucGribberAddTableReward`	`Fanuc-gribber-addTableReward-v0`
`fanucEraserBasic`	`Fanuc-eraser-v0`
`fanucEraserAddTableReward`	`Fanuc-eraser-addTableReward-v0`

Supported Algorithms:

Algorithm Key	Algorithm
`dqn`	DQN
`dqn_me`	DQN_ME
`dqn_residual`	ResidualSoftDQN
`sac`	SAC
`sac_residual`	ResidualSAC

📊 Evaluation

Evaluate a trained model:

python ./tests/test_train.py -a <ALGO> -e <ENV> -te

Optional flags:

-g: Save rendering as a GIF image
-r: Sample random actions from the action space
-m <MODEL_PATH>: Evaluate a custom model

See scripts/pretrain.sh for example commands.

🤖 MEReQ Training

🎮 Policy Intervention (Simulated Expert)

Run MEReQ with simulated expert intervention on Highway environment:

python -u MEReQ/highway_intvn.py -e RQL_intvn_plc_<EXP_NUM> 2>&1 | tee outputs/RQL_intvn_plc_<EXP_NUM>.txt

Optional flags:

-dm: Use expert demonstration (learning from demonstration)
-tf: Use total expert feature count (accumulated expert feature counts)
-rp: Use random policy (learning from random policy with Q-learning)

See scripts/IRL_intvn.sh for example commands.

👤 Human Intervention

Run MEReQ with human-in-the-loop intervention:

python -u MEReQ/highway_human.py -e RQL_human_<EXP_NUM> 2>&1 | tee outputs/RQL_human_<EXP_NUM>.txt

Optional flags:

-tf: Use total human feature count (accumulated human feature counts)

📈 Baselines

We provide implementations of several baseline methods. Below are examples using the fanuc_gribber task:

Method	Command
MEReQ	`python -u MEReQ/fanuc_gribber_intvn.py -eco -e fanuc_gribber/RQL_intvn_eco_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/RQL_intvn_eco_<exp_num>.txt`
MEReQ-NP	`python -u MEReQ/fanuc_gribber_intvn.py -e fanuc_gribber/RQL_intvn_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/RQL_intvn_<exp_num>.txt`
MaxEnt	`python -u MEReQ/fanuc_gribber_intvn.py -rp -e fanuc_gribber/QL_intvn_rd_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/QL_intvn_rd_<exp_num>.txt`
MaxEnt-FT	`python -u MEReQ/fanuc_gribber_intvn.py -cl -rp -e fanuc_gribber/QL_intvn_cl_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/QL_intvn_cl_<exp_num>.txt`
HG-DAgger-FT	`python -u MEReQ/fanuc_gribber_hgdagger.py -e fanuc_gribber/HGDAgger_ft_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/HGDAgger_ft_<exp_num>.txt`
IWR-FT	`python -u MEReQ/fanuc_gribber_iwr.py -e fanuc_gribber/IWR_ft_<exp_num> 2>&1 \| tee outputs/fanuc_gribber/IWR_ft_<exp_num>.txt`

📝 Citation

If you find this work useful, please consider citing:

@inproceedings{chen2025mereq,
  author    = {Chen, Yuxin and Tang, Chen and Wei, Jianglan and Li, Chenran and Tian, Ran and Zhang, Xiang and Zhan, Wei and Stone, Peter and Tomizuka, Masayoshi},
  title     = {MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention},
  booktitle = {9th Annual Conference on Robot Learning (CoRL)},
  year      = {2025},
}

🙏 Acknowledgements

This codebase builds upon the following open-source projects:

⚖️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

📄 Abstract

🛠️ Installation

1. Create Conda Environment

2. Install Local Dependencies

🚀 Usage

🎯 Pretraining

📊 Evaluation

🤖 MEReQ Training

🎮 Policy Intervention (Simulated Expert)

👤 Human Intervention

📈 Baselines

📝 Citation

🙏 Acknowledgements

⚖️ License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Gymnasium		Gymnasium
HighwayEnv		HighwayEnv
MEReQ		MEReQ
figures		figures
hyperparams		hyperparams
scripts		scripts
stable-baselines3		stable-baselines3
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

ThomasChen98/MEReQ

Folders and files

Latest commit

History

Repository files navigation

MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention

📄 Abstract

🛠️ Installation

1. Create Conda Environment

2. Install Local Dependencies

🚀 Usage

🎯 Pretraining

📊 Evaluation

🤖 MEReQ Training

🎮 Policy Intervention (Simulated Expert)

👤 Human Intervention

📈 Baselines

📝 Citation

🙏 Acknowledgements

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages