A novel policy optimization framework for diffusion large language models that leverages their unique "inpainting" ability to guide exploration and improve RL training efficiency and model performance.
conda env create -f env.yml
conda activate igpoDownload the MetaMathQA dataset from Hugging Face.
After downloading, the structure should be:
igpo/MetaMathQA/
├── MetaMathQA-395K.json
└── README.md
To run IGPO:
sbatch run_igpo.slurm(need to change the wandb api key in the slurm files)
To run GRPO:
sbatch run_grpo.slurmThis code is built on the D1 codebase.
If you find IGPO useful in your research, please consider citing:
@article{zhao2025inpainting,
title={Inpainting-Guided Policy Optimization for Diffusion Large Language Models},
author={Zhao, Siyan and Liu, Mengchen and Huang, Jing and Liu, Miao and Wang, Chenyu and Liu, Bo and Tian, Yuandong and Pang, Guan and Bell, Sean and Grover, Aditya and others},
journal={arXiv preprint arXiv:2509.10396},
year={2025}
}
IGPO is MIT licensed, as found in the LICENSE file.
