We propose RoboVIP, a multi-view inpainting-based video diffusion model with identity reference as conditions to augment robotics manipulation data in both simulation and real-world robot setup.
π₯ Update | π§ Installation | π» Inference Augmentation | π§© Dataset Preprocessing | π₯Train
- Release the paper
- Release the Video Diffusion Model weights and Inference Code
- Less GPU memory intense version (<80GB) of Bridge RLDS
- Release the preprocessing code of the dataset
- Release the training code for the Video Diffusion Model
- Release the simulation testing
- Release the training code for simulation
β If you like RoboVIP, please help ββstarββ this repo. Thanks! π€
Under Review. Code Will Release Soon.
If you make use of our work, please cite our paper.
@article{wang2026robovip,
title={RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation},
author={Wang, Boyang and Zhang, Haoran and Zhang, Shujie and Hao, Jinkun and Jia, Mingda and Lv, Qi and Mao, Yucheng and Lyu, Zhaoyang and Zeng, Jia and Xu, Xudong and others},
journal={arXiv preprint arXiv:2601.05241},
year={2026}
}RoboVIP is built on diffusers and RoboEngine. We appreciate the authors for sharing their awesome codebase.

