π§ MinD: Learning A Dual-System World Model for Real-Time Planning and Action Consistency Video Generation
Xiaowei Chi1,2*, Kuangzhi Ge3*, Jiaming Liu3β , Siyuan Zhou2, Peidong Jia3, Zichen He3, Kevin Zhang3, Rui Zhao1, Yuzhen Liu1, Tingguang Li1, Sirui Han2, Shanghang Zhang3β, Yike Guo2β
1Tencent RoboticsX, 2Hong Kong University of Science and Technology,
3Peking University
MinD is a dual-system world model for robotics that unifies video imagination and action generation. It enables real-time planning, implicit risk analysis, and explainable control. By combining a low-frequency visual diffusion model and a high-frequency action policy, MinD supports fast, safe, and semantically grounded decision-making for embodied agents.
-
Dual Diffusion System:
Combines a slow video generator (LoDiff-Visual) with a fast action generator (HiDiff-Policy) for planning and control. -
Real-Time Inference:
Single-step prediction enables inference up to 11.3 FPS, suitable for real-world robot execution. -
Implicit Risk Analysis:
Predicts task failures ahead of time by analyzing intermediate latent features from the video model. -
Multimodal & Modular:
Compatible with various vision, language, and action model backbones. Easy to integrate and extend.
βββ remote_infer.py # Inference server entry point
βββ vla/ # Vision-Language-Action modules
βββ action_model/ # HiDiff-Policy: diffusion-based action generator
βββ video_model/ # LoDiff-Visual: latent video prediction model
βββ matcher/ # DiffMatcher: aligns video and action features
βββ checkpoints/ # Pretrained model weights
βββ predicted_videos/ # Generated future frames (optional)
βββ scripts/ # Evaluation and visualization scripts
βββ requirements.txt # Python dependenciesWe are actively iterating on the codebase. Some paths, formats, and module APIs may change in the near future. Here's what's in progress:
- Refactoring module paths and configs for better modularity
- Adding support for more VLM backbones
- Exposing training interface for LoDiff / HiDiff fine-tuning
- Improving documentation and demo scripts
- Open-sourcing the training pipeline (ETA: TBD)
π We would like to thank CogACT and OpenVLA projects for inspiring the architecture and implementation of MinD.
If you encounter any issues, please open an issue β we will respond and fix them as soon as possible!
We welcome contributions! You can:
- Submit issues for bugs or feature requests
- Open pull requests with improvements or new modules
- Help with documentation or testing
- Python β₯ 3.8
- PyTorch β₯ 2.0
- CUDA Toolkit β₯ 12.1
- Transformers
- Pillow, NumPy
- OpenCLIP & RLBench (for simulation)
Install dependencies:
pip install -r requirements.txt- Python >= 3.8
- PyTorch >= 2.0
- CUDA Toolkit >= 12.1
- Clone the repository & install dependencies:
git clone https://github.com/manipulate-in-dream/MinD.git
cd MinD
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env file with your paths- Download pretrained weights:
mkdir -p checkpoints/vgm
# Download VGM-VLA checkpoint and place in checkpoints/vgm/# Set environment variables
export MIND_FULL_CHECKPOINT=/path/to/vgm_checkpoint.pt
export DATASET_STATISTICS_JSON=/path/to/stats.json
# Run inference
python vla/vgmactvla.py --input_image_path /path/to/image.png
# Or run the inference server
python remote_infer.pyVGM-VLA (MinD) achieves state-of-the-art performance with superior accuracy and real-time inference:
- Mean Success Rate: 63.0% (VGM-VLA)
- Inference Speed: Up to 11.3 FPS
- Failure Prediction Accuracy: 74%
| Task | VGM-VLA (MinD) | VPP-VLA | RoboDreamer | OpenVLA |
|---|---|---|---|---|
| Close Laptop Lid | 68% | 52% | 76% | 45% |
| Sweep to Dustpan | 96% | 72% | 76% | 58% |
| Mean Accuracy | 63.0% | 48.5% | 50.3% | 42.1% |
VGM-VLA demonstrates robust real-world performance, significantly outperforming baselines including VPP:
| Task | VGM-VLA (Wrist) | VGM-VLA (Front) | VPP-VLA | OpenVLA |
|---|---|---|---|---|
| Pick & Place | 75% | 60% | 50% | 40% |
| Unplug Charger | 65% | 50% | 40% | 25% |
| Wipe Whiteboard | 65% | 85% | 55% | 30% |
| Average | 72.5% | 68.75% | 48.3% | 37.5% |
VGM-VLA achieves 50% relative improvement over VPP-VLA and 93% over OpenVLA in real-world tasks.
- LoDiff predicts future frames as latent features.
- DiffMatcher aligns these with HiDiffβs action space.
- Latent PCA analysis shows clear separation between successful and failed task predictions.
- Enables early-stage failure detection without extra supervision.
- Train VGM-VLA:
scripts/train_vgmvla.py - RLBench evaluation:
scripts/eval_rlbench.py - Real-world testing:
scripts/eval_realworld.py - Latent feature analysis:
scripts/pca_analysis.py
# Train on Franka with OXE dataset
bash scripts/train_vdmvla_franka_oxe.sh
# Train on specific tasks
bash scripts/train_vdmvla_franka_whiteboard.shThe core MinD model combines:
| Component | Description |
|---|---|
| VGM-Visual | Vision-guided latent diffusion for future prediction |
| VGM-Policy | High-frequency action generation module |
| VGM-Matcher | Cross-modal alignment between vision and action |
| Risk Module | Implicit failure detection via latent analysis |
Key advantages over baselines:
- Real-time inference: 11.3 FPS vs VPP's 3.2 FPS
- Higher accuracy: 63% vs VPP's 48.5% on RLBench
- Better generalization: Superior zero-shot transfer
For questions or collaborations, please open an issue or contact the project maintainers.
If you find this project helpful, please cite:
@misc{chi2025mindlearningdualsystemworld,
title={MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis},
author={Xiaowei Chi and Kuangzhi Ge and Jiaming Liu and Siyuan Zhou and Peidong Jia and Zichen He and Rui Zhao and Yuzhen Liu and Tingguang Li and Lei Han and Sirui Han and Shanghang Zhang and Yike Guo},
year={2025},
eprint={2506.18897},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2506.18897},
}- π Project Website
- π₯ Demo Videos: see
/predicted_videos/ - π¦ Full code and checkpoints will be released soon.