Junke Wang1 Qihang Zhang2 Shuai Yang2 Yiming Luo2 Yujun Shen2 Zuxuan Wu1,* Yu-Gang Jiang1,* Yinghao Xu2,3,*
1Institute of Trustworthy Embodied AI, Fudan University
2Robbyant, Ant Group
3Hong Kong University of Science and Technology
*Corresponding authors
RepWAM is a representation-centric world action model, built around semantic visual-action tokenization. It first learns a visual-action tokenizer, and then trains a causal WAM to model future states and actions under language instructions, enabling effective transfer from world modeling to robot control.
- ⛰️ RepWAM is the first representation-centric WAM trained on semantic visual and action latent tokens.
- ⚡️ Our representation visual-action tokenizer, RepViTok, not only achieves superior visual reconstruction quality, but also transfers well to robot actions.
- 🔥 Leading performance on real robot tasks and RoboTwin 2.0, showing clear gains over WAN2.2 VAE.
- [2026/06/12]
Paper release. - [2026/06] Inference code release.
- [2026/06] Code and model weights release.
RepViTok first trains a video tokenizer with both pixel reconstruction and semantic alignment supervision. On top of the visual latent space, it then learns latent actions as compact transitions between visual states.
With the paired visual-action latents, RepWAM trains a causal diffusion transformer over visual-action chunks, jointly modeling future visual states and the latent actions that connect them under language conditioning.
We evaluate our model on a Franka dual-arm robot platform across three manipulation tasks, on which RepWAM consistently surpasses existing vision-language-action models and WAMs.
Real-world rollouts
Pick Fruit | Pick fruits into a plate.
Push Drawer | Push a drawer and place a block inside.
Insert Tube | Insert a test tube into a rack.
Trained from scratch without WAN initialization, RepWAM reaches competitive results on RoboTwin 2.0, i.e., 89.3 on Easy and 88.4 on Hard over the 50-task RoboTwin suite.
| Model | Backbone pretrained | Hor-2 Easy | Hor-2 Hard | Hor-3 Easy | Hor-3 Hard | 50-task Easy | 50-task Hard |
|---|---|---|---|---|---|---|---|
| pi0.5 | Yes | 79.3 | 73.0 | 78.6 | 67.4 | 82.7 | 76.8 |
| Motus | Yes | 85.2 | 80.9 | 85.0 | 84.2 | 88.7 | 87.0 |
| Lingbot-VA | Yes | 85.3 | 86.9 | 89.6 | 90.6 | 92.9 | 91.6 |
| RepWAM-1.3B | No | 85.7 | 84.0 | 92.0 | 85.4 | 86.6 | 83.1 |
| RepWAM-5B | No | 87.4 | 87.6 | 88.0 | 90.4 | 89.3 | 88.4 |
Replacing WAN2.2 VAE with RepViTok improves the average success rate by 8.6 points on Easy and 7.1 points on Hard, supporting the importance of semantic visual-action tokenization for world action modeling.
Open-loop video generation results from RepWAM.
Compared with prior latent action models, RepViTok better captures manipulation-relevant changes, thus leading to lower action loss.
If you find RepWAM helpful, please consider 🌟 our repo and citing the paper.
@article{wang2026repwam,
title = {RepWAM: World Action Modeling with Representation Visual-Action Tokenizers},
author = {Wang, Junke and Zhang, Qihang and Yang, Shuai and Luo, Yiming and Shen, Yujun and Wu, Zuxuan and Jiang, Yu-Gang and Xu, Yinghao},
journal= {arXiv preprint arXiv:2606.13674},
year = {2026}
}




