Skip to content

wdrink/RepWAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

RepWAM: World Action Modeling with
Representation Visual-Action Tokenizers

Junke Wang1    Qihang Zhang2    Shuai Yang2    Yiming Luo2    Yujun Shen2    Zuxuan Wu1,*    Yu-Gang Jiang1,*    Yinghao Xu2,3,*

1Institute of Trustworthy Embodied AI, Fudan University    2Robbyant, Ant Group
3Hong Kong University of Science and Technology
*Corresponding authors

Paper Project Hugging Face Google Scholar BibTeX

Introduction

RepWAM is a representation-centric world action model, built around semantic visual-action tokenization. It first learns a visual-action tokenizer, and then trains a causal WAM to model future states and actions under language instructions, enabling effective transfer from world modeling to robot control.

Highlights

  • ⛰️ RepWAM is the first representation-centric WAM trained on semantic visual and action latent tokens.
  • ⚡️ Our representation visual-action tokenizer, RepViTok, not only achieves superior visual reconstruction quality, but also transfers well to robot actions.
  • 🔥 Leading performance on real robot tasks and RoboTwin 2.0, showing clear gains over WAN2.2 VAE.

Open-source Plan

  • [2026/06/12] Paper release.
  • [2026/06] Inference code release.
  • [2026/06] Code and model weights release.

Method

Representation Visual-Action Tokenizer

RepViTok first trains a video tokenizer with both pixel reconstruction and semantic alignment supervision. On top of the visual latent space, it then learns latent actions as compact transitions between visual states.

RepViTok tokenizer

World Action Model

With the paired visual-action latents, RepWAM trains a causal diffusion transformer over visual-action chunks, jointly modeling future visual states and the latent actions that connect them under language conditioning.

RepWAM world action model

Results

Real-world Manipulation

We evaluate our model on a Franka dual-arm robot platform across three manipulation tasks, on which RepWAM consistently surpasses existing vision-language-action models and WAMs.

Real-world success rates

Real-world rollouts

Pick Fruit   |   Pick fruits into a plate.
Pick fruit real-world demo

Push Drawer   |   Push a drawer and place a block inside.
Push drawer real-world demo

Insert Tube   |   Insert a test tube into a rack.
Insert tube real-world demo

RoboTwin 2.0

Trained from scratch without WAN initialization, RepWAM reaches competitive results on RoboTwin 2.0, i.e., 89.3 on Easy and 88.4 on Hard over the 50-task RoboTwin suite.

Model Backbone pretrained Hor-2 Easy Hor-2 Hard Hor-3 Easy Hor-3 Hard 50-task Easy 50-task Hard
pi0.5 Yes 79.3 73.0 78.6 67.4 82.7 76.8
Motus Yes 85.2 80.9 85.0 84.2 88.7 87.0
Lingbot-VA Yes 85.3 86.9 89.6 90.6 92.9 91.6
RepWAM-1.3B No 85.7 84.0 92.0 85.4 86.6 83.1
RepWAM-5B No 87.4 87.6 88.0 90.4 89.3 88.4

Replacing WAN2.2 VAE with RepViTok improves the average success rate by 8.6 points on Easy and 7.1 points on Hard, supporting the importance of semantic visual-action tokenization for world action modeling.

WAN VAE ablation bar chart

Visualizations

Video Generation

Open-loop video generation results from RepWAM.

Open-loop generated video gallery

Latent Actions

Compared with prior latent action models, RepViTok better captures manipulation-relevant changes, thus leading to lower action loss.

Latent action visualization

Citation

If you find RepWAM helpful, please consider 🌟 our repo and citing the paper.

@article{wang2026repwam,
  title  = {RepWAM: World Action Modeling with Representation Visual-Action Tokenizers},
  author = {Wang, Junke and Zhang, Qihang and Yang, Shuai and Luo, Yiming and Shen, Yujun and Wu, Zuxuan and Jiang, Yu-Gang and Xu, Yinghao},
  journal= {arXiv preprint arXiv:2606.13674},
  year   = {2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors