RepWAM: World Action Modeling with
Representation Visual-Action Tokenizers

Junke Wang¹ Qihang Zhang² Shuai Yang² Yiming Luo² Yujun Shen² Zuxuan Wu^1,* Yu-Gang Jiang^1,* Yinghao Xu^2,3,*

¹Institute of Trustworthy Embodied AI, Fudan University ²Robbyant, Ant Group
³Hong Kong University of Science and Technology
^*Corresponding authors

Introduction

RepWAM is a representation-centric world action model, built around semantic visual-action tokenization. It first learns a visual-action tokenizer, and then trains a causal WAM to model future states and actions under language instructions, enabling effective transfer from world modeling to robot control.

Highlights

⛰️ RepWAM is the first representation-centric WAM trained on semantic visual and action latent tokens.
⚡️ Our representation visual-action tokenizer, RepViTok, not only achieves superior visual reconstruction quality, but also transfers well to robot actions.
🔥 Leading performance on real robot tasks and RoboTwin 2.0, showing clear gains over WAN2.2 VAE.

Open-source Plan

[2026/06/12] ~~Paper release.~~
[2026/06] Inference code release.
[2026/06] Code and model weights release.

Method

Representation Visual-Action Tokenizer

RepViTok first trains a video tokenizer with both pixel reconstruction and semantic alignment supervision. On top of the visual latent space, it then learns latent actions as compact transitions between visual states.

World Action Model

With the paired visual-action latents, RepWAM trains a causal diffusion transformer over visual-action chunks, jointly modeling future visual states and the latent actions that connect them under language conditioning.

Results

Real-world Manipulation

We evaluate our model on a Franka dual-arm robot platform across three manipulation tasks, on which RepWAM consistently surpasses existing vision-language-action models and WAMs.

Real-world rollouts

_{Pick Fruit | Pick fruits into a plate.}

_{Push Drawer | Push a drawer and place a block inside.}

_{Insert Tube | Insert a test tube into a rack.}

RoboTwin 2.0

Trained from scratch without WAN initialization, RepWAM reaches competitive results on RoboTwin 2.0, i.e., 89.3 on Easy and 88.4 on Hard over the 50-task RoboTwin suite.

Model	Backbone pretrained	Hor-2 Easy	Hor-2 Hard	Hor-3 Easy	Hor-3 Hard	50-task Easy	50-task Hard
pi0.5	Yes	79.3	73.0	78.6	67.4	82.7	76.8
Motus	Yes	85.2	80.9	85.0	84.2	88.7	87.0
Lingbot-VA	Yes	85.3	86.9	89.6	90.6	92.9	91.6
RepWAM-1.3B	No	85.7	84.0	92.0	85.4	86.6	83.1
RepWAM-5B	No	87.4	87.6	88.0	90.4	89.3	88.4

Replacing WAN2.2 VAE with RepViTok improves the average success rate by 8.6 points on Easy and 7.1 points on Hard, supporting the importance of semantic visual-action tokenization for world action modeling.

Visualizations

Video Generation

Open-loop video generation results from RepWAM.

Latent Actions

Compared with prior latent action models, RepViTok better captures manipulation-relevant changes, thus leading to lower action loss.

Citation

If you find RepWAM helpful, please consider 🌟 our repo and citing the paper.

@article{wang2026repwam,
  title  = {RepWAM: World Action Modeling with Representation Visual-Action Tokenizers},
  author = {Wang, Junke and Zhang, Qihang and Yang, Shuai and Luo, Yiming and Shen, Yujun and Wu, Zuxuan and Jiang, Yu-Gang and Xu, Yinghao},
  journal= {arXiv preprint arXiv:2606.13674},
  year   = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RepWAM: World Action Modeling with
Representation Visual-Action Tokenizers

Introduction

Highlights

Open-source Plan