Skip to content

ZGC-EmbodyAI/TwinBrainVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 

Repository files navigation

TwinBrainVLA logo : Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

GitHub arXiv License

Bin Yu1,2,* Shijie Lian2,4,* Xiaopeng Lin2,5,* Yuliang Wei1,† Zhaolong Shen2,6
Changti Wu2,7 Yuzhuo Miao1,2 Xinming Wang2,8 Bailing Wang1 Cong Huang2,3 Kai Chen2,3,9,†

1HIT, 2ZGCA, 3ZGCI, 4HUST, 5HKUST(GZ), 6BUAA, 7ECNU, 8CASIA, 9DeepCybo

*Equal contribution, †Corresponding author

ZGCAZhongguancun Academy & ZGCIZhongguancun Institute of Artificial Intelligence


πŸ“– Abstract

Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities.

To resolve this conflict, we introduce TwinBrainVLAlogo, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls.

Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.

πŸ—οΈ Architecture

TwinBrainVLA mimics the biological principle of hemispheric lateralization:

TwinBrainVLA Framework
  • Left Brain (Generalist): A frozen, pre-trained VLM (e.g., Qwen-VL) that serves as a semantic anchor, preserving open-world knowledge.
  • Right Brain (Specialist): A trainable VLM initialized with the same weights, specialized for embodied control and proprioceptive state encoding.
  • Asymmetric MoT (AsyMoT): A mechanism where the Right Brain attends to the frozen Key-Value (KV) pairs of the Left Brain via causal self-attention, transferring semantic knowledge without parameter pollution.
  • Action Expert: A Flow-Matching Diffusion Transformer (DiT) that generates continuous actions based on the condition features from the Right Brain.

πŸ™ Acknowledgements

We would like to thank the starVLA project for its inspiring work and open-source contributions. At the same time, we also express our gratitude to the following projects:

Citation

If you find this project or the dataset helpful, please cite:

@misc{TwinBrainVLA,
      title={TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers}, 
      author={Bin Yu and Shijie Lian and Xiaopeng Lin and Yuliang Wei and Zhaolong Shen and Changti Wu and Yuzhuo Miao and Xinming Wang and Bailing Wang and Cong Huang and Kai Chen},
      year={2026},
      eprint={2601.14133},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.14133}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors