ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Zhenyang Liu1,2, Yongchong Gu1, Yikai Wang3,
Xiangyang Xue1,β , Yanwei Fu1,2,β
1Fudan University, 2Shanghai Innovation Institute, 3Nanyang Technological University
β Corresponding Authors
This repository is the official implementation of ActiveVLA. We are currently preparing the code and data for release. Please stay tuned!
- Release the Code (Training & Inference scripts).
- Release Pre-trained Models.
- Release Evaluation Scripts (RLBench, COLOSSEUM, GemBench).
- Release Real-Robot Control Code.
Most existing Vision-Language-Action (VLA) models rely on static, wrist-mounted cameras that provide a fixed, end-effector-centric viewpoint. This setup limits perceptual flexibility: the agent cannot adaptively adjust its viewpoint or camera resolution according to the task context, leading to failures in long-horizon tasks or fine-grained manipulation due to occlusion and lack of detail.
We propose ActiveVLA, a novel vision-language-action framework that explicitly integrates active perception into robotic manipulation. Unlike passive perception methods, ActiveVLA empowers robots to:
- Actively Select Viewpoints: Autonomously determine optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions.
- Active 3D Zoom-in: Selectively enhance high-resolution views of task-critical regions within the 3D scene.
By dynamically refining its perceptual input, ActiveVLA achieves superior adaptability and performance in complex scenarios. Experiments show that ActiveVLA outperforms state-of-the-art baselines on RLBench, COLOSSEUM, and GemBench, and transfers seamlessly to real-world robots.
Figure 1: Comparison between previous VLA methods and ActiveVLA. While traditional VLAs fail due to fixed cameras and occlusion (left), ActiveVLA leverages 3D scene understanding to actively select better views and observe more carefully (right).
We propose a coarse-to-fine active perception framework that integrates 3D spatial reasoning with vision-language understanding.
The pipeline consists of two main stages:
- Critical Region Localization (Coarse Stage): Projects 3D inputs onto multi-view 2D projections to identify critical 3D regions via heatmaps.
- Active Perception Optimization (Fine Stage):
- Active Viewpoint Selection: Uses a hypothesis testing strategy to choose optimal viewpoints that maximize amodal relevance and diversity.
- Active 3D Zoom-in: Applies a virtual optical zoom effect to improve resolution in key areas for precise manipulation.
Figure 2: The pipeline of ActiveVLA. It adopts a two-stage strategy involving coarse heatmap prediction followed by active view selection and 3D zoom-in to generate the final 3D action.
Note: For more visualizations and real-world robot demos, please visit our Project Page.
ActiveVLA achieves state-of-the-art performance across multiple benchmarks:
- RLBench: Achieves an average success rate of 91.8%, ranking 1st in 10 tasks.
- COLOSSEUM: Demonstrates superior robustness with a 65.9% success rate in challenging generalization scenarios.
- GemBench: Outperforms all baselines with strong adaptability across diverse tasks.
- Real World: High success rates in occlusion-heavy tasks (e.g., retrieving items from drawers, handling occluded objects).
If you find our work useful in your research, please consider citing:
@misc{liu2026activevlainjectingactiveperception,
title={ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation},
author={Zhenyang Liu and Yongchong Gu and Yikai Wang and Xiangyang Xue and Yanwei Fu},
year={2026},
eprint={2601.08325},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2601.08325},
}