Weiyu Guo1,4, He Zhang1,4, Pengteng Li1,4, Tiefu Cai1,4, Ziyang Chen1,4, Yandong Guo1,4,
He Xiao4, Yongkui Yang3*, Ying Sun1,2*, Hui Xiong1,2*
1The Thrust of Artificial Intelligence, HKUST (Guangzhou), China
2The Department of CSE, HKUST, Hong Kong, China
3Shenzhen Institutes of Advanced Technology, CAS, China
4AI2Robotics, Shenzhen, China
Real-world experiments evaluating precision (Pouring), memory (Shaking), and safety reflexes (Collision Recovery).
The pursuit of general-purpose embodied intelligence faces a critical sensorimotor paradox: traditional Vision-Language-Action (VLA) models suffer from "temporal blindness" and high latency, leading to action jitter and an inability to reflex instantaneously in dynamic scenarios.
NeuroVLA introduces a bio-inspired, tri-level hierarchical architecture that restores the canonical division of labor found in biological motor systems. Instead of a monolithic processor, NeuroVLA decouples high-level cognition from low-level motor control:
- Cortical Module (Vision-Language): Responsible for semantic planning and high-level goal generation.
- Cerebellar Module (Adaptive): Functions as a high-frequency adaptive filter to predict sensory consequences and refine timing.
- Spinal Module (Spiking Neural Network): Implements asynchronous, localized actuation and fast sensorimotor loops.
By mapping the spinal module to event-driven spiking networks, NeuroVLA exploits temporal sparsity to minimize end-to-end latency, enabling localized, hardware-efficient learning on edge devices.
Our experiments on both simulated benchmarks and physical robotic hardware demonstrate distinctive capabilities that purely scaling monolithic VLAs cannot replicate:
- Kinematic Smoothness (75% Jerk Reduction): The cerebellar module functions as an adaptive filter, effectively suppressing high-frequency intention tremor. This reduces kinematic jerk by over 75%, ensuring fluid execution even with noisy visual feedback.
- Survival Reflexes (< 20 ms Latency): Under unexpected physical collisions, the cerebellar-spinal loops trigger rapid withdrawal reflexes in < 20 ms, bypassing the prohibitive latency (> 200 ms) of the cortical loop to protect hardware.
- Emergent Sparsity: The neuromorphic spinal layer exhibits unsupervised functional self-organization without explicit training signals:
- Temporal Sparsity: Neurons spontaneously revert to quiescence during static posturing to minimize metabolic cost.
- Spatial Disentanglement: The network naturally segregates high-dimensional control signals into distinct, somatotopic behavioral modes.
The environment setup is based on standard VLA dependencies. We recommend using conda to manage the environment.
- Linux (Ubuntu 20.04/22.04 recommended)
- Python 3.10+
- NVIDIA GPU with CUDA support
# 1. Create a conda environment
conda create -n neurovla python=3.10 -y
conda activate neurovla
# 2. Install PyTorch (Adjust CUDA version as needed)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolationNote: For specific dependency versions and detailed configuration related to the base VLA framework, please refer to the StarVLA Environment Setup Guide. Our implementation builds upon these foundational libraries.
# 1. Run training example
bash NeuroVLA/scripts/run_scripts/run_libero_train_NeuroVLA.sh
# 2. Run evaluation example
bash NeuroVLA/examples/LIBERO/eval_libero.shIf you find our code or architecture helpful in your research, please cite our repository:
@misc{guo2025neurovla,
author = {Guo, Weiyu and Zhang, He and Li, Pengteng and Cai, Tiefu and Chen, Ziyang and Guo, Yandong and Xiao, He and Yang, Yongkui and Sun, Ying and Xiong, Hui},
title = {NeuroVLA: A Brain-like Embodied Intelligence for Fluid and Fast Reflexive Robotics Control},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {https://github.com/guoweiyu/NeuroVLA}}
}This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
This is a strict copyleft license. If you use this software (or a modified version of it) to provide a service over a network, you must make the source code available to the users of that service.
See LICENSE for more details.


