Official implementation of the paper "InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning".
Note: We are doing our best to improve this work. If you have any questions or suggestions, please feel free to create an issue in this repo or contact us at shihan.wu.koorye@outlook.com.
[Project] [ArXiv] [PDF] [Inspire-FAST]
-
🔥Sep 29, 2025: CALVIN evaluation experiment results are now available.
-
🔥May 23, 2025: Our paper has been updated for better clarity and readability. The optimized version is now available on arXiv.
-
🔥May 21, 2025: The code is released and the paper is now available on arXiv.
Abstract Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To address this challenge, we propose Intrinsic Spatial Reasoning (InSpire), which mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the model's attention to task-relevant visual clues by simply appending the question “In which direction is the [object] relative to the robot” before the language instruction and aligning the VLA's answer “right / left / up / down / front / back / grasp” and predicted actions with the ground-truth. Notably, InSpire can be employed as a plugin to enhance existing autoregressive VLAs, requiring no extra data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.
Real-world Environments
LIBERO Simulated Environments
CALVIN Simulated Environments
Note: The real-world experiment was conducted on the
$\pi_0$ -FAST model, and the relevant code is available at Inspire-FAST.
Seen Tasks
Unseen Tasks
|
Pick Orange |
Banana Towel |
Ball Book |
Orange Cup Plate |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| InSpire: Pick Orange |
InSpire: Banana Towel |
InSpire: Ball Book |
InSpire: Orange Cup Plate |
![]() |
![]() |
![]() |
![]() |
Seen Tasks
| miniVLA: Butter Drawer |
miniVLA: Moka Stove |
miniVLA: Sauce Tray |
miniVLA: Book Caddy |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| InSpire: Butter Drawer |
InSpire: Moka Stove |
InSpire: Sauce Tray |
InSpire: Book Caddy |
![]() |
![]() |
![]() |
![]() |
Unseen Tasks
| miniVLA: Bowl Plate |
miniVLA: Cheese Basket |
miniVLA: Bowl Plate |
miniVLA: Book Caddy |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
| InSpire: Bowl Plate |
InSpire: Cheese Basket |
InSpire: Bowl Plate |
InSpire: Book Caddy |
![]() |
![]() |
![]() |
![]() |
ABC -> D
| miniVLA | InSpire |
|---|---|
![]() |
![]() |
![]() |
![]() |
| Model | Dataset | Checkpoint |
|---|---|---|
| MiniVLA | Libero90 | Download |
| InspireVLA | Libero90 | Download |
| InspireVLA | Libero10+Goal+Object+Spatial | Download |
- Clone the repository.
git clone https://github.com/Koorye/Inspire.git- Install the dependencies.
conda create -n inspire python=3.10
conda activate inspire
cd LIBERO
pip install -r requirements.txt
pip install -e .
cd ..
cd vq_bet_official
pip install -r requirements.txt
pip install -e .
cd ..
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements-min.txt
# (Optional) for Flash Attention
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation- Create
.hf_tokenfile in the root directory and add your Hugging Face token.
echo "your_huggingface_token" > .hf_token- Download pretrained checkpoints.
bash scripts/download_pretrained_weights.sh- Run the evaluation script.
bash vla_scripts/eval/eval_baseline_libero90.sh
bash vla_scripts/eval/eval_inspire_libero90.sh- Prepare the dataset.
See Dataset Preparation.
- Run the training script.
bash vla_scripts/train/train_baseline_libero90.sh
bash vla_scripts/train/train_inspire_libero90.sh- Run the evaluation script.
bash vla_scripts/eval/eval_baseline_libero90.sh
bash vla_scripts/eval/eval_inspire_libero90.shOur work is built upon the following open-source projects: CALVIN, LIBERO, miniVLA, Pi-0. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.










































