Skip to content

Koorye/Inspire

Repository files navigation

Inspire

Official implementation of the paper "InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning".

Note: We are doing our best to improve this work. If you have any questions or suggestions, please feel free to create an issue in this repo or contact us at shihan.wu.koorye@outlook.com.

[Project] [ArXiv] [PDF] [Inspire-FAST]

News

  • 🔥Sep 29, 2025: CALVIN evaluation experiment results are now available.

  • 🔥May 23, 2025: Our paper has been updated for better clarity and readability. The optimized version is now available on arXiv.

  • 🔥May 21, 2025: The code is released and the paper is now available on arXiv.

Introduction

Abstract Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To address this challenge, we propose Intrinsic Spatial Reasoning (InSpire), which mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the model's attention to task-relevant visual clues by simply appending the question “In which direction is the [object] relative to the robot” before the language instruction and aligning the VLA's answer “right / left / up / down / front / back / grasp” and predicted actions with the ground-truth. Notably, InSpire can be employed as a plugin to enhance existing autoregressive VLAs, requiring no extra data or interaction with other large models. Extensive experimental results in both simulation and real-world environments demonstrate the effectiveness and flexibility of our approach.

Motivation Method

Experiments

Overall Performance

Real-world Environments

Real-world Environments

LIBERO Simulated Environments

Simulated Environments

CALVIN Simulated Environments

CALVIN ResultsCALVIN Results

Attention Maps

Attention Maps

Videos

Real-world Environments

Note: The real-world experiment was conducted on the $\pi_0$-FAST model, and the relevant code is available at Inspire-FAST.

Seen Tasks

$\pi_0$-FAST:
Cookies Towel
$\pi_0$-FAST:
Left Bowl on Middle Bowl
$\pi_0$-FAST:
Blue Cup Plate
$\pi_0$-FAST:
Pull Bottom Plate
Seen Cookies Towel Seen Left Bowl on Middle Bowl Seen Blue Cup Plate Seen Pull Bottom Plate
InSpire:
Cookies Towel
InSpire:
Left Bowl on Middle Bowl
InSpire:
Blue Cup Plate
InSpire:
Pull Bottom Plate
Seen Cookies Towel Seen Left Bowl on Middle Bowl Seen Blue Cup Plate Seen Pull Bottom Plate

Unseen Tasks

$\pi_0$-FAST:
Pick Orange
$\pi_0$-FAST:
Banana Towel
$\pi_0$-FAST:
Ball Book
$\pi_0$-FAST:
Orange Cup Plate
Unseen Pick Orange Unseen Banana Towel Unseen Ball Book Unseen Orange Cup Plate
InSpire:
Pick Orange
InSpire:
Banana Towel
InSpire:
Ball Book
InSpire:
Orange Cup Plate
Unseen Pick Orange Unseen Banana Towel Unseen Ball Book Unseen Orange Cup Plate

Simulated Environments

Seen Tasks

miniVLA:
Butter Drawer
miniVLA:
Moka Stove
miniVLA:
Sauce Tray
miniVLA:
Book Caddy
Libero-90 Butter Drawer Libero-90 Moka Stove Libero-90 Sauce Tray Libero-90 Book Caddy
InSpire:
Butter Drawer
InSpire:
Moka Stove
InSpire:
Sauce Tray
InSpire:
Book Caddy
Libero-90 Butter Drawer Libero-90 Moka Stove Libero-90 Sauce Tray Libero-90 Book Caddy

Unseen Tasks

miniVLA:
Bowl Plate
miniVLA:
Cheese Basket
miniVLA:
Bowl Plate
miniVLA:
Book Caddy
Libero-Goal Bowl Plate Libero-Object Cheese Basket Libero-Spatial Bowl Plate Libero-10 Book Caddy
InSpire:
Bowl Plate
InSpire:
Cheese Basket
InSpire:
Bowl Plate
InSpire:
Book Caddy
Libero-Goal Bowl Plate Libero-Object Cheese Basket Libero-Spatial Bowl Plate Libero-10 Book Caddy

CALVIN Simulated Environments

ABC -> D

miniVLA InSpire

Models Checkpoints

Model Dataset Checkpoint
MiniVLA Libero90 Download
InspireVLA Libero90 Download
InspireVLA Libero10+Goal+Object+Spatial Download

Installation

  1. Clone the repository.
git clone https://github.com/Koorye/Inspire.git
  1. Install the dependencies.
conda create -n inspire python=3.10
conda activate inspire

cd LIBERO
pip install -r requirements.txt
pip install -e .
cd ..

cd vq_bet_official
pip install -r requirements.txt
pip install -e .
cd ..

pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements-min.txt

# (Optional) for Flash Attention
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

Evaluation with Pretrained Checkpoints

  1. Create .hf_token file in the root directory and add your Hugging Face token.
echo "your_huggingface_token" > .hf_token
  1. Download pretrained checkpoints.
bash scripts/download_pretrained_weights.sh
  1. Run the evaluation script.
bash vla_scripts/eval/eval_baseline_libero90.sh
bash vla_scripts/eval/eval_inspire_libero90.sh

Training Your Own Checkpoints

  1. Prepare the dataset.

See Dataset Preparation.

  1. Run the training script.
bash vla_scripts/train/train_baseline_libero90.sh
bash vla_scripts/train/train_inspire_libero90.sh
  1. Run the evaluation script.
bash vla_scripts/eval/eval_baseline_libero90.sh
bash vla_scripts/eval/eval_inspire_libero90.sh

Acknowledgements

Our work is built upon the following open-source projects: CALVIN, LIBERO, miniVLA, Pi-0. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.

About

Official implemetation of the paper "InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published