Skip to content

LiamLian0727/Euclids_Gift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks

issues forks stars huggingface model huggingface dataset arXiv license

📢 News

  • [02/21/2026]⚡: Euclid’s Gift has been accepted to CVPR 2026 Findings.
  • [10/24/2025]: We trained Qwen3VL (4B, 8B, and 30B) using Euclid30K, and the results show that the models also achieve significant gains across various spatial intelligence tasks. The weights of the fine-tuned models are available here.
Model SuperClevr Omni3D Bench VSIBench* MindCube
Qwen3VL-4B 55.36 27.74 35.51 26.11
Qwen3VL-Euclid-4B 61.24 (+5.88) 31.74 (+4.00) 42.26 (+6.75) 32.98 (+6.87)
Qwen3VL-8B 48.30 34.01 33.25 34.16
Qwen3VL-Euclid-8B 48.96 (+0.66) 35.03 (+1.02) 35.54 (+2.29) 41.02 (+6.86)
Qwen3VL-30B 64.12 36.71 40.00 39.75
Qwen3VL-Euclid-30B 70.18 (+6.06) 38.90 (+2.19) 45.80 (+5.80) 40.68 (+0.93)

Qwen3VL and Qwen3VL-Euclid are evaluated using the same prompting template defined in test/eval_qwen.sh to ensure a fair comparison.

Abstract

Spatial intelligence spans abilities such as visualizing and transforming shapes, mental rotation, reasoning about relative positions and containment, and counting/estimation. These remain challenging for modern Multimodal Large Language Models (MLLMs). We propose solving Euclidean geometry problems as a surrogate task and construct Euclid30K, a dataset of roughly 30K 2D and 3D geometry questions. We then fine‑tune Qwen2.5‑VL and RoboBrain2.0 models with Group Relative Policy Optimization (GRPO), enabling the models to internalize and apply Euclidean principles for shape recognition, counting, relation extraction, and multi‑step deductive reasoning. Without task‑specific adaptations, our models achieve significant zero‑shot gains on four spatial‑reasoning benchmarks: Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube. For example, on VSI‑Bench, average accuracy improves from 34.5% to 40.5% (+5.5 percentage points); RoboBrain2.0‑Euclid‑7B reaches 49.6%, surpassing the previous SOTA (Spatial‑MLLM).

Architecture

Gain

Quick Start

1) Environment Setup

Training

Evaluation

  • Install lmms‑eval following its official documentation. You can either:
    • Use the lmms-eval/ copy included in this repository; or
    • Copy the four task folders provided under test/lmms_eval/tasks/ into your existing lmms‑eval setup.
  • Download the benchmark datasets Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube_lmms_eval; then update the dataset paths in each corresponding YAML under test/lmms_eval/tasks/.

2) Training

Below is an example command for training (e.g., 8 GPUs). For multi‑node multi‑GPU training, see the example script train/dist_train.sh.

python3 -m verl.trainer.main \
    config=examples/config.yaml \
    data.train_files=/mnt/datasets/Euclid30K/Euclid30K_train.parquet \
    data.val_files=/mnt/datasets/Euclid30K/Euclid30K_val.parquet \
    worker.actor.model.model_path=/mnt/models/Qwen2.5-VL-7B-Instruct \
    trainer.experiment_name=EXPERIMENT_NAME \
    worker.actor.micro_batch_size_per_device_for_update=1 \
    worker.actor.micro_batch_size_per_device_for_experience=8 \
    worker.actor.clip_ratio_low=0.2 \
    worker.actor.clip_ratio_high=0.28 \
    worker.reward.reward_function=/mnt/code/Euclids_Gift/train/euclid.py:compute_score \
    trainer.total_epochs=10 \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_checkpoint_path=/mnt/models/Qwen2.5-VL-7B-Euclid

3) Evaluation

Evaluation

Use test/eval_qwen.sh, test/eval_robo.sh, and test/eval_euclid.sh to evaluate the Qwen2.5‑VL series, the RoboBrain 2.0 series, and Euclid models trained on Euclid30K, respectively.

Before running these scripts, set model_path in each script to the path of the model you want to evaluate.

Notably, as noted in VSIBench, spatial reasoning ability is the primary bottleneck limiting MLLM performance on the VSI-Bench test. Therefore, to better demonstrate how models perceive scenes and perform spatial reasoning, and to verify whether they genuinely acquire spatial intelligence from geometric knowledge, we deviate from the original VSI-Bench setup, which uses prompts such as "Answer with the option's letter from the given choices directly" or "Please answer the question using a single word or phrase" and constrains the maximum response length to 16 tokens. Instead, we follow the prompt configuration described in RoboBrain2.0 Sec. B, which encourages the model to first reason about the problem before providing an answer, and we set the maximum response length to 1024 tokens. This setup allows us to observe the model's intermediate reasoning process and assess whether it has internalized transferable spatial priors from Euclid30K training.

Citation

If you find this project or the dataset helpful, please cite:

@misc{Euclids_Gift,
    title={Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks},
    author={Shijie Lian and Changti Wu and Laurence Tianruo Yang and Hang Yuan and Bin Yu and Lei Zhang and Kai Chen},
    year={2025},
    eprint={2509.24473},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.24473}
}

Acknowledgements

We thank the VeRL / EasyR1 training framework, as well as the benchmark suites Super‑CLEVR, Omni3DBench, VSI‑Bench, and MindCube.

⭐ Stargazers

Stargazers repo roster for @LiamLian0727/Euclids_Gift

About

This repo is the official implementation of "Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision‑Language Models via Geometric Surrogate Tasks"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages