Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
English | ็ฎไฝไธญๆ
- EASI is a unified evaluation suite for Spatial Intelligence in multimodal LLMs.
- EASI supports two evaluation backends: VLMEvalKit and lmms-eval.
- After installation, you can quickly try a SenseNova-SI model with:
Using EASI (backend=VLMEvalKit):
cd VLMEvalKit/
python run.py --data MindCubeBench_tiny_raw_qa \
--model SenseNova-SI-1.3-InternVL3-8B \
--verbose --reuse --judge extract_matchingUsing EASI (backend=lmms-eval):
lmms-eval --model qwen2_5_vl \
--model_args pretrained=sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B \
--tasks site_bench_image \
--batch_size 1 \
--log_samples \
--output_path ./logs/EASI is a unified evaluation suite for Spatial Intelligence. It benchmarks state-of-the-art proprietary and open-source multimodal LLMs across a growing set of spatial benchmarks.
- Comprehensive Support: Currently EASI(v0.2.0) supports 23 Spatial Intelligence models and 25 spatial benchmarks.
- Dual Backends:
- VLMEvalKit: Rich model zoo with built-in judging capabilities.
- lmms-eval: Lightweight, accelerate-based distributed evaluation.
Full details are available at ๐ Supported Models & Benchmarks. EASI also provides transparent ๐ Benchmark Verification against official scores.
๐ [2026-01-16] EASI v0.2.0 is released. Major updates include:
- New Backend Support: Integrated lmms-eval alongside VLMEvalKit, offering flexible evaluation options.
- Expanded benchmark support: Added DSR-Bench.
For the full release history and detailed changelog, please see ๐ Changelog.
EASI provides two evaluation backends. You can install one or both depending on your needs.
git clone --recursive https://github.com/EvolvingLMMs-Lab/EASI.git
cd EASI
pip install -e ./VLMEvalKitgit clone --recursive https://github.com/EvolvingLMMs-Lab/EASI.git
cd EASI
pip install -e ./lmms-eval spacy
# Recommended Dependencies
# Use "torch==2.7.1", "torchvision==0.22.1" in pyproject.toml (this works with most models)
# Install flash-attn for faster inference
pip install flash-attn --no-build-isolationbash dockerfiles/EASI/build_runtime_docker.sh
docker run --gpus all -it --rm \
-v /path/to/your/data:/mnt/data \
--name easi-runtime \
VLMEvalKit_EASI:latest \
/bin/bash
EASI supports two evaluation backends. Choose the one that best fits your needs.
General command
python run.py --data {BENCHMARK_NAME} --model {MODEL_NAME} --judge {JUDGE_MODE} --verbose --reuse Please refer to the Configuration section below for the full list of available models and benchmarks. See run.py for the full list of arguments.
Example
Evaluate SenseNova-SI-1.3-InternVL3-8B on MindCubeBench_tiny_raw_qa:
python run.py --data MindCubeBench_tiny_raw_qa \
--model SenseNova-SI-1.3-InternVL3-8B \
--verbose --reuse --judge extract_matchingThis uses regex-based answer extraction. For LLM-based judging (e.g., on SpatialVizBench_CoT), switch to the OpenAI judge:
export OPENAI_API_KEY=YOUR_KEY
python run.py --data SpatialVizBench_CoT \
--model {MODEL_NAME} \
--verbose --reuse --judge gpt-4o-1120lmms-eval provides accelerate-based distributed evaluation with support for multi-GPU inference.
General command
lmms-eval --model {MODEL_TYPE} \
--model_args pretrained={MODEL_PATH} \
--tasks {TASK_NAME} \
--batch_size 1 \
--log_samples \
--output_path ./logs/Example: Single GPU
Evaluate SenseNova-SI-1.1-Qwen2.5-VL-3B on site_bench_image:
lmms-eval --model qwen2_5_vl \
--model_args pretrained=sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B \
--tasks site_bench_image \
--batch_size 1 \
--log_samples \
--output_path ./logs/Example: Multi-GPU with accelerate
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--num_processes=4 \
--num_machines=1 \
--mixed_precision=no \
--dynamo_backend=no \
--main_process_port=12346 \
-m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B,attn_implementation=flash_attention_2 \
--tasks site_bench_image \
--batch_size 1 \
--log_samples \
--output_path ./logs/List available tasks
lmms-eval --tasks listFor more details on lmms-eval usage, refer to the documentation in lmms-eval/docs/, including model guide, task guide, and run examples.
EASI (backend=VLMEvalKit)
- Models: Defined in
vlmeval/config.py. Verify inference withvlmutil check {MODEL_NAME}. - Benchmarks: Full list of supported Benchmarks at VLMEvalKit Supported Benchmarks.
- EASI Specifics: For EASI Leaderboard, related benchmarks are summarized in Supported Models & Benchmarks.
EASI (backend=lmms-eval)
-
Models: lmms-eval supports various model types including
qwen2_5_vl,llava,internvl2, and more. Use--model_argsto specify model parameters likepretrained,attn_implementation, etc. -
Tasks: Tasks are defined in
lmms-eval/lmms_eval/tasks/. To list all available tasks:lmms-eval --tasks list
Example tasks for spatial intelligence evaluation:
Task Name Description site_bench_imageSITE-Bench image evaluation site_bench_videoSITE-Bench video evaluation For more details on lmms-eval usage, refer to the lmms-eval documentation.
To submit your evaluation results to our EASI Leaderboard:
- Go to the EASI Leaderboard page.
- Click ๐ Submit here! to the submission form.
- Follow the instructions to fill in the submission form, and submit your results.
EASI is an open and evolving evaluation suite. We warmly welcome community contributions, including:
- New spatial benchmarks
- New model baselines
- Evaluation tools
If you are interested in contributing, or have questions about integration, please contact us at ๐ง easi-lmms-lab@outlook.com
@article{easi2025,
title={Holistic Evaluation of Multimodal LLMs on Spatial Intelligence},
author={Cai, Zhongang and Wang, Yubo and Sun, Qingping and Wang, Ruisi and Gu, Chenyang and Yin, Wanqi and Lin, Zhiqian and Yang, Zhitao and Wei, Chen and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Li, Jiaqi and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
journal={arXiv preprint arXiv:2508.13142},
year={2025}
}