Highlights · Environment Setup · Quick Start · Repository Structure · Acknowledgements · License · Citation
SpecEyes is a speculative perception and planning framework for agentic multimodal LLMs. It uses a lightweight vision-language model to quickly screen visual inputs and questions, then applies answer separability gating to either return the fast answer or defer to a stronger tool-using model. This repository provides evaluation code, judge scripts, confidence analysis, and result aggregation tools for SpecEyes.
| Direction | Description |
|---|---|
| Stateful Bottleneck Analysis | Reveal the sequential tool-use dependency limiting latency and concurrency in agentic MLLMs. |
| Agentic-Level Speculation | Propose speculative reasoning that skips full tool invocation loops for easy queries. |
| Answer Separability Gating | Introduce a new confidence metric based on top-K logit gaps to decide safe bypass. |
- Highlights ✨
- Table of Contents
- 1. Environment Setup 🛠️
- 2. Quick Start 🚀
- 3. Repository Structure 🗂️
- 4. Acknowledgements 🙏
- 5. License ⚖️
- 6. Citation 📚
We recommend Python 3.11. Install the PyTorch build matching your CUDA version first, then install the project requirements:
pip install -r requirements.txtRecommended optional packages:
flash-attn: useful for higher throughput on supported GPUsvllm==0.12.0: recommended in a separate environment for the judge model service
This repository also relies on a patched image-loading behavior in qwen-vl-utils. After installing qwen-vl-utils, run:
python scripts/patch_qwen_vl_utils.pyDownload the datasets and models into the following directories, or pass explicit paths at runtime:
- V*:
data/vstar - HR-Bench:
data/HR-Bench - POPE:
data/POPE - Deepeyes:
ChenShawn/DeepEyes-7B - Thyme:
Kwai-Keye/Thyme-RL - Qwen3-VL-2B:
Qwen/Qwen3-VL-2B-Instruct - Qwen2.5-72B:
Qwen/Qwen2.5-72B-Instruct
# Deepeyes baseline
python eval_code_deepeyes/SpecEyes.py --baseline
# Deepeyes with confidence gating
python eval_code_deepeyes/SpecEyes.py --score_threshold 0.98
# Thyme baseline
python eval_code_thyme/SpecEyes.py --baseline
# Thyme with confidence gating
python eval_code_thyme/SpecEyes.py --score_threshold 0.98For the code-reasoning variant, replace SpecEyes.py with SpecReason.py.
bash scripts/start_qwen2.5_72b_vllm.shThe default judge endpoint is http://localhost:23333/v1. Override it with --api_url if needed.
bash scripts/run_judges.shYou can also run them manually:
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_pope.py --input_folder eval_results_qwen3vl-2b-Instruct# Run batched small-model inference
python scripts/small_model_batch_inference.py
# Judge the generated outputs
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
# Analyze judge results
python scripts/analyze_small_confidence.py --input_folder judge_results_qwen3vl-2b-Instruct
python scripts/analyze_small_conf_percentage.py --input_folder judge_results_qwen3vl-2b-InstructSpecEyes/
├── data/
│ ├── vstar/
│ ├── HR-Bench/
│ └── POPE/
├── eval_code_deepeyes/
├── eval_code_thyme/
├── judge_code/
├── scripts/
├── vis/
├── eval_results_deepeyes/
├── eval_results_thyme/
└── ...
Core directories:
| Path | Description |
|---|---|
eval_code_deepeyes/ |
SpecEyes and SpecReason evaluation code built on Deepeyes |
eval_code_thyme/ |
SpecEyes and SpecReason evaluation code built on Thyme |
judge_code/ |
Judge scripts using a vLLM OpenAI-compatible endpoint |
scripts/small_model_batch_inference.py |
Batched small-model inference and confidence signal export |
scripts/gather_result.py |
Aggregation of speedup, and accuracy results |
scripts/analyze_small_confidence.py |
Confidence-distribution and performance analysis |
vis/ |
Plotting and visualization utilities used in the paper |
Additional notes:
eval_code_thyme/sandbox.pyis a localized sandbox copy used by the Thyme evaluation pipeline- Temporary processed images are written to
eval_code_thyme/temp_processed_images/ - Result folders and cache directories are intentionally excluded through
.gitignore
This repository benefits from code references from the DeepEyes repository. We sincerely thank the authors and maintainers for their open-source contributions, which helped inform parts of our implementation and experimentation workflow.
This repository is released under Apache-2.0. See LICENSE for the full license text.
The repository also includes notes about third-party code and patches, including:
- the upstream source attribution for
eval_code_thyme/sandbox.py - the patching behavior for
qwen-vl-utils
See THIRD_PARTY_NOTICES.md for the relevant attribution and redistribution notes. If you redistribute or modify those third-party-related components, you should also follow the corresponding upstream license requirements.
If you use this repository, please cite the corresponding paper:
@article{huang2026,
title={SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning},
author={Huang, Haoyu and Huang, Jinfa and Wan, Zhongwei and Zheng, Xiawu and Ji, Rongrong and Luo, Jiebo},
journal={arXiv preprint arXiv:2603.23483},
year={2026}
}