SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Highlights · Environment Setup · Quick Start · Repository Structure · Acknowledgements · License · Citation

SpecEyes is a speculative perception and planning framework for agentic multimodal LLMs. It uses a lightweight vision-language model to quickly screen visual inputs and questions, then applies answer separability gating to either return the fast answer or defer to a stronger tool-using model. This repository provides evaluation code, judge scripts, confidence analysis, and result aggregation tools for SpecEyes.

Highlights ✨

Direction	Description
Stateful Bottleneck Analysis	Reveal the sequential tool-use dependency limiting latency and concurrency in agentic MLLMs.
Agentic-Level Speculation	Propose speculative reasoning that skips full tool invocation loops for easy queries.
Answer Separability Gating	Introduce a new confidence metric based on top-K logit gaps to decide safe bypass.

1. Environment Setup 🛠️

We recommend Python 3.11. Install the PyTorch build matching your CUDA version first, then install the project requirements:

pip install -r requirements.txt

Recommended optional packages:

flash-attn: useful for higher throughput on supported GPUs
vllm==0.12.0: recommended in a separate environment for the judge model service

This repository also relies on a patched image-loading behavior in qwen-vl-utils. After installing qwen-vl-utils, run:

python scripts/patch_qwen_vl_utils.py

2. Quick Start 🚀

2.1 Prepare Datasets and Models

Download the datasets and models into the following directories, or pass explicit paths at runtime:

V*: data/vstar
HR-Bench: data/HR-Bench
POPE: data/POPE
Deepeyes: ChenShawn/DeepEyes-7B
Thyme: Kwai-Keye/Thyme-RL
Qwen3-VL-2B: Qwen/Qwen3-VL-2B-Instruct
Qwen2.5-72B: Qwen/Qwen2.5-72B-Instruct

2.2 Run the Main Evaluation

# Deepeyes baseline
python eval_code_deepeyes/SpecEyes.py --baseline

# Deepeyes with confidence gating
python eval_code_deepeyes/SpecEyes.py --score_threshold 0.98

# Thyme baseline
python eval_code_thyme/SpecEyes.py --baseline

# Thyme with confidence gating
python eval_code_thyme/SpecEyes.py --score_threshold 0.98

For the code-reasoning variant, replace SpecEyes.py with SpecReason.py.

2.3 Start the Judge Model

bash scripts/start_qwen2.5_72b_vllm.sh

The default judge endpoint is http://localhost:23333/v1. Override it with --api_url if needed.

2.4 Run the Judge Scripts

bash scripts/run_judges.sh

You can also run them manually:

python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_pope.py --input_folder eval_results_qwen3vl-2b-Instruct

3.5 Analyze Small-Model Confidence

# Run batched small-model inference
python scripts/small_model_batch_inference.py

# Judge the generated outputs
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct

# Analyze judge results
python scripts/analyze_small_confidence.py --input_folder judge_results_qwen3vl-2b-Instruct
python scripts/analyze_small_conf_percentage.py --input_folder judge_results_qwen3vl-2b-Instruct

3. Repository Structure 🗂️

SpecEyes/
├── data/
│   ├── vstar/
│   ├── HR-Bench/
│   └── POPE/
├── eval_code_deepeyes/
├── eval_code_thyme/
├── judge_code/
├── scripts/
├── vis/
├── eval_results_deepeyes/
├── eval_results_thyme/
└── ...

Core directories:

Path	Description
`eval_code_deepeyes/`	`SpecEyes` and `SpecReason` evaluation code built on Deepeyes
`eval_code_thyme/`	`SpecEyes` and `SpecReason` evaluation code built on Thyme
`judge_code/`	Judge scripts using a vLLM OpenAI-compatible endpoint
`scripts/small_model_batch_inference.py`	Batched small-model inference and confidence signal export
`scripts/gather_result.py`	Aggregation of speedup, and accuracy results
`scripts/analyze_small_confidence.py`	Confidence-distribution and performance analysis
`vis/`	Plotting and visualization utilities used in the paper

Additional notes:

eval_code_thyme/sandbox.py is a localized sandbox copy used by the Thyme evaluation pipeline
Temporary processed images are written to eval_code_thyme/temp_processed_images/
Result folders and cache directories are intentionally excluded through .gitignore

4. Acknowledgements 🙏

This repository benefits from code references from the DeepEyes repository. We sincerely thank the authors and maintainers for their open-source contributions, which helped inform parts of our implementation and experimentation workflow.

5. License ⚖️

This repository is released under Apache-2.0. See LICENSE for the full license text.

The repository also includes notes about third-party code and patches, including:

the upstream source attribution for eval_code_thyme/sandbox.py
the patching behavior for qwen-vl-utils

See THIRD_PARTY_NOTICES.md for the relevant attribution and redistribution notes. If you redistribute or modify those third-party-related components, you should also follow the corresponding upstream license requirements.

6. Citation 📚

If you use this repository, please cite the corresponding paper:

@article{huang2026,
  title={SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning},
  author={Huang, Haoyu and Huang, Jinfa and Wan, Zhongwei and Zheng, Xiawu and Ji, Rongrong and Luo, Jiebo},
  journal={arXiv preprint arXiv:2603.23483},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Highlights ✨

Table of Contents

1. Environment Setup 🛠️

2. Quick Start 🚀

2.1 Prepare Datasets and Models

2.2 Run the Main Evaluation

2.3 Start the Judge Model

2.4 Run the Judge Scripts

3.5 Analyze Small-Model Confidence

3. Repository Structure 🗂️

4. Acknowledgements 🙏

5. License ⚖️

6. Citation 📚

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
eval_code_deepeyes		eval_code_deepeyes
eval_code_thyme		eval_code_thyme
figures		figures
judge_code		judge_code
scripts		scripts
vis		vis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
requirements-vllm.txt		requirements-vllm.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Highlights ✨

Table of Contents

1. Environment Setup 🛠️

2. Quick Start 🚀

2.1 Prepare Datasets and Models

2.2 Run the Main Evaluation

2.3 Start the Judge Model

2.4 Run the Judge Scripts

3.5 Analyze Small-Model Confidence

3. Repository Structure 🗂️

4. Acknowledgements 🙏

5. License ⚖️

6. Citation 📚

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages