Skip to content

MAC-AutoML/SpecEyes

Repository files navigation

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

arXiv Hugging Face License: Apache-2.0

Highlights · Environment Setup · Quick Start · Repository Structure · Acknowledgements · License · Citation

SpecEyes is a speculative perception and planning framework for agentic multimodal LLMs. It uses a lightweight vision-language model to quickly screen visual inputs and questions, then applies answer separability gating to either return the fast answer or defer to a stronger tool-using model. This repository provides evaluation code, judge scripts, confidence analysis, and result aggregation tools for SpecEyes.

Highlights ✨

Direction Description
Stateful Bottleneck Analysis Reveal the sequential tool-use dependency limiting latency and concurrency in agentic MLLMs.
Agentic-Level Speculation Propose speculative reasoning that skips full tool invocation loops for easy queries.
Answer Separability Gating Introduce a new confidence metric based on top-K logit gaps to decide safe bypass.

Table of Contents

1. Environment Setup 🛠️

We recommend Python 3.11. Install the PyTorch build matching your CUDA version first, then install the project requirements:

pip install -r requirements.txt

Recommended optional packages:

  • flash-attn: useful for higher throughput on supported GPUs
  • vllm==0.12.0: recommended in a separate environment for the judge model service

This repository also relies on a patched image-loading behavior in qwen-vl-utils. After installing qwen-vl-utils, run:

python scripts/patch_qwen_vl_utils.py

2. Quick Start 🚀

2.1 Prepare Datasets and Models

Download the datasets and models into the following directories, or pass explicit paths at runtime:

2.2 Run the Main Evaluation

# Deepeyes baseline
python eval_code_deepeyes/SpecEyes.py --baseline

# Deepeyes with confidence gating
python eval_code_deepeyes/SpecEyes.py --score_threshold 0.98

# Thyme baseline
python eval_code_thyme/SpecEyes.py --baseline

# Thyme with confidence gating
python eval_code_thyme/SpecEyes.py --score_threshold 0.98

For the code-reasoning variant, replace SpecEyes.py with SpecReason.py.

2.3 Start the Judge Model

bash scripts/start_qwen2.5_72b_vllm.sh

The default judge endpoint is http://localhost:23333/v1. Override it with --api_url if needed.

2.4 Run the Judge Scripts

bash scripts/run_judges.sh

You can also run them manually:

python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_pope.py --input_folder eval_results_qwen3vl-2b-Instruct

3.5 Analyze Small-Model Confidence

# Run batched small-model inference
python scripts/small_model_batch_inference.py

# Judge the generated outputs
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct

# Analyze judge results
python scripts/analyze_small_confidence.py --input_folder judge_results_qwen3vl-2b-Instruct
python scripts/analyze_small_conf_percentage.py --input_folder judge_results_qwen3vl-2b-Instruct

3. Repository Structure 🗂️

SpecEyes/
├── data/
│   ├── vstar/
│   ├── HR-Bench/
│   └── POPE/
├── eval_code_deepeyes/
├── eval_code_thyme/
├── judge_code/
├── scripts/
├── vis/
├── eval_results_deepeyes/
├── eval_results_thyme/
└── ...

Core directories:

Path Description
eval_code_deepeyes/ SpecEyes and SpecReason evaluation code built on Deepeyes
eval_code_thyme/ SpecEyes and SpecReason evaluation code built on Thyme
judge_code/ Judge scripts using a vLLM OpenAI-compatible endpoint
scripts/small_model_batch_inference.py Batched small-model inference and confidence signal export
scripts/gather_result.py Aggregation of speedup, and accuracy results
scripts/analyze_small_confidence.py Confidence-distribution and performance analysis
vis/ Plotting and visualization utilities used in the paper

Additional notes:

  • eval_code_thyme/sandbox.py is a localized sandbox copy used by the Thyme evaluation pipeline
  • Temporary processed images are written to eval_code_thyme/temp_processed_images/
  • Result folders and cache directories are intentionally excluded through .gitignore

4. Acknowledgements 🙏

This repository benefits from code references from the DeepEyes repository. We sincerely thank the authors and maintainers for their open-source contributions, which helped inform parts of our implementation and experimentation workflow.

5. License ⚖️

This repository is released under Apache-2.0. See LICENSE for the full license text.

The repository also includes notes about third-party code and patches, including:

  • the upstream source attribution for eval_code_thyme/sandbox.py
  • the patching behavior for qwen-vl-utils

See THIRD_PARTY_NOTICES.md for the relevant attribution and redistribution notes. If you redistribute or modify those third-party-related components, you should also follow the corresponding upstream license requirements.

6. Citation 📚

If you use this repository, please cite the corresponding paper:

@article{huang2026,
  title={SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning},
  author={Huang, Haoyu and Huang, Jinfa and Wan, Zhongwei and Zheng, Xiawu and Ji, Rongrong and Luo, Jiebo},
  journal={arXiv preprint arXiv:2603.23483},
  year={2026}
}

About

This is the official implementation of our paper "SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors