Skip to content

research(performance): speculative tool calls — overlap LLM decoding with execution, 100s tok/s latency gain (arXiv:2512.15834) #2290

@bug-ops

Description

@bug-ops

Finding

Paper: "Optimizing Agentic Language Model Inference via Speculative Tool Calls"
arXiv: https://arxiv.org/abs/2512.15834 (December 2025)

Core Idea

Reduces per-turn tool-use latency by speculatively issuing tool calls before the LLM finishes decoding the current turn:

  • Predict the most likely tool call from partial decoding
  • Execute speculatively in parallel with remaining generation
  • If prediction matches — skip re-execution (hundreds of tokens/sec gain)
  • If mismatch — fall back to normal sequential execution

Applicability to Zeph (4/5)

Zeph's tool execution is fully sequential (LLM finishes → parse tool call → execute → return). The streaming architecture (Telegram, TUI) already has partial response delivery, making speculative dispatch technically feasible.

Applicable to:

  • High-latency tools (bash, web scrape, search_code) where execution time dominates
  • Multi-step tool chains where the next call is highly predictable from context

Notes

Requires changes in ToolExecutor and agent loop to support speculative dispatch. Lower priority than correctness/reliability work, but high impact for interactive TUI/Telegram sessions where per-turn latency is user-visible.

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexityllmzeph-llm crate (Ollama, Claude)researchResearch-driven improvementtoolsTool execution and MCP integration

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions