research(performance): speculative tool calls — overlap LLM decoding with execution, 100s tok/s latency gain (arXiv:2512.15834)

## Finding

**Paper**: "Optimizing Agentic Language Model Inference via Speculative Tool Calls"
**arXiv**: https://arxiv.org/abs/2512.15834 (December 2025)

## Core Idea

Reduces per-turn tool-use latency by speculatively issuing tool calls **before the LLM finishes decoding** the current turn:
- Predict the most likely tool call from partial decoding
- Execute speculatively in parallel with remaining generation
- If prediction matches — skip re-execution (hundreds of tokens/sec gain)
- If mismatch — fall back to normal sequential execution

## Applicability to Zeph (4/5)

Zeph's tool execution is fully sequential (LLM finishes → parse tool call → execute → return). The streaming architecture (Telegram, TUI) already has partial response delivery, making speculative dispatch technically feasible.

Applicable to:
- High-latency tools (bash, web scrape, search_code) where execution time dominates
- Multi-step tool chains where the next call is highly predictable from context

## Notes

Requires changes in  ToolExecutor and  agent loop to support speculative dispatch. Lower priority than correctness/reliability work, but high impact for interactive TUI/Telegram sessions where per-turn latency is user-visible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(performance): speculative tool calls — overlap LLM decoding with execution, 100s tok/s latency gain (arXiv:2512.15834) #2290

Finding

Core Idea

Applicability to Zeph (4/5)

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(performance): speculative tool calls — overlap LLM decoding with execution, 100s tok/s latency gain (arXiv:2512.15834) #2290

Description

Finding

Core Idea

Applicability to Zeph (4/5)

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions