Source
Speculative Tool Calls: Overlapping Tool Execution with Generation
https://arxiv.org/abs/2512.15834 — December 2025
Summary
The paper proposes a "tool cache" that indexes results by normalized (tool_name, canonicalized_args) key to avoid redundant executions across turns. A companion speculative-dispatch mechanism predicts tool calls before the LLM finishes decoding, but the cache is independently useful.
Applicability to Zeph
HIGH. Within a single session it is common for the LLM to call the same deterministic tool multiple times (e.g., re-reading the same file, repeated web scrape of the same URL). Each call wastes latency and tokens in the context.
Proposed implementation
- Scope: in-memory, per-session only (no persistence across sessions)
- Key:
{tool_name}:{canonicalized_args_json} (sorted keys, normalized values)
- TTL: configurable (default 5 min), reset on
/clear
- Opt-out: non-deterministic tools (shell commands with side effects,
memory_save) must be excluded via a cacheable = false flag in the ToolExecutor trait
- Location:
CompositeExecutor in zeph-tools wraps inner executors with a CachingExecutor layer
- Config:
[tools] result_cache = { enabled = true, ttl_secs = 300 } (new section)
- Metrics: cache hit count tracked in
MetricsSnapshot
Expected benefit
- Eliminates redundant file reads and identical web scrapes
- Reduces token count when cached result is re-injected (same tool_result content, but no re-execution latency)
- No LLM inference engine changes required — pure application-layer optimization
Non-goals
- Speculative dispatch (requires inference engine access, not feasible at app layer)
- Cross-session caching (stale results risk — too dangerous for a general cache)
Source
Speculative Tool Calls: Overlapping Tool Execution with Generation
https://arxiv.org/abs/2512.15834 — December 2025
Summary
The paper proposes a "tool cache" that indexes results by normalized
(tool_name, canonicalized_args)key to avoid redundant executions across turns. A companion speculative-dispatch mechanism predicts tool calls before the LLM finishes decoding, but the cache is independently useful.Applicability to Zeph
HIGH. Within a single session it is common for the LLM to call the same deterministic tool multiple times (e.g., re-reading the same file, repeated web scrape of the same URL). Each call wastes latency and tokens in the context.
Proposed implementation
{tool_name}:{canonicalized_args_json}(sorted keys, normalized values)/clearmemory_save) must be excluded via acacheable = falseflag in theToolExecutortraitCompositeExecutorinzeph-toolswraps inner executors with aCachingExecutorlayer[tools] result_cache = { enabled = true, ttl_secs = 300 }(new section)MetricsSnapshotExpected benefit
Non-goals