Source
arXiv:2603.19896 — Utility-Guided Agent Orchestration for Efficient LLM Tool Use (2026-03-20)
Technique
An orchestration policy that scores each candidate action (respond, retrieve, tool call, verify, stop) using a utility function weighting:
- Estimated gain — expected quality improvement from the action
- Step cost — token/latency cost
- Uncertainty — model confidence in the current state
- Redundancy — overlap with already-retrieved context
The policy selects the action with the highest utility rather than letting the LLM freely choose.
Applicability to Zeph
Bandit routing layer (#2415 BaRP): The utility function formulation is a natural extension of the BaRP cost-weight dial. Instead of a single cost_weight scalar, the bandit reward signal could incorporate all four components: gain (quality), cost, uncertainty, and redundancy. This would make the LinUCB reward more semantically rich.
ToolExecutor step sequencing: The summarize_output flag and the tool overflow threshold are ad hoc; utility scoring could replace them with a principled "is this tool call worth running?" gate.
context_strategy = "adaptive": The adaptive context strategy already attempts a similar multi-signal tradeoff — this paper provides formal grounding.
Implementation sketch
- Add utility scoring as an optional gate in
zeph-tools/src/composite.rs before dispatching a tool
- Feed utility signal as an additional feature to the LinUCB bandit reward model
- Config:
[tools] utility_scoring = true, utility_gain_weight, utility_cost_weight
Related
Source
arXiv:2603.19896 — Utility-Guided Agent Orchestration for Efficient LLM Tool Use (2026-03-20)
Technique
An orchestration policy that scores each candidate action (respond, retrieve, tool call, verify, stop) using a utility function weighting:
The policy selects the action with the highest utility rather than letting the LLM freely choose.
Applicability to Zeph
Bandit routing layer (#2415 BaRP): The utility function formulation is a natural extension of the BaRP cost-weight dial. Instead of a single
cost_weightscalar, the bandit reward signal could incorporate all four components: gain (quality), cost, uncertainty, and redundancy. This would make the LinUCB reward more semantically rich.ToolExecutor step sequencing: The
summarize_outputflag and the tool overflow threshold are ad hoc; utility scoring could replace them with a principled "is this tool call worth running?" gate.context_strategy = "adaptive": The adaptive context strategy already attempts a similar multi-signal tradeoff — this paper provides formal grounding.Implementation sketch
zeph-tools/src/composite.rsbefore dispatching a tool[tools] utility_scoring = true,utility_gain_weight,utility_cost_weightRelated