Overview
When the agent approaches its max iteration limit, it currently gets no advance warning — it simply hits the wall and _handle_max_iterations() makes one final tool-less API call asking for a summary. This means the LLM has no opportunity to proactively wrap up, consolidate its findings, or produce a quality final response before being cut off.
This idea comes from Utah (Inngest's agent harness), which implements a two-tier budget pressure system that injects system messages into the LLM context as iterations run low. The pattern is simple, zero-dependency, and addresses a real failure mode where agents exhaust iterations doing tool calls without ever producing a response.
Research Findings
How Utah's Budget Pressure Works
Utah injects ephemeral system messages at two tiers based on remaining iterations (default maxIterations = 20):
// CAUTION tier — 10 iterations before the end
if (iterations >= maxIterations - 10) {
"[SYSTEM: Iteration N/M. Start wrapping up — respond with text soon.]"
}
// WARNING tier — last 3 iterations
if (iterations >= maxIterations - 3) {
"[SYSTEM: You are on iteration N of M. You MUST respond with your final
answer NOW. Do not call any more tools.]"
}
Key design decisions:
- Messages are appended to
messagesForLLM (the copy sent to the API), not to the persistent messages array — they don't pollute session history
- Two tiers provide graduated pressure: first a nudge, then urgency
- If the loop exhausts all iterations anyway, a static fallback response is returned:
"(Reached max iterations: 20)"
Current State in Hermes Agent
In run_agent.py, the agent loop (while api_call_count < self.max_iterations) has:
- No pre-warning to the LLM about approaching the limit
- A post-hoc
_handle_max_iterations() (lines 2640-2757) that:
- Injects a user message asking for a summary after the limit is hit
- Makes one final API call with NO tools
- Returns whatever the LLM produces or an error string
max_iterations defaults to 60, displayed in progress output but never communicated to the LLM
The step_callback fires per-iteration (line 2941) and could be extended, but budget warnings are better handled as message injection into the API call.
Implementation Plan
Skill vs. Tool Classification
This is a core codebase change to run_agent.py, not a skill or tool. It modifies the agent loop's message preparation logic.
What We'd Need
- Configurable thresholds for warning tiers
- Message injection into
api_messages (not persisted messages)
- Integration with existing
_handle_max_iterations() as a fallback
Phased Rollout
Phase 1: Basic two-tier budget warnings
- Add
BUDGET_CAUTION_THRESHOLD = 0.7 and BUDGET_WARNING_THRESHOLD = 0.9 (fraction of max_iterations)
- Before each API call, check
api_call_count / self.max_iterations against thresholds
- Inject ephemeral system messages into the messages sent to the API:
- Caution (70%):
"[BUDGET: Iteration {N}/{max}. You have {remaining} iterations left. Start consolidating your work and prepare to provide a final response.]"
- Warning (90%):
"[BUDGET: Iteration {N}/{max}. You MUST provide your final response NOW. Do not make additional tool calls unless absolutely critical.]"
- Messages injected into
api_messages copy only, never persisted to messages or session DB
- Injection point: after line ~3000 in
run_agent.py, before the API call
Phase 2: Adaptive thresholds
- Scale thresholds based on actual
max_iterations value (a 10-turn session needs earlier warnings than a 60-turn one)
- Consider task complexity signals (number of tools called, context size) to adjust pressure timing
- Add config options in
cli.py CLI_CONFIG for threshold customization
Phase 3: Smart wrap-up behavior
- When budget warning fires, optionally reduce the available toolset (e.g., remove heavy tools like delegate_task, browser)
- Track whether the LLM acknowledged the warning (produced text alongside tool calls) vs. ignored it
- If warning was ignored, escalate to injecting the message as a forced user turn on next iteration
Pros & Cons
Pros
- Trivial to implement — ~20 lines of code in the main loop, no new dependencies
- Prevents silent exhaustion — the most common failure mode where agents loop endlessly doing tool calls
- Better response quality — the LLM can thoughtfully conclude vs. being abruptly asked to summarize after cutoff
- No architectural changes — works within the existing loop structure
- Ephemeral injection — doesn't pollute session history or affect context compression
Cons / Risks
- May cause premature wrap-up — aggressive thresholds might make the LLM stop working too early
- Threshold tuning — the right thresholds likely depend on task type; a fixed percentage may not be optimal for all cases
- Token cost — injected messages consume context tokens (minimal, but nonzero)
Open Questions
- Should the budget message format be a system message or a user message? (Utah uses user-role messages, but system role avoids confusing the LLM about who's speaking)
- Should the thresholds be absolute (e.g., "last 5 iterations") or relative (e.g., "last 10%")? Relative works better across different
max_iterations values
- Should Phase 3 tool reduction be opt-in or default?
- Should this interact with
_handle_max_iterations() — e.g., if the budget warning successfully caused a response, skip the post-hoc summary call?
References
Overview
When the agent approaches its max iteration limit, it currently gets no advance warning — it simply hits the wall and
_handle_max_iterations()makes one final tool-less API call asking for a summary. This means the LLM has no opportunity to proactively wrap up, consolidate its findings, or produce a quality final response before being cut off.This idea comes from Utah (Inngest's agent harness), which implements a two-tier budget pressure system that injects system messages into the LLM context as iterations run low. The pattern is simple, zero-dependency, and addresses a real failure mode where agents exhaust iterations doing tool calls without ever producing a response.
Research Findings
How Utah's Budget Pressure Works
Utah injects ephemeral system messages at two tiers based on remaining iterations (default
maxIterations = 20):Key design decisions:
messagesForLLM(the copy sent to the API), not to the persistentmessagesarray — they don't pollute session history"(Reached max iterations: 20)"Current State in Hermes Agent
In
run_agent.py, the agent loop (while api_call_count < self.max_iterations) has:_handle_max_iterations()(lines 2640-2757) that:max_iterationsdefaults to 60, displayed in progress output but never communicated to the LLMThe step_callback fires per-iteration (line 2941) and could be extended, but budget warnings are better handled as message injection into the API call.
Implementation Plan
Skill vs. Tool Classification
This is a core codebase change to
run_agent.py, not a skill or tool. It modifies the agent loop's message preparation logic.What We'd Need
api_messages(not persistedmessages)_handle_max_iterations()as a fallbackPhased Rollout
Phase 1: Basic two-tier budget warnings
BUDGET_CAUTION_THRESHOLD = 0.7andBUDGET_WARNING_THRESHOLD = 0.9(fraction of max_iterations)api_call_count / self.max_iterationsagainst thresholds"[BUDGET: Iteration {N}/{max}. You have {remaining} iterations left. Start consolidating your work and prepare to provide a final response.]""[BUDGET: Iteration {N}/{max}. You MUST provide your final response NOW. Do not make additional tool calls unless absolutely critical.]"api_messagescopy only, never persisted tomessagesor session DBrun_agent.py, before the API callPhase 2: Adaptive thresholds
max_iterationsvalue (a 10-turn session needs earlier warnings than a 60-turn one)cli.pyCLI_CONFIG for threshold customizationPhase 3: Smart wrap-up behavior
Pros & Cons
Pros
Cons / Risks
Open Questions
max_iterationsvalues_handle_max_iterations()— e.g., if the budget warning successfully caused a response, skip the post-hoc summary call?References
run_agent.pylines 2640-2757 (_handle_max_iterations), line 2919 (main loop condition)