Skip to content

Feature: Iteration Budget Pressure — Warn the LLM Before Max Iterations Hit #414

@teknium1

Description

@teknium1

Overview

When the agent approaches its max iteration limit, it currently gets no advance warning — it simply hits the wall and _handle_max_iterations() makes one final tool-less API call asking for a summary. This means the LLM has no opportunity to proactively wrap up, consolidate its findings, or produce a quality final response before being cut off.

This idea comes from Utah (Inngest's agent harness), which implements a two-tier budget pressure system that injects system messages into the LLM context as iterations run low. The pattern is simple, zero-dependency, and addresses a real failure mode where agents exhaust iterations doing tool calls without ever producing a response.


Research Findings

How Utah's Budget Pressure Works

Utah injects ephemeral system messages at two tiers based on remaining iterations (default maxIterations = 20):

// CAUTION tier — 10 iterations before the end
if (iterations >= maxIterations - 10) {
  "[SYSTEM: Iteration N/M. Start wrapping up — respond with text soon.]"
}

// WARNING tier — last 3 iterations
if (iterations >= maxIterations - 3) {
  "[SYSTEM: You are on iteration N of M. You MUST respond with your final
   answer NOW. Do not call any more tools.]"
}

Key design decisions:

  • Messages are appended to messagesForLLM (the copy sent to the API), not to the persistent messages array — they don't pollute session history
  • Two tiers provide graduated pressure: first a nudge, then urgency
  • If the loop exhausts all iterations anyway, a static fallback response is returned: "(Reached max iterations: 20)"

Current State in Hermes Agent

In run_agent.py, the agent loop (while api_call_count < self.max_iterations) has:

  • No pre-warning to the LLM about approaching the limit
  • A post-hoc _handle_max_iterations() (lines 2640-2757) that:
    • Injects a user message asking for a summary after the limit is hit
    • Makes one final API call with NO tools
    • Returns whatever the LLM produces or an error string
  • max_iterations defaults to 60, displayed in progress output but never communicated to the LLM

The step_callback fires per-iteration (line 2941) and could be extended, but budget warnings are better handled as message injection into the API call.


Implementation Plan

Skill vs. Tool Classification

This is a core codebase change to run_agent.py, not a skill or tool. It modifies the agent loop's message preparation logic.

What We'd Need

  1. Configurable thresholds for warning tiers
  2. Message injection into api_messages (not persisted messages)
  3. Integration with existing _handle_max_iterations() as a fallback

Phased Rollout

Phase 1: Basic two-tier budget warnings

  • Add BUDGET_CAUTION_THRESHOLD = 0.7 and BUDGET_WARNING_THRESHOLD = 0.9 (fraction of max_iterations)
  • Before each API call, check api_call_count / self.max_iterations against thresholds
  • Inject ephemeral system messages into the messages sent to the API:
    • Caution (70%): "[BUDGET: Iteration {N}/{max}. You have {remaining} iterations left. Start consolidating your work and prepare to provide a final response.]"
    • Warning (90%): "[BUDGET: Iteration {N}/{max}. You MUST provide your final response NOW. Do not make additional tool calls unless absolutely critical.]"
  • Messages injected into api_messages copy only, never persisted to messages or session DB
  • Injection point: after line ~3000 in run_agent.py, before the API call

Phase 2: Adaptive thresholds

  • Scale thresholds based on actual max_iterations value (a 10-turn session needs earlier warnings than a 60-turn one)
  • Consider task complexity signals (number of tools called, context size) to adjust pressure timing
  • Add config options in cli.py CLI_CONFIG for threshold customization

Phase 3: Smart wrap-up behavior

  • When budget warning fires, optionally reduce the available toolset (e.g., remove heavy tools like delegate_task, browser)
  • Track whether the LLM acknowledged the warning (produced text alongside tool calls) vs. ignored it
  • If warning was ignored, escalate to injecting the message as a forced user turn on next iteration

Pros & Cons

Pros

  • Trivial to implement — ~20 lines of code in the main loop, no new dependencies
  • Prevents silent exhaustion — the most common failure mode where agents loop endlessly doing tool calls
  • Better response quality — the LLM can thoughtfully conclude vs. being abruptly asked to summarize after cutoff
  • No architectural changes — works within the existing loop structure
  • Ephemeral injection — doesn't pollute session history or affect context compression

Cons / Risks

  • May cause premature wrap-up — aggressive thresholds might make the LLM stop working too early
  • Threshold tuning — the right thresholds likely depend on task type; a fixed percentage may not be optimal for all cases
  • Token cost — injected messages consume context tokens (minimal, but nonzero)

Open Questions

  • Should the budget message format be a system message or a user message? (Utah uses user-role messages, but system role avoids confusing the LLM about who's speaking)
  • Should the thresholds be absolute (e.g., "last 5 iterations") or relative (e.g., "last 10%")? Relative works better across different max_iterations values
  • Should Phase 3 tool reduction be opt-in or default?
  • Should this interact with _handle_max_iterations() — e.g., if the budget warning successfully caused a response, skip the post-hoc summary call?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions