Skip to content

fix(agent): boost max_tokens on length-continuation retries#9489

Closed
ygd58 wants to merge 1 commit into
NousResearch:mainfrom
ygd58:fix/continuation-max-tokens-boost
Closed

fix(agent): boost max_tokens on length-continuation retries#9489
ygd58 wants to merge 1 commit into
NousResearch:mainfrom
ygd58:fix/continuation-max-tokens-boost

Conversation

@ygd58

@ygd58 ygd58 commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

Problem

Closes #9372

Models with a low default output limit (e.g. GLM-4.7 on NVIDIA Build) truncate at the same limit on every continuation attempt, exhausting all 3 retries without ever finishing.

Root Cause

The continuation loop sends the continue prompt but calls the API with the same max_tokens each time. If the model hits the limit on the first call, it hits it again on every retry.

Fix

Set _ephemeral_max_output_tokens before each continuation retry, growing the budget per attempt (capped at 32K): Retry 1 = max_tokens x 2, Retry 2 = max_tokens x 3.

Also extended _ephemeral_max_output_tokens consumption to the chat_completions path in _build_api_kwargs() so both chat_completions and anthropic_messages modes benefit.

Before / After

Before: GLM-4.7 truncated x3 at same limit, error.
After: GLM-4.7 retried with growing limit, more room to finish.

Models with a low default output limit (e.g. GLM-4.7 on NVIDIA Build)
truncate at the same limit on every continuation attempt, exhausting
all 3 retries without ever finishing the response.

Fix: set _ephemeral_max_output_tokens before each continuation retry,
doubling the output budget per attempt (capped at 32K). This applies
to both chat_completions and anthropic_messages modes:
- chat_completions: _ephemeral_max_output_tokens now consumed in
  _build_api_kwargs() alongside the existing anthropic path
- Retry 1: max_tokens * 2, Retry 2: max_tokens * 3 (max 32K)

Fixes NousResearch#9372
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12231 (merged) which wired ephemeral max_tokens into chat_completions. Check if the continuation-retry boost in this PR is already covered by that merge.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for this contribution @ygd58! The fixes described in this PR are already present on main.

This is an automated hermes-sweeper review.

Evidence:

  • run_agent.py lines 12810–12812: progressive length-continuation boost (_boost_base * (length_continue_retries + 1), capped at 32 768) setting _ephemeral_max_output_tokens on each retry — exactly matching this PR's approach.
  • run_agent.py lines 8394–8396: _ephemeral_max_output_tokens consumed in the chat_completions branch of _build_api_kwargs, covering OpenRouter, NVIDIA NIM, and all other non-Anthropic providers.
  • Implementing commit: f7af90e2d"fix: wire _ephemeral_max_output_tokens into chat_completions and add NVIDIA NIM default" — whose commit message describes all three fixes verbatim.

As noted by @alt-glitch, PR #12231 (merged) covered the same ground. Closing as implemented on main.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: NVIDIA Build API modle z-ai/glm4.7 returns model hit max output tokens

3 participants