Skip to content

fix: wire ephemeral max_tokens into chat_completions + NVIDIA NIM default#12231

Merged
teknium1 merged 1 commit into
mainfrom
salvage/nvidia-nim-ephemeral-max-tokens
Apr 18, 2026
Merged

fix: wire ephemeral max_tokens into chat_completions + NVIDIA NIM default#12231
teknium1 merged 1 commit into
mainfrom
salvage/nvidia-nim-ephemeral-max-tokens

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Summary

Salvage of #12152 by @LVT382009.

What this fixes

  1. _ephemeral_max_output_tokens not consumed by chat_completions — The error-recovery ephemeral override (set when the API returns "max_tokens too large given prompt") was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority in the cascade, matching the anthropic pattern.

  2. NVIDIA NIM max_tokens default (16384) — NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts).

  3. Progressive length-continuation boost — When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop re-sent the same token limit on all 3 attempts.

Test plan

  • pytest tests/run_agent/ tests/agent/test_auxiliary_client.py — 851 passed (3 pre-existing failures on main)
  • E2E: 9 scenarios covering the full priority cascade (NIM default, ephemeral priority, user override, non-NVIDIA omission, OpenRouter ephemeral, boost math, cap)

Closes #12152

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Container build files modified

Changes to Dockerfiles or compose files can alter base images, add build steps, or expose ports. Verify base image pins and build commands.

Files:

Dockerfile

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

…NVIDIA NIM default

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
@teknium1 teknium1 force-pushed the salvage/nvidia-nim-ephemeral-max-tokens branch from 4954768 to f38de37 Compare April 18, 2026 19:51
@teknium1 teknium1 merged commit f7af90e into main Apr 18, 2026
5 of 7 checks passed
@teknium1 teknium1 deleted the salvage/nvidia-nim-ephemeral-max-tokens branch April 18, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants