fix: NVIDIA NIM models always truncate due to missing max_tokens default and ephemeral boost not wired to chat_completions by LVT382009 · Pull Request #12152 · NousResearch/hermes-agent

LVT382009 · 2026-04-18T13:02:37Z

Fix `finish_reason='length'` truncation loop on NVIDIA NIM (GLM-4.7 and others)

So I've been using Hermes with GLM-4.7 on NVIDIA NIM and kept hitting this super frustrating issue where even a simple "hi" would just loop through 3 truncation warnings and then die with "Response remained truncated after 3 continuation attempts". Spent a while digging through run_agent.py to figure out what was actually going on and found two separate things both working against each other.

What's actually happening

Problem 1 — NVIDIA NIM never gets a max_tokens value

When you don't set max_tokens in your config (or even when you do, since it never gets passed to AIAgent anyway), self.max_tokens ends up as None. Most providers are fine with that and just use a sensible default. NVIDIA NIM is not — it falls back to a really low internal default that GLM-4.7's thinking tokens alone can blow through before it even starts writing the actual response. So it truncates immediately, every single time, on the first call.

Problem 2 — The retry boost only works for Anthropic, not NVIDIA NIM

There's already a retry boost mechanism in the code — when a length truncation happens, it sets _ephemeral_max_output_tokens to a growing multiple of the base token budget before retrying. Good idea. The problem is that _build_api_kwargs() only consumes _ephemeral_max_output_tokens inside the anthropic_messages branch. NVIDIA NIM goes through chat_completions, so the boost is set but never actually sent to the API. All 3 retries hit the exact same token limit and fail identically.

The fix

Three small changes to run_agent.py:

1. Add the boost logic to the retry loop

Around the if restart_with_length_continuation: block, add the growing budget before continue:

if restart_with_length_continuation:
    _boost_base = self.max_tokens if self.max_tokens else 4096
    _boost = _boost_base * (length_continue_retries + 1)
    self._ephemeral_max_output_tokens = min(_boost, 32768)
    continue

Retry 1 gets base × 2, retry 2 gets base × 3, capped at 32K.

2. Wire up ephemeral consumption in the chat_completions path

In _build_api_kwargs(), the section that builds max_tokens for chat_completions currently just checks self.max_tokens. Replace it with:

_ephemeral_out = getattr(self, "_ephemeral_max_output_tokens", None)
if _ephemeral_out is not None:
    self._ephemeral_max_output_tokens = None  # consume immediately
    api_kwargs.update(self._max_tokens_param(_ephemeral_out))
elif self.max_tokens is not None:
    api_kwargs.update(self._max_tokens_param(self.max_tokens))
elif "integrate.api.nvidia.com" in self._base_url_lower:
    api_kwargs.update(self._max_tokens_param(16384))
elif self._is_qwen_portal():
    ...

3. Set a sane default for NVIDIA NIM

Even with the boost working, the first call still truncates because nothing sets a max_tokens for NVIDIA NIM upfront. The elif "integrate.api.nvidia.com" line above handles this — sends 16384 as the default so the first call has enough room and the boost rarely needs to kick in at all.

Result

Before: hi → ⚠️ ⚠️ ⚠️ → ❌ dead

After: hi → ✅ response, no warnings

Tested on GLM-4.7 via https://integrate.api.nvidia.com/v1. Should also help any other model on NVIDIA NIM that suffers from the same low default.

…o chat_completions path

@LVT382009

…NVIDIA NIM default Based on #12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

LVT382009

Work perfectly!~

@LVT382009

…NVIDIA NIM default Based on #12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on #12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on #12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

teknium1 · 2026-04-18T19:52:00Z

Merged via PR #12231 (rebase-merge, commit f7af90e on main). Your commit was cherry-picked onto current main with authorship preserved in git log. Thanks for the fix, @LVT382009 — NIM users no longer truncate on first call.

#12231

LVT382009 · 2026-04-18T21:41:03Z

np

@LVT382009

…NVIDIA NIM default Based on NousResearch#12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on NousResearch#12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on NousResearch#12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on NousResearch#12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009

…NVIDIA NIM default Based on NousResearch#12152 by @LVT382009. Two fixes to run_agent.py: 1. _ephemeral_max_output_tokens consumption in chat_completions path: The error-recovery ephemeral override was only consumed in the anthropic_messages branch of _build_api_kwargs. All chat_completions providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.) silently ignored it. Now consumed at highest priority, matching the anthropic pattern. 2. NVIDIA NIM max_tokens default (16384): NVIDIA NIM falls back to a very low internal default when max_tokens is omitted, causing models like GLM-4.7 to truncate immediately (thinking tokens exhaust the budget before the response starts). 3. Progressive length-continuation boost: When finish_reason='length' triggers a continuation retry, the output budget now grows progressively (2x base on retry 1, 3x on retry 2, capped at 32768) via _ephemeral_max_output_tokens. Previously the retry loop just re-sent the same token limit on all 3 attempts.

fix: set default max_tokens for NVIDIA NIM and wire ephemeral boost t…

915ce18

…o chat_completions path

LVT382009 mentioned this pull request Apr 18, 2026

[Bug]: NVIDIA Build API modle z-ai/glm4.7 returns model hit max output tokens #9372

Closed

1 task

kshitijk4poor mentioned this pull request Apr 18, 2026

fix: wire ephemeral max_tokens into chat_completions + NVIDIA NIM default #12217

Closed

2 tasks

LVT382009 commented Apr 18, 2026

View reviewed changes

This was referenced Apr 18, 2026

fix: wire ephemeral max_tokens into chat_completions + NVIDIA NIM default #12231

Merged

meta: NVIDIA NIM provider parity tracker #12233

Closed

teknium1 closed this in #12231 Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: NVIDIA NIM models always truncate due to missing max_tokens default and ephemeral boost not wired to chat_completions#12152

fix: NVIDIA NIM models always truncate due to missing max_tokens default and ephemeral boost not wired to chat_completions#12152
LVT382009 wants to merge 1 commit into
NousResearch:mainfrom
LVT382009:fix/nvidia-nim-max-tokens-truncation

LVT382009 commented Apr 18, 2026

Uh oh!

LVT382009 left a comment

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

LVT382009 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LVT382009 commented Apr 18, 2026

Fix finish_reason='length' truncation loop on NVIDIA NIM (GLM-4.7 and others)

What's actually happening

The fix

Result

Uh oh!

LVT382009 left a comment

Choose a reason for hiding this comment

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

LVT382009 commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix `finish_reason='length'` truncation loop on NVIDIA NIM (GLM-4.7 and others)