Skip to content

fix: NVIDIA NIM models always truncate due to missing max_tokens default and ephemeral boost not wired to chat_completions#12152

Closed
LVT382009 wants to merge 1 commit into
NousResearch:mainfrom
LVT382009:fix/nvidia-nim-max-tokens-truncation
Closed

fix: NVIDIA NIM models always truncate due to missing max_tokens default and ephemeral boost not wired to chat_completions#12152
LVT382009 wants to merge 1 commit into
NousResearch:mainfrom
LVT382009:fix/nvidia-nim-max-tokens-truncation

Conversation

@LVT382009

Copy link
Copy Markdown
Contributor

Fix finish_reason='length' truncation loop on NVIDIA NIM (GLM-4.7 and others)

So I've been using Hermes with GLM-4.7 on NVIDIA NIM and kept hitting this super frustrating issue where even a simple "hi" would just loop through 3 truncation warnings and then die with "Response remained truncated after 3 continuation attempts". Spent a while digging through run_agent.py to figure out what was actually going on and found two separate things both working against each other.


What's actually happening

Problem 1 — NVIDIA NIM never gets a max_tokens value

When you don't set max_tokens in your config (or even when you do, since it never gets passed to AIAgent anyway), self.max_tokens ends up as None. Most providers are fine with that and just use a sensible default. NVIDIA NIM is not — it falls back to a really low internal default that GLM-4.7's thinking tokens alone can blow through before it even starts writing the actual response. So it truncates immediately, every single time, on the first call.

Problem 2 — The retry boost only works for Anthropic, not NVIDIA NIM

There's already a retry boost mechanism in the code — when a length truncation happens, it sets _ephemeral_max_output_tokens to a growing multiple of the base token budget before retrying. Good idea. The problem is that _build_api_kwargs() only consumes _ephemeral_max_output_tokens inside the anthropic_messages branch. NVIDIA NIM goes through chat_completions, so the boost is set but never actually sent to the API. All 3 retries hit the exact same token limit and fail identically.


The fix

Three small changes to run_agent.py:

1. Add the boost logic to the retry loop

Around the if restart_with_length_continuation: block, add the growing budget before continue:

if restart_with_length_continuation:
    _boost_base = self.max_tokens if self.max_tokens else 4096
    _boost = _boost_base * (length_continue_retries + 1)
    self._ephemeral_max_output_tokens = min(_boost, 32768)
    continue

Retry 1 gets base × 2, retry 2 gets base × 3, capped at 32K.

2. Wire up ephemeral consumption in the chat_completions path

In _build_api_kwargs(), the section that builds max_tokens for chat_completions currently just checks self.max_tokens. Replace it with:

_ephemeral_out = getattr(self, "_ephemeral_max_output_tokens", None)
if _ephemeral_out is not None:
    self._ephemeral_max_output_tokens = None  # consume immediately
    api_kwargs.update(self._max_tokens_param(_ephemeral_out))
elif self.max_tokens is not None:
    api_kwargs.update(self._max_tokens_param(self.max_tokens))
elif "integrate.api.nvidia.com" in self._base_url_lower:
    api_kwargs.update(self._max_tokens_param(16384))
elif self._is_qwen_portal():
    ...

3. Set a sane default for NVIDIA NIM

Even with the boost working, the first call still truncates because nothing sets a max_tokens for NVIDIA NIM upfront. The elif "integrate.api.nvidia.com" line above handles this — sends 16384 as the default so the first call has enough room and the boost rarely needs to kick in at all.


Result

Before: hi⚠️ ⚠️ ⚠️ → ❌ dead

After: hi → ✅ response, no warnings

Tested on GLM-4.7 via https://integrate.api.nvidia.com/v1. Should also help any other model on NVIDIA NIM that suffers from the same low default.

kshitijk4poor pushed a commit that referenced this pull request Apr 18, 2026
…NVIDIA NIM default

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.

@LVT382009 LVT382009 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Work perfectly!~

kshitijk4poor pushed a commit that referenced this pull request Apr 18, 2026
…NVIDIA NIM default

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
…NVIDIA NIM default

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
teknium1 pushed a commit that referenced this pull request Apr 18, 2026
…NVIDIA NIM default

Based on #12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #12231 (rebase-merge, commit f7af90e on main). Your commit was cherry-picked onto current main with authorship preserved in git log. Thanks for the fix, @LVT382009 — NIM users no longer truncate on first call.

#12231

@LVT382009

Copy link
Copy Markdown
Contributor Author

np

ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…NVIDIA NIM default

Based on NousResearch#12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
…NVIDIA NIM default

Based on NousResearch#12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…NVIDIA NIM default

Based on NousResearch#12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…NVIDIA NIM default

Based on NousResearch#12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…NVIDIA NIM default

Based on NousResearch#12152 by @LVT382009.

Two fixes to run_agent.py:

1. _ephemeral_max_output_tokens consumption in chat_completions path:
   The error-recovery ephemeral override was only consumed in the
   anthropic_messages branch of _build_api_kwargs.  All chat_completions
   providers (OpenRouter, NVIDIA NIM, Qwen, Alibaba, custom, etc.)
   silently ignored it.  Now consumed at highest priority, matching the
   anthropic pattern.

2. NVIDIA NIM max_tokens default (16384):
   NVIDIA NIM falls back to a very low internal default when max_tokens
   is omitted, causing models like GLM-4.7 to truncate immediately
   (thinking tokens exhaust the budget before the response starts).

3. Progressive length-continuation boost:
   When finish_reason='length' triggers a continuation retry, the output
   budget now grows progressively (2x base on retry 1, 3x on retry 2,
   capped at 32768) via _ephemeral_max_output_tokens.  Previously the
   retry loop just re-sent the same token limit on all 3 attempts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants