Skip to content

[Bug]: NVIDIA Build API modle z-ai/glm4.7 returns Thinking Budget Exhausted #7729

@an-gith

Description

@an-gith

Bug Description

When usingNVIDIA Build provider with model z-ai/glm4.7 , the API returns Thinking Budget Exhausted error, The model ID works on OpenClaw, but Hermes cannot callit successfully.

Steps to Reproduce

  1. Run hermes setup and configure custom with NVIDIA Build API key

  2. Select model z-ai/glm4.7

  3. Try to chat: hernes chat or via gateway

  4. Error: ⚠️ Response truncated (finish_reason='length') - model hit max output tokens
    💭 Reasoning exhausted the output token budget — no visible response was produced.
    ─ ⚕ Hermes ─────────────────────────────────────────────────────────────

    ⚠️ Thinking Budget Exhausted

    The model used all its output tokens on reasoning and had none left
    for the actual response.

    To fix this:
    → Lower reasoning effort: /thinkon low or /thinkon minimal
    → Increase the output token limit: set model.max_tokens in
    config.yaml
    5.return to Ollama(cloud) model minimax-m2.7 ,changing nothing,it works

Expected Behavior

⚠️ Response truncated (finish_reason='length') - model hit max output tokens
💭 Reasoning exhausted the output token budget — no visible response was produced.
─ ⚕ Hermes ─────────────────────────────────────────────────────────────

 ⚠️ **Thinking Budget Exhausted**                                         
                                                                          
 The model used all its output tokens on reasoning and had none left      
 for the actual response.                                                 
                                                                          
 To fix this:                                                             
 → Lower reasoning effort: `/thinkon low` or `/thinkon minimal`           
 → Increase the output token limit: set `model.max_tokens` in             
 config.yaml

Actual Behavior

Configuration (configyaml, env, hermes setup)

Affected Component

Setup / Installation

Messaging Platform (if gateway-related)

No response

Operating System

docker in synology(DSM 7.2.2-72806) docker

Python Version

3.13.5

Hermes Version

Hermes Agent v0.8.0 (2026.4.8)

Relevant Logs / Traceback

⚠️  Response truncated (finish_reason='length') - model hit max output tokens
  💭 Reasoning exhausted the output token budget — no visible response was produced.
   ─  ⚕ Hermes  ───────────────────────────────────────────────────────────── 
                                                                              
     ⚠️ **Thinking Budget Exhausted**                                         
                                                                              
     The model used all its output tokens on reasoning and had none left      
     for the actual response.                                                 
                                                                              
     To fix this:                                                             
     → Lower reasoning effort: `/thinkon low` or `/thinkon minimal`           
     → Increase the output token limit: set `model.max_tokens` in             
     config.yaml

Root Cause Analysis (optional)

No response

Proposed Fix (optional)

No response

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions