Bug Description
When using OpenRouter-compatible providers (Nous Research inference, OpenRouter itself), the parse_available_output_tokens_from_error() function in agent/model_metadata.py fails to detect "output cap too large" errors, causing a dead loop:
- User configures
max_tokens larger than context_length - input_tokens
- Provider returns a 400 error with the output cap clearly stated
- Hermes classifies it as
context_overflow (correct)
- But
parse_available_output_tokens_from_error() returns None — it only recognizes Anthropic's "available_tokens" keyword, not OpenRouter's "X in the output" format
- Hermes falls through to the input-too-large recovery path (compression)
- On a fresh session with 1 message, compression cannot reduce anything
- Returns "Context length exceeded (4,882 tokens). Cannot compress further."
- Gateway auto-resets the session
- Next message → same error → infinite loop
The user sees "Session auto-reset" repeatedly, and /new has no effect because the root cause is the max_tokens config value, not the session history.
Steps to Reproduce
- Configure Hermes to use an OpenRouter/Nous provider with a model that has a 256K context window (e.g.,
stepfun/step-3.7-flash:free)
- Set
max_tokens: 262000 in config.yaml (exceeding context_length - input_tokens)
- Send ANY message — even "hi" in a brand-new session
- Observe: API returns 400, Hermes attempts compression, fails, auto-resets the session, loops forever
Actual API Error (from Nous Research)
Error code: 400 - {
"status": 400,
"message": "This request is not valid. Check the model name and other parameters.
Additional info: This endpoint's maximum context length is 256000 tokens.
However, you requested about 281093 tokens
(5683 of text input, 13410 of tool input, 262000 in the output).
Please reduce the length of either one, or use the context-compression plugin
to compress your prompt automatically."
}
Breakdown:
| Component |
Tokens |
| Text input (system prompt + user msg) |
5,683 |
| Tool schemas |
13,410 |
| max_tokens (output) |
262,000 |
| Total requested |
281,093 |
| Model context window |
256,000 |
| Exceeds by |
25,093 |
Note: The input (5683 + 13410 = 19093) is NOT the problem. The output cap (262000) is.
Expected Behavior
Hermes should:
- Parse the OpenRouter/Nous error format: extract
maximum context length (256000), text input (5683), tool input (13410), and output (262000)
- Detect this is an output-cap error (
max_tokens is the issue, not input)
- Calculate
available_output = context_length - text_input - tool_input
- Auto-reduce
max_tokens to available_output - safety_margin and retry
- Warn the user: "Output cap (262000) too large for remaining context. Auto-reduced to ~236907."
Affected Component
Root Cause Analysis
Primary: Missing error format in parse_available_output_tokens_from_error()
File: agent/model_metadata.py, line ~958
The current is_output_cap_error guard:
is_output_cap_error = (
"max_tokens" in error_lower
and ("available_tokens" in error_lower or "available tokens" in error_lower)
)
This only matches Anthropic-style errors ("... = available_tokens: 10000"). OpenRouter/Nous use the format:
(N of text input, M of tool input, K in the output)
This format appears in errors from at least:
- Nous Research inference API (
inference-api.nousresearch.com)
- OpenRouter (
openrouter.ai) — same proxy software
Secondary: Bad UX — error message is misleading
The final error message seen by the user is:
Context length exceeded (4,882 tokens). Cannot compress further.
💡 The conversation has accumulated too much content.
Try /new to start fresh, or /compress to manually trigger compression.
This message implies the conversation history is too long (4,882 tokens is tiny). The real fix is to reduce max_tokens — but the user has no way to know this. /new doesn't reset max_tokens, so the loop continues.
Relevant Architecture Issue
Issue #9181 ("Architecture: separate base vs effective context in overflow recovery") tracks the broader design problem of conflation between input overflow and output-cap overflow. This bug is a concrete, reproducible instance of that conflation causing an infinite loop.
Proposed Fix
Add OpenRouter/Nous error format support to parse_available_output_tokens_from_error():
# In the is_output_cap_error check:
is_output_cap_error = (
"max_tokens" in error_lower
and (
"available_tokens" in error_lower
or "available tokens" in error_lower
or re.search(r'\d+\s+in\s+the\s+output', error_lower) is not None
)
)
And extract the available output tokens from the OpenRouter format:
# New extraction for OpenRouter/Nous format:
# "5683 of text input, 13410 of tool input, 262000 in the output"
# + "maximum context length is 256000"
or_match = re.search(
r'maximum\s+context\s+length\s+is\s+(\d+).*?'
r'(\d+)\s+of\s+text\s+input.*?'
r'(\d+)\s+of\s+tool\s+input',
error_lower
)
if or_match:
max_ctx = int(or_match.group(1))
text_input = int(or_match.group(2))
tool_input = int(or_match.group(3))
available = max_ctx - text_input - tool_input
if available >= 1:
return available
Additional Notes
The server-side fix in config is: set max_tokens to ≤ context_length - ~20K (system prompt + tools overhead), or remove it entirely to let the provider default apply.
Environment
- OS: Ubuntu 24.04 (server), Telegram gateway
- Hermes version: 0.15.x (installed from source at
/usr/local/lib/hermes-agent)
- Model:
stepfun/step-3.7-flash:free via Nous Research
- Provider:
nous (OpenRouter-compatible protocol)
Are you willing to submit a PR for this?
Bug Description
When using OpenRouter-compatible providers (Nous Research inference, OpenRouter itself), the
parse_available_output_tokens_from_error()function inagent/model_metadata.pyfails to detect "output cap too large" errors, causing a dead loop:max_tokenslarger thancontext_length - input_tokenscontext_overflow(correct)parse_available_output_tokens_from_error()returnsNone— it only recognizes Anthropic's"available_tokens"keyword, not OpenRouter's"X in the output"formatThe user sees "Session auto-reset" repeatedly, and
/newhas no effect because the root cause is themax_tokensconfig value, not the session history.Steps to Reproduce
stepfun/step-3.7-flash:free)max_tokens: 262000in config.yaml (exceedingcontext_length - input_tokens)Actual API Error (from Nous Research)
Breakdown:
Note: The input (
5683 + 13410 = 19093) is NOT the problem. The output cap (262000) is.Expected Behavior
Hermes should:
maximum context length (256000),text input (5683),tool input (13410), andoutput (262000)max_tokensis the issue, not input)available_output = context_length - text_input - tool_inputmax_tokenstoavailable_output - safety_marginand retryAffected Component
Root Cause Analysis
Primary: Missing error format in
parse_available_output_tokens_from_error()File:
agent/model_metadata.py, line ~958The current
is_output_cap_errorguard:This only matches Anthropic-style errors (
"... = available_tokens: 10000"). OpenRouter/Nous use the format:This format appears in errors from at least:
inference-api.nousresearch.com)openrouter.ai) — same proxy softwareSecondary: Bad UX — error message is misleading
The final error message seen by the user is:
This message implies the conversation history is too long (4,882 tokens is tiny). The real fix is to reduce
max_tokens— but the user has no way to know this./newdoesn't resetmax_tokens, so the loop continues.Relevant Architecture Issue
Issue #9181 ("Architecture: separate base vs effective context in overflow recovery") tracks the broader design problem of conflation between input overflow and output-cap overflow. This bug is a concrete, reproducible instance of that conflation causing an infinite loop.
Proposed Fix
Add OpenRouter/Nous error format support to
parse_available_output_tokens_from_error():And extract the available output tokens from the OpenRouter format:
Additional Notes
The server-side fix in config is: set
max_tokensto ≤context_length - ~20K(system prompt + tools overhead), or remove it entirely to let the provider default apply.Environment
/usr/local/lib/hermes-agent)stepfun/step-3.7-flash:freevia Nous Researchnous(OpenRouter-compatible protocol)Are you willing to submit a PR for this?