fix(compression): sanitize malformed tool messages + auto-recover on API 400 by franksong2702 · Pull Request #18650 · NousResearch/hermes-agent

franksong2702 · 2026-05-02T03:46:42Z

Problem

Multi-pass context compression on long sessions (155+ messages) can produce tool-role messages with tool_call_id set to None, empty string, or missing entirely. The existing _sanitize_tool_pairs method handles two failure modes but misses this third one — these malformed messages slip through and reach the API, causing HTTP 400:

'tool_call_id' is not set

Once a session gets corrupted this way, it's permanently stuck — every subsequent API call fails with the same 400, and since the error classifier treats it as a generic format_error (non-retryable), it falls back to the same broken messages. The user sees Primary model failed — switching to fallback repeatedly until they manually run /new.

Reproduction: Long WeChat session (155 msgs, ~33K tokens) → 5+ compression passes → MiMo API rejects every subsequent request with 400. Workaround is /new to start a fresh session.

Fixes #16472 — Tool call ID not invalidated after 400 error (same permanent-failure loop from malformed tool messages).

Fixes #4662 — Malformed persisted tool calls poisoning sessions (our Part 1 prevents corruption at compression time, Part 2 auto-recovers when corruption is detected via API 400).

Also mitigates:

Session compression corrupts tool_call arguments, truncating JSON to 214 chars #12731 — Session compression corrupts tool_call arguments (compression sanitizer now catches missing IDs)
Truncated tool_call.arguments in conversation history wedges the retry loop #14443 — Truncated tool_call.arguments wedges the retry loop (retry loop now auto-sanitizes instead of looping forever)
run_agent sanitizer misclassifies tool results when tool_call_id has surrounding whitespace #9999 — sanitizer misclassifies tool results with whitespace in tool_call_id (different bug, but our sanitizer improvements reduce the attack surface)

Root Cause

Two-part chain:

Part 1 — Compression produces malformed messages:
_sanitize_tool_pairs only collects tool_call_id values that are truthy. Tool messages with falsy tool_call_id never enter the orphan-detection set, so the filter never matches them.

Part 2 — Error classifier doesn't recognize tool format errors:
The _classify_400 function only triggers compression for context_length_exceeded patterns. A 400 like tool_call_id is not set falls through to format_error → non-retryable → fallback → same broken messages → still 400 → permanent loop.

Fix

Part 1: Prevent corruption (context_compressor.py)

Added Step 0 to _sanitize_tool_pairs: filter out tool-role messages where tool_call_id is falsy before the existing orphan checks run.

Part 2: Auto-recover from corruption (error_classifier.py + run_agent.py)

New FailoverReason.tool_message_malformed enum value
New should_sanitize_tools flag on ClassifiedError
New _TOOL_CALL_MALFORMED_PATTERNS matching tool_call_id, function call output, etc.
New _sanitize_tool_messages_for_retry() method on AIAgent that removes malformed/orphaned tool messages in-place
Retry loop checks classified.should_sanitize_tools and auto-recovers instead of falling back

Tests

6 new tests across 2 test files:

test_sanitizer_removes_tool_messages_with_none_tool_call_id
test_sanitizer_removes_tool_messages_with_empty_tool_call_id
test_sanitizer_removes_all_orphaned_when_no_assistant_calls
test_400_tool_call_id_malformed (error classifier)
test_400_no_tool_call_found_for_output (error classifier)
test_enum_members_exist (updated for new enum value)

All 125+ tests pass.

…all_id Sanitizer handled orphaned tool results (no matching tool_call) and missing tool results (tool_call without result), but tool messages with tool_call_id=None, '', or missing entirely slipped through both checks and reached the API, causing HTTP 400 'tool_call_id is not set' errors. This manifests on providers like MiMo after multi-pass compression on long sessions (155+ messages), where message copying or provider formatting can produce tool messages without a valid tool_call_id. Adds Step 0 to _sanitize_tool_pairs: filter out tool-role messages where tool_call_id is falsy before the existing orphan checks run. Includes 3 new unit tests covering None, empty string, and missing field cases.

Second half of the compression corruption fix. When the API rejects a request with 400 containing 'tool_call_id' or 'function call output' patterns, the error classifier now recognizes this as tool_message_malformed (retryable + should_sanitize_tools) instead of a generic non-retryable format_error. The retry loop runs _sanitize_tool_messages_for_retry() which: 1. Removes tool messages with missing/null/empty tool_call_id 2. Removes orphaned tool results (valid id but no matching tool_call) 3. Retries the API call with cleaned messages This recovers sessions that got permanently stuck after a bad compression pass produced malformed tool messages. Previously such sessions required manual /new — now they self-heal.

Community PRs applied: - NousResearch#18596: Enable secret redaction by default (SECURITY) - NousResearch#18650: Sanitize malformed tool messages + auto-recover on API 400 - NousResearch#18607: Emergency compression before max_iterations exhaustion - NousResearch#18603: Compression fallback to main model on 413 rate limit - NousResearch#18638: Pass threshold_percent on model switch - NousResearch#18663: Strip extra_content from tool_calls for strict APIs - NousResearch#18618: Forward explicit_api_key to OpenRouter - NousResearch#18632: Show cache tokens in /insights breakdown - NousResearch#18614: Add idempotency guard for patch duplicate loops - NousResearch#18600: Raise ValueError when HERMES_HOME unset in profile mode - NousResearch#18616: Allow ZWJ emoji in context files - NousResearch#18582: Reload .env on /restart - NousResearch#18547: Stabilize system prompt prefix for KV cache reuse - NousResearch#18692: Strip FTS5 operators from session search truncation terms Fix: Add order_by_last_active=True to list_sessions_rich call (pre-existing commit 142b4bf code sync)

teknium1 · 2026-06-10T12:32:49Z

This looks implemented on current main by the later pre-call message repair path, so this PR is now stale as a separate fix. This is an automated hermes-sweeper review.

Evidence:

agent/conversation_loop.py:579 sanitizes tool-call arguments, then agent/conversation_loop.py:598 calls _repair_message_sequence(messages) before the API payload is built.
agent/agent_runtime_helpers.py:399 keeps a tool message only when tool_call_id is truthy and matches the current assistant tool-call IDs; missing, None, empty-string, and orphaned tool results are dropped in-place before request construction.
agent/conversation_loop.py:692 applies the second per-call safety net, _sanitize_api_messages(api_messages), and agent/agent_runtime_helpers.py:1928 removes orphaned tool results / injects stubs for missing paired results before dispatch.
The direct summary path is covered too: agent/chat_completion_helpers.py:1340 calls _sanitize_api_messages(api_messages) before the summary API call.
The main-line repair landed in 812ce0b9878d1dc9ac1f7c419a620deeb57117f3 (fix(run_agent): break permanent empty-response loop from orphan tool-tail (#21385)), contained in v2026.5.16 and later tags.

The exact enum/should_sanitize_tools recovery API proposed here was not adopted, but the user-facing failure mode this PR targets is covered by the current pre-call sanitizer on main.

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 2, 2026

franksong2702 changed the title ~~fix(compression): remove tool messages with missing/null/empty tool_call_id~~ fix(compression): sanitize malformed tool messages + auto-recover on API 400 May 2, 2026

This was referenced May 2, 2026

Bug: Tool call ID not invalidated after 400 error — causes persistent failures across sessions #16472

Open

[Bug]: Malformed persisted tool calls can poison a session and cause repeated 400 errors on subsequent requests #4662

Closed

alt-glitch mentioned this pull request May 16, 2026

[Bug]: Tool Result Contamination Causes Persistent HTTP 400 Error Loop #27033

Open

1 task

vynxevainglory-ai mentioned this pull request May 20, 2026

fix(agent): make tool error messages ephemeral to prevent HTTP 400 replay loop #29436

Open

teknium1 closed this Jun 10, 2026

teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compression): sanitize malformed tool messages + auto-recover on API 400#18650

fix(compression): sanitize malformed tool messages + auto-recover on API 400#18650
franksong2702 wants to merge 2 commits into
NousResearch:mainfrom
franksong2702:fix/orphan-tool-call-id-sanitizer

franksong2702 commented May 2, 2026 •

edited

Loading

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

franksong2702 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Part 1: Prevent corruption (context_compressor.py)

Part 2: Auto-recover from corruption (error_classifier.py + run_agent.py)

Tests

Uh oh!

teknium1 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

franksong2702 commented May 2, 2026 •

edited

Loading