Skip to content

fix(compression): sanitize malformed tool messages + auto-recover on API 400#18650

Closed
franksong2702 wants to merge 2 commits into
NousResearch:mainfrom
franksong2702:fix/orphan-tool-call-id-sanitizer
Closed

fix(compression): sanitize malformed tool messages + auto-recover on API 400#18650
franksong2702 wants to merge 2 commits into
NousResearch:mainfrom
franksong2702:fix/orphan-tool-call-id-sanitizer

Conversation

@franksong2702

@franksong2702 franksong2702 commented May 2, 2026

Copy link
Copy Markdown

Problem

Multi-pass context compression on long sessions (155+ messages) can produce tool-role messages with tool_call_id set to None, empty string, or missing entirely. The existing _sanitize_tool_pairs method handles two failure modes but misses this third one — these malformed messages slip through and reach the API, causing HTTP 400:

'tool_call_id' is not set

Once a session gets corrupted this way, it's permanently stuck — every subsequent API call fails with the same 400, and since the error classifier treats it as a generic format_error (non-retryable), it falls back to the same broken messages. The user sees Primary model failed — switching to fallback repeatedly until they manually run /new.

Reproduction: Long WeChat session (155 msgs, ~33K tokens) → 5+ compression passes → MiMo API rejects every subsequent request with 400. Workaround is /new to start a fresh session.

Fixes #16472 — Tool call ID not invalidated after 400 error (same permanent-failure loop from malformed tool messages).

Fixes #4662 — Malformed persisted tool calls poisoning sessions (our Part 1 prevents corruption at compression time, Part 2 auto-recovers when corruption is detected via API 400).

Also mitigates:

Root Cause

Two-part chain:

Part 1 — Compression produces malformed messages:
_sanitize_tool_pairs only collects tool_call_id values that are truthy. Tool messages with falsy tool_call_id never enter the orphan-detection set, so the filter never matches them.

Part 2 — Error classifier doesn't recognize tool format errors:
The _classify_400 function only triggers compression for context_length_exceeded patterns. A 400 like tool_call_id is not set falls through to format_error → non-retryable → fallback → same broken messages → still 400 → permanent loop.

Fix

Part 1: Prevent corruption (context_compressor.py)

Added Step 0 to _sanitize_tool_pairs: filter out tool-role messages where tool_call_id is falsy before the existing orphan checks run.

Part 2: Auto-recover from corruption (error_classifier.py + run_agent.py)

  • New FailoverReason.tool_message_malformed enum value
  • New should_sanitize_tools flag on ClassifiedError
  • New _TOOL_CALL_MALFORMED_PATTERNS matching tool_call_id, function call output, etc.
  • New _sanitize_tool_messages_for_retry() method on AIAgent that removes malformed/orphaned tool messages in-place
  • Retry loop checks classified.should_sanitize_tools and auto-recovers instead of falling back

Tests

6 new tests across 2 test files:

  • test_sanitizer_removes_tool_messages_with_none_tool_call_id
  • test_sanitizer_removes_tool_messages_with_empty_tool_call_id
  • test_sanitizer_removes_all_orphaned_when_no_assistant_calls
  • test_400_tool_call_id_malformed (error classifier)
  • test_400_no_tool_call_found_for_output (error classifier)
  • test_enum_members_exist (updated for new enum value)

All 125+ tests pass.

…all_id

Sanitizer handled orphaned tool results (no matching tool_call) and
missing tool results (tool_call without result), but tool messages with
tool_call_id=None, '', or missing entirely slipped through both checks
and reached the API, causing HTTP 400 'tool_call_id is not set' errors.

This manifests on providers like MiMo after multi-pass compression on
long sessions (155+ messages), where message copying or provider
formatting can produce tool messages without a valid tool_call_id.

Adds Step 0 to _sanitize_tool_pairs: filter out tool-role messages
where tool_call_id is falsy before the existing orphan checks run.
Includes 3 new unit tests covering None, empty string, and missing
field cases.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 2, 2026
Second half of the compression corruption fix. When the API rejects a
request with 400 containing 'tool_call_id' or 'function call output'
patterns, the error classifier now recognizes this as
tool_message_malformed (retryable + should_sanitize_tools) instead of
a generic non-retryable format_error.

The retry loop runs _sanitize_tool_messages_for_retry() which:
1. Removes tool messages with missing/null/empty tool_call_id
2. Removes orphaned tool results (valid id but no matching tool_call)
3. Retries the API call with cleaned messages

This recovers sessions that got permanently stuck after a bad
compression pass produced malformed tool messages. Previously such
sessions required manual /new — now they self-heal.
@franksong2702 franksong2702 changed the title fix(compression): remove tool messages with missing/null/empty tool_call_id fix(compression): sanitize malformed tool messages + auto-recover on API 400 May 2, 2026
Cyrene963 pushed a commit to Cyrene963/hermes-agent that referenced this pull request May 3, 2026
Community PRs applied:
- NousResearch#18596: Enable secret redaction by default (SECURITY)
- NousResearch#18650: Sanitize malformed tool messages + auto-recover on API 400
- NousResearch#18607: Emergency compression before max_iterations exhaustion
- NousResearch#18603: Compression fallback to main model on 413 rate limit
- NousResearch#18638: Pass threshold_percent on model switch
- NousResearch#18663: Strip extra_content from tool_calls for strict APIs
- NousResearch#18618: Forward explicit_api_key to OpenRouter
- NousResearch#18632: Show cache tokens in /insights breakdown
- NousResearch#18614: Add idempotency guard for patch duplicate loops
- NousResearch#18600: Raise ValueError when HERMES_HOME unset in profile mode
- NousResearch#18616: Allow ZWJ emoji in context files
- NousResearch#18582: Reload .env on /restart
- NousResearch#18547: Stabilize system prompt prefix for KV cache reuse
- NousResearch#18692: Strip FTS5 operators from session search truncation terms

Fix: Add order_by_last_active=True to list_sessions_rich call
(pre-existing commit 142b4bf code sync)
@teknium1

Copy link
Copy Markdown
Contributor

This looks implemented on current main by the later pre-call message repair path, so this PR is now stale as a separate fix. This is an automated hermes-sweeper review.

Evidence:

  • agent/conversation_loop.py:579 sanitizes tool-call arguments, then agent/conversation_loop.py:598 calls _repair_message_sequence(messages) before the API payload is built.
  • agent/agent_runtime_helpers.py:399 keeps a tool message only when tool_call_id is truthy and matches the current assistant tool-call IDs; missing, None, empty-string, and orphaned tool results are dropped in-place before request construction.
  • agent/conversation_loop.py:692 applies the second per-call safety net, _sanitize_api_messages(api_messages), and agent/agent_runtime_helpers.py:1928 removes orphaned tool results / injects stubs for missing paired results before dispatch.
  • The direct summary path is covered too: agent/chat_completion_helpers.py:1340 calls _sanitize_api_messages(api_messages) before the summary API call.
  • The main-line repair landed in 812ce0b9878d1dc9ac1f7c419a620deeb57117f3 (fix(run_agent): break permanent empty-response loop from orphan tool-tail (#21385)), contained in v2026.5.16 and later tags.

The exact enum/should_sanitize_tools recovery API proposed here was not adopted, but the user-facing failure mode this PR targets is covered by the current pre-call sanitizer on main.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

3 participants