Skip to content

fix(session-repair): strip malformed tool_use blocks to prevent permanent session corruption#6687

Closed
NSEvent wants to merge 6 commits intoopenclaw:mainfrom
NSEvent:fix/session-corruption-malformed-tool-use
Closed

fix(session-repair): strip malformed tool_use blocks to prevent permanent session corruption#6687
NSEvent wants to merge 6 commits intoopenclaw:mainfrom
NSEvent:fix/session-corruption-malformed-tool-use

Conversation

@NSEvent
Copy link

@NSEvent NSEvent commented Feb 1, 2026

Summary

  • Strips malformed tool_use/toolCall/functionCall blocks from assistant messages BEFORE the existing pairing repair runs
  • Adds droppedMalformedToolUseCount to the repair report for observability
  • Prevents creating synthetic error results for blocks that were never valid tool calls

Problem

When tool calls are interrupted (by error, timeout, content filtering, or process termination), sessions become permanently corrupted. Every subsequent API request fails with:

  • unexpected tool_use_id found in tool_result blocks
  • tool result's tool id not found (2013)

Root cause: The existing extractToolCallsFromAssistant() skips malformed blocks (missing id) but leaves them in the message content. The blocks remain in the transcript, causing API rejections.

Solution

Add a pre-processing step that strips malformed tool_use blocks before the pairing repair runs:

Malformed conditions detected:

  • Missing or empty id field (tool call wasn't fully initialized)
  • Has partialJson field present (Anthropic SDK streaming artifact) - uses property presence check ("partialJson" in rec) to catch regardless of value
  • Has partial field set to true (generic streaming indicator)
  • Has incomplete field set to true (OpenAI-style indicator)

Type variants supported (via shared TOOL_BLOCK_TYPES Set):

  • camelCase: toolCall, toolUse, functionCall
  • snake_case: tool_use, function_call

Both isValidToolUseBlock and extractToolCallsFromAssistant use the same TOOL_BLOCK_TYPES Set to ensure consistent handling across validation and extraction.

The name field is intentionally NOT required - extractToolCallsFromAssistant already handles missing names gracefully by defaulting to undefined.

Design decisions

  • Shared constant: TOOL_BLOCK_TYPES is a Set used by both functions to ensure consistency.
  • Property presence vs value check: For partialJson, we use "partialJson" in rec rather than !== undefined because the mere presence of this field (even if explicitly undefined) indicates a streaming artifact.
  • Strict boolean checks for partial/incomplete: We use === true rather than truthy checks to avoid false positives from falsy values like 0, "", or null which don't indicate a partial tool call.
  • Expanded logging: All non-zero repair counters are now logged (malformed stripped, orphans dropped, duplicates dropped, synthetic results added) for easier debugging.

Test plan

  • Added comprehensive tests for malformed block detection
  • Existing tests pass (pnpm test src/agents/session-transcript-repair.test.ts)
  • Full test suite passes (pnpm test)
  • Lint passes (pnpm lint)

Fixes #5497, #5481, #5430, #5518

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

This PR hardens session transcript repair by stripping malformed assistant tool blocks (e.g., missing/empty id or streaming artifacts like partialJson, partial: true, incomplete: true) before the existing tool-call/tool-result pairing logic runs. It also unifies tool-block type detection across validation and extraction via a shared TOOL_BLOCK_TYPES set (supporting both camelCase and snake_case variants), adds droppedMalformedToolUseCount to the repair report for observability, and updates the Google embedded runner to log non-zero repair counters.

The change integrates cleanly with existing transcript sanitation: sanitizeToolUseResultPairing() now delegates to repairToolUseResultPairing(), which first cleans assistant content and then enforces strict provider requirements by moving matching toolResult messages directly after the corresponding assistant tool-call turn, dropping orphan/duplicate results, and synthesizing missing results only for valid tool calls.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • Changes are narrowly scoped to transcript sanitation, include defensive runtime checks, preserve message ordering/metadata, and are covered by targeted unit tests that exercise the new malformed-block stripping and reporting behavior.
  • No files require special attention

Context used:

  • Context from dashboard - CLAUDE.md (source)
  • Context from dashboard - AGENTS.md (source)

@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Feb 1, 2026
NSEvent and others added 6 commits February 1, 2026 15:39
…uption

When tool calls are interrupted (by error, timeout, content filtering, or
process termination), sessions can become permanently corrupted. Every
subsequent API request fails with errors like:
- "unexpected tool_use_id found in tool_result blocks"
- "tool result's tool id not found (2013)"

Root cause: extractToolCallsFromAssistant() skips malformed tool_use blocks
but leaves them in the message content. The blocks remain in the transcript
causing API rejections.

Fix: Strip malformed tool_use blocks (missing id, missing name, or with
partialJson field) BEFORE the pairing repair runs. This prevents creating
synthetic results for invalid blocks and allows sessions to auto-recover.

Fixes openclaw#5497, openclaw#5481, openclaw#5430, openclaw#5518

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add additional streaming/partial indicators beyond partialJson:
- partial === true (generic streaming indicator)
- incomplete === true (OpenAI-style indicator)

This ensures we catch malformed tool_use blocks from all provider
SDK shapes, not just Anthropic's partialJson field.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
P1 fix: Use property presence check ("partialJson" in rec) instead of
`!== undefined` to correctly detect the streaming artifact field
regardless of its value.

P3 fix: Expand logging to include all non-zero repair counters
(malformed stripped, orphans dropped, duplicates dropped, synthetic
results added) for easier debugging of transcript issues.

Added docstring explaining why partial/incomplete use strict boolean
checks (to avoid false positives from falsy values like 0 or "").

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…rmatting

Address Greptile feedback:
- Add snake_case type variants (tool_use, function_call) to support
  sessions from providers/SDKs that emit these types
- Format test object literal across multiple lines for readability

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address Greptile P1: Both `isValidToolUseBlock` and `extractToolCallsFromAssistant`
now use a shared `TOOL_BLOCK_TYPES` Set to ensure consistent handling of
all tool block type variants.

This ensures snake_case tool blocks (tool_use, function_call) are properly
extracted for pairing repair, not just stripped.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@steipete
Copy link
Contributor

AI-assisted stale triage closure (2026-02-24).

Closing this PR because the fix is already superseded by merged work.

Why:

This is AI-closed housekeeping, not a rejection of corruption-prevention concerns.

If malformed tool_use blocks still corrupt sessions on current main, open a fresh focused PR with failing transcript sample.

@steipete steipete closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphaned tool_result after mid-stream assistant error causes permanent session breakage

2 participants