Skip to content

fix(session): strip malformed tool_use blocks to prevent session corruption#5557

Closed
NSEvent wants to merge 2 commits intoopenclaw:mainfrom
NSEvent:fix/session-corruption-malformed-tool-use
Closed

fix(session): strip malformed tool_use blocks to prevent session corruption#5557
NSEvent wants to merge 2 commits intoopenclaw:mainfrom
NSEvent:fix/session-corruption-malformed-tool-use

Conversation

@NSEvent
Copy link

@NSEvent NSEvent commented Jan 31, 2026

Summary

  • Strip malformed tool_use blocks from assistant messages during transcript repair to fix permanent session corruption
  • Malformed blocks (missing id, missing name, or with partialJson field) are now detected and removed before pairing repair runs
  • Sessions that would previously require manual JSONL surgery now auto-recover
  • Added logging when malformed blocks are stripped for observability

Problem

When tool calls are interrupted (by error, timeout, content filtering, or process termination), sessions become permanently corrupted. Every subsequent API request fails with:

  • unexpected tool_use_id found in tool_result blocks
  • tool result's tool id not found (2013)

Root cause: extractToolCallsFromAssistant() skips malformed tool_use blocks (lines 22-23 in the old code) but leaves them in the message content. The blocks remain in the transcript causing API rejections.

Solution

  1. Add isValidToolUseBlock() to detect malformed blocks
  2. Add stripMalformedToolUseBlocks() to remove them from assistant messages
  3. Call the strip function before pairing repair to prevent creating synthetic results for invalid blocks
  4. Track droppedMalformedToolUseCount in the repair report for observability
  5. Log a warning when malformed blocks are stripped

Test plan

  • All 13 unit tests pass (4 existing + 9 new)
  • New test "recovers from interrupted tool call (session corruption scenario)" reproduces the exact failure mode
  • Lint passes
  • Build succeeds

Future suggestions

These are not blockers but could improve robustness:

  1. Consolidate validation logic - extractToolCallsFromAssistant() has similar but not identical validation. Could extract shared validation to reduce duplication.

  2. Provider-specific partial data fields - Currently only checks for partialJson. Other providers might use different field names for partial/streaming data.

  3. Consider logging in stripMalformedToolUseBlocks directly - Currently only logs in google.ts. Other call sites via sanitizeToolUseResultPairing() don't get the warning.

Fixes #5497, #5481, #5430, #5518

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

This PR improves session transcript repair by stripping malformed assistant toolCall blocks (e.g., missing/empty id or containing partialJson) before attempting to re-pair toolResult messages. The repair report now exposes droppedMalformedToolUseCount, and the Google embedded runner logs a warning when malformed tool blocks are removed. A new test suite reproduces the “interrupted tool call causes permanent session corruption” scenario and validates the new behavior.

Confidence Score: 4/5

  • This PR looks safe to merge and should improve recovery from corrupted transcripts, with low behavioral risk outside transcript sanitization.
  • Changes are localized to transcript-repair logic and the Google runner’s sanitization pipeline, with comprehensive new unit tests covering key malformed-block scenarios. Minor edge-case risk remains around provider-specific malformed tool-call shapes not covered (beyond partialJson/missing id).
  • src/agents/session-transcript-repair.ts

@openclaw-barnacle openclaw-barnacle bot added the agents Agent runtime and tooling label Jan 31, 2026
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@jnvw
Copy link

jnvw commented Jan 31, 2026

Why this check is failing openclaw@2026.1.30 format /home/runner/_work/openclaw/openclaw

oxfmt --check

Checking formatting...

docs/automation/cron-jobs.md (94ms)

Format issues found in above 1 files. Run without --check to fix.
Finished in 1345ms on 3589 files using 10 threads.
 ELIFECYCLE  Command failed with exit code 1.
Error: Process completed with exit code 1.

@openclaw-barnacle openclaw-barnacle bot added the docs Improvements or additions to documentation label Jan 31, 2026
@NSEvent NSEvent force-pushed the fix/session-corruption-malformed-tool-use branch 2 times, most recently from a58932e to 0e2ac92 Compare January 31, 2026 17:15
@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation and removed docs Improvements or additions to documentation labels Jan 31, 2026
NSEvent and others added 2 commits January 31, 2026 09:21
…uption

When tool calls are interrupted (by error, timeout, content filtering, or
process termination), sessions can become permanently corrupted. Every
subsequent API request fails with errors like:
- "unexpected tool_use_id found in tool_result blocks"
- "tool result's tool id not found (2013)"

Root cause: extractToolCallsFromAssistant() skips malformed tool_use blocks
but leaves them in the message content. The blocks remain in the transcript
causing API rejections.

Fix: Strip malformed tool_use blocks (missing id, missing name, or with
partialJson field) BEFORE the pairing repair runs. This prevents creating
synthetic results for invalid blocks and allows sessions to auto-recover.

Fixes openclaw#5497, openclaw#5481, openclaw#5430, openclaw#5518

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@NSEvent NSEvent force-pushed the fix/session-corruption-malformed-tool-use branch from fc3255e to 7d53df6 Compare January 31, 2026 17:22
@openclaw-barnacle openclaw-barnacle bot removed the docs Improvements or additions to documentation label Jan 31, 2026
@NSEvent
Copy link
Author

NSEvent commented Jan 31, 2026

Why this check is failing openclaw@2026.1.30 format /home/runner/_work/openclaw/openclaw

oxfmt --check

Checking formatting...

docs/automation/cron-jobs.md (94ms)

Format issues found in above 1 files. Run without --check to fix. Finished in 1345ms on 3589 files using 10 threads.  ELIFECYCLE  Command failed with exit code 1. Error: Process completed with exit code 1.

Not sure how that got changed, fixed now

@NSEvent
Copy link
Author

NSEvent commented Jan 31, 2026

@greptileai

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@NSEvent
Copy link
Author

NSEvent commented Feb 1, 2026

Closing to resubmit with improvements addressing edge-case coverage.

@NSEvent NSEvent closed this Feb 1, 2026
@adam91holt
Copy link
Contributor

🔍 Overlapping PRs Detected

This PR appears to overlap with 22 other open PRs all addressing tool call/tool_result pairing and sanitization issues:

Similarity PR Author Title
91.5% #4844 @lailoo fix(agents): skip error/aborted assistant messages in transcript repair
90.6% #4476 @kira-ariaki fix: skip tool calls from aborted assistant messages in transcript repair
90.5% #3194 @koriyoshi2041 fix: skip incomplete tool calls in transcript repair
90.5% #3565 @kiranjd fix(sessions): truncate at incomplete tool calls instead of synthetic repair
88.7% #2253 @Zedit42 fix: sanitize incomplete tool calls with partialJson
88.6% #3647 @nhangen fix: sanitize tool arguments in session history
87.5% #3707 @bheemreddy181 fix: repair unpaired tool calls when loading sessions
87.2% #3622 @mickobizzle fix(agents): drop orphan tool results
86.7% #3125 @snejati86 fix: prevent orphan tool_result errors from streaming failures
86.5% #4852 @lailoo fix(agents): sanitize tool pairing after compaction and history truncation
85.6% #4516 @chesterbella fix: drop errored assistant tool calls and their orphan tool_results
84.3% #4009 @drag88 fix(agent): sanitize messages after orphan user repair
84.3% #3880 @SalimBinYousuf1 fix: drop assistant messages with stopReason error to avoid orphaning
83.7% #1859 @zerone0x fix(agents): skip extracting tool calls from errored assistant turns
83.5% #4700 @marcelomar21 fix: deduplicate tool_use IDs and enable sanitization for Anthropic
82.9% #3362 @samhotchkiss fix: auto-repair and retry on orphan tool_result errors
82.6% #5482 @bsmithelion-arcadia fix(session): normalize tool call blocks for cross-provider compat
82.4% #4719 @bsmithelion-arcadia fix(session): normalize tool call blocks for cross-provider compat
82.2% #5032 @shayan919293 fix: re-run sanitization after limitHistoryTurns to fix orphaned tool results
82.1% #4598 @aisling404 fix(agents): skip tool extraction for aborted/errored assistant messages
78.3% #2557 @steve-rodri fix(agents): preserve tool call/result pairing in history limiting
#2213 @manzienkog fix: normalize toolCall arguments to prevent Anthropic API rejection

Similarity scores computed using Voyage AI embeddings (cosine similarity) on standardized PR summaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphaned tool_result after mid-stream assistant error causes permanent session breakage

3 participants