fix(replay-history): drop trailing stream-error placeholder before pr… by openperf · Pull Request #77287 · openclaw/openclaw

openperf · 2026-05-04T11:43:48Z

Summary

Problem: A session whose last assistant turn errored before producing content — persisted as { role: "assistant", stopReason: "error", content: [] } (or, after the offline session-file repair pass, with content: [{ type: "text", text: "[assistant turn failed before producing content]" }]) — gets sent to the next provider call as the trailing message. Prefill-strict providers (the report's repro case is github-copilot/claude-opus-4.6 in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228, but the same pattern hits any model that requires the conversation to end with a user message) reject the call with 400 This model does not support assistant message prefill. The conversation must end with a user message.. The retry path keeps re-sending the same shape and the user sees a session that never recovers.
Root Cause: src/agents/pi-embedded-runner/replay-history.ts:327-399 (normalizeAssistantReplayContent) deliberately rewrites empty content: [] assistant turns whose stopReason === "error" (or isZeroUsageEmptyStopAssistantTurn(message) — the stopReason: "stop" zero-usage shape) into a non-empty single-text-block carrying STREAM_ERROR_FALLBACK_TEXT (src/agents/stream-message-shared.ts:90). The doc comment at replay-history.ts:370-396 explicitly states the goal: AWS Bedrock Converse rejects assistant messages with no ContentBlock, so the rewrite is needed to keep replay valid for Bedrock. That goal is correct for non-trailing error turns. The same rewrite is wrong when applied to the trailing error turn — Bedrock's contract is satisfied by the next user message that immediately follows in normal flow, but when the error turn is the last entry there is no following user message and the rewrite produces a request whose final message is assistant. src/agents/session-file-repair.ts:65 further persists the same sentinel content to disk, so on the next turn the loaded transcript already carries the sentinel as a non-empty trailing assistant block, which the existing Array.isArray(replayContent) && replayContent.length === 0 branch does not match (it now skips through unchanged) and the trailing-prefill 400 reproduces.
Fix: After the existing sentinel-rewrite loop in normalizeAssistantReplayContent finishes, walk back from the tail and drop trailing assistant turns that match the dropable-trailing predicate: an assistant turn whose content is either (a) empty and stopReason === "error" or isZeroUsageEmptyStopAssistantTurn(message), or (b) the single-text-block sentinel shape [{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }]. Both shapes carry zero usage and no model output; dropping is lossless. The next provider request now ends in the last user (or whatever non-droppable turn was before the synthetic tail), which Bedrock accepts (no empty assistant ContentBlock) and prefill-strict providers also accept (last message is user). The trim sits after the rewrite loop so it transparently catches the persisted-sentinel-on-disk shape as well as the in-memory rewrite shape, with one helper.
What changed:
- src/agents/pi-embedded-runner/replay-history.ts — extend normalizeAssistantReplayContent with a post-loop tail trim plus two pure helpers (isReplayDroppableTrailingAssistant, isStreamErrorSentinelContent). Comment block records why dropping is lossless and which provider classes are protected.
- src/agents/pi-embedded-runner/replay-history.test.ts — relocate two existing tests that locked in the buggy "rewrite trailing error to sentinel" behavior so they exercise the mid-turn sentinel rewrite (with a follow-up user message); add four new tests covering trailing drop for empty-error, trailing drop for zero-usage-empty-stop, trailing drop for persisted sentinel content, multi-turn trailing drop, and boundary preservation for real assistant content + non-error empty toolUse/length turns.
- CHANGELOG.md — single Fixes line under Unreleased referencing the issue with non-closing Refs syntax.
What did NOT change (scope boundary):
- Behavior is unchanged for non-trailing error/zero-usage-empty-stop turns: they still get rewritten with the sentinel, preserving Bedrock Converse compatibility verified by the relocated tests and by run.empty-error-retry.test.ts's "Clean stop with no output is a legitimate silent reply, not a crash" boundary.
- No changes to STREAM_ERROR_FALLBACK_TEXT itself, to buildStreamErrorAssistantMessage, to session-file-repair.ts's on-disk write of the sentinel, or to any provider-specific replay path. The fix is provider-agnostic and lives at the single normalization chokepoint that all replay flows funnel through (attempt.ts:546, attempt.ts:3019, and replay-history.ts:628).
- No changes to the cascading auth-profile cooldown that this trailing prefill 400 used to trigger; that amplifier is tracked separately under the same issue and is being addressed in a sibling PR.
- No changes to the auto-repair routine that, on the second turn after the prefill 400, fills the JSONL with null-role entries; that is a third independent failure mode in the same issue and gets its own PR.

Reproduction

Drive an agent on a prefill-strict provider/model (the report uses github-copilot/claude-opus-4.6) into the failure shape: a tool-call assistant turn aborts before producing content. The session manager persists the entry as { role: "assistant", stopReason: "error", content: [] } and (after the offline session-file repair pass that's documented at session-file-repair.ts:65) may rewrite it to the [{ type: "text", text: "[assistant turn failed before producing content]" }] sentinel shape on disk.
Without sending another user turn (e.g. an internal retry, a heartbeat replay, or a continuation that loads the persisted transcript) trigger another provider call.
Without this fix: provider returns 400 ... The conversation must end with a user message., the failover reason classifies as format, the session is stuck. The same shape continues to be re-sent on every retry.
With this fix: normalizeAssistantReplayContent drops the trailing synthetic assistant turn before the request leaves OpenClaw. The provider sees a request ending in the last real user turn (or whatever non-droppable entry precedes the synthetic tail). Bedrock still accepts it (no empty ContentBlock), prefill-strict providers also accept it (last message is user), and the agent generates a fresh assistant reply.

Risk / Mitigation

Risk 1 — losing a legitimate trailing assistant turn: Could a turn that the user actually wants the model to "continue" be dropped? No. The drop predicate matches only:
- empty content: [] plus stopReason: "error" (synthetic, zero-usage, the failed turn itself)
- empty content: [] plus the isZeroUsageEmptyStopAssistantTurn shape (already documented in the existing rewrite branch as a synthesized failure surface — see replay-history.ts:387-388 and the boundary test run.empty-error-retry.test.ts referenced in the existing comment)
- non-empty content that is exactly [{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }] (synthetic sentinel)
Anything else — real text content, real usage, stopReason: "toolUse" or "length" — is preserved. Test cases lock both directions of this predicate.
Mitigation: explicit "does not drop a trailing assistant turn that has real content" and "does not drop a trailing assistant turn with non-error empty content (toolUse / length)" tests; the run.empty-error-retry.test.ts silent-reply boundary continues to pass because nonzero-usage stop turns do not match isZeroUsageEmptyStopAssistantTurn (existing boundary preserved).
Risk 2 — breaking Bedrock Converse compatibility: Could Bedrock now reject calls that used to work because we removed the sentinel rewrite? No. The pre-existing rewrite branch is unchanged for non-trailing error turns — those still get the sentinel inserted. The trim only fires for trailing turns, and a trailing assistant message with empty content is exactly the case Bedrock would have rejected anyway; dropping it produces a request whose last message is user, which Bedrock accepts.
Mitigation: relocated tests for "converts mid-turn assistant content: [] to a non-empty sentinel text block when stopReason is error" and "converts mid-turn zero-usage empty stop turns to a replay sentinel" continue to assert the rewrite path explicitly.
Risk 3 — provider-specific divergence: The fix is provider-agnostic. Could it diverge somewhere downstream? No. normalizeAssistantReplayContent is the single chokepoint upstream of all provider runtime calls (attempt.ts:546 normalizeMessagesForLlmBoundary, attempt.ts:3019 normalizedReplayMessages, replay-history.ts:628). After it returns, downstream provider plugins do their own provider-specific massaging on the message list, but they receive a list whose tail is already user-shaped, which is the universal contract.
Mitigation: existing sanitizeProviderReplayHistoryWithPlugin and validateProviderReplayTurnsWithPlugin flows are downstream of this normalization; they continue to receive a structurally-valid list.
Risk 4 — repair amplification: This PR does not stop the auto-repair routine from filling the JSONL with null-role entries when triggered by a different invalid sequence. That is a separate root cause and gets its own PR. The trim here removes the trigger condition for the prefill 400 specifically.
Mitigation: the changelog entry and the Refs reference are explicit about scope.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Agents (replay history normalization)
Tests (replay-history.test.ts)
Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77228 — addresses the trailing prefill-placeholder root cause only. The cascading auth-profile cooldown amplifier (a single format 400 currently poisons the whole auth profile across sessions) and the auto-repair amplifier (the routine writes 935+ null-role entries when reconciling an invalid sequence) are independent failure modes in the same issue and are out of scope here so each can ship as its own narrowly-scoped PR.

clawsweeper · 2026-05-04T11:46:54Z

Codex review: needs maintainer review before merge.

Summary
The branch trims trailing synthetic stream-error/zero-usage assistant placeholders in replay normalization, adds regression tests for trim and preservation boundaries, and records the user-facing fix in the changelog.

Reproducibility: yes. at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review.

Next step before merge
Protected maintainer-labeled MEMBER PR with queued CI; maintainers should finish normal review and merge handling rather than dispatch a repair job.

Security
Cleared: The diff only changes replay normalization, focused tests, and the changelog; it does not touch CI, dependencies, install scripts, package publishing, secrets, or other supply-chain execution paths.

Review details

Best possible solution:

Land the shared normalizer fix after maintainer review and completion of queued CI checks; keep the broader provider cooldown and repair-corruption follow-ups tracked separately under #77228 or sibling PRs.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review.

Is this the best way to solve the issue?

Yes. The PR fixes the shared replay chokepoint, preserves non-trailing Bedrock sentinel behavior, requires synthetic provenance before dropping sentinel text, and now includes changelog coverage.

What I checked:

Current main replay behavior: Current main rewrites empty assistant stopReason: "error" and zero-usage empty stop turns to STREAM_ERROR_FALLBACK_TEXT inside normalizeAssistantReplayContent, with no tail trim before returning. (src/agents/pi-embedded-runner/replay-history.ts:370, d5edeae6ee9d)
Shared replay chokepoint: normalizeAssistantReplayContent feeds both the LLM-boundary normalization path and the active session replay update path, so the PR targets the shared outbound replay surface. (src/agents/pi-embedded-runner/run/attempt.ts:546, d5edeae6ee9d)
Docs scope: Transcript hygiene docs identify sanitizeSessionHistory/replay-history as the transcript repair surface and document Bedrock empty assistant stream-error repair as current behavior. Public docs: docs/reference/transcript-hygiene.md. (docs/reference/transcript-hygiene.md:51, d5edeae6ee9d)
PR tail trim: The latest head adds a post-loop trim that drops only trailing assistant turns matching empty error/zero-usage-stop or sentinel content with synthetic provenance. (src/agents/pi-embedded-runner/replay-history.ts:400, d2089a759ec5)
Regression coverage: The PR adds tests for trailing empty-error, zero-usage stop, persisted sentinel, multi-tail trimming, real content preservation, and sentinel-text real-reply boundaries. (src/agents/pi-embedded-runner/replay-history.test.ts:190, d2089a759ec5)
Changelog present: The latest PR head includes the required Unreleased changelog entry for the user-facing replay-history fix. (CHANGELOG.md:313, d2089a759ec5)

Likely related people:

openperf: Merged fix(agents): prevent Bedrock replay death loop on empty assistant content #71627 introduced the shared stream-error sentinel, replay rewrite, and session-file repair paths that this PR refines; this PR also builds directly on that current-main history. (role: introduced behavior and adjacent maintainer; confidence: high; commits: 930d81aa41d2; files: src/agents/pi-embedded-runner/replay-history.ts, src/agents/session-file-repair.ts, src/agents/stream-message-shared.ts)
steipete: Recent history shows repeated transcript-hygiene and replay-history maintenance, including metadata-only replay cleanup and blank Bedrock user replay sanitation, and fix(agents): prevent Bedrock replay death loop on empty assistant content #71627 was merged by this maintainer. (role: recent maintainer and merger; confidence: high; commits: d80a8eb3adb6, 9d33da6ddf2c, 929df0f5567f; files: src/agents/pi-embedded-runner/replay-history.ts, src/agents/session-file-repair.ts, docs/reference/transcript-hygiene.md)
obviyus: Merged Preserve delivered assistant replies in session repair #76420 changed session-file repair to preserve delivered assistant replies and clarified that assistant-prefill handling belongs in outbound replay rather than durable history. (role: adjacent owner; confidence: medium; commits: e8756d99aef0; files: src/agents/session-file-repair.ts, docs/reference/transcript-hygiene.md, CHANGELOG.md)
amknight: Authored fix(agents): trim trailing assistant turns and rewrite blank user messages in session repair #75606, which handled trailing assistant prefill and blank user repair in session-file repair before later refinement by Preserve delivered assistant replies in session repair #76420, so the history is directly adjacent to this replay-boundary work. (role: adjacent feature author; confidence: medium; commits: 524528944f05; files: src/agents/session-file-repair.ts, src/agents/session-file-repair.test.ts)

Remaining risk / open question:

Several check runs for head d2089a759ec55cf7911d51b7cfaa619941fc2616 were still queued at review time, so final merge readiness depends on CI completion.
No live provider request was executed in this read-only review; confidence comes from source inspection, PR diff review, docs, and added regression coverage.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d5edeae6ee9d.

…ovider replay The STREAM_ERROR_FALLBACK_TEXT sentinel exists to satisfy Bedrock Converse's "ContentBlock must not be empty" rule for non-trailing assistant error turns. When the same shape ends up as the *trailing* message in a replay request, prefill-strict providers (e.g. github-copilot/claude-opus-4.6) reject the call with 400 "The conversation must end with a user message.". The original turn carried zero usage and either empty content or only the synthetic sentinel text — drop is lossless. The trim runs after the existing sentinel-rewrite loop so it also covers transcripts where session-file repair previously persisted the sentinel to disk. Refs openclaw#77228 — addresses the prefill-trigger root cause only. The cascading auth-profile cooldown amplifier and the auto-repair null-role amplifier are tracked separately in the same issue.

…penclaw#77228)

openperf · 2026-05-05T06:14:14Z

Landed as squash commit 24bd0b212f54013f3e85a5d64f50487136944383 on main.

@yfge

…ainer-hardening * origin/main: (843 commits) docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728) docs: reorder unreleased changelog fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge) fix: expose ollama thinking profile before activation test(gateway): preserve dispatch timers in waiter test(gateway): keep startup context timer live docs: document cache-friendly activity helper ci: install ffmpeg for Mantis media previews fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott) docs(changelog): note doctor device pairing advice fix fix(doctor): avoid impossible device token rotation advice ci: use Crabbox media previews for Mantis docs: filter maintainer-owned triage noise test: cover GitHub activity helper fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288) fix: prune orphan session artifacts perf: reduce GitHub activity cache misses fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq) ci: embed Mantis desktop previews fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287) ... # Conflicts: # CHANGELOG.md

…ovider send (openclaw#77287) normalizeAssistantReplayContent rewrites empty assistant error turns into a STREAM_ERROR_FALLBACK_TEXT sentinel to satisfy Bedrock Converse's non-empty ContentBlock requirement for non-trailing turns. When that sentinel is the trailing entry, prefill-strict providers reject the request with "400 This model does not support assistant message prefill. The conversation must end with a user message." and the session cannot recover on its own. Add a post-loop tail trim that drops trailing assistant turns whose content is empty with stopReason "error" or zero-usage empty stop, or carries only the sentinel text with the same synthetic provenance. A real model reply whose content happens to equal the sentinel string is preserved by requiring zero usage or stopReason "error" before dropping. The trim catches both the in-memory rewrite shape and the sentinel persisted to disk by session-file-repair. Tests: - pnpm test src/agents/pi-embedded-runner/replay-history.test.ts - pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/pi-embedded-runner/replay-history.ts src/agents/pi-embedded-runner/replay-history.test.ts - pnpm check:changed Refs openclaw#77228

openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 4, 2026

openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from 1f0e275 to befc941 Compare May 4, 2026 12:04

openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from 3e60c65 to e9e58e1 Compare May 5, 2026 06:03

openperf added 3 commits May 5, 2026 14:08

docs(changelog): add entry for replay-history trailing sentinel fix (o…

2aa84c9

…penclaw#77228)

docs(changelog): move replay-history entry to Unreleased section

9327313

openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from aa7a39b to 9327313 Compare May 5, 2026 06:08

openperf merged commit 24bd0b2 into openclaw:main May 5, 2026
109 checks passed

github-actions Bot mentioned this pull request May 5, 2026

📡 Upstream Digest — 2026-05-05 09:55 UTC curtismercier/openclaw-mods#773

Open

clawsweeper Bot mentioned this pull request May 8, 2026

fix(agents): strip trailing empty assistant turns before fallback #75524

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(replay-history): drop trailing stream-error placeholder before pr…#77287

fix(replay-history): drop trailing stream-error placeholder before pr…#77287
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-replay-trim-trailing-stream-error-placeholder

openperf commented May 4, 2026

Uh oh!

clawsweeper Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

openperf commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

openperf commented May 4, 2026

Summary

Reproduction

Risk / Mitigation

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Uh oh!

clawsweeper Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

openperf commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

clawsweeper Bot commented May 4, 2026 •

edited

Loading