Skip to content

fix(replay-history): drop trailing stream-error placeholder before pr…#77287

Merged
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-replay-trim-trailing-stream-error-placeholder
May 5, 2026
Merged

fix(replay-history): drop trailing stream-error placeholder before pr…#77287
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-replay-trim-trailing-stream-error-placeholder

Conversation

@openperf
Copy link
Copy Markdown
Member

@openperf openperf commented May 4, 2026

Summary

  • Problem: A session whose last assistant turn errored before producing content — persisted as { role: "assistant", stopReason: "error", content: [] } (or, after the offline session-file repair pass, with content: [{ type: "text", text: "[assistant turn failed before producing content]" }]) — gets sent to the next provider call as the trailing message. Prefill-strict providers (the report's repro case is github-copilot/claude-opus-4.6 in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228, but the same pattern hits any model that requires the conversation to end with a user message) reject the call with 400 This model does not support assistant message prefill. The conversation must end with a user message.. The retry path keeps re-sending the same shape and the user sees a session that never recovers.

  • Root Cause: src/agents/pi-embedded-runner/replay-history.ts:327-399 (normalizeAssistantReplayContent) deliberately rewrites empty content: [] assistant turns whose stopReason === "error" (or isZeroUsageEmptyStopAssistantTurn(message) — the stopReason: "stop" zero-usage shape) into a non-empty single-text-block carrying STREAM_ERROR_FALLBACK_TEXT (src/agents/stream-message-shared.ts:90). The doc comment at replay-history.ts:370-396 explicitly states the goal: AWS Bedrock Converse rejects assistant messages with no ContentBlock, so the rewrite is needed to keep replay valid for Bedrock. That goal is correct for non-trailing error turns. The same rewrite is wrong when applied to the trailing error turn — Bedrock's contract is satisfied by the next user message that immediately follows in normal flow, but when the error turn is the last entry there is no following user message and the rewrite produces a request whose final message is assistant. src/agents/session-file-repair.ts:65 further persists the same sentinel content to disk, so on the next turn the loaded transcript already carries the sentinel as a non-empty trailing assistant block, which the existing Array.isArray(replayContent) && replayContent.length === 0 branch does not match (it now skips through unchanged) and the trailing-prefill 400 reproduces.

  • Fix: After the existing sentinel-rewrite loop in normalizeAssistantReplayContent finishes, walk back from the tail and drop trailing assistant turns that match the dropable-trailing predicate: an assistant turn whose content is either (a) empty and stopReason === "error" or isZeroUsageEmptyStopAssistantTurn(message), or (b) the single-text-block sentinel shape [{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }]. Both shapes carry zero usage and no model output; dropping is lossless. The next provider request now ends in the last user (or whatever non-droppable turn was before the synthetic tail), which Bedrock accepts (no empty assistant ContentBlock) and prefill-strict providers also accept (last message is user). The trim sits after the rewrite loop so it transparently catches the persisted-sentinel-on-disk shape as well as the in-memory rewrite shape, with one helper.

  • What changed:

    • src/agents/pi-embedded-runner/replay-history.ts — extend normalizeAssistantReplayContent with a post-loop tail trim plus two pure helpers (isReplayDroppableTrailingAssistant, isStreamErrorSentinelContent). Comment block records why dropping is lossless and which provider classes are protected.
    • src/agents/pi-embedded-runner/replay-history.test.ts — relocate two existing tests that locked in the buggy "rewrite trailing error to sentinel" behavior so they exercise the mid-turn sentinel rewrite (with a follow-up user message); add four new tests covering trailing drop for empty-error, trailing drop for zero-usage-empty-stop, trailing drop for persisted sentinel content, multi-turn trailing drop, and boundary preservation for real assistant content + non-error empty toolUse/length turns.
    • CHANGELOG.md — single Fixes line under Unreleased referencing the issue with non-closing Refs syntax.
  • What did NOT change (scope boundary):

    • Behavior is unchanged for non-trailing error/zero-usage-empty-stop turns: they still get rewritten with the sentinel, preserving Bedrock Converse compatibility verified by the relocated tests and by run.empty-error-retry.test.ts's "Clean stop with no output is a legitimate silent reply, not a crash" boundary.
    • No changes to STREAM_ERROR_FALLBACK_TEXT itself, to buildStreamErrorAssistantMessage, to session-file-repair.ts's on-disk write of the sentinel, or to any provider-specific replay path. The fix is provider-agnostic and lives at the single normalization chokepoint that all replay flows funnel through (attempt.ts:546, attempt.ts:3019, and replay-history.ts:628).
    • No changes to the cascading auth-profile cooldown that this trailing prefill 400 used to trigger; that amplifier is tracked separately under the same issue and is being addressed in a sibling PR.
    • No changes to the auto-repair routine that, on the second turn after the prefill 400, fills the JSONL with null-role entries; that is a third independent failure mode in the same issue and gets its own PR.

Reproduction

  1. Drive an agent on a prefill-strict provider/model (the report uses github-copilot/claude-opus-4.6) into the failure shape: a tool-call assistant turn aborts before producing content. The session manager persists the entry as { role: "assistant", stopReason: "error", content: [] } and (after the offline session-file repair pass that's documented at session-file-repair.ts:65) may rewrite it to the [{ type: "text", text: "[assistant turn failed before producing content]" }] sentinel shape on disk.
  2. Without sending another user turn (e.g. an internal retry, a heartbeat replay, or a continuation that loads the persisted transcript) trigger another provider call.
  3. Without this fix: provider returns 400 ... The conversation must end with a user message., the failover reason classifies as format, the session is stuck. The same shape continues to be re-sent on every retry.
  4. With this fix: normalizeAssistantReplayContent drops the trailing synthetic assistant turn before the request leaves OpenClaw. The provider sees a request ending in the last real user turn (or whatever non-droppable entry precedes the synthetic tail). Bedrock still accepts it (no empty ContentBlock), prefill-strict providers also accept it (last message is user), and the agent generates a fresh assistant reply.

Risk / Mitigation

  • Risk 1 — losing a legitimate trailing assistant turn: Could a turn that the user actually wants the model to "continue" be dropped? No. The drop predicate matches only:

    • empty content: [] plus stopReason: "error" (synthetic, zero-usage, the failed turn itself)
    • empty content: [] plus the isZeroUsageEmptyStopAssistantTurn shape (already documented in the existing rewrite branch as a synthesized failure surface — see replay-history.ts:387-388 and the boundary test run.empty-error-retry.test.ts referenced in the existing comment)
    • non-empty content that is exactly [{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }] (synthetic sentinel)

    Anything else — real text content, real usage, stopReason: "toolUse" or "length" — is preserved. Test cases lock both directions of this predicate.
    Mitigation: explicit "does not drop a trailing assistant turn that has real content" and "does not drop a trailing assistant turn with non-error empty content (toolUse / length)" tests; the run.empty-error-retry.test.ts silent-reply boundary continues to pass because nonzero-usage stop turns do not match isZeroUsageEmptyStopAssistantTurn (existing boundary preserved).

  • Risk 2 — breaking Bedrock Converse compatibility: Could Bedrock now reject calls that used to work because we removed the sentinel rewrite? No. The pre-existing rewrite branch is unchanged for non-trailing error turns — those still get the sentinel inserted. The trim only fires for trailing turns, and a trailing assistant message with empty content is exactly the case Bedrock would have rejected anyway; dropping it produces a request whose last message is user, which Bedrock accepts.
    Mitigation: relocated tests for "converts mid-turn assistant content: [] to a non-empty sentinel text block when stopReason is error" and "converts mid-turn zero-usage empty stop turns to a replay sentinel" continue to assert the rewrite path explicitly.

  • Risk 3 — provider-specific divergence: The fix is provider-agnostic. Could it diverge somewhere downstream? No. normalizeAssistantReplayContent is the single chokepoint upstream of all provider runtime calls (attempt.ts:546 normalizeMessagesForLlmBoundary, attempt.ts:3019 normalizedReplayMessages, replay-history.ts:628). After it returns, downstream provider plugins do their own provider-specific massaging on the message list, but they receive a list whose tail is already user-shaped, which is the universal contract.
    Mitigation: existing sanitizeProviderReplayHistoryWithPlugin and validateProviderReplayTurnsWithPlugin flows are downstream of this normalization; they continue to receive a structurally-valid list.

  • Risk 4 — repair amplification: This PR does not stop the auto-repair routine from filling the JSONL with null-role entries when triggered by a different invalid sequence. That is a separate root cause and gets its own PR. The trim here removes the trigger condition for the prefill 400 specifically.
    Mitigation: the changelog entry and the Refs reference are explicit about scope.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents (replay history normalization)
  • Tests (replay-history.test.ts)
  • Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77228 — addresses the trailing prefill-placeholder root cause only. The cascading auth-profile cooldown amplifier (a single format 400 currently poisons the whole auth profile across sessions) and the auto-repair amplifier (the routine writes 935+ null-role entries when reconciling an invalid sequence) are independent failure modes in the same issue and are out of scope here so each can ship as its own narrowly-scoped PR.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 4, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs maintainer review before merge.

Summary
The branch trims trailing synthetic stream-error/zero-usage assistant placeholders in replay normalization, adds regression tests for trim and preservation boundaries, and records the user-facing fix in the changelog.

Reproducibility: yes. at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review.

Next step before merge
Protected maintainer-labeled MEMBER PR with queued CI; maintainers should finish normal review and merge handling rather than dispatch a repair job.

Security
Cleared: The diff only changes replay normalization, focused tests, and the changelog; it does not touch CI, dependencies, install scripts, package publishing, secrets, or other supply-chain execution paths.

Review details

Best possible solution:

Land the shared normalizer fix after maintainer review and completion of queued CI checks; keep the broader provider cooldown and repair-corruption follow-ups tracked separately under #77228 or sibling PRs.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review.

Is this the best way to solve the issue?

Yes. The PR fixes the shared replay chokepoint, preserves non-trailing Bedrock sentinel behavior, requires synthetic provenance before dropping sentinel text, and now includes changelog coverage.

What I checked:

Likely related people:

Remaining risk / open question:

  • Several check runs for head d2089a759ec55cf7911d51b7cfaa619941fc2616 were still queued at review time, so final merge readiness depends on CI completion.
  • No live provider request was executed in this read-only review; confidence comes from source inspection, PR diff review, docs, and added regression coverage.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d5edeae6ee9d.

@openperf openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from 1f0e275 to befc941 Compare May 4, 2026 12:04
@openperf openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from 3e60c65 to e9e58e1 Compare May 5, 2026 06:03
openperf added 3 commits May 5, 2026 14:08
…ovider replay

The STREAM_ERROR_FALLBACK_TEXT sentinel exists to satisfy Bedrock Converse's
"ContentBlock must not be empty" rule for non-trailing assistant error
turns. When the same shape ends up as the *trailing* message in a replay
request, prefill-strict providers (e.g. github-copilot/claude-opus-4.6)
reject the call with 400 "The conversation must end with a user message.".
The original turn carried zero usage and either empty content or only the
synthetic sentinel text — drop is lossless. The trim runs after the existing
sentinel-rewrite loop so it also covers transcripts where session-file
repair previously persisted the sentinel to disk.

Refs openclaw#77228 — addresses the prefill-trigger root cause only.
The cascading auth-profile cooldown amplifier and the auto-repair null-role
amplifier are tracked separately in the same issue.
@openperf openperf force-pushed the fix/77228-replay-trim-trailing-stream-error-placeholder branch from aa7a39b to 9327313 Compare May 5, 2026 06:08
@openperf openperf merged commit 24bd0b2 into openclaw:main May 5, 2026
109 checks passed
@openperf
Copy link
Copy Markdown
Member Author

openperf commented May 5, 2026

Landed as squash commit 24bd0b212f54013f3e85a5d64f50487136944383 on main.

vincentkoc added a commit to VintageAyu/openclaw that referenced this pull request May 5, 2026
…ainer-hardening

* origin/main: (843 commits)
  docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728)
  docs: reorder unreleased changelog
  fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge)
  fix: expose ollama thinking profile before activation
  test(gateway): preserve dispatch timers in waiter
  test(gateway): keep startup context timer live
  docs: document cache-friendly activity helper
  ci: install ffmpeg for Mantis media previews
  fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott)
  docs(changelog): note doctor device pairing advice fix
  fix(doctor): avoid impossible device token rotation advice
  ci: use Crabbox media previews for Mantis
  docs: filter maintainer-owned triage noise
  test: cover GitHub activity helper
  fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288)
  fix: prune orphan session artifacts
  perf: reduce GitHub activity cache misses
  fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq)
  ci: embed Mantis desktop previews
  fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287)
  ...

# Conflicts:
#	CHANGELOG.md
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…ovider send (openclaw#77287)

normalizeAssistantReplayContent rewrites empty assistant error turns into
a STREAM_ERROR_FALLBACK_TEXT sentinel to satisfy Bedrock Converse's
non-empty ContentBlock requirement for non-trailing turns. When that
sentinel is the trailing entry, prefill-strict providers reject the
request with "400 This model does not support assistant message prefill.
The conversation must end with a user message." and the session cannot
recover on its own.

Add a post-loop tail trim that drops trailing assistant turns whose
content is empty with stopReason "error" or zero-usage empty stop, or
carries only the sentinel text with the same synthetic provenance. A real
model reply whose content happens to equal the sentinel string is
preserved by requiring zero usage or stopReason "error" before dropping.
The trim catches both the in-memory rewrite shape and the sentinel
persisted to disk by session-file-repair.

Tests:
- pnpm test src/agents/pi-embedded-runner/replay-history.test.ts
- pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/pi-embedded-runner/replay-history.ts
  src/agents/pi-embedded-runner/replay-history.test.ts
- pnpm check:changed

Refs openclaw#77228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant