fix(replay-history): drop trailing stream-error placeholder before pr…#77287
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review. Next step before merge Security Review detailsBest possible solution: Land the shared normalizer fix after maintainer review and completion of queued CI checks; keep the broader provider cooldown and repair-corruption follow-ups tracked separately under #77228 or sibling PRs. Do we have a high-confidence way to reproduce the issue? Yes, at source level. Current main can rewrite a trailing empty assistant error/zero-usage stop turn into a sentinel assistant-final transcript that flows through the shared provider replay path; I did not run a live github-copilot request in this read-only review. Is this the best way to solve the issue? Yes. The PR fixes the shared replay chokepoint, preserves non-trailing Bedrock sentinel behavior, requires synthetic provenance before dropping sentinel text, and now includes changelog coverage. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against d5edeae6ee9d. |
1f0e275 to
befc941
Compare
3e60c65 to
e9e58e1
Compare
…ovider replay The STREAM_ERROR_FALLBACK_TEXT sentinel exists to satisfy Bedrock Converse's "ContentBlock must not be empty" rule for non-trailing assistant error turns. When the same shape ends up as the *trailing* message in a replay request, prefill-strict providers (e.g. github-copilot/claude-opus-4.6) reject the call with 400 "The conversation must end with a user message.". The original turn carried zero usage and either empty content or only the synthetic sentinel text — drop is lossless. The trim runs after the existing sentinel-rewrite loop so it also covers transcripts where session-file repair previously persisted the sentinel to disk. Refs openclaw#77228 — addresses the prefill-trigger root cause only. The cascading auth-profile cooldown amplifier and the auto-repair null-role amplifier are tracked separately in the same issue.
aa7a39b to
9327313
Compare
|
Landed as squash commit |
…ainer-hardening * origin/main: (843 commits) docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728) docs: reorder unreleased changelog fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge) fix: expose ollama thinking profile before activation test(gateway): preserve dispatch timers in waiter test(gateway): keep startup context timer live docs: document cache-friendly activity helper ci: install ffmpeg for Mantis media previews fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott) docs(changelog): note doctor device pairing advice fix fix(doctor): avoid impossible device token rotation advice ci: use Crabbox media previews for Mantis docs: filter maintainer-owned triage noise test: cover GitHub activity helper fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288) fix: prune orphan session artifacts perf: reduce GitHub activity cache misses fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq) ci: embed Mantis desktop previews fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287) ... # Conflicts: # CHANGELOG.md
…ovider send (openclaw#77287) normalizeAssistantReplayContent rewrites empty assistant error turns into a STREAM_ERROR_FALLBACK_TEXT sentinel to satisfy Bedrock Converse's non-empty ContentBlock requirement for non-trailing turns. When that sentinel is the trailing entry, prefill-strict providers reject the request with "400 This model does not support assistant message prefill. The conversation must end with a user message." and the session cannot recover on its own. Add a post-loop tail trim that drops trailing assistant turns whose content is empty with stopReason "error" or zero-usage empty stop, or carries only the sentinel text with the same synthetic provenance. A real model reply whose content happens to equal the sentinel string is preserved by requiring zero usage or stopReason "error" before dropping. The trim catches both the in-memory rewrite shape and the sentinel persisted to disk by session-file-repair. Tests: - pnpm test src/agents/pi-embedded-runner/replay-history.test.ts - pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/agents/pi-embedded-runner/replay-history.ts src/agents/pi-embedded-runner/replay-history.test.ts - pnpm check:changed Refs openclaw#77228
Summary
Problem: A session whose last assistant turn errored before producing content — persisted as
{ role: "assistant", stopReason: "error", content: [] }(or, after the offline session-file repair pass, withcontent: [{ type: "text", text: "[assistant turn failed before producing content]" }]) — gets sent to the next provider call as the trailing message. Prefill-strict providers (the report's repro case is github-copilot/claude-opus-4.6 in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228, but the same pattern hits any model that requires the conversation to end with a user message) reject the call with400 This model does not support assistant message prefill. The conversation must end with a user message.. The retry path keeps re-sending the same shape and the user sees a session that never recovers.Root Cause:
src/agents/pi-embedded-runner/replay-history.ts:327-399(normalizeAssistantReplayContent) deliberately rewrites emptycontent: []assistant turns whosestopReason === "error"(orisZeroUsageEmptyStopAssistantTurn(message)— thestopReason: "stop"zero-usage shape) into a non-empty single-text-block carryingSTREAM_ERROR_FALLBACK_TEXT(src/agents/stream-message-shared.ts:90). The doc comment atreplay-history.ts:370-396explicitly states the goal: AWS Bedrock Converse rejects assistant messages with noContentBlock, so the rewrite is needed to keep replay valid for Bedrock. That goal is correct for non-trailing error turns. The same rewrite is wrong when applied to the trailing error turn — Bedrock's contract is satisfied by the next user message that immediately follows in normal flow, but when the error turn is the last entry there is no following user message and the rewrite produces a request whose final message is assistant.src/agents/session-file-repair.ts:65further persists the same sentinel content to disk, so on the next turn the loaded transcript already carries the sentinel as a non-empty trailing assistant block, which the existingArray.isArray(replayContent) && replayContent.length === 0branch does not match (it now skips through unchanged) and the trailing-prefill 400 reproduces.Fix: After the existing sentinel-rewrite loop in
normalizeAssistantReplayContentfinishes, walk back from the tail and drop trailing assistant turns that match the dropable-trailing predicate: an assistant turn whosecontentis either (a) empty andstopReason === "error"orisZeroUsageEmptyStopAssistantTurn(message), or (b) the single-text-block sentinel shape[{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }]. Both shapes carry zero usage and no model output; dropping is lossless. The next provider request now ends in the last user (or whatever non-droppable turn was before the synthetic tail), which Bedrock accepts (no empty assistant ContentBlock) and prefill-strict providers also accept (last message is user). The trim sits after the rewrite loop so it transparently catches the persisted-sentinel-on-disk shape as well as the in-memory rewrite shape, with one helper.What changed:
src/agents/pi-embedded-runner/replay-history.ts— extendnormalizeAssistantReplayContentwith a post-loop tail trim plus two pure helpers (isReplayDroppableTrailingAssistant,isStreamErrorSentinelContent). Comment block records why dropping is lossless and which provider classes are protected.src/agents/pi-embedded-runner/replay-history.test.ts— relocate two existing tests that locked in the buggy "rewrite trailing error to sentinel" behavior so they exercise the mid-turn sentinel rewrite (with a follow-up user message); add four new tests covering trailing drop for empty-error, trailing drop for zero-usage-empty-stop, trailing drop for persisted sentinel content, multi-turn trailing drop, and boundary preservation for real assistant content + non-error emptytoolUse/lengthturns.CHANGELOG.md— single Fixes line under Unreleased referencing the issue with non-closingRefssyntax.What did NOT change (scope boundary):
run.empty-error-retry.test.ts's "Clean stop with no output is a legitimate silent reply, not a crash" boundary.STREAM_ERROR_FALLBACK_TEXTitself, tobuildStreamErrorAssistantMessage, tosession-file-repair.ts's on-disk write of the sentinel, or to any provider-specific replay path. The fix is provider-agnostic and lives at the single normalization chokepoint that all replay flows funnel through (attempt.ts:546,attempt.ts:3019, andreplay-history.ts:628).Reproduction
{ role: "assistant", stopReason: "error", content: [] }and (after the offline session-file repair pass that's documented atsession-file-repair.ts:65) may rewrite it to the[{ type: "text", text: "[assistant turn failed before producing content]" }]sentinel shape on disk.400 ... The conversation must end with a user message., the failover reason classifies asformat, the session is stuck. The same shape continues to be re-sent on every retry.normalizeAssistantReplayContentdrops the trailing synthetic assistant turn before the request leaves OpenClaw. The provider sees a request ending in the last real user turn (or whatever non-droppable entry precedes the synthetic tail). Bedrock still accepts it (no empty ContentBlock), prefill-strict providers also accept it (last message is user), and the agent generates a fresh assistant reply.Risk / Mitigation
Risk 1 — losing a legitimate trailing assistant turn: Could a turn that the user actually wants the model to "continue" be dropped? No. The drop predicate matches only:
content: []plusstopReason: "error"(synthetic, zero-usage, the failed turn itself)content: []plus theisZeroUsageEmptyStopAssistantTurnshape (already documented in the existing rewrite branch as a synthesized failure surface — seereplay-history.ts:387-388and the boundary testrun.empty-error-retry.test.tsreferenced in the existing comment)[{ type: "text", text: STREAM_ERROR_FALLBACK_TEXT }](synthetic sentinel)Anything else — real text content, real usage,
stopReason: "toolUse"or"length"— is preserved. Test cases lock both directions of this predicate.Mitigation: explicit "does not drop a trailing assistant turn that has real content" and "does not drop a trailing assistant turn with non-error empty content (toolUse / length)" tests; the
run.empty-error-retry.test.tssilent-reply boundary continues to pass because nonzero-usagestopturns do not matchisZeroUsageEmptyStopAssistantTurn(existing boundary preserved).Risk 2 — breaking Bedrock Converse compatibility: Could Bedrock now reject calls that used to work because we removed the sentinel rewrite? No. The pre-existing rewrite branch is unchanged for non-trailing error turns — those still get the sentinel inserted. The trim only fires for trailing turns, and a trailing assistant message with empty content is exactly the case Bedrock would have rejected anyway; dropping it produces a request whose last message is user, which Bedrock accepts.
Mitigation: relocated tests for "converts mid-turn assistant content: [] to a non-empty sentinel text block when stopReason is error" and "converts mid-turn zero-usage empty stop turns to a replay sentinel" continue to assert the rewrite path explicitly.
Risk 3 — provider-specific divergence: The fix is provider-agnostic. Could it diverge somewhere downstream? No.
normalizeAssistantReplayContentis the single chokepoint upstream of all provider runtime calls (attempt.ts:546 normalizeMessagesForLlmBoundary,attempt.ts:3019 normalizedReplayMessages,replay-history.ts:628). After it returns, downstream provider plugins do their own provider-specific massaging on the message list, but they receive a list whose tail is already user-shaped, which is the universal contract.Mitigation: existing
sanitizeProviderReplayHistoryWithPluginandvalidateProviderReplayTurnsWithPluginflows are downstream of this normalization; they continue to receive a structurally-valid list.Risk 4 — repair amplification: This PR does not stop the auto-repair routine from filling the JSONL with null-role entries when triggered by a different invalid sequence. That is a separate root cause and gets its own PR. The trim here removes the trigger condition for the prefill 400 specifically.
Mitigation: the changelog entry and the
Refsreference are explicit about scope.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Refs #77228 — addresses the trailing prefill-placeholder root cause only. The cascading auth-profile cooldown amplifier (a single
format400 currently poisons the whole auth profile across sessions) and the auto-repair amplifier (the routine writes 935+ null-role entries when reconciling an invalid sequence) are independent failure modes in the same issue and are out of scope here so each can ship as its own narrowly-scoped PR.